From bb101_at at yahoo.com Fri Oct 1 05:18:45 2004 From: bb101_at at yahoo.com (Brady Bonty) Date: Wed Nov 25 01:03:26 2009 Subject: [Beowulf] HPC Survey Message-ID: <20041001121845.63443.qmail@web20621.mail.yahoo.com> Salutations, My name is Brady Black, I am currently a student enrolled in a High Performance Computing curriculum. As part of my school work and internship, I am gathering information relating to current High Performance Computer installations and their operations. I plan on using this information to better understand the hurdles currently facing High Performance Computing and to provide some insight into solutions of common problems. I would be very appreciative if you would take 5 – 10 minutes out of your busy schedule to answer this 20 question survey. http://www.unc.edu/~bradyb/hpcSurvey.html Please be assured that all information gathered from this survey will remain anonymous unless specific consent is provided. My plan is to use the aggregate data to provide an overview of challenges currently faced by the High Performance Computing industry. If you would like a copy of the aggregate data, please let me know. Thank you, Brady Black bradyb[at]unc[dot]edu __________________________________ Do you Yahoo!? Yahoo! Mail Address AutoComplete - You start. We finish. http://promotions.yahoo.com/new_mail From 050675 at student.unife.it Fri Oct 1 00:59:25 2004 From: 050675 at student.unife.it (050675@student.unife.it) Date: Wed Nov 25 01:03:26 2009 Subject: [Beowulf] raw results Message-ID: <51685.192.84.144.228.1096617565.squirrel@student.unife.it> Hi all, someone in this list (Robert J. Brown, probably), a few months ago asked me to post one or twice a month if I had interesting results with raw ethernet on which I'm making my first level degree. Now I've some interesting results: packets loss have been decreased (even if under 300 bytes payload I experience so many loss, but probably the problem is in the used 32 bits architecture or in the fact that I use a 2.4.x kernel without NAPI, for the moment). For any other payload value (in a range between 300 and 1500 bytes) I've no losses, even with jumbo frames (achieved throughput 111 MB/sec in a point-to-point connection with Gbit ethernet cards). I've also experienced NAPI, but with a kernel 2.4, not 2.6 (next goal). Thank you for your attention and help in these months. Simone Saravalli From rgb at phy.duke.edu Fri Oct 1 06:42:01 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:26 2009 Subject: [Beowulf] raw results In-Reply-To: <51685.192.84.144.228.1096617565.squirrel@student.unife.it> Message-ID: On Fri, 1 Oct 2004 050675@student.unife.it wrote: > Hi all, > someone in this list (Robert J. Brown, probably), a few months ago > asked me to post one or twice a month if I had interesting results with > raw ethernet on which I'm making my first level degree. > Now I've some interesting results: packets loss have been decreased (even > if under 300 bytes payload I experience so many loss, but probably the > problem is in the used 32 bits architecture or in the fact that I use a > 2.4.x kernel without NAPI, for the moment). > For any other payload value (in a range between 300 and 1500 bytes) I've > no losses, even with jumbo frames (achieved throughput 111 MB/sec in a > point-to-point connection with Gbit ethernet cards). > I've also experienced NAPI, but with a kernel 2.4, not 2.6 (next goal). > Thank you for your attention and help in these months. Cool! robert >>G<< brown (rgb:-) -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From gerry.creager at tamu.edu Fri Oct 1 07:44:49 2004 From: gerry.creager at tamu.edu (Gerry Creager n5jxs) Date: Wed Nov 25 01:03:26 2009 Subject: [Beowulf] Somewhat OT, but still...: Has anyone seen... Message-ID: <415D6D61.2040707@tamu.edu> problems on Supermicro dual Xeon motherboards/systems with the 2.6 kernels, especially with interrupts and keyboard controllers? I've got a system that will lock up using FC2, and the latest updates for 2.6.8-1.521smp, run fine in the uniproc mode, boot but not allow local keyboard access in 2.6.5-1.358smp and work fine in uniprocessor. I'm thinking it's hardware, but I'm askin' if anyone else has seen something similar... Thanks, Gerry -- Gerry Creager -- gerry.creager@tamu.edu Network Engineering -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.847.8578 Pager: 979.228.0173 Office: 903A Eller Bldg, TAMU, College Station, TX 77843 From gerry.creager at tamu.edu Fri Oct 1 07:23:34 2004 From: gerry.creager at tamu.edu (Gerry Creager n5jxs) Date: Wed Nov 25 01:03:26 2009 Subject: [Beowulf] raw results In-Reply-To: References: Message-ID: <415D6866.6090309@tamu.edu> One note: Please review RFC2544 regarding packet loss and mitigation for small packets. We tested a number of switches about 18 months ago using an Anritsu MD1230 automated tester, and saw a lot of packet loss in switches that were not crafted for small packets. Something to consider. There are manufacturers who have engineered for both jumbo and small packets. The tuning for this is not trivial: the problems for small packets are not well translated to jumbo frames. As packet size decreases, overhead increases and the packet transmission rate goes 'way up. Jumbo frames may adversely impact switch fabric memory if you're testing store-and-forward devices not designed with sufficient memory for jumbos originally. Option 'B' is fragmentation, removing the benefit of jumbos immediately. Thanks, all the same, for posting your results. We're always interested in independent reports and independent methods! Regards, Gerry Robert G. Brown wrote: > On Fri, 1 Oct 2004 050675@student.unife.it wrote: > > >>Hi all, >> someone in this list (Robert J. Brown, probably), a few months ago >>asked me to post one or twice a month if I had interesting results with >>raw ethernet on which I'm making my first level degree. >>Now I've some interesting results: packets loss have been decreased (even >>if under 300 bytes payload I experience so many loss, but probably the >>problem is in the used 32 bits architecture or in the fact that I use a >>2.4.x kernel without NAPI, for the moment). >>For any other payload value (in a range between 300 and 1500 bytes) I've >>no losses, even with jumbo frames (achieved throughput 111 MB/sec in a >>point-to-point connection with Gbit ethernet cards). >>I've also experienced NAPI, but with a kernel 2.4, not 2.6 (next goal). >>Thank you for your attention and help in these months. > > > Cool! > > robert >>G<< brown (rgb:-) > -- Gerry Creager -- gerry.creager@tamu.edu Network Engineering -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.847.8578 Pager: 979.228.0173 Office: 903A Eller Bldg, TAMU, College Station, TX 77843 From jrajiv at hclinsys.com Mon Oct 4 04:43:33 2004 From: jrajiv at hclinsys.com (Rajiv) Date: Wed Nov 25 01:03:26 2009 Subject: [Beowulf] Dual Boot in Master and Client Message-ID: <04b001c4aa07$60aab630$39140897@PMORND> Dear All, I would like to have dual boot - Windows and Linux in master and all clients. In which beowulf package this is possible? Regards, Rajiv -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20041004/3a0d3bb9/attachment.html From mphil39 at hotmail.com Mon Oct 4 12:23:38 2004 From: mphil39 at hotmail.com (Matt Phillips) Date: Wed Nov 25 01:03:26 2009 Subject: [Beowulf] How to find a swapped out, runnable process? Message-ID: I am running RH9 (2.4.20-9SGI_XFS_1.2.0smp) on a 16-node cluster. I noticed the load on the I/O node to be consistently high after one of the clients crashed during rsync. I did vmstat and found that 1 or more process are always in the runnable but swapped out queue.. Here's a sample output of vmstat procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 2 0 1 30120 9508 176 1609060 0 0 1 1 0 2 0 0 0 0 0 1 30120 9508 176 1609060 0 0 0 0 118 237 0 3 97 0 0 2 30120 9508 176 1609060 0 0 0 1064 252 670 0 1 99 0 0 1 30120 9508 176 1609060 0 0 0 36 156 330 0 0 100 0 0 1 30120 9508 176 1609060 0 0 0 268 157 305 0 0 100 0 0 1 30120 9508 176 1609060 0 0 0 8 129 245 0 0 100 As you can see, there is always one process in procs/w queue.. How do I find which process is this? I tried various combos of ps (like looking at wchan, stat outputs etc, variations of top).. but ps/top only show 1-2 process in the runnable queue and doesnt indicated if they are swapped. Maybe I am reading the man pages incorrectly. Anyone has ideas how I can catch this errant process? TIA, Matt _________________________________________________________________ On the road to retirement? Check out MSN Life Events for advice on how to get there! http://lifeevents.msn.com/category.aspx?cid=Retirement From mikee at mikee.ath.cx Mon Oct 4 13:40:42 2004 From: mikee at mikee.ath.cx (Mike) Date: Wed Nov 25 01:03:26 2009 Subject: [Beowulf] OT: effective amount of data through gigabit ether? Message-ID: <20041004204042.GQ7153@mikee.ath.cx> I know this is off topic, but I've not found an answer anywhere. On one IBM doc it says the effective throughput for 10Mb/s is 5.7GB/hour, 100Mb/s is 17.6GB/hour, but only lists TBD for 1000MB/s. Does anyone know what this effective number is? This is for calculating how long backups should take through my backup network. (I'm not interested in how long it takes to read/write the disk, just the network throughput.) Mike From agrajag at dragaera.net Mon Oct 4 14:23:05 2004 From: agrajag at dragaera.net (Sean Dilda) Date: Wed Nov 25 01:03:26 2009 Subject: [Beowulf] OT: effective amount of data through gigabit ether? In-Reply-To: <20041004204042.GQ7153@mikee.ath.cx> References: <20041004204042.GQ7153@mikee.ath.cx> Message-ID: <1096924985.4303.37.camel@pel> On Mon, 2004-10-04 at 16:40, Mike wrote: > I know this is off topic, but I've not found an answer anywhere. > On one IBM doc it says the effective throughput for 10Mb/s is > 5.7GB/hour, 100Mb/s is 17.6GB/hour, but only lists TBD for 1000MB/s. > Does anyone know what this effective number is? This is for > calculating how long backups should take through my backup network. > Those are interesting numbers. I calculate the peak numbers to be: 10MB/s - 4.19GB/hour (less than your IBM number) 100MB/s - 41.9GB/hour (way more than your IBM number) 1000MB/s - 419GB/hour In reality, the answer depends on your hardware. With my setup I've pushed 114MB/s over gigabit for an excess of ten minutes, which tends to average out all the bursts. If I took that out, it'd come to about 400GB/hour. > (I'm not interested in how long it takes to read/write the disk, > just the network throughput.) See now, that's the trick. Gigabit maxes out around 119MB/s. I've not tended to see disks that can actually preform that well (maybe with sequential data, but not with random data). From mwill at penguincomputing.com Mon Oct 4 14:13:57 2004 From: mwill at penguincomputing.com (Michael Will) Date: Wed Nov 25 01:03:26 2009 Subject: [Beowulf] OT: effective amount of data through gigabit ether? In-Reply-To: <20041004204042.GQ7153@mikee.ath.cx> References: <20041004204042.GQ7153@mikee.ath.cx> Message-ID: <200410041413.57525.mwill@penguincomputing.com> On Monday 04 October 2004 01:40 pm, Mike wrote: > I know this is off topic, but I've not found an answer anywhere. > On one IBM doc it says the effective throughput for 10Mb/s is > 5.7GB/hour, 100Mb/s is 17.6GB/hour, but only lists TBD for 1000MB/s. I would assume 900Mb/s as an optimistic best case throughput for the GigE, which would be about 395GB/hour. 17.6GB/hour seems like a really low estimate, that would be only about 40Mb/s effective transfer rate over an 100Mb/s link? Maybe that number is really measuring the tape writing speed instead? Michael Will > Does anyone know what this effective number is? This is for > calculating how long backups should take through my backup network. > > (I'm not interested in how long it takes to read/write the disk, > just the network throughput.) > > Mike > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Michael Will, Linux Sales Engineer NEWS: We have moved to a larger iceberg :-) NEWS: 300 California St., San Francisco, CA. Tel: 415-954-2822 Toll Free: 888-PENGUIN Fax: 415-954-2899 www.penguincomputing.com From pa_bosje at yahoo.co.uk Mon Oct 4 13:47:09 2004 From: pa_bosje at yahoo.co.uk (Patricia) Date: Wed Nov 25 01:03:26 2009 Subject: [Beowulf] myrinet (scali) or ethernet Message-ID: <20041004204709.89858.qmail@web25402.mail.ukl.yahoo.com> Hi People, I am user of two clusters: One runs under myrinet and the other under scali. In both cases I installed my software to run under each of them (but not ethernet). All I want to know is how to check whether my parallel jobs are indeed running under myrinet (scali) or ethernet. I have this question because I have observed a strong decay in the performance after a power outage. thanks for any input! Patricia ___________________________________________________________ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com From nican at nsc.liu.se Mon Oct 4 13:45:33 2004 From: nican at nsc.liu.se (Niclas Andersson) Date: Wed Nov 25 01:03:26 2009 Subject: [Beowulf] Call for Participation - LCSC and NGN Message-ID: CALL FOR PARTICIPATION National Supercomputer Centre in Sweden (NSC) and Norwegian High Performance Computing Consortium (NOTUR) welcome you to participate in 5th Annual Workshop on Linux Clusters For Super Computing (LCSC) and workshop on Nordic Grid Neighbourhood (NGN) 18-20 October, 2004 Hosted by National Supercomputer Centre Linkoping University, SWEDEN http://www.nsc.liu.se/lcsc The LCSC workshop are brimful of knowledgeable speakers giving exciting talks about Linux clusters and distributed applications requiring vast computational resources. Just a few samples: - LCSC Keynote: Cluster Computing - You've come a long way in a short time Jack Dongarra, University of Tennessee - Application Performance on High-End and Commodity-class Computers Martyn Guest, CLRC Daresbury Laboratory - The BlueGene/L Supercomputer and LOFAR/LOIS Bruce Elmegreen, IBM Watson Research Center - MPI Micro-benchmarks: Misleading and Dangerous Greg Lindahl, Pathscale Inc. and many more. In the NGN workshop we gather speakers from the Nordic countries, Baltic states and northwest Russia to talk about Grids and efforts made in the field of Grid technology. There will be presentations of applications, Grid middleware and national initiatives as well as industrial solutions. NGN is supported by the Nordplus Neighbour program of the Research Council of Norway. Additionally, during these days there will be valuable vendor presentations, exciting exhibits, instructive tutorials, seminars and several other meetings. For more information and registration: http://www.nsc.liu.se/lcsc From wathey at salk.edu Mon Oct 4 15:19:37 2004 From: wathey at salk.edu (Jack Wathey) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] ammonite In-Reply-To: <20040927235904.GA21014@piskorski.com> References: <20040927182134.GA23662@piskorski.com> <20040927235904.GA21014@piskorski.com> Message-ID: The ammonite cluster can now be seen in pictures and words at http://jessen.ch/ammonite/ thanks to the generosity and skill of Per Jessen, and thanks to Tom Bartol, who not only took the photos, but also helped bring ammonite to life. Ammonite is a 200-processor cluster of bare diskless motherboards. Best wishes, Jack From James.P.Lux at jpl.nasa.gov Mon Oct 4 16:11:30 2004 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] ammonite In-Reply-To: References: <20040927235904.GA21014@piskorski.com> <20040927182134.GA23662@piskorski.com> <20040927235904.GA21014@piskorski.com> Message-ID: <5.2.0.9.2.20041004160726.018615d0@mail.jpl.nasa.gov> At 03:19 PM 10/4/2004 -0700, Jack Wathey wrote: >http://jessen.ch/ammonite/ Very nice.. particularly the efforts you put into the cooling air. The comment about flow against a pressure drop is very sound. I realize it's a time thing, but had you considered removing the fans from the power supplies? What's a rough order of magnitude cost on the blower/VFD? Does this VFD have a "servo" input that could be used to automatically change blower speed in response to temperature? (some VFDs have an analog input that can be used to set up a speed = linear function of input voltage, and by cleverly setting the gain and offset, you can do quite nicely) James Lux, P.E. Spacecraft Radio Frequency Subsystems Flight Telecommunications Systems Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From brett at nssl.noaa.gov Mon Oct 4 16:24:48 2004 From: brett at nssl.noaa.gov (Brett Morrow) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Oklahoma Supercomputing Symposium 2004 Message-ID: <4161DBC0.8050403@nssl.noaa.gov> Anyone else out there attending this event? ------------------------------------------------------------------------------------------------------------------------------------------ Join us for the Oklahoma Supercomputing Symposium 2004, Wed Oct 6 - Thu Oct 7, here at OU (CCE Forum). To register for the Symposium, go to http://symposium2004.oscer.ou.edu/ and follow the links. Some 400 people have registered for the Symposium, from 38 academic institutions, 40 companies, 18 government agencies and 2 non- governmental organizations, in 17 states and Canadian provinces. The Symposium is free, with meals provided, and it's a great way to meet leaders, potential collaborators, colleagues, and potential future employers and employees, from academia, government and industry. Our speaker list includes: * Sangtae Kim, new Division Director, Shared Cyberinfrastructure Division, Director for Computer & Information Science & Engineering, National Science Foundation * S. Ramakrishnan, Director, Center for Development of Advanced Computing, India * Stephen Wheat, Principal Scientist, Intel Corp * Joerg Schwartz, Senior Program Manager, Sun Labs * Steve Modica, Principal Engineer, SGI * Ian Lumb, Grid Solutions Manager, Platform Computing Inc. * Kurt Snodgrass, Vice Chancellor, Information Technology and Telecommunications, Oklahoma State Regents for Higher Education * Mark Musser, Senior Solutions Architect, Oracle Corporation * Viswa Sharma, Chief Technical Officer, CorEdge Networks * Anil Srivastava, Executive Chairman & Chief Strategic Officer, AcrossWorld Communications * Krzysztof Kuczera, Associate Professor, Department of Chemistry, University of Kansas * Ed Seidel, Director, Center for Computation & Technology, Louisiana State University * Mary Fran Yafchak, IT Program Coordinator, Southeastern Universities Research Association * Art Vandenberg, Director of Advanced Campus Services, Georgia State University * Dennis Aebersold, Chief Information Officer, University of Oklahoma * Amy Apon, Associate Professor of Computer Science, University of Arkansas * Richard Braley, Professor & Chair, Department of Technology, Cameron University * Paul Gray, Assistant Professor, Department of Computer Science, University of Northern Iowa * John Matrow, System Administrator/Trainer, High Performance Computing Center, Wichita State University We'll also have a vendor exposition, where you'll have an opportunity to learn about existing and emerging HPC technologies. Also, if you know of any students -- grad and undergrad -- who might be interested in the Symposium, this is a great opportunity to introduce them to conferences, especially because it's free. Our academic sponsors include Oklahoma EPSCoR, the Oklahoma Chamber of Commerce, and the OU Department of Information Technology, the Ou Vice President for Research, and the OU Supercomputing Center for Education & Research (OSCER). And if there are colleagues or students that you think might be interested, please forward this note to them. -- Brett Morrow, NSSL/SPC Alternate Program Manager INDUS Corporation National Severe Storms Laboratory (405) 366-0515 Brett.Morrow@noaa.gov http://www.induscorp.com From redboots at ufl.edu Mon Oct 4 15:31:31 2004 From: redboots at ufl.edu (JOHNSON,PAUL C) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] ethernet switch, dhcp question Message-ID: <265814864.1096929091119.JavaMail.osg@osgjas04.cns.ufl.edu> All: Im fairly new to beowulf clusters so please excuse the question if it is trivial. Ive installed mpich on several computers and have run several programs but the performance seems a little slow. All the computers in my lab are connected directly to the campus network. Would I see an increase in performance if I instead had slaves connected through a switch in my room connected to a master computer using dhcp to assign ip's? Thanks for any help, Paul -- JOHNSON,PAUL C From tmattox at gmail.com Mon Oct 4 16:32:11 2004 From: tmattox at gmail.com (Tim Mattox) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Dual Boot in Master and Client In-Reply-To: <04b001c4aa07$60aab630$39140897@PMORND> References: <04b001c4aa07$60aab630$39140897@PMORND> Message-ID: Hi Rajiv, I would think you could do this with Warewulf. http://warewulf-cluster.org/ Just make the BIOS on each node first attempt to boot with PXE, and upon PXE failure, boot from a locally installed Windows on the node's hard drive. To switch from Linux to Windows, turn off the dhcpd server on the master, and reboot the nodes. They should then come up in Windows. To switch back, you would turn on the dhcpd server on the master, and then using some "unknown-to-me" windows utility to remotely reboot the nodes, which should then come back up into Linux via the PXE+ramdisk booting with Warewulf. As for making the master dual boot, that is up to your local Linux guru to configure LILO or Grub for dual boot. I don't do Windows, so I can't help you there. Other Beowulf methods for diskless nodes should also work similarly. ----- Original Message ----- From: Rajiv Date: Mon, 4 Oct 2004 17:13:33 +0530 Subject: [Beowulf] Dual Boot in Master and Client To: beowulf@beowulf.org Dear All, I would like to have dual boot - Windows and Linux in master and all clients. In which beowulf package this is possible? Regards, Rajiv -- Tim Mattox - tmattox@gmail.com - http://homepage.mac.com/tmattox/ From wathey at salk.edu Mon Oct 4 16:37:15 2004 From: wathey at salk.edu (Jack Wathey) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] ammonite In-Reply-To: <5.2.0.9.2.20041004160726.018615d0@mail.jpl.nasa.gov> References: <20040927235904.GA21014@piskorski.com> <20040927182134.GA23662@piskorski.com> <20040927235904.GA21014@piskorski.com> <5.2.0.9.2.20041004160726.018615d0@mail.jpl.nasa.gov> Message-ID: On Mon, 4 Oct 2004, Jim Lux wrote: > At 03:19 PM 10/4/2004 -0700, Jack Wathey wrote: >> http://jessen.ch/ammonite/ > > > Very nice.. particularly the efforts you put into the cooling air. The > comment about flow against a pressure drop is very sound. > > I realize it's a time thing, but had you considered removing the fans from > the power supplies? The linear flow rate through the whole rack is typically about 120 to 200 fpm, which is sigificantly less than the linear flow rate through a PS. If the PS fan was there *only* to cool the mid-tower box it was meant to go into, then removing it would do no harm. But a PS generates a fair amount of heat of its own, so I left the fans. And as you say, it would have taken time. > > What's a rough order of magnitude cost on the blower/VFD? Does this VFD have > a "servo" input that could be used to automatically change blower speed in > response to temperature? (some VFDs have an analog input that can be used to > set up a speed = linear function of input voltage, and by cleverly setting > the gain and offset, you can do quite nicely) The blower was about $1300, the Teco inverter about $1000. The inverter has an rs232 interface for computer control, which I don't use at present, but hope to some day. It also has the analog control feature you describe. From hahn at physics.mcmaster.ca Mon Oct 4 21:53:04 2004 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] ethernet switch, dhcp question In-Reply-To: <265814864.1096929091119.JavaMail.osg@osgjas04.cns.ufl.edu> Message-ID: > run several programs but the performance seems a little slow. All > the computers in my lab are connected directly to the campus > network. Would I see an increase in performance if I instead had > slaves connected through a switch in my room connected to a master > computer using dhcp to assign ip's? possibly - I think you should ssh to your slaves and look at /proc/loadavg while running the MPI program. (actually, I usually run "vmstat 1" on slaves, since it aggregates lots of other potentially valuable information.) if your network is a bottleneck, slaves will be not-fully-busy. if your campus network is 10 or 100bT or not full-duplex, that's very likely the case. if your campus net is gigabit, then I would be surprised to see much improvement by using a local switch (assuming your lab machines are plugged into the same campus-owned switch). if your lab machines are not all equivalent in speed, or if your MPI problem is not well-balanced, I'd expect to see some nodes busy and others not. similarly, if there are pesky users running netscape on some nodes, that's probably going to hurt (assuming your code is fairly tight-coupled.) regards, mark hahn. From hahn at physics.mcmaster.ca Mon Oct 4 21:56:38 2004 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Dual Boot in Master and Client In-Reply-To: Message-ID: > I would think you could do this with Warewulf. I didn't look, but I suspect warewulf uses all the usual open-source tools. > To switch from Linux to Windows, turn off the dhcpd server on if warewulf uses pxelinux, you can much more nicely configure particular nodes to boot with particular default configs, including different kernels, windows, etc by providing per-node pxelinux.conf/ files. > To switch back, you would turn on the dhcpd server on the master, > and then using some "unknown-to-me" windows utility to remotely I suspect that cygwin and ssh could do this nicely. but being the blunt-object sort of guy, I'd rather reset the windows machines remotely via IPMI-over-lan ;) From eugen at leitl.org Tue Oct 5 08:04:50 2004 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Cray XD1 out Message-ID: <20041005150450.GW1457@leitl.org> http://investors.cray.com/phoenix.zhtml?c=98390&p=irol-newsArticle&ID=622736&highlight= Cray Announces General Availability of the Cray XD1 Opteron/Linux-based Supercomputer SEATTLE--(BUSINESS WIRE)--Oct. 4, 2004--Global supercomputer leader Cray Inc. (Nasdaq:CRAY) today announced the general availability of the new Cray XD1(TM) supercomputer, an Opteron/Linux-based system priced from under $100,000 to about $2 million (U.S. list) that handily outperforms similarly priced Linux clusters. The company also announced the United States Department of Agriculture Forest Service is a Cray XD1 customer, which adds to an impressive list of early users, including the Ohio Supercomputer Center, the Pacific Northwest National Laboratory (PNNL), Germany's Helmut Schmidt University and the SAHA Institute of Nuclear Physics (Calcutta, India). "Tracking the evolving chemical composition of a smoke plume produces a task so computationally intense that we assumed we would not be able to afford any computer capable of performing it," said Bryce Nordgren, Physical Scientist with the Forest Service's Fire Science Lab. "Reviewing the test case results from Cray restored our hope that we would be able to perform a scientifically meaningful simulation on our budget. We were particularly impressed with the Cray XD1's awesome scalability on this challenging interdisciplinary problem." The Cray XD1 supercomputer is ideal for the special needs of high-performance computing (HPC) applications used by government and academia, as well as computer-aided engineering (CAE) in the aerospace, automotive and marine industries; weather forecasting and climate modeling; petroleum exploration; financial modeling; and life sciences research. "We evaluated many proposals from leading IT companies and decided on Cray because of the Cray XD1 system's excellent price-to-performance ratio," said Professor Hendrik Rothe, chair of Helmut Schmidt University's Laboratory for Measurement and Information Technology. According to Rich Partridge, Enterprise Systems analyst with D.H. Brown Associates, "With the XD1, Cray leverages its strong heritage to bring highly parallel, affordable supercomputing to a broad market of industrial, government and academic users. The Cray XD1 is not merely an Opteron/Linux parallel system; it is a 'Cray,' and that makes all the difference. This is a true supercomputer, with balanced performance that commodity designs just cannot achieve." About the Cray XD1 Supercomputer The Cray XD1 features the direct connect processor (DCP) architecture, which removes PCI bottlenecks and memory contention to deliver superior sustained performance. According to the HPC Challenge benchmarks, the Cray XD1 has the lowest latency of any HPC system, with MPI latency of 1.8 microseconds and random ring latency of 1.3 microseconds. Tests conducted by the Ohio Supercomputer Center show that the Cray XD1 ships messages with four times lower MPI latency than common cluster interconnects such as Infiniband, Quadrics or Myrinet, and 30 times lower than Gigabit Ethernet employed in lowest-cost clusters. The Cray XD1's interconnect delivers twice the bandwidth of 4X Infiniband for messages up to 1 KB and 60 percent higher throughput for very large messages. The Linux/Opteron system runs x86 32/64 bit codes. Field programmable gate arrays (FPGAs) are available to accelerate applications, and the Active Manager subsystem provides single system command and control and high availability features. A 3VU (5.25") chassis provides 12 compute processors, 58 peak gigaflops, 96 GB/second aggregate switching capacity, 1.8-microsecond MPI interprocessor latency, 84 GB maximum memory and 1.5 TB maximum disk storage. A 12-chassis rack provides 144 compute processors, 691 peak gigaflops, 1TB/second aggregate switching capacity, 2 microsecond MPI interprocessor latency, 922 GB/second aggregate memory bandwidth, 1 TB maximum memory and 18 TB maximum disk storage. About Cray Inc. The world's leading supercomputer company, Cray Inc. pioneered high-performance computing with the introduction of the Cray-1 in 1976. The only company dedicated to meeting the specific needs of HPC users, Cray designs and manufactures supercomputers used by government, industry and academia worldwide for applications ranging from scientific research to product design, testing to manufacturing. Cray's diverse product portfolio delivers superior performance, scalability and reliability to the entire HPC market, from the high-end capability user to the department workgroup. For more information, go to www.cray.com. Safe Harbor Statement This press release contains forward-looking statements. There are certain factors that could cause Cray's execution plans to differ materially from those anticipated by the statements above. These include the successful porting of application programs to Cray systems and general economic and market conditions. For a discussion of these and other risks, see "Factors That Could Affect Future Results" in Cray's most recent Quarterly Report on Form 10-Q filed with the SEC. Cray is a registered trademark, and Cray XD1 is a trademark, of Cray Inc. All other trademarks are the property of their respective owners. CONTACT: Cray Inc. Victor Chynoweth, 206-701-2280 (Investors) victorc@cray.com or Steve Conway, 651-592-7441 (Media) sttico@aol.com SOURCE: Cray Inc. "Safe Harbor" Statement under the Private Securities Litigation Reform Act of 1995: Statements in this press release regarding Cray Inc.'s business which are not historical facts are "forward-looking statements" that involve risks and uncertainties. For a discussion of such risks and uncertainties, which could cause actual results to differ from those contained in the forward-looking statements, see "Risk Factors" in the Company's Annual Report or Form 10-K for the most recently ended fiscal year. -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20041005/dffedbee/attachment.bin From kc7rad at radstream.com Mon Oct 4 20:04:45 2004 From: kc7rad at radstream.com (Ken Linder (kc7rad)) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] ethernet switch, dhcp question References: <265814864.1096929091119.JavaMail.osg@osgjas04.cns.ufl.edu> Message-ID: <00a101c4aa88$11843e10$4f32a8c0@kc7rad> Paul, I don't think the DHCP has much to do with mpich performance... I think the only additional overhead it adds is when the node computer first comes on-line. It just sends a request to the DHCP server for an IP address (and gateway, netmask, etc... I think :-) Now, what WILL affect your performance is using your campus network. You are relying on the cable and infrastructure of that network. Your cluster is also competing for resources with all other network users, at the switch level. One could argue that this delay is negligable but if you add all the delays that may exist in a small public network, it could be considerable. I recommend you spend a little money and get yourself a switch. I just saw an 8-port on e-bay for $13. I suggest you make your own cables. For me anyway, it is an almost cathartic release for me to build my own cables. :-) With your own switch, you control the traffic. If you do this and response time is still slow, at least you have your own network to analyze. Ken www.radstream.com ----- Original Message ----- From: "JOHNSON,PAUL C" To: Sent: Monday, October 04, 2004 4:31 PM Subject: [Beowulf] ethernet switch, dhcp question > All: > > Im fairly new to beowulf clusters so please excuse the question if > it is trivial. Ive installed mpich on several computers and have > run several programs but the performance seems a little slow. All > the computers in my lab are connected directly to the campus > network. Would I see an increase in performance if I instead had > slaves connected through a switch in my room connected to a master > computer using dhcp to assign ip's? > Thanks for any help, > Paul > > -- > JOHNSON,PAUL C From joachim at ccrl-nece.de Tue Oct 5 00:46:07 2004 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] myrinet (scali) or ethernet In-Reply-To: <20041004204709.89858.qmail@web25402.mail.ukl.yahoo.com> References: <20041004204709.89858.qmail@web25402.mail.ukl.yahoo.com> Message-ID: <4162513F.2070701@ccrl-nece.de> Patricia wrote: > Hi People, > > I am user of two clusters: One runs under myrinet and > the other under scali. In both cases I installed my > software to run under each of them (but not ethernet). > All I want to know is how to check whether my parallel > jobs are indeed running under myrinet (scali) or > ethernet. For Myrinet, you can check if an MPI-programm linked against the MPICH-GM library runs correctly. With Scali ("MPI Connect"), it's more complicated as it can fallback to Ethernet if the other interconnect (SCI?) does not work. Just run a ping-pong benchmark to measure latency (there's one included with Scali MPI), and if you get < 10us latency, you are not using Ethernet. Next to this, there should also be diagnostic tools included. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From patrick at myri.com Tue Oct 5 05:05:29 2004 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] myrinet (scali) or ethernet In-Reply-To: <20041004204709.89858.qmail@web25402.mail.ukl.yahoo.com> References: <20041004204709.89858.qmail@web25402.mail.ukl.yahoo.com> Message-ID: <41628E09.1020809@myri.com> Hi Patricia, Patricia wrote: > I am user of two clusters: One runs under myrinet and > the other under scali. In both cases I installed my Myrinet is hardware and Scali makes software. Do you run Scali's software on Myrinet ? > jobs are indeed running under myrinet (scali) or > ethernet. Do you link with Scali's software or MPICH-GM ? If it's MPICH-GM, binaries will only run on Myrinet. If it's Scali, I don't know, I guess it chooses Myrinet or Ethernet at runtime. In this case, you can look at the output of gm_board_info on a node where your job is running, and see if any of the PIDs and the command lines of programs using a GM port matches your application process. It may still be possible that Scali opens a GM port without using it. Another solution would be to unplug a few nodes but Scali may be able to use Ethernet only for the nodes where Myrinet has been unplugged. You can also look at the GM counters (with gm_counters) and see if the number of packets sent/received goes up. However, you would not be sure if another process is using Myrinet at that time or if IP/Myrinet is up and running too. I guess there should be a way with Scali to know which device is used at runtime, but I really don't know how. Is it the same problem than the Myricom Help ticket #30885 ? Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From john.hearns at clustervision.com Tue Oct 5 07:22:32 2004 From: john.hearns at clustervision.com (John Hearns) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Dual Boot in Master and Client In-Reply-To: References: <04b001c4aa07$60aab630$39140897@PMORND> Message-ID: <1096986151.17462.21.camel@vigor12> On Tue, 2004-10-05 at 00:32, Tim Mattox wrote: > Hi Rajiv, > I would think you could do this with Warewulf. > http://warewulf-cluster.org/ > Just make the BIOS on each node first attempt to boot with PXE, > and upon PXE failure, boot from a locally installed Windows on > the node's hard drive. That's a good idea. In addition to that, I saw this project for Windows installs: http://unattended.sourceforge.net/ Depends if you want to re-install, or quickly boot an already installed setup. Your suggestion is probably better. > To switch from Linux to Windows, turn off the dhcpd server on > the master, and reboot the nodes. They should then come up > in Windows. > To switch back, you would turn on the dhcpd server on the master, > and then using some "unknown-to-me" windows utility to remotely > reboot the nodes, It seems possible to use Samba to do this, using the "net rpc shutdown" command. So assuming you run Samba on your head node you could probably reboot your master mode from Windows to Linux then issue this command. From tmattox at gmail.com Tue Oct 5 07:32:29 2004 From: tmattox at gmail.com (Tim Mattox) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Dual Boot in Master and Client In-Reply-To: References: Message-ID: Hi Mark, On Tue, 5 Oct 2004 00:56:38 -0400 (EDT), Mark Hahn wrote: > > I would think you could do this with Warewulf. > > I didn't look, but I suspect warewulf uses all the usual > open-source tools. Not sure what you mean exactly, but Warewulf itself is GPL'ed, and it leverages things like yum, rpm, pxelinux or Etherboot, rsync, etc. You can install whatever Beowulf tools you want such as LAM MPI, SGE, and pdsh for a small list of examples. If you have an RPM of whatever package, it's easy to install for the nodes. If you have a SRPM it takes just a few more steps. > > To switch from Linux to Windows, turn off the dhcpd server on > > if warewulf uses pxelinux, you can much more nicely configure > particular nodes to boot with particular default configs, > including different kernels, windows, etc by providing per-node > pxelinux.conf/ files. For now, Warewulf automatically creates those config files for pxelinux to do it's thing... so your custom configs would get clobbered by Warewulf when it generates it's own. Similarly, Warewulf rebuilds the dhcpd.conf file based it's node "database" and config files. I don't foresee putting any effort myself into making a dual boot into Window's a config option for Warewulf directly. However, if someone really needs this functionality, I doubt Greg (gmk to his friends) or I would reject their contributed code, as long as it was general enough to support dual/multi-booting into a locally installed Linux on nodes as well. -- Tim Mattox - tmattox@gmail.com - http://homepage.mac.com/tmattox/ From jakob at unthought.net Tue Oct 5 07:56:53 2004 From: jakob at unthought.net (Jakob Oestergaard) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] ammonite In-Reply-To: References: <20040927182134.GA23662@piskorski.com> <20040927235904.GA21014@piskorski.com> Message-ID: <20041005145653.GR18307@unthought.net> On Mon, Oct 04, 2004 at 03:19:37PM -0700, Jack Wathey wrote: > > The ammonite cluster can now be seen in pictures and words at > > http://jessen.ch/ammonite/ Cool! On the SMP/UP problem: Do you have ACPI support in your kernel? Newer kernels can use ACPI for parsing SMP information from the motherboard, rather than guess-working on the old MP tables at magic locations in memory. This *could* be worth a shot I think. I am 100% sure that your SMP/UP problem has *nothing* to do, what so ever, with NFS server contention - either your kernel loads and boots, or it doesn't load and boot. The kernel does not use NFS (or local disk) during the early stages of boot, where the processors are set up. NFS problems would result in a failed boot, not a missing CPU. Alternatively, I'd try out 2.4.27 (or 2.6.8.1 if you're feeling lucky), your 2.4.20 kernel is *really* dated. Cheers, -- / jakob From rgb at phy.duke.edu Tue Oct 5 09:17:29 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] OT: effective amount of data through gigabit ether? In-Reply-To: <200410041413.57525.mwill@penguincomputing.com> Message-ID: On Mon, 4 Oct 2004, Michael Will wrote: > On Monday 04 October 2004 01:40 pm, Mike wrote: > > I know this is off topic, but I've not found an answer anywhere. > > On one IBM doc it says the effective throughput for 10Mb/s is > > 5.7GB/hour, 100Mb/s is 17.6GB/hour, but only lists TBD for 1000MB/s. > > I would assume 900Mb/s as an optimistic best case throughput for the GigE, > which would be about 395GB/hour. > > 17.6GB/hour seems like a really low estimate, that would be only > about 40Mb/s effective transfer rate over an 100Mb/s link? Maybe > that number is really measuring the tape writing speed instead? I agree with Michael (and Sean) here -- it is pretty straightforward to compute a theoretical peak bandwidth -- just convert the Mbps into MBps by dividing by 8. 10 Mbps ethernet can thus manage 1.25 MBps, 100 Mbps -> 12.5 MBps, 1000 Mbps -> 125 MBps. Most people don't consider the packet headers to be part of the "throughput" -- with a standard ethernet MTU of 1500 bytes + 18 bytes ethernet header, take away 64 bytes for TCP/IP header, one has at most 1436/1518 or 94.6% of peak This leaves a STILL theoretical peak of 1.18 x 10^{n-1} MBps (where n is the log_10 of the raw BW in Mbps). One reason people like to use switches and NICs that support oversize packets is that doing so reduces both this 5.4% chunk of header-based overhead and the ordinarily "invisible" mandatory pause between packets, per packet, letting you get a bit closer to peak. On top of this, switches and so forth will typically add a small bit of latency per packet on top of the minimum interpacket interval, and the TCP stack on both ends of the connection will add another slice of latency per packet. These are typically of order 50-200 microseconds (which end of this wide range you see depending on lots of things like switch quality and load, NIC quality and type, CPU speed and load, OS revision, and probably whether or not it is a Tuesday). By the time all is said and done, one usually ends up with somewhere between 80% and 94% of theoretical peak, or 1.0 x 10^{n-1| and 1.174 x 10&{n-1} MBps, although for particularly poor NICs I've done worse than 80% in years past. Higher numbers (closer to theoretical peak, as noted) for giant packets. This is likely to be the relevant estimate for moving large data files. For moving SMALL data files or messages, one moves from bandwidth-dominated communications to latency-dominated communications bottlenecks. For small messages (say, less than 1K for the sake of argument) the "bandwidth" increasingly is simply the size of the data portion of the packet times the number of packets per second your interface can manage. To convert into GBph is trivial: there are 3600 seconds/hour, and 1000 MB in on GB, so multiplying the numbers above by 3.6 seems in order. This gives a theoretical peak in the ballpark of 4.25 x 10^{n - 1} GBph for a standard MTU (higher with large packets), a probable real world peak more like 4.05 x 10^{n - 1} GBph at 90% of wirespeed. FWIW, I think that GbE tends to perform closer to its theoretical peak than do the older 10 or 100 BT. This is both because it is much more likely that the interfaces and switches will handle large frames and because the hardware tends to be more expensive and better built, with more attention paid to details like how things are cached and DMA that can make a big difference in overall performance efficiency. Hope this helps, although as has already been noted (and will likely be noted again:-) the network isn't necessarily going to be the rate limiting bottleneck for backup. rgb > > Michael Will > > Does anyone know what this effective number is? This is for > > calculating how long backups should take through my backup network. > > > > (I'm not interested in how long it takes to read/write the disk, > > just the network throughput.) > > > > Mike > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Tue Oct 5 09:42:23 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] ethernet switch, dhcp question In-Reply-To: <265814864.1096929091119.JavaMail.osg@osgjas04.cns.ufl.edu> Message-ID: On Mon, 4 Oct 2004, JOHNSON,PAUL C wrote: > All: > > Im fairly new to beowulf clusters so please excuse the question if > it is trivial. Ive installed mpich on several computers and have > run several programs but the performance seems a little slow. All > the computers in my lab are connected directly to the campus > network. Would I see an increase in performance if I instead had > slaves connected through a switch in my room connected to a master > computer using dhcp to assign ip's? Quite probably, depending on how your campus is networked. A local switch is the preferred method of building a cluster. If you are using a 100 BT network and only have a small cluster, a 100BT switch is so cheap it is almost a non-issue. Even if you DO leave them connected to the campus network, if you do this by interconnecting your switch and the campus network you'll likely see a performance increase. Putting them on a private network gives you an even quieter networking environment and better control over the network, but yes, it will make you learn a whole bunch of things (e.g. DHCP/PXE and more) to get it right. If you go this route, I'd strongly urge that you get PXE-capable network cards and set up fully automated installation and booting at the same time. You'll spend a month learning all sorts of complicated networking, but at the end of it you'll REALLY save time on installation, operation, and so forth and your cluster will be upwardly scalable in size with very little additional effort. rgb > Thanks for any help, > Paul > > -- > JOHNSON,PAUL C > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From wathey at salk.edu Tue Oct 5 14:40:24 2004 From: wathey at salk.edu (Jack Wathey) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] ammonite In-Reply-To: <20041005145653.GR18307@unthought.net> References: <20040927182134.GA23662@piskorski.com> <20040927235904.GA21014@piskorski.com> <20041005145653.GR18307@unthought.net> Message-ID: On Tue, 5 Oct 2004, Jakob Oestergaard wrote: > On the SMP/UP problem: > > Do you have ACPI support in your kernel? Newer kernels can use ACPI for > parsing SMP information from the motherboard, rather than guess-working > on the old MP tables at magic locations in memory. This *could* be > worth a shot I think. Here are what I suspect are the relevant lines from the .config file: CONFIG_ACPI=y CONFIG_ACPI_DEBUG=y CONFIG_ACPI_BUSMGR=y CONFIG_ACPI_SYS=y CONFIG_ACPI_CPU=y CONFIG_ACPI_BUTTON=y CONFIG_ACPI_AC=y CONFIG_ACPI_EC=y CONFIG_ACPI_CMBATT=y CONFIG_ACPI_THERMAL=y So I guess the answer is 'yes'. > I am 100% sure that your SMP/UP problem has *nothing* to do, what so > ever, with NFS server contention - either your kernel loads and boots, > or it doesn't load and boot. The kernel does not use NFS (or local > disk) during the early stages of boot, where the processors are set up. > NFS problems would result in a failed boot, not a missing CPU. > > Alternatively, I'd try out 2.4.27 (or 2.6.8.1 if you're feeling lucky), > your 2.4.20 kernel is *really* dated. I'll try updating the kernel when I get a chance. It is, as you say, rather old now. Thanks, Jack From haavardw at ifi.uio.no Tue Oct 5 23:52:53 2004 From: haavardw at ifi.uio.no (=?ISO-8859-1?Q?H=E5vard_Wall?=) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] myrinet (scali) or ethernet In-Reply-To: <4162513F.2070701@ccrl-nece.de> References: <20041004204709.89858.qmail@web25402.mail.ukl.yahoo.com> <4162513F.2070701@ccrl-nece.de> Message-ID: <41639645.3090306@ifi.uio.no> Joachim Worringen wrote: > Patricia wrote: > With Scali ("MPI Connect"), it's more complicated as it can fallback to > Ethernet if the other interconnect (SCI?) does not work. Just run a > ping-pong benchmark to measure latency (there's one included with Scali > MPI), and if you get < 10us latency, you are not using Ethernet. Next to > this, there should also be diagnostic tools included. > With Scampi MPI Connect, you can check which interconnects are in use by setting the environment variable SCAMPI_NETWORKS_VERBOSE=2. It is true that scampi will try to fallback to another interconnect if the primary fails. The interconnects used is listed in /opt/scali/etc/ScaMPI.conf. You may override this by setting the environment variable SCAMPI_NETWORKS (or use the -net switch with mpimon). For example SCAMPI_NETWORKS="smp,sci,tcp" will first try communication through shared memory, then SCI, and at last (tcp) if this fails. -- hw From Hakon.Bugge at scali.com Tue Oct 5 23:54:34 2004 From: Hakon.Bugge at scali.com (=?iso-8859-1?Q?H=E5kon?= Bugge) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] myrinet (scali) or ethernet In-Reply-To: <200410051551.i95FofLX020150@bluewest.scyld.com> References: <200410051551.i95FofLX020150@bluewest.scyld.com> Message-ID: <6.1.2.0.0.20041006083354.03f4a6f0@elin.scali.no> At 05:51 PM 10/5/04, Patrick Geoffray wrote: >Patricia wrote: > > I am user of two clusters: One runs under myrinet and > > the other under scali. In both cases I installed my > >Myrinet is hardware and Scali makes software. Do you run Scali's >software on Myrinet ? Patricia, would be nice to know if you run Scali MPI Connect (SMC) or some older ancient versions. If you run SMC, it would be nice to know if you run: o) Gbe through the TCP/IP stack (-net tcp) o) Gbe through the Direct Ethernet Transport (-net det0) o) Myrinet (-net gm0) o) Infiniband (-net ib0) o) 10Gbe through 3rd party DAPLs (network name will vary) I assume you do not run combinations of the above although that is possible. I quick sanity check of your system could be to run sample benchmarks to assess the system(s) performance (latency and bandwidth). I usually use bandwidth and all2all, both located in /opt/scali/examples/bin. Another obvious check is to see if the system is idle when it is supposed to be idle. You could use the utility scatop for that purpose. Another way out is of course mailto:support _AT_ scali.com >[snip] > >I guess there should be a way with Scali to know which device is used at >runtime, but I really don't know how. # SCAMPI_NETWORKS_VERBOSE=2 mpimon -net smp,gm0,tcp /opt/scali/examples/bin/hello -- `scahosts` Hakon Bugge Hakon.Bugge _ AT_ scali.com From anandv at singnet.com.sg Wed Oct 6 03:41:50 2004 From: anandv at singnet.com.sg (Anand Vaidya) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Another bare motherboard cluster in a box Message-ID: <200410061841.51046.anandv@singnet.com.sg> Another bare motherboard based cluster, with Via CPU. Found this one on via arena "I found myself deeply interested (disturbed?) in the latest clustering software called HPC (High Performance Clustering). Most of the software is Linux based therefore free to download but I still needed an actual cluster to run the stuff on. Rather than standing in line somewhere like Stanford for a brief encounter with a cluster I went about building my own." http://www.slipperyskip.com/page10.html From Umesh.Chaurasia at siemens.com Wed Oct 6 04:38:24 2004 From: Umesh.Chaurasia at siemens.com (Chaurasia Umesh) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Linux memory leak? Message-ID: Hello, I am Umesh Chaurasia working in Siemens. I got your mail id from Linux forum . We have developed application on Linux 7.2 Kernel 2.4. System H/W configuration is 1 GB RAM, P-4, 2.4 GHZ. When we are putting our system on load after whole night we found only 5 MB memory left whereas in start it was 800 MB. Is there any patch or special configuration required to save the memory. Your input will really help us to build our system for Linux plateform. Regards, Umesh Chaurasia From rgb at phy.duke.edu Wed Oct 6 07:42:46 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Storage Message-ID: Dear List, I'm turning to you for some top quality advice as I have so often in the past. I'm helping assemble a grant proposal that involves a grid-style cluster with very large scale storage requirements. Specifically, it needs to be able to scale into the 100's of TB in "central disk store" (whatever that means:-) in addition to commensurate amounts of tape backup. The tape backup is relatively straightforward -- there is a 100 TB library available to the project already that will hold 200 TB after an LTO1->LTO2 upgrade, and while tapes aren't exactly cheap, they are vastly cheaper than disk in these quantities. The disk is a real problem. Raw disk these days is less than $1/GB for SATA in 200-300 GB sizes, a bit more for 400 GB sizes, so a TB of disk per se costs in the ballpark of $1000. However, HOUSING the disk in reliable (dual power, hot swap) enclosures is not cheap, adding RAID is not cheap, and building a scalable arrangement of servers to provide access with some controllable degree of latency and bandwidth for access is also not cheap. Management requirements include 3 year onsite service for the primary server array -- same day for critical components, next day at the latest for e.g. disks or power supplies that we can shelve and deal with ourselves in the short run. The solution we adopt will also need to be scalable as far as administration is concerned -- we are not interested in "DIY" solutions where we just buy an enclosure and hang it on an over the counter server and run MD raid, not because this isn't reliable and workable for a departmental or even a cluster RAID in the 1-8 TB range (a couple of servers) it isn't at all clear how it will scale to the 10-80 TB range, when 10's of servers would be required. Management of the actual spaces thus provided is not trivial -- there are certain TB-scale limits in linux to cope with (likely to soon be resolved if they aren't already in the latest kernels, but there in many of the working versions of linux still in use) and with an array of partitions and servers to deal with, just being able to index, store and retrieve files generated by the compute component of the grid will be a major issue. SO, what I want to know is: a) What are listvolken who have 10+ TB requirements doing to satisfy them? b) What did their solution(s) cost, both to set up as a base system (in the case of e.g. a network appliance) and c) incremental costs (e.g. filled racks)? d) How does their solution scale, both costwise (partly answered in b and c) and in terms of management and performance? e) What software tools are required to make their solution work, and are they open source or proprietary? f) Along the same lines, to what extent is the hardware base of their solution commodity (defined here as having a choice of multiple vendors for a component at a point of standardized attachment such as a fiber channel port or SCSI port) or proprietary (defined as if you buy this solution THIS part will always need to be purchased from the original vendor at a price "above market" as the solution is scaled up). Rules: Vendors reply directly to me only, not the list. I'm in the market for this, most of the list is not. Note also that I've already gotten a decent picture of at least two or three solutions offered by tier 1 cluster vendors or dedicated network storage vendors although I'm happy to get more. However, I think that beowulf administrators, engineers, and users should likely answer on list as the real-world experiences are likely to be of interest to lots of people and therefore would be of value in the archives. I'm hoping that some of you bioinformatics people have experience here, as well as maybe even people like movie makers. FWIW, the actual application is likely to be Monte Carlo used to generate huge data sets (per node) and cook them down to smaller (but still multiGB) data sets, and hand them back to the central disk store for aggregation and indexed/retrievable intermediate term storage, with migration to the tape store on some as yet undetermined criterion for frequency of access and so forth. Other uses will likely emerge, but this is what we know for now. I'd guess that bioinformatics and movie generation (especially the latter) are VERY similar in the actual data flow component and also require multiTB central stores and am hoping that you have useful information to share. Thanks in advance, rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From john.hearns at clustervision.com Wed Oct 6 09:14:38 2004 From: john.hearns at clustervision.com (John Hearns) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Linux memory leak? In-Reply-To: Message-ID: On Wed, 6 Oct 2004, Chaurasia Umesh wrote: > Hello, > > I am Umesh Chaurasia working in Siemens. I got your mail id from Linux forum > . > We have developed application on Linux 7.2 Kernel 2.4. System H/W > configuration is 1 GB RAM, P-4, 2.4 GHZ. > When we are putting our system on load after whole night we found only 5 MB > memory left whereas in start it was 800 MB. Are you SURE that you are not counting the buffer memory as used? Linux uses free memory as disk buffer, which is released on demand. Please send us your output from 'free' and 'vmstat' Also, and I hate to say this, I guess you mean Redhat 7.2 which is very long in the tooth. Redhat 9 is end-of-life. You should consider Redhat Enterprise or Fedora Core 1 for 2.4 series kernels. (And yes, From alvin at Mail.Linux-Consulting.com Thu Oct 7 00:42:35 2004 From: alvin at Mail.Linux-Consulting.com (Alvin Oga) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Storage - housing 100TB In-Reply-To: Message-ID: hi ya robert our solution is scalable using off the shelf commodity parts and open source software - we also recommend a duplicate system for "live backups" - we can customize our products ( hardware solutions ) to fit the clients requirements and budget - example large 100TB disk-subsystem on 4 disks per blade ........ 1.2TB per blade with 300GB disks 10 blades per 4U chassis .... 12TB per 4U chassis 10 4U chassis per rack ...... 120TB per 42U rack http://www.itx-blades.net/Dwg/BLADE.jpg - model shown holds 4 disks, but we can fit 8-disks in it http://www.itx-blades.net/Dwg/4U-BLADE.jpg - cooling ( front to back or top to bottom ) is our main concerns that we try to solve with one solution - system runs on +12V dc input - 2x 600W 2U powersupply is enough power for driving the system - i'd be more than happy to send a demo chassis and blades, no charge if we can get feedback that you've used it and built it out as you needed - hopefully you can provide the disks, mb, cpu, memory, - we can provide the "system assembly and testing time" at "evaluation" costs ( all fees credited toward the purchase ) - you keep the 4U chassis afterward ( no charge ) On Wed, 6 Oct 2004, Robert G. Brown wrote: > I'm helping assemble a grant proposal that involves a grid-style cluster > with very large scale storage requirements. Specifically, it needs to > be able to scale into the 100's of TB in "central disk store" (whatever > that means:-) in addition to commensurate amounts of tape backup. The good .. sounds like fun > tape backup is relatively straightforward -- there is a 100 TB library > available to the project already that will hold 200 TB after an > LTO1->LTO2 upgrade, and while tapes aren't exactly cheap, they are > vastly cheaper than disk in these quantities. - tape backups are not cheap ... - tape backups are not reliable ( to save the tapes and restore ) - dirty heads, tapes that need to be swapped, .. - tape backups are too slow ( to restore ) > The disk is a real problem. Raw disk these days is less than $1/GB for > SATA in 200-300 GB sizes, a bit more for 400 GB sizes, so a TB of disk > per se costs in the ballpark of $1000. yup.. good ball park > However, HOUSING the disk in > reliable (dual power, hot swap) enclosures is not cheap, adding RAID is > not cheap, it can be ... does it need to be dual-hot-swap power supplies ?? - no problem... we can provide that (though not a pretty "case" ) raid is cheap ... but why use raid ... there is no benefit to using software or hardware raid at this size data ... - time is better spent in optimizing data and backup of the data to a 2nd system - it is NOT trivial to backup 20TB - 100TB of data - raid'ing reduces the overall reliability ( more things to fail ) and increases the system admin costs ( more testing ) > and building a scalable arrangement of servers to provide > access with some controllable degree of latency and bandwidth for access > is also not cheap. not sure what the issues are .. - it'd depend on the switch/hub, and "disk subsystem/infrastructure" > Management requirements include 3 year onsite > service for the primary server array -- same day for critical > components, we'd be using a duplicate "hot swap backup system" > next day at the latest for e.g. disks or power supplies that > we can shelve and deal with ourselves in the short run. most everything we use is off the shelf and be kept on the shelf for emergencies power supplies, disks, motherboards, cpu, memory, fans > The solution we > adopt will also need to be scalable as far as administration is > concerned -- scaling is easy in our case ... > we are not interested in "DIY" solutions where we just buy > an enclosure and hang it on an over the counter server and run MD raid, we can build and test for you ( onsite if needed ) > not because this isn't reliable and workable for a departmental or even > a cluster RAID in the 1-8 TB range (a couple of servers) it isn't at all > clear how it will scale to the 10-80 TB range, when 10's of servers > would be required. we don't forecast any issues with sw raid ... on 4 disks per blade ........ 1.2TB per blade with 300GB disks 10 blades per 4U chassis .... 12TB per 4U chassis 10 4U chassis per rack ...... 120TB per 42U rack > Management of the actual spaces thus provided is not trivial actual data to save would be a bigger issue than the saving of it onto the disk subsystems > -- there > are certain TB-scale limits in linux to cope with (likely to soon be > resolved if they aren't already in the latest kernels, but there in many > of the working versions of linux still in use) and with an array of individual file size issues would limit the raw data one can save other way around it is to use custom device drivers like oracle that uses their own "raw data" drivers to get around file size limiations > partitions and servers to deal with, just being able to index, store and > retrieve files generated by the compute component of the grid will be a > major issue. that depends on how the data is created and stored ??? - we dont think it as a major issue, as long as each "TB-sized files" can be indexed properly at the time of its creation > SO, what I want to know is: > > a) What are listvolken who have 10+ TB requirements doing to satisfy > them? we prefer non-raided systems ... and duplicate disk-systems for backup > b) What did their solution(s) cost, both to set up as a base system > (in the case of e.g. a network appliance) and raw components is roughly $25K per 12TB in one 4U chassis http://www.itx-blades.net/Systems/ - add marketing/sales/admin/contract/onsite costs to it ( $250K for fully Managed - 3yr contracts w/ 2nd backup system ) http://www.itx-blades.net/Systems/ > c) incremental costs (e.g. filled racks)? the system is expandable as needed per 1.2TB blade or 12TB ( 4U chassis ) additional costs to intall additional blades into the disk-subsystem is incremental for the time needed to add its config to the existing config files for the disk subsystem ( fairly simple, since the rest of the system has already been tested and operational ) > d) How does their solution scale, both costwise (partly answered in b > and c) and in terms of management and performance? partly and answered above scalable solutions is accomplished with modular blades and blade chassis > e) What software tools are required to make their solution work, and > are they open source or proprietary? just the standard linux software raid tools in the kernel everything is open source > f) Along the same lines, to what extent is the hardware base of their > solution commodity (defined here as having a choice of multiple vendors everything is off-the-shelf we have the proprietory 4U blade chassis for "holding the blades" in place along with the power supply ( the system can be changed per customer requirements > for a component at a point of standardized attachment such as a fiber > channel port or SCSI port) fiber channel cards may be used if needed, but it'd require some reconfigurations - fiber channel PCI cards are expensive and it is unclear if its required or not > or proprietary (defined as if you buy this > solution THIS part will always need to be purchased from the original > vendor at a price "above market" as the solution is scaled up). everything is off-the-shelf > Rules: Vendors reply directly to me only, not the list. i was wondering why nobody replied publicly :-) > I'm in the > market for this, most of the list is not. Note also that I've already > gotten a decent picture of at least two or three solutions offered by > tier 1 cluster vendors or dedicated network storage vendors although I'm > happy to get more. i hope "name brand" is not the primary evaluation consideration > However, I think that beowulf administrators, engineers, and users > should likely answer on list as the real-world experiences are likely to > be of interest to lots of people and therefore would be of value in the > archives. I'm hoping that some of you bioinformatics people have > experience here, as well as maybe even people like movie makers. we've been indirectly selling small systems to the movie industry ( by the hundred's of systems ) .. its just a simple mpeg player :-) > FWIW, the actual application is likely to be Monte Carlo used to > generate huge data sets (per node) and cook them down to smaller (but > still multiGB) data sets, and hand them back to the central disk store > for aggregation and indexed/retrievable intermediate term storage, with good ... > migration to the tape store on some as yet undetermined criterion for > frequency of access and so forth. Other uses will likely emerge, but i'd avoid tape storage due to costs and index/restore/uptime issues > this is what we know for now. I'd guess that bioinformatics and movie > generation (especially the latter) are VERY similar in the actual data > flow component and also require multiTB central stores and am hoping > that you have useful information to share. have fun alvin From hahn at physics.mcmaster.ca Thu Oct 7 15:08:27 2004 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Storage In-Reply-To: Message-ID: > that means:-) in addition to commensurate amounts of tape backup. The ick! our big-storage plans very, very much hope to eliminate tape. > tape backup is relatively straightforward -- there is a 100 TB library > available to the project already that will hold 200 TB after an > LTO1->LTO2 upgrade, and while tapes aren't exactly cheap, they are > vastly cheaper than disk in these quantities. hmm, LTO2 is $0.25/GB; disks are about double that. considering the issues of tape reliability, access time and migration, I think disk is worth it. from what I hear in the storage industry, this is a growing consensus among, for instance, hospitals - they don't want to spend their time reading tapes to see whether the media is failing and content needs to be migrated. migrating content that's online is ah, easier. in the $ world, online data is attractive in part so its lifetime can be more explicitly managed (ie, deleted!) > The disk is a real problem. Raw disk these days is less than $1/GB for > SATA in 200-300 GB sizes, a bit more for 400 GB sizes, so a TB of disk > per se costs in the ballpark of $1000. However, HOUSING the disk in > reliable (dual power, hot swap) enclosures is not cheap, adding RAID is > not cheap, and building a scalable arrangement of servers to provide > access with some controllable degree of latency and bandwidth for access > is also not cheap. no insult intended, but have you looked closely, recently? I did some quick web-pricing this weekend, and concluded: vendor capacity size $Cad list per TB density dell/emc 12x250 3U $7500 1.0 TB/U apple 14x250 3U $4000 1.166 hp/msa1500cs 12x250x4 10U $3850 1.2 (divide $Cad by 1.25 or so to get $US.) all three plug into FC. the HP goes up to 8 shelves per controller or 24 TB per FC port, though. > Management requirements include 3 year onsite > service for the primary server array -- same day for critical > components, next day at the latest for e.g. disks or power supplies that > we can shelve and deal with ourselves in the short run. The solution we pretty standard policies. > adopt will also need to be scalable as far as administration is > concerned -- we are not interested in "DIY" solutions where we just buy > an enclosure and hang it on an over the counter server and run MD raid, > not because this isn't reliable and workable for a departmental or even > a cluster RAID in the 1-8 TB range (a couple of servers) it isn't at all > clear how it will scale to the 10-80 TB range, when 10's of servers > would be required. Robert, are you claiming that 10's of servers are unmanagable on a *cluster* mailing list!?! or are you thinking of the number of moving parts? > Management of the actual spaces thus provided is not trivial -- there > are certain TB-scale limits in linux to cope with (likely to soon be > resolved if they aren't already in the latest kernels, but there in many > of the working versions of linux still in use) and with an array of I can understand and even emphathize with some people's desire to stick to old and well-understood kernels. but big storage is a very good reason to kick them out of this complacency - the old kernel are justifiable only on not-broke-don't-fix grounds... > partitions and servers to deal with, just being able to index, store and > retrieve files generated by the compute component of the grid will be a > major issue. how so? I find that people still use sensible hierarchical organization, even if the files are larger and more numerous than in the past. > a) What are listvolken who have 10+ TB requirements doing to satisfy > them? we're acquiring somewhere between .2 and 2 PB, and are planning machinrooms around the obvious kinds of building blocks: lots of servers that are in the say 4-20 TB range, preferably connected by some fast fabric (IB seems attractive, since it's got mediocre latency but good bandwidth.) > b) What did their solution(s) cost, both to set up as a base system > (in the case of e.g. a network appliance) and I'm fairly certain that if I were making all the decisions here, I'd go for fairly smallish modular servers plugged into IB. > c) incremental costs (e.g. filled racks)? ? > d) How does their solution scale, both costwise (partly answered in b > and c) and in terms of management and performance? my only real concern with management is MTBF: if we had a hypothetical collection 2PB of 250G SATA disks with 1Mhour MTBF, we'd go 5 days between disk replacements. to me, this motivates toward designs that have fairly large numbers of disks that can share a hot spare (or maybe raid6?) > e) What software tools are required to make their solution work, and > are they open source or proprietary? I'd be interested in knowing what the problem is that you're asking to be solved. just that you don't want to run "find / -name whatever" on a filesystem of 20 TB? or that you don't want 10 separate 2TB filesystems? > f) Along the same lines, to what extent is the hardware base of their > solution commodity (defined here as having a choice of multiple vendors > for a component at a point of standardized attachment such as a fiber > channel port or SCSI port) or proprietary (defined as if you buy this > solution THIS part will always need to be purchased from the original > vendor at a price "above market" as the solution is scaled up). as far as I can see, the big vendors are somehow oblivious of the fact that customers *HATE* the proprietary, single-source attitude. oh, you can plug any FC devices you want into your san, as long as they're all our products and we've "qualified" them. > Rules: Vendors reply directly to me only, not the list. I'm in the > market for this, most of the list is not. Note also that I've already I think you'd be surprised at how many, many people are buying multi-TB systems for isolated labs. there are good reasons that this kind of scattershot approach is not wise in, say, a university setting, where a shared resource pool can respond better to burstiness, consistent maintenance, stable environment, etc. regards, mark hahn. From rgb at phy.duke.edu Thu Oct 7 16:59:59 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Storage In-Reply-To: References: Message-ID: On Thu, 7 Oct 2004, Mark Hahn wrote: > > that means:-) in addition to commensurate amounts of tape backup. The > > ick! our big-storage plans very, very much hope to eliminate tape. > > > tape backup is relatively straightforward -- there is a 100 TB library > > available to the project already that will hold 200 TB after an > > LTO1->LTO2 upgrade, and while tapes aren't exactly cheap, they are > > vastly cheaper than disk in these quantities. > > hmm, LTO2 is $0.25/GB; disks are about double that. considering the > issues of tape reliability, access time and migration, I think > disk is worth it. from what I hear in the storage industry, this > is a growing consensus among, for instance, hospitals - they don't > want to spend their time reading tapes to see whether the media is > failing and content needs to be migrated. migrating content that's > online is ah, easier. in the $ world, online data is attractive in part > so its lifetime can be more explicitly managed (ie, deleted!) It isn't the media, it's the way it is served. Tape is ballpark of $250/TB, but once you've invested in a general shell -- a tape library of whatever size you want to pay for -- cost scales linearly, and it (tape) is easy and relatively safe to transport. Disk, by the time you wrap it up, serve it, connect it to this and that, and provide it with this and that costs much more. Otherwise I agree with most of what you say, but remember, I didn't write the RFP specs. Besides, today they decided to drop the 60 TB of tape spec. Oops! We'll still meet or exceed it anyway, as we have a big tape library that is conveniently underutilized handy, so we REALLY just pay for the media (plus maybe kick in a drive or two). > > The disk is a real problem. Raw disk these days is less than $1/GB for > > SATA in 200-300 GB sizes, a bit more for 400 GB sizes, so a TB of disk > > per se costs in the ballpark of $1000. However, HOUSING the disk in > > reliable (dual power, hot swap) enclosures is not cheap, adding RAID is > > not cheap, and building a scalable arrangement of servers to provide > > access with some controllable degree of latency and bandwidth for access > > is also not cheap. > > no insult intended, but have you looked closely, recently? I did some > quick web-pricing this weekend, and concluded: > > vendor capacity size $Cad list per TB density > dell/emc 12x250 3U $7500 1.0 TB/U > apple 14x250 3U $4000 1.166 > hp/msa1500cs 12x250x4 10U $3850 1.2 > > (divide $Cad by 1.25 or so to get $US.) all three plug into FC. > the HP goes up to 8 shelves per controller or 24 TB per FC port, though. So you add FC switch and server(s) and end up at a minimum of around $5K/TB. The maximum prices I'm seeing reported by respondants and that we've seen in quotes or prices of actual systems are well over $10K/TB, some as high as $30K/TB. Price depends on how fast and scalable you want it to be, which in turn depends on how proprietary it is. But I'll summarize all of this when I get through the proposal and can breathe again. The cheapest solutions are those you build yourself, BTW -- as one might expect -- followed by ones that a vendor assembles for you, followed in order by proprietary/named solutions that require special software or special software and special hardware. Some of the solutions out there use basically "no" commodity parts that you can replace through anybody but the vendor -- they even wrap up the disks themselves in their own custom packaging and firmware and double the price in the process. > > Management requirements include 3 year onsite > > service for the primary server array -- same day for critical > > components, next day at the latest for e.g. disks or power supplies that > > we can shelve and deal with ourselves in the short run. The solution we > > pretty standard policies. > > > adopt will also need to be scalable as far as administration is > > concerned -- we are not interested in "DIY" solutions where we just buy > > an enclosure and hang it on an over the counter server and run MD raid, > > not because this isn't reliable and workable for a departmental or even > > a cluster RAID in the 1-8 TB range (a couple of servers) it isn't at all > > clear how it will scale to the 10-80 TB range, when 10's of servers > > would be required. > > Robert, are you claiming that 10's of servers are unmanagable > on a *cluster* mailing list!?! or are you thinking of the number > of moving parts? I'm thinking of scalability of management at all levels, and performance at all levels. I don't >>think<< that I'm crazy in thinking that this is an issue in large scale storage design -- at least one respondant so far suggested that I wasn't radical enough and that off-the shelf or homemade SAN solutions are doomed to nasty failure at very large (100+ TB) sizes. I'm not certain that I believe him (I had several people describe their off-the-shelf solutions that scale to 100+ TB sizes, and was directed to e.g. http://www.archive.org/web/petabox.php) but think of me as being hypercautious in my already admitted ignorance;-) That is, if there are no issues and people are running stacks of 6.4 TB enclosures hanging off of OTC linux boxes and managing the volumes and data issues transparently and they scale to 100's of TB, sure, I'd love to hear about it. Now I have, although there are issues, there are issues. As I said, I'll summarize (and maybe start some lovely arguments:-) when I'm done but I'm still DIGESTING all the data I've gotten from vendors and list-friends (all of whom I profoundly thank!). > > Management of the actual spaces thus provided is not trivial -- there > > are certain TB-scale limits in linux to cope with (likely to soon be > > resolved if they aren't already in the latest kernels, but there in many > > of the working versions of linux still in use) and with an array of > > I can understand and even emphathize with some people's desire to > stick to old and well-understood kernels. but big storage is a very > good reason to kick them out of this complacency - the old kernel are > justifiable only on not-broke-don't-fix grounds... Again, agreed, but one wants to be very conservative in a project proposal, especially when we HAVE NO CHOICE as to the actual kernel or OS distribution -- we will have to just "install the grid" with a package developed elsewhere by people that you or I might or might not agree with. Historically, in fact, I think that there is no chance that either one of us would do things the way they have done them so far, and maybe we will ultimately influence the design, but when writing the proposal we have to assume that we'll be using their linux. Where at least we've talked them up from some -- shall we say old? obsolete? non-x64 supporting? versions of linux and the associated kernels and libraries as a base... (you get the idea). > > partitions and servers to deal with, just being able to index, store and > > retrieve files generated by the compute component of the grid will be a > > major issue. > > how so? I find that people still use sensible hierarchical organization, > even if the files are larger and more numerous than in the past. It's a grid, and we're trying to avoid direct NFS mounts on all the nodes for a variety of reasons (like performance, reliability, security) and because in this kind of grid people will need to use fully automated schema for data storage, retrieval, and archival migration on and off the main data store. Honestly, I personally think that the data management issue and toolset is MORE important than the hardware. As you note, we can build arrays of disk servers or arrays of disk and associated servers or network appliances and arrays of disk a variety of ways, including DIY with a fairly obvious design. In order for people to be able to direct a node to run for a week and drop its results, properly indexed and crossreferenced by user/group/program/parameters in a database, somewhere into the data store where it will be transparently migrated onto and off of an attached tape archive as needed AND possibly resync'd back to a project CENTRAL store AND possibly sync'd back to the home LAN and store of the grid user for local processing --- it is doable, sure, but I wouldn't call it trivial or necessarily doable without some hacking or involvement in OS projects addressing this issue or purchase of proprietary software ditto. If it is trivial, and there is a simple package that does all this and eats your meatloaf for you out to a PB of data, please enlighten me...:-) > > a) What are listvolken who have 10+ TB requirements doing to satisfy > > them? > > we're acquiring somewhere between .2 and 2 PB, and are planning machinrooms > around the obvious kinds of building blocks: lots of servers that are in > the say 4-20 TB range, preferably connected by some fast fabric (IB seems > attractive, since it's got mediocre latency but good bandwidth.) Ya. > > b) What did their solution(s) cost, both to set up as a base system > > (in the case of e.g. a network appliance) and > > I'm fairly certain that if I were making all the decisions here, I'd > go for fairly smallish modular servers plugged into IB. Any idea of what that would cost? > > > c) incremental costs (e.g. filled racks)? I meant "cost of additional filled disk enclosures" once you've bought in. Some solutions involve network appliances with a large capital investment before you buy your first disk enclosure, and then scale linearly with filled enclosures to some point, where you buy another appliance. Some solutions already specify an appliance interconnect so that the whole thing is transparent to your cluster. Some solutions are expensive, expensive. I'm just trying to figure out HOW expensive, and how far we can go for what we can afford with the different alternatives. I'm happy for anyone to tell me the virtues of the expensive systems (the benefits) as long as I have the costs in hand as well, so I can ultimately do the old fashioned CBA. > > d) How does their solution scale, both costwise (partly answered in b > > and c) and in terms of management and performance? > > my only real concern with management is MTBF: if we had a hypothetical > collection 2PB of 250G SATA disks with 1Mhour MTBF, we'd go 5 days between > disk replacements. to me, this motivates toward designs that have fairly > large numbers of disks that can share a hot spare (or maybe raid6?) Right, but if your hypothetical array of disks also involved a stack of over the counter servers, network switches (of any sort, eg IB or FC or GE), and so on, there isn't just the disk to worry about -- in fact, in a good RAID enclosure it is relatively straightforward to deal with the disk (and hot swap power and hot swap fan) failures. Dealing with intrinsic server failures, e.g. toasted memory, CPU, CPU fans, CPU power supply (maybe, unless server has dual power) and sundry networking or peripheral card failures takes a lot more time and expertise, and can take down whole blocks of disks if the disk is provided only via direct connections to specific servers. Both human effort and expertise required and projected downtime depend a lot on how you build and set things up. Or rather, I >>expect<< it to, and am seeking war-stories (stories of profound failures where some design was FUBAR and ultimately abandoned for cause, especially) so I can figure out which designs to avoid because they DON'T scale in management. Performance scaling is also important, but we're not looking for the fastest possible solution or truly superior performance scaling (the kinds of solutions that cost the $10K+/TB sorts of prices). Unless of course all the other solutions simply choke to death at some e.g. 80 TB scale. I don't "expect" them too, sure, but if I knew the answer, why'd I ask? > > > e) What software tools are required to make their solution work, and > > are they open source or proprietary? > > I'd be interested in knowing what the problem is that you're asking to be > solved. just that you don't want to run "find / -name whatever" on > a filesystem of 20 TB? or that you don't want 10 separate 2TB filesystems? Partially described above. The dataflow we are expecting isn't unique to our problem, BTW. One respondant with almost exactly the same needs described a tool they are developing that is designed fairly specifically to manage the dataflow and archival/migration issues transparently. I'm waiting to hear whether it interfaces with any sort of indexing schema or toolset -- if so, it would simply solve the problem. Solve it for the cheapest possible (hardware reliable, COTS component) data stack -- a pile of OTS multiTB servers -- as well! In case the above wasn't clear, think: a) Run 1 day to 1 week, generate some 100+ GB per CPU on node local storage; b) Run hours to days, reduce the data to "interesting" and compressed form, occupying maybe 10% of this space. How the actual data is originally created (one big file or many little files, e.g.) I haven't a clue yet. How it is aggregated ditto. At some point, though; c) condensed data (be it in one 10 GB file or 10 1 GB files or larger or smaller fragments) is sent in to the central store, where it has to be saved in a way that is transparent to the user, indexed by the generating program, its parameters, the generating/owning group, various node and timestamp metadata, all in a DB that is searchable by the large community that wants to SHARE this data. So "find" is clearly out, even find with really long filenames. Find is REALLY out if you think about its performance scaling as you fill the store with lots of inodes. d) Once on the central store, the data has to be able to stay there (if it is being used), be backed up to tape (regardless), be MIGRATED to tape to free space on the central store for other data that IS being used, be retreiveable from backup or archive, be downloadable by the generating user to a home faraway for local processing, be downloadable by OTHER groups/users to THEIR homes faraway, and be uploadable to a PB-scale toplevel store and centralized archive in a higher tier of the grid. e) and maybe other stuff. The RFP wasn't horribly detailed (it wasn't at ALL detailed) and the material we've obtained from grid prototype sites isn't very helpful at the design phase. So we may NEED to export NFS space to the nodes or use XFS and some fancy toolsets or the like, but we're hoping to avoid this if the actual workflow permits it. On a grid, it "should", since grid tasks should all use "grid functions" to accomplish macroscopic tasks, not Unix/linux/posix functions or tools. > > f) Along the same lines, to what extent is the hardware base of their > > solution commodity (defined here as having a choice of multiple vendors > > for a component at a point of standardized attachment such as a fiber > > channel port or SCSI port) or proprietary (defined as if you buy this > > solution THIS part will always need to be purchased from the original > > vendor at a price "above market" as the solution is scaled up). > > as far as I can see, the big vendors are somehow oblivious of the fact > that customers *HATE* the proprietary, single-source attitude. > oh, you can plug any FC devices you want into your san, > as long as they're all our products and we've "qualified" them. You are now the third or fourth person to make THAT observation. "Standards? We don' care about no stinkin' standards..." (apologies to Mel Brooks and Blazing Saddles...;-) > > > Rules: Vendors reply directly to me only, not the list. I'm in the > > market for this, most of the list is not. Note also that I've already > > I think you'd be surprised at how many, many people are buying > multi-TB systems for isolated labs. there are good reasons that > this kind of scattershot approach is not wise in, say, a university > setting, where a shared resource pool can respond better to burstiness, > consistent maintenance, stable environment, etc. I agree again. Hell, I maintain a 3x80 GB disk IDE RAID in my HOME server these days, and the only thing special about the "80" is the age of the disks -- next time I upgrade it I'll likely make it close to a TB just because I can. So TB-scale storage is to be expected in most departmental size computing efforts at $1/GB plus housing and server. 100 TB-scale storage is a different beast. One is really engineering a storage "cluster" and like all cluster engineering, the optimal result depends on the application mix and expected usage; a "recipe" based solution might work or it might lead to disaster and effectively unusuable resources due to bottlenecks, contention, or management issues. Cluster engineering I have a reasonable understanding of; storage cluster engineering at this scale is way beyond my ken, although I'm learning fast. If only I had a couple of hundred thousand dollars, now, I'd build and buy a bunch of prototypes and really learn it the right way...;-) Thanks enormously for the response, rgb > > regards, mark hahn. > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Thu Oct 7 17:27:13 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Storage In-Reply-To: References: Message-ID: On Thu, 7 Oct 2004, Robert G. Brown wrote: > In case the above wasn't clear, think: > > a) Run 1 day to 1 week, generate some 100+ GB per CPU on node local > storage; I hate to reply to myself, but I meant "per node" -- on a hundred node dual CPU cluster, generate as much as 2 TB of raw data a week, which reduces to maybe 0.2 TB of data a week in hundreds of files. Multiply by 50 and we'll fill 10+ TB in a year, in tens of thousands of files (or more). And this is the lower-bound estimate, likely off by a factor of 2-4 and certain to be off by even more as the cluster scales up in size over the next few years to as many as 500 nodes sustained, all cranking out data according to this prescription but amplified by Moore's Law by exponentially increasing factors. This is why I'm worried about scaling so much. Even the genomics people have some sort of linear bounds on their data production rate. This has exponential growth in productivity matching (hopefully) expected growth in storage, so it might not get relatively easier... and if the exponents mismatch, it could get a lot worse. rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From csamuel at vpac.org Thu Oct 7 18:54:06 2004 From: csamuel at vpac.org (Chris Samuel) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Strange NFS corruption on Linux cluster to AIX 5.2 NFS server Message-ID: <200410081154.09031.csamuel@vpac.org> Hi folks, (system details at the end) I'm having a real hard time trying to track down a really bizzare NFS related issue on some clusters we're helping out on and I'm wondering if anyone here quickly knows the answer to this question before I go off trawling through the kernel sources. I have a 72K assembler file (the results of a day of narrowing down the problem) that when I do: as -o /tmp/file.o file.s generates a valid .o file, but when I do: as -o /some/nfs/directory/file.o file.s creates a corrupted object file (and in the original case leads to a link error due to the corrupted ELF format). However, cp'ing or cat'ing the object file from /tmp to the NFS filesystem is fine, it's just the assemblers output that is corrupted. I thought that this was just an NFS probem until I used strace to dump out the entire contents of the file descriptors that 'as' reads and writes to for the assembler file and for the object file, and then diff'd them. The only significant differences is that the write(2)'s to the object files are not the same, which I find extremely puzzling, I can see no way that the assembler can generate different output depending on whether the file it's just open()'d is on NFS or local disk. :-( My only thought is that strace (which uses ptrace(2)) is reading the data from the kernel at some point after it has been corrupted, presumably at some point in the NFS parts of the kernel. The problem with this file goes away (MD5 matches that of the one in created in /tmp) if I change rsize & wsize from 8192 to 4096, but then other object files get corrupted instead. :-( We've tried this out on three nodes in the cluster, and they all corrupt the output file, so it's unlikely to be a particular hardware problem. What is hurting my brain is that there is a mirror of this cluster both in OS installs (identical RPMs of the OS, especially kernel, gcc, assembler & libraries were used) and in firmware (BIOS and firmware updates were from the same CD) where this problem does not occur at all. In both situations the NFS server is an AIX 5.2 box, it is possible that there are minor differences there, but I cannot see how a difference in the NFS server could affect the output of the assembler on the Linux box before it goes anywhere near hitting the wire, let alone making it to the NFS server. The mount options are identical (we've checked both /etc/fstab and /proc/mounts) and rpm -Va doesn't show any unusual discrepancies between the two clusters. OS: RHEL3 Kernel: kernel-smp-2.4.21-15.EL Binutils: binutils-2.14.90.0.4-35 NFS-utils: nfs-utils-1.0.6-21EL Hardware: IBM x335 and IBM x345 dual Xeons. cheers! Chris -- Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20041008/1e5e9377/attachment.bin From srgadmin at cs.hku.hk Fri Oct 8 00:56:49 2004 From: srgadmin at cs.hku.hk (SRG Administrator) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] NPC2004: Call For Participation Message-ID: <41664841.5020303@cs.hku.hk> Call For Participation 2004 IFIP Internation Conference on Network and Parallel Computing (NPC 2004) http://grid.hust.edu.cn/npc04 **************************************************************************** INTRODUCTION: The goal of NPC 2004 is to establish an international forum for engineers and scientists to present their excellent ideas and experiences in all system fields of network and parallel computing. NPC 2004, hosted by Huazhong University of Science and Technology, will be held at Oct 18 - 20, 2004 in Wuhan, China. All accepted papers will be published by Springer-Verlag in the Lecture Notes in Computer Science Series (cited by SCI). There are many scenic spots and historical sites in Wuhan including the Yellow Crane Tower with the 1,700 years history, one of the three famous towers in South China, and the East Lake whose natural beauty rivals that of the West Lake in Hangzhou. The main topics of interest include, but not limited to: Parallel & Distributed Architectures Network Security Multimedia Streaming Services Performance Modeling/Evaluation Network Storage Middleware Frameworks and Toolkits Network & Interconnect Architecture Parallel Programming Environments and Tools Parallel & Distributed Applications/Algorithms Advanced Web and Proxy Services Peer-to-peer Computing Cluster & Grid Computing ************************************************************************** KEYNOTE SPEAKERS: Prof. Kai Hwang Director of Internet and Grid Computing Laboratory University of Southern California Topic: Secure Grid Computing with Trusted Resources and Internet Datamining Dr. Thomas Sterling Faculty Associate Center for Advanced Computing Research California Institute of Technology Topic: Towards Memory Oriented Scalable Computer Architecture and High Efficiency Petaflops Computing Prof. Jose A.B. Fortes Director of Advanced Computing and Information Systems (ACIS) Laboratory University of Florida Topic: In-VIGO: Making the grid virtually yours Dr. Robert Kuhn Intel Americas, Inc Topic: Productivity in HPC Clusters Dr. Mootaz Elnozahy IBM Topic: PERCS: IBM Effort in HPCS ************************************************************************ REGISTRATION FEE (ON-SITE FEES) Regular : Euro 400 or US$ 480 Student : Euro 200 or US $240 Accompany : Euro 150 or US $180 EXTRA ROCEEDINGS : EURO 100 (or US $ 120) FOR EVERY EXTRA PROCEEDINGS The registration form can be downloaded from the following address. http://grid.hust.edu.cn/npc04/download/registration-form-npc04.pdf ************************************************************************ VENUE The conference will be held in Wuhan Lake View Garden Hotel that is the only traditional ancient style hotel in Wuhan City to meet the international five- star hotel standard. It is located in the beautiful East Lake scenery site ,and the East Lake High Technological Development Zone. The nice environment and the convenient traffic make it the ideal accommodation for travelers. 200 all kinds of well-equipped rooms are delightfully decorated and offer an array of comforts and amenities. http://www.lakeviewgarden.com/english/ ************************************************************************ PROGRAM The detailed program can be found at http://grid.hust.edu.cn/npc04/program.htm ************************************************************************ TOURISM Wuhan, the capital of Hubei Province, is the largest city in Central China, with a population of over 7 million and an area of 8,467 square kilometers. It lies at the confluence of the Yangtze and Han rivers and is comprised of three towns--Wuchang, Hankou, and Hanyang--that face each other across the rivers and are linked by two bridges. A major junction of traffic and communication, it is the center of economy, culture and politics in Central China and is proud of metallurgy, automobiles, machinery and high-tech industries. A core of national air, water and land transportation it offers great potential for further development and foreign investment. Wuhan is rich in culture and history. Its civilization began about 3,500 years ago, and is of great importance in Chinese culture, military, economy and politics. It shares the same culture of Chu, formed since the ancient Kingdom of Chu more than 2,000 years ago. Numerous natural and artificial attractions and scenic spots are scattered around. Famous scenic spots in Wuhan include Yellow Crane Tower, Guiyuan Temple, East Lake, and Hubei Provincial Museum with the famous chimes playing the music of different styles. Yellow Crane Tower is the symbol of Wuhan. It is located on the Snake Hill in Wuchang, at the south bank of Yangtze River; it is called one of the three most famous towers in southern China, together with Yueyang Tower in Hunan Province and Tengwang Tower in Jiangxi Province. The East Lake is one of the first state scenic spots in the east of Wuhan. The lake covers an area of 33 square Km, and is the largest lake of a city throughout China. In 1999, it was granted by the State as National Civilized Scenic Spot Model Site. Moshan Hill located in the East Lake scenic spot, it is surrounded by the lake from the East, the West and the North. From East to West, it is 2,200 meter long; from North to South, about 500 meter broad. Moshan Hill has six peaks, with Chu culture as its subject, such as Chu Bazaar, Chu Heaven Platform, Chu Talents Park, are full of antique flavor and classic beauty of Chu culture. ************************************************************************ For more information, please contact the program vice-chair, Dr. Hai Jin Tel:+86-27-87543529 Fax:+86-27-87557354 Email:hjin@hust.edu.cn From janfrode at parallab.uib.no Fri Oct 8 12:51:32 2004 From: janfrode at parallab.uib.no (Jan-Frode Myklebust) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Storage In-Reply-To: References: Message-ID: <20041008195132.GC32425@ii.uib.no> On Thu, Oct 07, 2004 at 07:59:59PM -0400, Robert G. Brown wrote: > If it is trivial, and there is a simple package that does all this and > eats your meatloaf for you out to a PB of data, please enlighten > me...:-) Have you had a look at SRB -> http://www.npaci.edu/DICE/SRB/ ? Sounds to me like it fullfills all your requirements (except for the meatloaf part, but I could be wrong). -jf From jonathan.hujsak at baesystems.com Fri Oct 8 13:15:26 2004 From: jonathan.hujsak at baesystems.com (Hujsak, Jonathan T (US SSA)) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] 64bit comparisons Message-ID: <436A69503F99214D9EBC27415F66C40C3288E1@blums0008.sd.gd.com> Hi! We're looking at implementing a large G5 cluster here at BAE Systems. Have you gained any new 'lessons learned' since the communication below? Can you recommend a good version of MPI to use for these? We've been looking at MPICH, MPIPro and also the Apple xgrid... Thanks! Jonathan Hujsak BAE Systems San Diego Bill Broadley bill at cse.ucdavis.edu Fri May 14 11:48:21 PDT 2004 * Previous message: [Beowulf] 64bit comparisons * Next message: [Beowulf] 64bit comparisons * Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] _____ On Fri, May 14, 2004 at 09:44:01AM -0700, Robert B Heckendorn wrote: > One of the options we are strongly considering for our next cluster is > going with Apple X-servers. There performance is purported to be good Careful to benchmark both processors at the same time if that is your intended usage pattern. Are the dual-g5's shipping yet? Last I heard yield problems were resulting in only uniprocessor shipments. My main concern that despite the marketing blurb of 2 10GB/sec CPU interfaces or similar that there is a shared 6.4 GB/sec memory bus. > and their power consumption is small. Has anyone measured a dual g5 xserv with a kill-a-watt or similar? > Can people comment on any comparisons betwee Apple and (Athlon64 > or Opteron)? Personally I've had problems, I need to spend more time resolving them, things like: * Need to tweak /etc/rc to allow Mpich to use shared memory * Latency between two mpich processes on the same node is 10-20 times the linux latency. I've yet to try LAM. * Differences in semaphores requires a rewrite for some linux code I had * Difference in the IBM fortran compiler required a rewrite compared to code that ran on Intel's, portland group's, and GNU's fortran compiler. Given all that I'm still interested to see what the G5 is good at and under what workloads the G5 wins perf/price or perf/watt. -- Bill Broadley Computational Science and Engineering UC Davis -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20041008/19c8e119/attachment.html From rgb at phy.duke.edu Fri Oct 8 14:16:28 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Storage In-Reply-To: <20041008195132.GC32425@ii.uib.no> References: <20041008195132.GC32425@ii.uib.no> Message-ID: On Fri, 8 Oct 2004, Jan-Frode Myklebust wrote: > On Thu, Oct 07, 2004 at 07:59:59PM -0400, Robert G. Brown wrote: > > > If it is trivial, and there is a simple package that does all this and > > eats your meatloaf for you out to a PB of data, please enlighten > > me...:-) > > Have you had a look at SRB -> http://www.npaci.edu/DICE/SRB/ ? Sounds > to me like it fullfills all your requirements (except for the > meatloaf part, but I could be wrong). Ah, but read: http://www.npaci.edu/dice/srb/srbOpenSource.html where it is clear that the answer to the question "is it GPL-level open source" (essentially free softare) is "no". Worse, it is one of those really evil packages that requires that you contact a University's "Technology Transfer staff" for anything but carefully prescribed kinds of usage (by Universities, basically). This kinds of licensing drives me somewhat wild, especially since in this particular project we/Duke (an academic institution) will be partnering with MCNC (a state funded center) and other area schools and universities. Oops. State funded centers have to dicker for the right to use the toolset. In spite of its impressive list of projects (and features), this makes it, as you say, a package that does NOT eat your meatloaf for you. This is the general idea of the project's data management package tool as well (and some others folks have pointed out) and I appreciate the reference. I just wish that Universities would stop taking software developed (generally) with generous support from federal and state grants and putting these silly "we want to make money from this" licenses. Just GPL them and do things right... Condor used to drive me nuts the same way. SGE ditto. PBS even more so. Tools like this need to be REAL open source, free like air, especially when it is almost dead certain that they began with all sorts of ideas and possibly code contributed by a free source community, built on top of free tools contributed by that community. rgb > > > -jf > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From lindahl at pathscale.com Fri Oct 8 14:47:12 2004 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] 64bit comparisons In-Reply-To: <436A69503F99214D9EBC27415F66C40C3288E1@blums0008.sd.gd.com> References: <436A69503F99214D9EBC27415F66C40C3288E1@blums0008.sd.gd.com> Message-ID: <20041008214712.GA3602@greglaptop.internal.keyresearch.com> On Fri, Oct 08, 2004 at 01:15:26PM -0700, Hujsak, Jonathan T (US SSA) wrote: > We're looking at implementing a large G5 cluster here at BAE Systems. > > Have you gained any new 'lessons learned' since the communication > below? Can you recommend a good version of MPI to use for these? Jonathan, The most important lesson learned for large clusters is that you should gain your own experience -- buy one of each potential node and run your apps on it. As for MPI implementations, it usually depends on the interconnect that you're planning on using. -- greg From landman at scalableinformatics.com Fri Oct 8 14:59:36 2004 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Storage In-Reply-To: References: <20041008195132.GC32425@ii.uib.no> Message-ID: On Fri, 8 Oct 2004, Robert G. Brown wrote: > This is the general idea of the project's data management package tool > as well (and some others folks have pointed out) and I appreciate the > reference. I just wish that Universities would stop taking software > developed (generally) with generous support from federal and state > grants and putting these silly "we want to make money from this" > licenses. Just GPL them and do things right... Unless you negotiate this as part of your employment package (and my understanding is that few universities are willing to give up their Bayh-Dole based rights to your work), that this probably won't happen. Notice the intense resistance from certain interested groups to the NIH-NCRR policy of requesting software developed with federal money to be open-source. University tech transfer folks were among the interested parties. I think what needs to evolve is a two pronged model ala mysql. If you are going to spin it out and turn it into a profit center, then by all means, pay for a license. If you are going to use it in research (not for products or derivative works), then GPL it (or similar). > Condor used to drive me nuts the same way. SGE ditto. PBS even more so. For some reason, Condor has not released their code. I find this odd. I thought they had. > > Tools like this need to be REAL open source, free like air, especially > when it is almost dead certain that they began with all sorts of ideas > and possibly code contributed by a free source community, built on top > of free tools contributed by that community. > Remember, the poor starving universities need to eat too... :( There are valid reasons to ask for money for software. There are valid reasons not to distribute everything gratis (GPL is *not* a business plan) and to constrain redistribution. These reasons make sense for businesses. Universities generally have a different mission than businesses (though arguably, Bayh-Dole has blurred this significantly). As with other employers, they own in most cases, everything you do. If you want to build a company based upon what you have done in your lab, you have to negotiate with the tech transfer office. Joe From rgb at phy.duke.edu Fri Oct 8 17:24:43 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Storage In-Reply-To: References: <20041008195132.GC32425@ii.uib.no> Message-ID: On Fri, 8 Oct 2004, Joe Landman wrote: > Remember, the poor starving universities need to eat too... :( Oh, I agree, I just dislike it immensely when University's start to resemble occult, tax protected venture capital investment firms with a built-in money laundering business in the form of "open and free" teaching and research. Supposedly they exist to teach and to do research in the most philosophical of senses, but for far too many of them the moment there is a sniff of money in a development they are all over it. For "patentable" technology, one can just barely justify this, at least historically. For software, which typically lives on a copyright, the university has no more business getting involved than it has trying to co-opt a publication to a learned journal and force the author to republish it with copyright belonging to the University for money. In practice, I think even the patent co-opting is middling Evil. Rather than even partnering with the actual developer who likely had the idea, did all the groundwork, got the grant, so that the University MADE MONEY (likely money exceeding the developer's salary) from every step of the process at basically no risk to themselves, they just assert, sorry, this belongs to us now and we'll give back some tiny fraction of anything we make from it to your research program, if you are fortunate enough to get tenure and keep your grants and still be working here when we do. But for software, especially software developed by academics and grant-paid employees in association with federally funded projects, this kind of nonsense is just unforgiveable. One thing I like about Duke is that they understand the clear benefit to open source software, and more or less insist that stuff developed by systems staff that is reusable be GPL or equivalent. But then you run into a place that doesn't and is grasping mine mine mine... while freely using the pieces WE contribute back to the GPL pool. > There are valid reasons to ask for money for software. There are valid > reasons not to distribute everything gratis (GPL is *not* a business > plan) and to constrain redistribution. These reasons make sense for > businesses. Universities generally have a different mission than > businesses (though arguably, Bayh-Dole has blurred this significantly). One would hope. > > As with other employers, they own in most cases, everything you do. If > you want to build a company based upon what you have done in your lab, you > have to negotiate with the tech transfer office. "Negotiate" isn't exactly the word -- generally it is laid out pretty clearly in the faculty and staff bylaws. Any negotiations had better start before you even start the project, and to keep something you may have to formally leave the University before you start or risk their just taking it no matter when or how you finish. Greed is a universal human trait, I guess, even in the Ivory Tower. Doesn't mean I have to like it. rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From alvin at Mail.Linux-Consulting.com Fri Oct 8 22:20:10 2004 From: alvin at Mail.Linux-Consulting.com (Alvin Oga) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Storage - vendors In-Reply-To: Message-ID: hi ya On Thu, 7 Oct 2004, Robert G. Brown wrote: > On Thu, 7 Oct 2004, Mark Hahn wrote: ... > > vendor capacity size $Cad list per TB density > > dell/emc 12x250 3U $7500 1.0TB/U > > apple 14x250 3U $4000 1.166 > > hp/msa1500cs 12x250x4 10U $3850 1.2 1Us w/ 8 drives 3 * 8 * 250 3U $ 9K /6TB 2.oTB/U Blades w/ 4drive 10* 4 * 250 4U $20K /10TB 2.5TB/U * plain old 1Us with 8 drives per 1U allows ( 2.0TB per 1U ) http://linux-1u.net/Dwg/jpg.sm/c2610.jpg * 10 mini-itx blades per 4U chassis w/ 4 disks ( 1TB per blade, 10 blade per 4U chassis ) http://itx-blades.net * adding FC cards will increase the system costs :-) - the FC/san market is fairly tight market and very expensive > > (divide $Cad by 1.25 or so to get $US.) all three plug into FC. > > the HP goes up to 8 shelves per controller or 24 TB per FC port, though. ... > The cheapest solutions are those you build yourself, BTW -- as one might > expect -- followed by ones that a vendor assembles for you, followed in > order by proprietary/named solutions that require special software or > special software and special hardware. "costs" are usually based on "vendor name recognition" compared to the raw costs of parts and the closed market of competitors selling their widgets ( costs of parts is minimal compared to their retail pricing ) > describe their off-the-shelf solutions that scale to 100+ TB sizes, and > was directed to e.g. http://www.archive.org/web/petabox.php) but think > of me as being hypercautious in my already admitted ignorance;-) their design is also based on their ability to use rs232 to log into the adjacent box if it goes down for some reason, but, rs232 might not work if the power fialed or the machine didn't boot to get to the init level to turn on agetty/uugetty ---- it'd be good to have a 2nd 100TB backup subsystem ... as it's not trivial to backup and restore ( from bare metal ) that amount of data and you want to be certain you don't lose yesterday'sor last weeks data due to today's faulty backup c ya alvin From hanzl at noel.feld.cvut.cz Fri Oct 8 14:33:11 2004 From: hanzl at noel.feld.cvut.cz (hanzl@noel.feld.cvut.cz) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Storage and cachefs on nodes? Message-ID: <20041008233311G.hanzl@unknown-domain> Anybody using cachefs(-alike) and local disks on nodes for reboot-persistent cache of huge central storage? (I periodically and obsessively repeat this poll with negative answer ever so far, obviously I am the only person with data storage needs perverted this way. Given the recent interest in storage, I dare to ask again...) Thanks Vaclav Hanzl From laurence at scalablesystems.com Fri Oct 8 18:20:02 2004 From: laurence at scalablesystems.com (Laurence Liew) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] 64bit comparisons In-Reply-To: <436A69503F99214D9EBC27415F66C40C3288E1@blums0008.sd.gd.com> References: <436A69503F99214D9EBC27415F66C40C3288E1@blums0008.sd.gd.com> Message-ID: <41673CC2.5060605@scalablesystems.com> Hi Jonathan 1) But 1 or 2 nodes of each platform and test with your apps to see - do they work on G5/Linux or G5/Mac OS - which gives best price/performance 2) MPI - depends on your budget for the interconnect - Quadrics, Myrinet, Infiniband are all candidates - your performance requirements and budget will determine which one suits best 3) IO - notice you did not mention anything about IO - spend some time thinking about IO - depending on your needs you may need a parallel filesystem or a simple NAS Hope this helps. Cheers! Laurence Hujsak, Jonathan T (US SSA) wrote: > *Hi!* > > * * > > *We?re looking at implementing a large G5 cluster here at BAE Systems.* > > * * > > *Have you gained any new ?lessons learned? since the communication * > > *below? Can you recommend a good version of MPI to use for these?* > > *We?ve been looking at MPICH, MPIPro and also the Apple xgrid?* > > * * > > *Thanks!* > > * * > > *Jonathan Hujsak* > > *BAE Systems* > > *San Diego* > > * * > > *Bill Broadley* bill at cse.ucdavis.edu > > /Fri May 14 //11:48:21 PDT// 2004/ > > * Previous message: [Beowulf] 64bit comparisons > > * Next message: [Beowulf] 64bit comparisons > > * *Messages sorted by:* [ date ] > > [ thread ] > > [ subject ] > > [ author ] > > > ------------------------------------------------------------------------ > > On Fri, May 14, 2004 at 09:44:01AM -0700, Robert B Heckendorn wrote: > >>/ One of the options we are strongly considering for our next cluster is/ > >>/ going with Apple X-servers. There performance is purported to be good/ > > > > Careful to benchmark both processors at the same time if that is your > > intended usage pattern. Are the dual-g5's shipping yet? Last I heard > > yield problems were resulting in only uniprocessor shipments. My main > > concern that despite the marketing blurb of 2 10GB/sec CPU interfaces > > or similar that there is a shared 6.4 GB/sec memory bus. > > > >>/ and their power consumption is small./ > > > > Has anyone measured a dual g5 xserv with a kill-a-watt or similar? > > > >>/ Can people comment on any comparisons betwee Apple and (Athlon64/ > >>/ or Opteron)?/ > > > > Personally I've had problems, I need to spend more time resolving them, > > things like: > > * Need to tweak /etc/rc to allow Mpich to use shared memory > > * Latency between two mpich processes on the same node is 10-20 times the > > linux latency. I've yet to try LAM. > > * Differences in semaphores requires a rewrite for some linux code I had > > * Difference in the IBM fortran compiler required a rewrite compared to code > > that ran on Intel's, portland group's, and GNU's fortran compiler. > > > > Given all that I'm still interested to see what the G5 is good at and under > > what workloads the G5 wins perf/price or perf/watt. > > > > -- > > Bill Broadley > > Computational Science and Engineering > > UC Davis > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- ========================================== Visit us at Supercomputing2004. Booth #400 ========================================== Laurence Liew, CTO Email: laurence@scalablesystems.com Scalable Systems Pte Ltd Web : http://www.scalablesystems.com (Reg. No: 200310328D) 7 Bedok South Road Tel : 65 6827 3953 Singapore 469272 Fax : 65 6827 3922 From jrajiv at hclinsys.com Fri Oct 8 21:55:34 2004 From: jrajiv at hclinsys.com (Rajiv) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] HPC in Windows Message-ID: <01ee01c4adbc$3618d330$39140897@PMORND> Dear All, Are there any Beowulf packages for windows? Regards, Rajiv -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20041009/eb3fc4ef/attachment.html From jrajiv at hclinsys.com Fri Oct 8 21:54:34 2004 From: jrajiv at hclinsys.com (Rajiv) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Application Deployment Message-ID: <01df01c4adbc$1229c240$39140897@PMORND> Dear All, Is there any software available for application deployment- both linux and windows. I would like to install packages from master to all the clients through a management console. Regards, Rajiv -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20041009/315c2948/attachment.html From clwang at cs.hku.hk Fri Oct 8 22:57:52 2004 From: clwang at cs.hku.hk (Cho Li Wang) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] CFP: CCGrid2005 (Cardiff, UK) Message-ID: <41677DE0.5070006@csis.hku.hk> CLUSTER COMPUTING AND GRID (CCGrid 2005) http://www.cs.cf.ac.uk/ccgrid2005/ 9-12 May 2005 Cardiff, UK ****************************************************************** IMPORTANT DATE: Paper submission: November 15, 2004 ****************************************************************** SCOPE ===== Commodity-based clusters and Grid computing technologies are rapidly developing, and are key components in the emergence of a novel service-based fabric for high capability computing. Cluster-powered Grids not only provide access to cost-effective problem-solving power, but also promise to enable a more collaborative approach to the use of distributed resources, and new economic products and services. CCGrid2005, sponsored by the IEEE Computer Society (final approval pending), is designed to bring together international leaders who are pioneering researchers, developers, and users of clusters, networks, and Grid architectures and applications. The symposium will also serve as a forum to present the latest work, and highlight related activities from around the world. CCGrid2005 is interested in topics including, but not limited to: o Hardware and Software (based on PCs, Workstations, SMPs or Supercomputers) o Middleware for Clusters and Grids o Dynamic Optical Network Architectures for Grid Computing o Parallel File Systems, including wide area file systems, and Parallel I/O o Scheduling and Load Balancing o Programming Models, Tools, and Environments o Performance Evaluation and Modeling o Resource Management and Scheduling o Computational, Data, and Information Grid Architectures and Systems o Grid Economies, Service Architectures, and Resource Exchange Architectures o Grid-based Problem Solving Environments o Scientific, Engineering, and Commercial Grid Applications o Portal Computing / Science Portals TECHNICAL PAPER SUBMISSION ========================== Authors are invited to submit papers of not more than 8 pages of double column text using single spaced 10 point size type on 8.5 x 11 inch pages, as per IEEE 8.5 x 11 manuscript guidelines, see http://www.computer.org/cspress/instruct.htm. Authors should submit a PostScript (level 2) or PDF file that will print on a PostScript printer. Paper submission instructions will be placed on this webpage (http://www.cs.cf.ac.uk/ccgrid2005). It is expected that the proceedings will be published by the IEEE Computer Society Press, USA. POSTER SUBMISSION ================= Authors may also submit short papers, of no more than 4 pages, of double column text using single spaced 10 point size type on 8.5 x 11 inch pages, as per IEEE 8.5 x 11 manuscript guidelines, see http://www.computer.org/cspress/instruct.htm. Authors should submit a PostScript (level 2) or PDF file that will print on a PostScript printer. Paper submission instructions will be placed on this webpage (http://www.cs.cf.ac.uk/ccgrid2005). Please contact the Poster's chair -- Dr Yan (Coral) Huang -- if you have queries. Dr Huang can be reached at: yan.huang@cs.cardiff.ac.uk IMPORTANT DATES =============== Paper Submission November 15, 2004 Notification January 10, 2005 Final (Camera Ready) February 9, 2005 Version SPECIAL EVENTS ============== Those wishing to organize workshops, present tutorials on emerging topics or participate in the industry track are invited to send the following information to: Workshops: workshops-ccgrid2005@cs.cf.ac.uk, Tutorials: tutorials-ccgrid2005@cs.cf.ac.uk, or Industry Track: industrytrack-ccgrid2005@cs.cf.ac.uk. COMMITTEES ========== Honorary Chair -------------- Tony Hey, EPSRC, UK Conference Chairs ----------------- David W. Walker, Cardiff University, UK Carl Kesselman, USC/ISI, US Programme Committee Chair ------------------------- Omer F. Rana, Cardiff University, UK Programme Committee Vice-Chairs ------------------------------- Jack Dongarra, University of Tenneesee, US Luc Moreau, University of Southampton, UK Sven Graupner, HP Labs, US Peter Sloot, University of Amsterdam, The Netherlands Craig Lee, The Aerospace Corporation, US Publications Chair ------------------ Rajkumar Buyya, University of Melbourne, Australia Workshops Chair --------------- Craig Lee, Aerospace Corporation, US Publicity Chairs ---------------- Vladimir Getov, University of Westminster, UK (Europe) Marcin Paprzycki, Oaklahoma State University, US (Europe) C. L. Wang, University of Hong Kong (Asia Pacific) Ken Hawick, Massey University, New Zealand (Asia Pacific) Manish Parashar, Rutgers University, US (America) Tutorials Chair --------------- Michael Gerndt, TU Munich, Germany Industry Track Chair -------------------- Alistair Dunlop, OMII, UK Exhibits Chair -------------- Steven Newhouse, OMII, UK Posters Chair ------------- Yan Huang, Cardiff University, UK Finance Chair ------------- John Oliver, Welsh eScience Centre, UK Registration Chair ------------------ Tracey Lavis, Cardiff University, UK Local Arrangements Chair ------------------------ Linda Wilson, Welsh eScience Centre, UK PROGRAMME COMMITTEE ------------------- Seif Haridi, KTH Stockholm, Sweden Bruno Schulze, Laboratsrio Nacional de Computagco Cientmfica, Brazil David Abramson, Monash University, Australia Steven Willmott, Universitat Polithcnica de Catalunya, Spain Xian-He Sun, Illinois Institute of Technology, US Yun-Heh (Jessica) Chen-Burger, University of Edinburgh, UK Thilo Kielmann, Vrije Universiteit, The Netherlands Brian Matthews, RAL/CCLRC and Oxford Brookes University, UK Maozhen Li, Brunel University, UK Greg Astfalk, HP Labs, US Marty Humphrey, University of Virginia, US Geoffrey Fox, University of Indiana, US Martin Berzins, University of Leeds, UK Hai Jin, Huazhong University of Science and Technology, China Giovanni Chiola, Universita' di Genova, Italy Domenico Talia, Universita' della Calabria/ICAR-CNR, Italy Josi Cunha, Universidade Nova de Lisboa, Portugal Ron Perrott, Queens University Belfast, UK Ewa Deelman, ISI/USC, US Stephen Jarvis, Warwick University, UK Niclas Andersson, Linkvping University, Sweden Putchong Uthayopas, Kasetsart University, Thailand John Morrison, University College Cork, Ireland Stephen Scott, Oak Ridge National Lab, US Luciano Serafini, ITC-IRST, Italy David A. Bader, University of New Mexico, US Mark Baker, University of Portsmouth, UK Emilio Luque, Universitat Autrnoma de Barcelona, Spain Akhil Sahai, HP Labs, US Gregor von Laszewski, Argonne National Lab, US Fethi Rabhi, University of New South Wales, Sydney, Australia Fabrizio Petrini, Los Alamos National Lab, US Kate Keahey, Argonne National Lab, US Sergei Gorlatch, Universitdt M|nster, Germany Brian Tierney, Lawrence Berkeley National Lab, US Rauf Izmailov, NEC Labs, US Stephen J. Turner, Nanyang Technological University, Singapore Savas Parastatidis, University of Newcastle, UK Elias Houstis, University of Thessaly, Greece -- and Purdue University, US Karl Aberer, EPFL, Switzerland Rolf Hempel, DLR, Germany Anne Elster, NTNU, Norway Artur Andrzejak, Zuse Institute Berlin, Germany Jennifer Schopf, Argonne National Laboratory, US John Gurd, University of Manchester, UK Domenico Laforenza, ISTI/CNR, Italy Wolfgang Rehm, TU Chemnitz, Germany Gabriel Antoniu, IRISA, France From janfrode at parallab.uib.no Sat Oct 9 02:08:54 2004 From: janfrode at parallab.uib.no (Jan-Frode Myklebust) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Storage In-Reply-To: References: <20041008195132.GC32425@ii.uib.no> Message-ID: <20041009090854.GB21880@ii.uib.no> On Fri, Oct 08, 2004 at 05:16:28PM -0400, Robert G. Brown wrote: > > Ah, but read: > > http://www.npaci.edu/dice/srb/srbOpenSource.html Ouch! Thanks for pointing this out. -jf From gustavo at martinelli.etc.br Sat Oct 9 12:49:10 2004 From: gustavo at martinelli.etc.br (Gustavo Gobi Martinelli) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] rsh don't see the real variables Message-ID: <1097351349.416840b60272e@www.martinelli.etc.br> I'm trying to make the pvm 3.4.5 work, but I'm having a problem with "rsh". if I execute the command: # rsh 192.168.0.2 'set' I will see a list of variables that isn't in the .bash_profile of root user. But, if I execute this: # rsh 192.168.0.1 the login occurs and I can execute # set Now, I can see the variable that I need. What did happen? How can I find the local where I can declare the variables that will be appear with " rsh 192.168.0.1 'set' " command? Because this, the PVM doesn't work. It needs see $PVM_ROOT variable that exists in .bash_profile but not in "rsh" session. Someone know anything about it? I'm using the Fedora Core 2 with kernel 2.6.7. -- Atenciosamente, Gustavo Gobi Martinelli Linux User# 270627 From alvin at Mail.Linux-Consulting.com Sat Oct 9 14:43:26 2004 From: alvin at Mail.Linux-Consulting.com (Alvin Oga) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Storage - storage cache In-Reply-To: Message-ID: hi ya i deleted the email and decided to reply to the prev post about disk cache - have you checked into webcache/file cache apps ?? - those $5K - $15K apps will cache your files on their hw ... your local clients would fetch its data from their local disk cache from 2x - 100x faster than going across to the far away colo on the internet ( its intended for making your far away colo ( look like its in your local lan - it's sorta like a fancy "file" proxy or fancy version control that moves data around behind the scene - it's disk capacity limited to disk space in its cache c ya alvin file cache apps... riverbed.com actona.com ( now cisco ) tacitnetworks.com From rgb at phy.duke.edu Sat Oct 9 15:11:01 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] HPC in Windows In-Reply-To: <01ee01c4adbc$3618d330$39140897@PMORND> References: <01ee01c4adbc$3618d330$39140897@PMORND> Message-ID: On Sat, 9 Oct 2004, Rajiv wrote: > Dear All, > Are there any Beowulf packages for windows? Not that I know of. In fact, the whole concept seems a bit oxymoronic, as the definition of a beowulf is a cluster supercomputer running an open source operating system. However, there are windows based clusters, and there are parallel libraries that will compile and work on windows based LANs. A lot of things will be more difficult, as Windows is missing a few million moving parts that are standard everyday fare under an *nix OS (like xterms, shells, secure remote logins) UNLESS you pay for them or build them yourself where open sources exist. Back when I still used WinXX, one could find a small suite of *nix-alike tools, but back then WinXX was still based on DOS. SO, you can certainly use windows machines in a cluster (or a grid) if you can manage the hassle of paying for all the operating systems, compilers, associated tools to facilitate remote login and shell operations. Or, you can just use linux on all those systems, at worst on a dual boot basis. Run Windows by day, a linux cluster by night. These days, Open Office (and a few other packages) renders most linux boxes so copacetic that they can coexist in a Windows environment and a WinXX user can learn to do pretty much everything they need to do under Linux (and in a GUI) in a day or two. rgb > > Regards, > Rajiv Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Sat Oct 9 15:24:32 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Application Deployment In-Reply-To: <01df01c4adbc$1229c240$39140897@PMORND> References: <01df01c4adbc$1229c240$39140897@PMORND> Message-ID: On Sat, 9 Oct 2004, Rajiv wrote: > Dear All, > Is there any software available for application deployment- both linux > and windows. I would like to install packages from master to all the > clients through a management console. Again, I don't know about Windows -- most software running on a WinXX box will be proprietary, and simply cannot be installed in the way you describe without either a lot of knowledge and/or Windows-specific tools. After all, you've got all those CD codes and serial numbers and other proprietary bullshit to manage, and you're at serious risk of lawsuit if you fail to manage them perfectly. Windows does remote install these days, as I understand it, although I doubt that it remotely approaches kickstart in its ease of use and transparency. In linux, there are a variety of solutions, depending on whether you use RPMs or Debian. With Red Hat and descendents (Fedora, Centos) you can use kickstart, which is a lovely tool for installing clusters. Kickstart run on top of PXE and DHCP makes installing most systems a matter of turning them on (after making a single host specific MAC address entry in a table or two, and even this can be automated). It makes reinstalling non-servers at any time just as easy. Servers, of course, require knowledge, experience, wisdom, and time to do right, which is why sysadmins get paid and are worth a very decent salary. To install packages from a master to clients, there are both shrink wrapped tools and general approaches. For fairly obvious reasons, I'd suggest yum and possibly a mix of rsync and any of the packages that let you execute an ordinary shell command on a list of hosts). This is both because yum manages dependencies for you and because once installed it also manages automatic updates and even upgrades from your repository. Altering a client configuration is often just a matter of e.g. making an entry in a table that is read in by a script that calls yum update whatever, that is itself pushed out in an installed rpm that yum updates. Add an entry to the table and wait a day, or use the shell distribution tool if you are in a hurry. I don't know about "management consoles", though. Again, this sounds WinXX-ish -- you're hoping for something to hide all the detail of several distributions, packaging systems, software installation tools, and operating systems and still make them all work transparently for you without your needing to know what they are doing. Not in this Universe, at least not unless you pay a real expert a lot of money (for their software) and are willing to live with something that doesn't work horribly well at best anyway. Your best bet is to learn the specific systems you're working with well enough to make them dance through hoops, and not to rely on interfaces that are very expensive to maintain (and which to my experience NEVER work anyway). rgb > Regards, > Rajiv Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From landman at scalableinformatics.com Sat Oct 9 15:24:14 2004 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Storage and cachefs on nodes? In-Reply-To: <20041008233311G.hanzl@unknown-domain> References: <20041008233311G.hanzl@unknown-domain> Message-ID: <4168650E.8050800@scalableinformatics.com> hanzl@noel.feld.cvut.cz wrote: >Anybody using cachefs(-alike) and local disks on nodes for >reboot-persistent cache of huge central storage? > > Not really, though we have a package under development that might address some aspect of this. Contact me offline if you want to hear more. We are in early stages of this work, so it will be a while before it is ready. > >(I periodically and obsessively repeat this poll with negative answer >ever so far, obviously I am the only person with data storage needs >perverted this way. Given the recent interest in storage, I dare to >ask again...) > >Thanks > >Vaclav Hanzl >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From rgb at phy.duke.edu Sat Oct 9 15:32:14 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] rsh don't see the real variables In-Reply-To: <1097351349.416840b60272e@www.martinelli.etc.br> References: <1097351349.416840b60272e@www.martinelli.etc.br> Message-ID: On Sat, 9 Oct 2004, Gustavo Gobi Martinelli wrote: > > I'm trying to make the pvm 3.4.5 work, but I'm having a problem with "rsh". > > if I execute the command: > > # rsh 192.168.0.2 'set' > > I will see a list of variables that isn't in the .bash_profile of root user. > > But, if I execute this: > > # rsh 192.168.0.1 > > the login occurs and I can execute > > # set > > Now, I can see the variable that I need. > > What did happen? How can I find the local where I can declare the variables that > will be appear with " rsh 192.168.0.1 'set' " command? > > Because this, the PVM doesn't work. It needs see $PVM_ROOT variable that exists > in .bash_profile but not in "rsh" session. > > Someone know anything about it? I'm using the Fedora Core 2 with kernel 2.6.7. Use ssh, and look into the environment commands. rsh has many flaws, one of which is a failure to pass environment variables at all sanely from the calling host. So this is one way to proceed. Also note (from man bash): When bash is invoked as an interactive login shell, or as a non-inter- active shell with the --login option, it first reads and executes com- mands from the file /etc/profile, if that file exists. After reading that file, it looks for ~/.bash_profile, ~/.bash_login, and ~/.profile, in that order, and reads and executes commands from the first one that exists and is readable. The --noprofile option may be used when the shell is started to inhibit this behavior. In other words, there is a difference between the behavior of an interactive shell (what you get when you execute rsh hostname to log in) and a non-interactive shell -- they actually read and execute different .??* files in a different order. In fact, you can control the order to some extent with the call syntax. Assuming that you're using bash, you might read the man page carefully and experiment -- if you put the requisite environment variable definitions in the right place you should still be able to have them initialized even over rsh, as long as you don't have to pass them via the remote shell itself. If you do, you'll NEED to look into ssh in more detail. HTH, rgb > > -- > Atenciosamente, > Gustavo Gobi Martinelli > Linux User# 270627 > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Sat Oct 9 15:39:06 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] rsh don't see the real variables In-Reply-To: <1097351349.416840b60272e@www.martinelli.etc.br> References: <1097351349.416840b60272e@www.martinelli.etc.br> Message-ID: On Sat, 9 Oct 2004, Gustavo Gobi Martinelli wrote: > > I'm trying to make the pvm 3.4.5 work, but I'm having a problem with "rsh". One more comment. pvm under Red Hat * (and now under Fedora Core 2) IS a shell script (in /usr/bin/pvm). It should set PVM_ROOT correctly for you, automagically, whether or not it is set in your original or nodes shells, as PVM is started somewhere and used to build a virtual cluster IF you invoke PVM by name on the default path. This can probably still screw up with some ways you might use PVM, but there SHOULD be ways to do your project that don't require a PVM_ROOT to be spelled out in your node shells. rgb > > if I execute the command: > > # rsh 192.168.0.2 'set' > > I will see a list of variables that isn't in the .bash_profile of root user. > > But, if I execute this: > > # rsh 192.168.0.1 > > the login occurs and I can execute > > # set > > Now, I can see the variable that I need. > > What did happen? How can I find the local where I can declare the variables that > will be appear with " rsh 192.168.0.1 'set' " command? > > Because this, the PVM doesn't work. It needs see $PVM_ROOT variable that exists > in .bash_profile but not in "rsh" session. > > Someone know anything about it? I'm using the Fedora Core 2 with kernel 2.6.7. > > -- > Atenciosamente, > Gustavo Gobi Martinelli > Linux User# 270627 > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From james.p.lux at jpl.nasa.gov Sat Oct 9 16:56:22 2004 From: james.p.lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] Application Deployment References: <01df01c4adbc$1229c240$39140897@PMORND> Message-ID: <000801c4ae5b$957423d0$33a8a8c0@LAPTOP152422> ----- Original Message ----- From: "Robert G. Brown" To: "Rajiv" Cc: Sent: Saturday, October 09, 2004 3:24 PM Subject: Re: [Beowulf] Application Deployment > On Sat, 9 Oct 2004, Rajiv wrote: > > > Dear All, > > > Is there any software available for application deployment- both linux > > and windows. I would like to install packages from master to all the > > clients through a management console. > > Again, I don't know about Windows -- most software running on a WinXX > box will be proprietary, and simply cannot be installed in the way you > describe without either a lot of knowledge and/or Windows-specific > tools. After all, you've got all those CD codes and serial numbers and > other proprietary bullshit to manage, and you're at serious risk of > lawsuit if you fail to manage them perfectly. Windows does remote > install these days, as I understand it, although I doubt that it > remotely approaches kickstart in its ease of use and transparency. Actually, Windows DOES provide a central management capability, with fairly good control of the client images, management of those pesky licensing issues, etc. It's called SMS (Systems Management Server), and it's been around for about 10 years (at least), and in its current form is a godsend for people who have to manage all those thousands of WinXX desktops in big companies. With the huge volume of patches required to keep a Windows environment working, you'd have to have something like it. Ever since the earliest Windows versions that supported networking, there have been ways to do centralized network installs (I can't remember if Windows for Workgroups did it, but certainly, the first versions of Win NT did) and fairly automated rollouts of new software versions. Typically, the documentation came in the "resource kits", and lately, would be in things like "enterprise version back office resource kits". Often, if you forked out the kilobuck for a Visual whatever development kit, it would include all that stuff (along with the driver development kit, the SDK, etc., ) I might note that part of the incentive behind the ".NET" initiative is to simplify the whole configuration management across an enterprise scale installation. Part of it is a "late binding" to the services/components that your application needs, and the ability for your application to be insensitive to just how that component becomes available. Not incidentally, of course, the architecture includes ways to manage charging for the use of a component, both on a traditional licensing model, and on a per use model, and probably all manner of complex ways in between. Let's see, I want to watch a HDTV movie, so I have a monthly license for the decompression engine, a per use/per view license for the movie, a per day license for the ability to stop/start/rewind the movie, all with complex cross costing among the various and sundry providers (Now, for 3 days only, watch Star Trek XXI without paying decompression license fees, with the purchase of a Nokia cellphone (available only in areas served by Adelphia, within 6 miles of a qualifying retail outlet, 3 year contract required, mail-in rebate may take 6-8 months to process, void where prohibited, see etc.etc. for details.) You're just not going to get it at your local Comp-USA or download it off the web. And, it's not free or even particularly cheap (although, compared to the cost of all those licenses for the desktops, it's quite reasonable.) And, in a somewhat limited version, it's fairly inexpensive (so that developers can readily develop their software to fit within the MS Windows deployment model). Heck, you can probably even go and test your application for free on a Windows cluster at Microsoft. They used to provide the Jolt cola while you were at the facility testing for free as well, and may still well do. Qualitatively, managing 1000 computers (particularly with identical configurations) under Windows is probably not much more difficult than doing it under Linux. In both cases, you're going to need some training and/or experience to make it work. > > which is why sysadmins get paid and are worth a very decent salary. In both the Windows and *nix world, this is true. > > I don't know about "management consoles", though. Again, this sounds > WinXX-ish -- you're hoping for something to hide all the detail of > several distributions, packaging systems, software installation tools, > and operating systems and still make them all work transparently for you > without your needing to know what they are doing. > > Not in this Universe, at least not unless you pay a real expert a lot of > money (for their software) and are willing to live with something that > doesn't work horribly well at best anyway. Your best bet is to learn > the specific systems you're working with well enough to make them dance > through hoops, and not to rely on interfaces that are very expensive to > maintain (and which to my experience NEVER work anyway). Modern, enterprise scale Windows installations do this just as well as Linux, to a first order. Don't judge the large scale management capabilities of Windows by the consumer Windows experience. And, as far as cost goes, I suspect that configuration and software management costs for large Windows installations are not much different than for *nix. Both require training, expertise, etc. Sure, for Linux, the actual software is free, but that's a small fraction of the $100K/yr you're paying to folks to use the software. Microsoft is VERY aware of where their bread is buttered, and has worked VERY hard to make sure that managing 1000 Windows desktops (or server farms) in a corporate enviroment isn't a whole lot more difficult or expensive than managing 1000 Linux boxes. The last thing MS wants to hear from a Fortune 500 CIO is that they are dumping Windows for Linux because of the management costs. As you point out, the desktop "office productivity" applications are typically just as good under Linux as under Windows. It's a very different model from the consumer world, where once you've bought the heavily discounted box with the manufacturer OEM install of the OS, you're really on your own. In the cluster arena, things are different yet. Typically, people want pedal to the metal speed, they don't give a rat's fuzzy behind for the office productivity tools, and they're going to write all their production code themselves, so they want something that is conceptually simple to work with (OS interface wise), they generally have no need for sophisticated digital rights management and revenue schemes, etc. > > rgb > > > Regards, > > Rajiv > > Robert G. Brown http://www.phy.duke.edu/~rgb/ > Duke University Dept. of Physics, Box 90305 > Durham, N.C. 27708-0305 > Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From gustavo at martinelli.etc.br Sat Oct 9 18:39:44 2004 From: gustavo at martinelli.etc.br (Gustavo Gobi Martinelli) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] rsh don't see the real variables In-Reply-To: References: <1097351349.416840b60272e@www.martinelli.etc.br> Message-ID: <1097372384.416892e05302f@www.martinelli.etc.br> Robert, > Use ssh, and look into the environment commands. rsh has many flaws, > one of which is a failure to pass environment variables at all sanely > from the calling host. So this is one way to proceed. the rsh doesn't pass environment variables, it gets the variables on the host. And I have to know, where I can declare this variables because they are different of the .bash_profile and /etc/profile. > > Also note (from man bash): > > When bash is invoked as an interactive login shell, or as a > non-inter- > active shell with the --login option, it first reads and executes > com- > mands from the file /etc/profile, if that file exists. After > reading > that file, it looks for ~/.bash_profile, ~/.bash_login, and > ~/.profile, > in that order, and reads and executes commands from the first one > that > exists and is readable. The --noprofile option may be used when > the > shell is started to inhibit this behavior. > > In other words, there is a difference between the behavior of an > interactive shell (what you get when you execute rsh hostname to log in) > and a non-interactive shell -- they actually read and execute different > .??* files in a different order. I will study about it. > In fact, you can control the order to some extent with the call syntax. The PVM executes its codes. I don't have any control about it. So I can't make a different sintax. > Assuming that you're using bash, you > might read the man page carefully and experiment -- if you put the > requisite environment variable definitions in the right place you should > still be able to have them initialized even over rsh, I will create the other files and I will test it. > as long as you > don't have to pass them via the remote shell itself. I don't have to pass variables on the rsh session, It executes remote codes using the variables on that host. > If you do, you'll > NEED to look into ssh in more detail. I will see it too. > HTH, > rgb Thank?s for the help. From gustavo at martinelli.etc.br Sat Oct 9 18:57:04 2004 From: gustavo at martinelli.etc.br (Gustavo Gobi Martinelli) Date: Wed Nov 25 01:03:27 2009 Subject: [Beowulf] rsh don't see the real variables In-Reply-To: References: <1097351349.416840b60272e@www.martinelli.etc.br> Message-ID: <1097373424.416896f038293@www.martinelli.etc.br> Robert, > One more comment. pvm under Red Hat * (and now under Fedora Core 2) IS > a shell script (in /usr/bin/pvm). It should set PVM_ROOT correctly for > you, automagically, whether or not it is set in your original or nodes > shells, as PVM is started somewhere and used to build a virtual cluster > IF you invoke PVM by name on the default path. Yes, I don't include the pvm path on the default path. But I remember that the PATH that I saw on the rsh session was incomplete, compareted with the origiral. I have to know where this variables are declared. > This can probably still screw up with some ways you might use PVM, but > there SHOULD be ways to do your project that don't require a PVM_ROOT to > be spelled out in your node shells. I agree, but I have to make PVM works quickly. I will try other ways. Thank?s again Gustavo Gobi Martinelli From mark.westwood at ohmsurveys.com Sun Oct 10 05:52:18 2004 From: mark.westwood at ohmsurveys.com (mark.westwood@ohmsurveys.com) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] Re: HPC in Windows In-Reply-To: <01ee01c4adbc$3618d330$39140897@PMORND> References: <01ee01c4adbc$3618d330$39140897@PMORND> Message-ID: Hi Rajiv I don't know about Windows-based clusters, but you might want to check out Beowulf Cluster Computing with Windows edited by Thomas Sterling MIT Press, 2001 The book runs to 488 pages so must have something to say on the topic. I have the companion volume Beowulf Cluster Computing with Linux and would recommend that as a good introduction to the topic. Regards Mark Westwood OHM Ltd Rajiv writes: > Dear All, > Are there any Beowulf packages for windows? > > Regards, > Rajiv From hahn at physics.mcmaster.ca Sun Oct 10 08:43:11 2004 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] Application Deployment In-Reply-To: Message-ID: > RPMs or Debian. With Red Hat and descendents (Fedora, Centos) you can > use kickstart, which is a lovely tool for installing clusters. > Kickstart run on top of PXE and DHCP makes installing most systems a > matter of turning them on (after making a single host specific MAC don't forget the zero-install approach - nothing installed on nodes. just export the nodes' root filesystem from a fileserver, and you never have to do anything per-node. yum and rpm both let you install within a separate tree, so the fileserver doesn't need to be running the same config as the nodes. obviously, this results in a certain amount of NFS traffic, as opposed to having those files installed on the node's disk. issues: - diskless nodes are very attractive in many contexts: reliability, price, maintainability, etc. - running NFS-root is a way of tolerating local disk faults; lack of swap may or may not be a problem. - NFS can easily be faster than local disk IO. - in aggregate, a buch of diskless nodes will, in the worst case, create much more traffic than your net and fileserver can handle. - my experience so far with 50-100-node clusters is that a single NFS-connected fileserver is actually pretty good. (our nodes have a local disk used for things like checkpoints of big parallel applications.) - for big MPI clusters, it's extremely attractive to put fileservers directly onto the MPI fabric. suddenly, gigabit is no longer a limiter for file IO and systems like Lustre can give some pretty impressive data rates. - this scheme is probably optimal for very hetrogenous datacenters as well, where you might boot a node in some random OS purely for a particular user/app. that kind of thing seems very dubious to me, but it would only take a few minutes of perl scripting to write a web frontend to select things like IP, distro, kernel, server, etc for a particular node, and propogate the changes. I think that for a small cluster, I'd consider having the nodes with full installs on them. for anything larger than say 4 nodes, I definitely prefer the root-on-fileserver approach with "ephemeral" nodes. it's also pretty sexy to take a node out of the box, plug it in and have it accept jobs in a minute or so with no manual intervention. > course, require knowledge, experience, wisdom, and time to do right, > which is why sysadmins get paid and are worth a very decent salary. hmm. anyone for a cluster-admin salary survey? regards, mark hahn. From hahn at physics.mcmaster.ca Sun Oct 10 08:55:54 2004 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] Application Deployment In-Reply-To: <000801c4ae5b$957423d0$33a8a8c0@LAPTOP152422> Message-ID: > require training, expertise, etc. Sure, for Linux, the actual software is > free, but that's a small fraction of the $100K/yr you're paying to folks to > use the software. very interesting. one structural disadvantage that the windows ecosystem does labor under is that it must stick to the OS-really-installed-on-desktop model. that is, msft is not quite ready to go to a ephemeral-client model, where desktops just PXE-boot and mount everything of consequence across the net. (not just thin-client, where clients are all hard-installed, but use only a thin app like a browser for whatever the user needs.) with lan-ipmi and pxe, it's almost reasonable to claim that support doesn't scale with increasing nodes. there are still costs that scale with number of users, number of apps. hardware maintenance always scales with number of moving parts, but for ephemeral clients, it's far easier to have spares. the infrastructure to support 1K clients all booting monday morning would be nontrivial, but very tractable. no doubt the lack of nazi DRM (uncontrolled and dangerous network!) is why the msft community hasn't taken this approach. From james.p.lux at jpl.nasa.gov Sun Oct 10 10:26:50 2004 From: james.p.lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] Application Deployment References: Message-ID: <000601c4aeee$545a73c0$33a8a8c0@LAPTOP152422> ----- Original Message ----- From: "Mark Hahn" To: Sent: Sunday, October 10, 2004 8:55 AM Subject: Re: [Beowulf] Application Deployment > > require training, expertise, etc. Sure, for Linux, the actual software is > > free, but that's a small fraction of the $100K/yr you're paying to folks to > > use the software. > > very interesting. one structural disadvantage that the windows ecosystem > does labor under is that it must stick to the OS-really-installed-on-desktop > model. that is, msft is not quite ready to go to a ephemeral-client model, > where desktops just PXE-boot and mount everything of consequence across > the net. (not just thin-client, where clients are all hard-installed, but > use only a thin app like a browser for whatever the user needs.) I'm not sure, but MS is certainly heading towards the ephemeral client (with local cacheing) model, since it enables such things as revenue based on a per use of a component basis. Say you're a small software developer and you've developed a really nifty piechart algorithm for Excel. MS wants to give you a way that you could generate revenue from each use of this component, and that sort of implies that the component is fetched from some repository on the fly. Same for "notepad" or you name it. I think they're desperately trying to get away from the "transfer of tangible property" for software, because sooner or later, shrink wrap licenses are going to get hammered in court (if it looks, walks, and talks like a sale, then it IS a sale, and you should be able to resell, etc., freely). On the other hand, if each and every time you use the component (be it "MS Word", that clever Excel chart, etc.) you are separately engaging in a revenue transaction, then you don't get into those sticky areas. > > with lan-ipmi and pxe, it's almost reasonable to claim that support > doesn't scale with increasing nodes. there are still costs that scale > with number of users, number of apps. hardware maintenance always > scales with number of moving parts, but for ephemeral clients, it's > far easier to have spares. the infrastructure to support 1K clients > all booting monday morning would be nontrivial, but very tractable. > > no doubt the lack of nazi DRM (uncontrolled and dangerous network!) > is why the msft community hasn't taken this approach. Precisely so... MS, and the legions of developers who develop for the Windows environment, generally want some mechanism to be paid for their work. Per use revenue is a nice way of getting around the "bootleg copy" problem. Who cares if you copy it, if every time it runs, you have to hit a license server and pay your little micropayment. In fact, bootlegs are great... they cost the originator of the software nothing. It's the whole compatibility, configuration managment thing that is a big problem (all those components have to be compatible with all the other components, etc.). MS, for all of their faults, doesn't have stupid people working for it. If they could find a better way to "sell" software (or, more properly, the added value provided by the software/content/what-have-you) that doesn't rely on copyright (which everyone admits is poorly suited to such things), they'd love it. From landman at scalableinformatics.com Sun Oct 10 11:22:32 2004 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] Application Deployment In-Reply-To: <000601c4aeee$545a73c0$33a8a8c0@LAPTOP152422> References: <000601c4aeee$545a73c0$33a8a8c0@LAPTOP152422> Message-ID: <41697DE8.2090301@scalableinformatics.com> Jim Lux wrote: >Precisely so... > >MS, and the legions of developers who develop for the Windows environment, >generally want some mechanism to be paid for their work. > (please note: not intended to be flame bait or trolling) hrm... so do those of us who try to survive and grow in companies that specialize in the Linux environment. As the owner of 1.5 such companies, I really would like to see them thrive and grow, and this requires getting paid for our work. I have had some customers wish to freely share our work with others, which I take as an indication of the value of the stuff; it wouldn't be shared if it did not have value/merit, but as I said before, I have to pay the developers. Can't pay them with a check that reads "3000 goodwill dollars, not redeemable at your mortgage company, or at the food store, but you sure made some of our `customers' happy". As I have said privately to others, the GPL is not a business model. Moreover, the ivory tower concept of giving away the source to sell the consulting seems to work for very few groups, if any (I can think of one, mySQL AB). As it is unlikely that there will be millions of linux clusters out there, the MySQL model of leveraging the needs of a huge installed user base will not work here. Most of us who are betting their families well being and livelihood on this, would like to be able to earn a living from this. There is nothing wrong with asking for money for the value you provide. Most of the consumers of the OSS stuff do so for varied reasons. One of the larger components is the "Free as in beer"* approach to cost containment. The control is nice, but as I see it, control is not what is driving the adoption of Linux based systems. Its TCO. If it costs you $2500 for that RHAT install, it doesn't cost you per client connecting to it. This immediately puts it at a (tremendous) cost advantage over any MSFT based solution. Add to this that this is a real market, with several competing vendors with mostly overlapping offerings. Hence there is real competition. Prices are held in check (to a degree). (* I know, I know, not too much free beer out there ...) MSFT has (at last check) the wrong licensing model for their software for clusters. They would need to change it in order to make it sensible from a financial perspective, to deploy such clusters. As this is a tiny market compared to their major market, I think that the needed changes will not happen. I could be wrong (and I think the MSFT HPC folks read this stuff in stealth mode, so feel free to correct me on/offline). >Per use revenue is >a nice way of getting around the "bootleg copy" problem. Who cares if you >copy it, if every time it runs, you have to hit a license server and pay >your little micropayment. In fact, bootlegs are great... they cost the >originator of the software nothing. It's the whole compatibility, >configuration managment thing that is a big problem (all those components >have to be compatible with all the other components, etc.). > > The per use view presumes you are using a consumable resource in some sense, and then attaching a value to it. It reminds me of the innumerable requests for registration going on now, coupled with the "would you like to view this article? only 1.95$USD right now..." I see linked from various news sites. Of course, apart from electrical power, it is hard to understand what resource you are consuming when this occurs. I suspect that consumer backlash against this will likely halt this march. I do see a subscription model becoming far more likely, whereby for a fixed (continuous) fee, you get access to content (much like a magazine, but software content in this case). I think people generally would be more accepting of this model than a micropayment per click. >MS, for all of their faults, doesn't have stupid people working for it. If >they could find a better way to "sell" software (or, more properly, the >added value provided by the software/content/what-have-you) that doesn't >rely on copyright (which everyone admits is poorly suited to such things), >they'd love it. > > Winston Churchill had something to say about this being the worst possible model, apart from the others. Of course the context was different, but generally the idea is correct. Software companies have a model that forces continuous "innovation" in order to maintain an upgrade cycle, and therefore get revenues flowing "continuously". The problem for them is convincing you to upgrade. Why upgrade if it works well enough? So they need to even out their revenue, get it more predictable, and break the upgrade cycle. Oddly, by breaking the upgrade cycle, you can spend more time fixing stuff, and less time inventing new broken stuff. Similar to what OSS gives you (to a degree). The important part (for business types) is that your revenue is now much more predictable. Now do something interesting (which I have not seen done yet by MSFT, though I expect in in short order). Change the acquisition model to that of a subscription. So instead of paying $500 for an install with a free set of patches, pay $50 to acquire the base + $100/year of subscription. Roll the next "versions" out in phases, with inter-function dependencies rather than entire version dependencies. The software becomes the platform. Talk about lock-in. There will be no upgrade cycle to contend with. Changes can be made quite modular. New features better tested and rolled in in an evolutionary manner. Brand new functionality could be created into different subscription paths. Copying and "pirating" would be encouraged (you need that subscription after all) as each machine would require its own subscription to function. Rough guess, but I would bet on something much like this emerging. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 612 4615 From james.p.lux at jpl.nasa.gov Sun Oct 10 12:02:44 2004 From: james.p.lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] Application Deployment References: <000601c4aeee$545a73c0$33a8a8c0@LAPTOP152422> <41697DE8.2090301@scalableinformatics.com> Message-ID: <000601c4aefb$c8fe7480$33a8a8c0@LAPTOP152422> ----- Original Message ----- From: "Joe Landman" To: "Jim Lux" Cc: "Mark Hahn" ; Sent: Sunday, October 10, 2004 11:22 AM Subject: Re: [Beowulf] Application Deployment > > > Jim Lux wrote: > > >Precisely so... > > > >MS, and the legions of developers who develop for the Windows environment, > >generally want some mechanism to be paid for their work. > > > (please note: not intended to be flame bait or trolling) > > > > hrm... so do those of us who try to survive and grow in companies that > specialize in the Linux environment. Indeed, this IS true as well. I suppose it comes down to what you actually want to pay for. In the generalized (oversimplified) Linux/GPL/free as in beer model, the income stream comes from providing support, etc. while the software development is provided altruistically (or, as advertising, good will garnering). The U.S.Gov't and various EU institutions (ESA) pay for development of all sorts of software, some of which has general usefulness. USGS map data might be a good example here. > > As I have said privately to others, the GPL is not a business model. > Moreover, the ivory tower concept of giving away the source to sell the > consulting seems to work for very few groups, if any (I can think of > one, mySQL AB). As it is unlikely that there will be millions of linux > clusters out there, the MySQL model of leveraging the needs of a huge > installed user base will not work here. Very true, I think. Fortunately, there are people who are willing to work with stars in their eyes. > > Most of us who are betting their families well being and livelihood on > this, would like to be able to earn a living from this. There is > nothing wrong with asking for money for the value you provide. Most of > the consumers of the OSS stuff do so for varied reasons. One of the > larger components is the "Free as in beer"* approach to cost > containment. The control is nice, but as I see it, control is not what > is driving the adoption of Linux based systems. Its TCO. If it costs > you $2500 for that RHAT install, it doesn't cost you per client > connecting to it. This immediately puts it at a (tremendous) cost > advantage over any MSFT based solution. Add to this that this is a real > market, with several competing vendors with mostly overlapping > offerings. Hence there is real competition. Prices are held in check > (to a degree). However, in the enterprise market, I think that MSFT is holding their own (yes, partly by anticompetitive practices, I suspect) in the TCO area. I don't know the details of how MS licenses thousand unit installations, but I'll bet it's not a "per copy, per year" basis. More likely, it's a "you have X thousand computers, so you fit in our 3000-6000 unit bulk rate". The last thing MS wants to have to do is count desktops and audit the numbers. This is how MS deals with OEM consumer mfrs.. you shipped X number of computers (a publically available number), so send us a check for Y*X dollars. We don't really care if you installed Win on them or not. > > (* I know, I know, not too much free beer out there ...) > > MSFT has (at last check) the wrong licensing model for their software > for clusters. They would need to change it in order to make it sensible > from a financial perspective, to deploy such clusters. As this is a > tiny market compared to their major market, I think that the needed > changes will not happen. I think you're right. The relatively small number of cases involved could be handled by special deals with PR tie-in. I could be wrong (and I think the MSFT HPC > folks read this stuff in stealth mode, so feel free to correct me > on/offline). > > > > >Per use revenue is > >a nice way of getting around the "bootleg copy" problem. Who cares if you > >copy it, if every time it runs, you have to hit a license server and pay > >your little micropayment. In fact, bootlegs are great... they cost the > >originator of the software nothing. It's the whole compatibility, > >configuration managment thing that is a big problem (all those components > >have to be compatible with all the other components, etc.). > > > > > > The per use view presumes you are using a consumable resource in some > sense, and then attaching a value to it. It reminds me of the > innumerable requests for registration going on now, coupled with the > "would you like to view this article? only 1.95$USD right now..." I see > linked from various news sites. I would never maintain that the price someone is willing to pay for a commodity is related to its intrinsic value. Something is worth what someone is willing to pay for it. I pay (grumpily) $7 to watch a movie at the theater. I don't imagine that the marginal cost to show the movie to me is even close to $7. > > Of course, apart from electrical power, it is hard to understand what > resource you are consuming when this occurs. I suspect that consumer > backlash against this will likely halt this march. I do see a > subscription model becoming far more likely, whereby for a fixed > (continuous) fee, you get access to content (much like a magazine, but > software content in this case). I think people generally would be more > accepting of this model than a micropayment per click. But people are willing to pay "per click" for things like SMS messages and phone calls. (There are various "bulk purchase" schemes for both...500 free minutes, etc., but I think those have more importance from a marketing standpoint than from a revenue standpoint. Industries with "cost reimbursement" models (most gov't contractors, gov't agencies, health care, etc.) are very attracted to a "dollars per click" model, because it allows easy allocation of costs to multiple accounts. > > Now do something interesting (which I have not seen done yet by MSFT, > though I expect in in short order). Change the acquisition model to > that of a subscription. So instead of paying $500 for an install with a > free set of patches, pay $50 to acquire the base + $100/year of > subscription. Roll the next "versions" out in phases, with > inter-function dependencies rather than entire version dependencies. > The software becomes the platform. > > Talk about lock-in. There will be no upgrade cycle to contend with. > Changes can be made quite modular. New features better tested and > rolled in in an evolutionary manner. Brand new functionality could be > created into different subscription paths. Copying and "pirating" would > be encouraged (you need that subscription after all) as each machine > would require its own subscription to function. > > Rough guess, but I would bet on something much like this emerging. This IS really the .NET model... Get away from software being "WinXX compatible" and to "generalized Windows platform compatible", with a steady revenue stream, just like the gas company. From hvidal at tesseract-tech.com Sun Oct 10 09:01:37 2004 From: hvidal at tesseract-tech.com (H.Vidal, Jr.) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] SATA vs SCSI drives Message-ID: <41695CE1.80507@tesseract-tech.com> Hello all. We are building some Network Area Storage gear around some high-end imaging and data acq. systems. Reliability for storage of this data is a big time must. To date, we have built all of this lab's gear around SCSI drives because it has been our research and experience that SCSI drives are better built than IDE drives. However, when looking at these drive arrays and NAS appliances, it is very clear that SATA drives are really driving large scale storage. What has been the general experience on this list of SATA vs SCSI in terms of performance, reliability, quoted as well as real-world failure rates, etc? Which SATA drives are considered 'the best' the way, say Seagate drives are held in high esteem for SCSI? And, if anybody likes any particular RAID and/or NAS system, let's hear your stories. About 1.4-1.7 Terabyte raw space. Thanks for your collective help and attention. Hernando Vidal, Jr. Tesseract Technology From kallio at ebi.ac.uk Sun Oct 10 13:10:01 2004 From: kallio at ebi.ac.uk (Kimmo Kallio) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] Storage and cachefs on nodes? In-Reply-To: <20041008233311G.hanzl@unknown-domain> Message-ID: I can guarantee you are not the only one interested in this... we've been even semi-seriously thinking of implementing this ourself, but there is never enough time as usual. I've been toying on the idea of extending Linux (memory) buffer cache to utilise local disk as raw block device instead of going through a filesystem. This wouldn't be reboot-persistent, but wouldn't suffer from filesystem corruption or such meaning that there would be no extra (human) management overhead on individual machines. However, I haven't gotten as far as finding out if this is realistically doable or not. As for the storage in general we use NetApp filers and their DNFS (NetCache) caching devices. It's reliable but doesn't come cheap. We are looking for distributed filesystems (Lustre, Terragrid, ...) to complement the existing setup. Regards, Kimmo Kallio, Europen Bioinformatics Institute On Fri, 8 Oct 2004 hanzl@noel.feld.cvut.cz wrote: > Anybody using cachefs(-alike) and local disks on nodes for > reboot-persistent cache of huge central storage? > > > (I periodically and obsessively repeat this poll with negative answer > ever so far, obviously I am the only person with data storage needs > perverted this way. Given the recent interest in storage, I dare to > ask again...) > > Thanks > > Vaclav Hanzl > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From landman at scalableinformatics.com Sun Oct 10 15:26:59 2004 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] Application Deployment In-Reply-To: <000601c4aefb$c8fe7480$33a8a8c0@LAPTOP152422> References: <000601c4aeee$545a73c0$33a8a8c0@LAPTOP152422> <41697DE8.2090301@scalableinformatics.com> <000601c4aefb$c8fe7480$33a8a8c0@LAPTOP152422> Message-ID: <4169B733.7040303@scalableinformatics.com> Jim Lux wrote: >----- Original Message ----- > > > >>As I have said privately to others, the GPL is not a business model. >>Moreover, the ivory tower concept of giving away the source to sell the >>consulting seems to work for very few groups, if any (I can think of >>one, mySQL AB). As it is unlikely that there will be millions of linux >>clusters out there, the MySQL model of leveraging the needs of a huge >>installed user base will not work here. >> >> > >Very true, I think. Fortunately, there are people who are willing to work >with stars in their eyes. > > Stars don't pay (my) bills. :( [...] >>> >>> >>The per use view presumes you are using a consumable resource in some >>sense, and then attaching a value to it. It reminds me of the >>innumerable requests for registration going on now, coupled with the >>"would you like to view this article? only 1.95$USD right now..." I see >>linked from various news sites. >> >> > >I would never maintain that the price someone is willing to pay for a >commodity is related to its intrinsic value. Something is worth what >someone is willing to pay for it. I pay (grumpily) $7 to watch a movie at >the theater. I don't imagine that the marginal cost to show the movie to me >is even close to $7. > > Value in this case != marginal cost. Value is that difficult to define aspect of something. The marginal cost of viewing the movie has to be quite low, far below the $7 you pay. The "value" in the case is "entertainment" (though I leave that to other threads) from the movie. That is, you make a judgment in your mind before parting with the money that the thing you are buying serves some need that you ascribe a "value" to. In the case of movie tickets, it is a simple binary system; either it has value or it does not (e.g. your need to be entertained). If you have a problem similar to Robert's storage issue, what is the value to you (not the marginal cost) of solving it? E.g. will other projects be delayed? Value includes opportunity cost, and many other softer calculcations/guesstimates. > > >>Of course, apart from electrical power, it is hard to understand what >>resource you are consuming when this occurs. I suspect that consumer >>backlash against this will likely halt this march. I do see a >>subscription model becoming far more likely, whereby for a fixed >>(continuous) fee, you get access to content (much like a magazine, but >>software content in this case). I think people generally would be more >>accepting of this model than a micropayment per click. >> >> > >But people are willing to pay "per click" for things like SMS messages and >phone calls. (There are various "bulk purchase" schemes for both...500 free > > I am subscribing to a service in such a way so as not to pay per minute. I don't SMS (I simply do not see the value in it, and would welcome someone explaining this to me (offline)). I am not talking about basic paging or blackberry stuff (the latter being quite cool). I don't mind this recurring cost. >minutes, etc., but I think those have more importance from a marketing >standpoint than from a revenue standpoint. Industries with "cost >reimbursement" models (most gov't contractors, gov't agencies, health care, >etc.) are very attracted to a "dollars per click" model, because it allows >easy allocation of costs to multiple accounts. > > > ok... I anthropomorphised. "There are more business models in the economy Joseph, than are dreamt of in your philosophy." I stand ... er ... sit... corrected. > > > >>Now do something interesting (which I have not seen done yet by MSFT, >>though I expect in in short order). Change the acquisition model to >>that of a subscription. So instead of paying $500 for an install with a >>free set of patches, pay $50 to acquire the base + $100/year of >>subscription. Roll the next "versions" out in phases, with >>inter-function dependencies rather than entire version dependencies. >>The software becomes the platform. >> >>Talk about lock-in. There will be no upgrade cycle to contend with. >>Changes can be made quite modular. New features better tested and >>rolled in in an evolutionary manner. Brand new functionality could be >>created into different subscription paths. Copying and "pirating" would >>be encouraged (you need that subscription after all) as each machine >>would require its own subscription to function. >> >>Rough guess, but I would bet on something much like this emerging. >> >> > >This IS really the .NET model... >Get away from software being "WinXX compatible" and to "generalized Windows >platform compatible", with a steady revenue stream, just like the gas >company. > > > > > Yup. I started looking at Mono to see if it made sense to start targetting commercial apps for it. Still not sure, but it is getting there. Not HPC apps, but user interfaces and other things. What I remember from a committee I sat on about a decade ago about how to handle cost distribution/sharing was "chargeback kills usage". Per usage fees were not appealing to most end users we spoke with (academe). This of course suggests adaptive business micro-models for software in specific market contexts (government, academe, industry). -- Joe From bropers at cct.lsu.edu Sun Oct 10 15:38:46 2004 From: bropers at cct.lsu.edu (Brian D. Ropers-Huilman) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] Application Deployment In-Reply-To: References: Message-ID: <4169B9F6.3050101@cct.lsu.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Mark Hahn said the following on 2004-10-10 10:43: | | - NFS can easily be faster than local disk IO. | How so? Under what configurations, versions, etc.? - -- Brian D. Ropers-Huilman .::. Manager .::. HPC and Computation Center for Computation & Technology (CCT) bropers@cct.lsu.edu Johnston Hall, Rm. 350 +1 225.578.3272 (V) Louisiana State University +1 225.578.5362 (F) Baton Rouge, LA 70803-1900 USA http://www.cct.lsu.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFBabn2wRr6eFHB5lgRAot3AJ42WJ3vN3rGXrf01BTTkcmwur6lAACcC29H HY22RxHrcAzXU7c/LmiwwY8= =YRfV -----END PGP SIGNATURE----- From george at galis.org Sun Oct 10 16:44:14 2004 From: george at galis.org (George Georgalis) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] SATA vs SCSI drives In-Reply-To: <41695CE1.80507@tesseract-tech.com> References: <41695CE1.80507@tesseract-tech.com> Message-ID: <20041010234414.GB777@trot.local> On Sun, Oct 10, 2004 at 12:01:37PM -0400, H.Vidal, Jr. wrote: >Which SATA drives are considered 'the best' the way, say Seagate drives are >held in high esteem for SCSI? > >And, if anybody likes any particular RAID and/or NAS system, let's hear >your stories. About 1.4-1.7 Terabyte raw space. I've heard these are a good value http://www.winsys.com/products/flata.php If you build your own, the 3com controllers can be had under $400 and are said to be quite good. I'm booting SATA with a $35 addonics controller on a workstation -- which I consider as reliable, faster and cheaper than ATA. But that setup wasn't without difficulty setting up. // George -- George Georgalis, systems architect, administrator Linux BSD IXOYE http://galis.org/george/ cell:646-331-2027 mailto:george@galis.org From hanzl at noel.feld.cvut.cz Sun Oct 10 15:10:46 2004 From: hanzl at noel.feld.cvut.cz (hanzl@noel.feld.cvut.cz) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] Storage and cachefs on nodes? In-Reply-To: References: <20041008233311G.hanzl@unknown-domain> Message-ID: <20041011001046L.hanzl@unknown-domain> >> Anybody using cachefs(-alike) and local disks on nodes for >> reboot-persistent cache of huge central storage? > > I can guarantee you are not the only one interested in this... > ... > ... Europen Bioinformatics Institute Great, thanks. I always believed that this data access pattern must appear in bioinformatics. > ... we've been even semi-seriously thinking of implementing this > ourself, but there is never enough time as usual. People start to implement this again and again but none of the small nice projects seems to survive in long term. > We are looking for distributed filesystems (Lustre, Terragrid, ...) Problem with most huge projects going this way is that they involve special server while many users could be quite happy with just a special client (NFS client with local filesystem cache and certain degree of filesystem semantics screwup). Most discussions on this topic end by "It can be done, if you need it, just implement it". But the real question is how to implement it and let it survive in long term - across changing kernel versions etc. I think persistent file caching should be as independent as it can get, using standard commodity server and being careful to minimize dependencies in client. Solaris cachefs looked good from this point. I am not sure how much can I expect from linux cachefs as seen in e.g. 2.6.9-rc3-mm3 - if I got it right, it is a kernel subsystem with intra-kernel API, being now tested with AFS and intended as usable for NFS. It is however "low" on NFS team priority list. So linux cachefs might provide cleaner solutions than Solaris cachefs - if it ever provides them. Regards Vaclav Hanzl From daniel.kidger at quadrics.com Mon Oct 11 00:12:34 2004 From: daniel.kidger at quadrics.com (Dan Kidger) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] Application Deployment In-Reply-To: <4169B9F6.3050101@cct.lsu.edu> References: <4169B9F6.3050101@cct.lsu.edu> Message-ID: <200410110812.34738.daniel.kidger@quadrics.com> On Sunday 10 October 2004 11:38 pm, Brian D. Ropers-Huilman wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Mark Hahn said the following on 2004-10-10 10:43: > | - NFS can easily be faster than local disk IO. > > How so? Under what configurations, versions, etc.? easy - Fileserver is RAID and/or Lustre and you have a high bandwidth network. This can easily outstrip the IO available from a single local disk This point also crops up when people do ftp (or scp) perfromance tests across their high BW network (like our QsNet). Unfortunately they end up measuring and hence reporting the bandwidth of the disks at either end.:-( Daniel. -------------------------------------------------------------- Dr. Dan Kidger, Quadrics Ltd. daniel.kidger@quadrics.com One Bridewell St., Bristol, BS1 2AA, UK 0117 915 5505 ----------------------- www.quadrics.com -------------------- From hahn at physics.mcmaster.ca Mon Oct 11 10:23:15 2004 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] SATA vs SCSI drives In-Reply-To: <41695CE1.80507@tesseract-tech.com> Message-ID: > imaging and data acq. systems. Reliability for storage of this data is > a big time must. good - reliability is easy. the ubiquity of raid has made inherent drive reliability less of a critical factor. > To date, we have built all of this lab's gear around SCSI drives because it > has been our research and experience that SCSI drives are better built > than IDE drives. research = web opinions? there are some differences between traditional scsi products and everything else. I don't think most customers, even ones who've tried to inform themselves, understand what the differences really are. basically, SCSI is and has always been driven by "enterprise" database needs. for instance, higher RPM is not a way to get higher bandwidth - indeed, higher density, lower-RPM disks often deliver higher bandwidth, and in any case, bandwidth is easily scaled by striping. the real differences have more to do with expected duty cycle and lifespan. again, the enterprise DB market expect to be able to do max seeks/second for the full service life, 24x365.2425x5. your use almost certainly does not involve constant, maxed-out activity. as such, your solution should not use parts designed and priced for that. > However, when looking at these drive arrays and NAS > appliances, it is very clear that SATA drives are really driving large scale > storage. enterprise DB's are not driven by density, whereas the rest of the market is. there are good technical reasons for this divergence (more heads means slower seeks; greater density means slower seeks). > What has been the general experience on this list of SATA vs SCSI in terms > of performance, reliability, quoted as well as real-world failure rates, > etc? somewhat higher infant mortality due to a lower-quality supply chain for low-end (*ata) disks. in use, failure rates more or less in keeping with the drives mtbf or warranty. > Which SATA drives are considered 'the best' the way, say Seagate drives are > held in high esteem for SCSI? your statement about Seagate is pure aesthetics, and I very much doubt that there's a clear taste preference for Seagate. (for instance Fujitsu, HGST and Maxtor all make drives of equal quality/reliability/performance.) > And, if anybody likes any particular RAID and/or NAS system, let's hear > your stories. About 1.4-1.7 Terabyte raw space. small servers like this are no longer much of a challenge or interest. just slap 8x250G sata disks into a box, raid5 with one hot spare, and relax. personally, I'm fond of Promise s150tx4 controllers since they're cheap and effective. any 3-5yr warranty disk from seagate/maxtor/hgst/wd will work perfectly fine. yes, of course the box should have decent airflow, and a hefty power supply, but none of that is hard anymore (it also helps that disks themselves have become cooler.) regards, mark hahn. From andrew at ceruleansystems.com Mon Oct 11 11:08:09 2004 From: andrew at ceruleansystems.com (J. Andrew Rogers) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] SATA vs SCSI drives Message-ID: <1097518089.32071@whirlwind.he.net> I purchase many, many terabytes of disk array every year, and use a mixture of SCSI and (S)ATA depending on the specific application. My observations and experiences with the current crop of technology: SCSI attached to a decent RAID controller (e.g. LSI MegaRAID) will generally outperform a roughly equivalent SATA array for many purposes, and if you have money to burn you can build significantly faster arrays. This is due to a combination of physically faster drives and mature drive and controller implementations that work very well together. That said, for single-process access, streaming, and similar, the performance is largely similar. A 10k SATA array will perform about as well as a 10k SCSI array in most cases. For applications that are bound by access/seek times (e.g. databases), SCSI still seems to have substantially more throughput in practice. The bandwidth issue is almost a non-issue in my experience, as you'll run into access/seek limitations first for most apps. So to summarize, they are mostly differentiated by the effective access/seek throughput; SATA is the cheaper choice if you aren't significantly bound by this parameter. And as SATA firmware in both the drives and controllers improves, and fast SCSI drive hardware is adapted to SATA interfaces, I expect this gap to close. It hasn't closed yet, but in a couple years I expect it will be. j. andrew rogers From jlb17 at duke.edu Mon Oct 11 06:06:18 2004 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] SATA vs SCSI drives In-Reply-To: <20041010234414.GB777@trot.local> References: <41695CE1.80507@tesseract-tech.com> <20041010234414.GB777@trot.local> Message-ID: On Sun, 10 Oct 2004 at 7:44pm, George Georgalis wrote > On Sun, Oct 10, 2004 at 12:01:37PM -0400, H.Vidal, Jr. wrote: > >Which SATA drives are considered 'the best' the way, say Seagate drives are > >held in high esteem for SCSI? > > > >And, if anybody likes any particular RAID and/or NAS system, let's hear > >your stories. About 1.4-1.7 Terabyte raw space. > > I've heard these are a good value > http://www.winsys.com/products/flata.php > > If you build your own, the 3com controllers can be had under $400 and are ^^^^ I think you meant "3ware" there... > said to be quite good. I'm booting SATA with a $35 addonics controller on > a workstation -- which I consider as reliable, faster and cheaper than > ATA. But that setup wasn't without difficulty setting up. -- Joshua Baker-LePain Department of Biomedical Engineering Duke University From laurenceliew at yahoo.com.sg Sat Oct 9 23:00:09 2004 From: laurenceliew at yahoo.com.sg (Laurence Liew) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] HPC in Windows In-Reply-To: <01ee01c4adbc$3618d330$39140897@PMORND> References: <01ee01c4adbc$3618d330$39140897@PMORND> Message-ID: <4168CFE9.7010903@yahoo.com.sg> Yes.. You can download the Windows HPC package.. go to microsoft and look for it. If You are a Microsoft partner... you can regsiter and attend the MS Windows HPC training classes held at Cornell University. Speak with your Microsoft rep and they will be able to arrange. Laurence Rajiv wrote: > Dear All, > Are there any Beowulf packages for windows? > > Regards, > Rajiv > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- A non-text attachment was scrubbed... Name: laurenceliew.vcf Type: text/x-vcard Size: 150 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20041010/8ead9717/laurenceliew.vcf From iwao at rickey-net.com Sun Oct 10 09:59:42 2004 From: iwao at rickey-net.com (Iwao Makino) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] HPC in Windows In-Reply-To: <01ee01c4adbc$3618d330$39140897@PMORND> References: <01ee01c4adbc$3618d330$39140897@PMORND> Message-ID: Rajiv, You may want to check this web site; They have plenty of resources there for you to study and start. They even had starter demo kit(not sure about current status) However, I am not sure if it was Beowulf, but was sure HPC cluster. At 10:25 AM +0530 04.10.9, Rajiv wrote: >Dear All, > Are there any Beowulf packages for windows? > >Regards, >Rajiv > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20041011/b3374e9e/attachment.html From matej.ciesko at stud.uni-erlangen.de Sun Oct 10 11:25:32 2004 From: matej.ciesko at stud.uni-erlangen.de (Matej Ciesko) Date: Wed Nov 25 01:03:28 2009 Subject: AW: [Beowulf] HPC in Windows In-Reply-To: <01ee01c4adbc$3618d330$39140897@PMORND> Message-ID: Hi, In fact there is everything on the market that you need to build Windows based computational clusters as easily as with other operating systems. OS: Windows 2003 operating system provides functionality for everything you need to deploy, run and maintain computational clusters. For starters go look for the free Microsoft's Computational Clustering Technical Preview Toolkit (CCTP). It includes some useful tools to get you started. More info is available at the MS HPC web page: http://www.microsoft.com/windowsserver2003/hpc/default.mspx If you are looking for references here's a tip: The Cornell Theory Center (CTC) is and has been using large scale Windows based computational clusters for quite some time now. They also have "best practice" documentation available on of how to build them on their web site. Middleware: Most common middleware packages used for HPC clustering are ported to the windows platform. Look for MPI/PRO for example (or NT-MPICH if you go for free stuff). Google for more :-) Except that, most of these (commercial) middleware packages work well together with Microsoft development environments (Visual Studio) which makes development very comfortable. AND, if you or your research facility are part of a university or other educational institution, the chance is high that you can get all of the MS Products for free for the purpose of your research by applying to one of their many academic alliance programs. (Go look for MSDN AA). http://msdn.microsoft.com/academic/ (at the bottom of the page). Best Regards, Matej Ciesko. _____ Von: beowulf-bounces@beowulf.org [mailto:beowulf-bounces@beowulf.org] Im Auftrag von Rajiv Gesendet: Saturday, October 09, 2004 6:56 AM An: beowulf@beowulf.org Betreff: [Beowulf] HPC in Windows Dear All, Are there any Beowulf packages for windows? Regards, Rajiv --- Incoming mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.775 / Virus Database: 522 - Release Date: 10/8/2004 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20041010/6c8ae8e9/attachment.html From john.hearns at clustervision.com Sun Oct 10 23:47:22 2004 From: john.hearns at clustervision.com (John Hearns) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] Re: HPC in Windows In-Reply-To: References: <01ee01c4adbc$3618d330$39140897@PMORND> Message-ID: <1097477241.1977.3.camel@vigor12> On Sun, 2004-10-10 at 13:52, mark.westwood@ohmsurveys.com wrote: > Hi Rajiv > > I don't know about Windows-based clusters, but you might want to check out > > Beowulf Cluster Computing with Windows > edited by Thomas Sterling > MIT Press, 2001 There's also an HPC edition of Windows Server 2003 http://www.microsoft.com/windowsserver2003/hpc/default.mspx Can't comment further as I have never used it, but that page has lots of links. From pjs at eurotux.com Mon Oct 11 03:49:07 2004 From: pjs at eurotux.com (Paulo Silva) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] MPI problem Message-ID: <1097491747.6316.21.camel@valen> Hello, I'm testing a 12 node Beowulf cluster using torque (an OpenPBS based program) and mpich with rsh/nfs. When a submitted a program that generates a big output file at the end of execution I got this error: /opt/mpich/bin/mpirun: line 1: 31098 File size limit exceeded/home/xpto/QCD/su3_ora -p4pg /home/xpto/QCD/PI30888 - p4wd /home/xpto/QCD The output file remains with 2.0 GB Since this error only occurs when I use MPI programs, I suspect this is some issue related with mpich. Does anyone knows what's the problem? Thanks for any help -- Paulo Silva Eurotux Inform?tica, SA -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Esta =?ISO-8859-1?Q?=E9?= uma parte de mensagem assinada digitalmente Url : http://www.scyld.com/pipermail/beowulf/attachments/20041011/db2f1bec/attachment.bin From epaulson at cs.wisc.edu Mon Oct 11 10:16:50 2004 From: epaulson at cs.wisc.edu (Erik Paulson) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] HPC in Windows In-Reply-To: References: <01ee01c4adbc$3618d330$39140897@PMORND> Message-ID: <20041011171650.GE15038@cobalt.cs.wisc.edu> On Sat, Oct 09, 2004 at 06:11:01PM -0400, Robert G. Brown wrote: > On Sat, 9 Oct 2004, Rajiv wrote: > > > Dear All, > > Are there any Beowulf packages for windows? > > Not that I know of. In fact, the whole concept seems a bit oxymoronic, > as the definition of a beowulf is a cluster supercomputer running an > open source operating system. > It's really time that gave up on trying to hold a strong definition to "beowulf". It's like kleenex or hacker/cracker. The world doesn't care. Clusters of x86 PCs doing "HPC" = beowulf And on the Beowulf on Windows bit - http://www.amazon.com/exec/obidos/tg/detail/-/0262692759/qid=1097514164/sr=8-1/ref=sr_8_xs_ap_i1_xgl14/104-7091285-1915902?v=glance&s=books&n=507846 "Beowulf Cluster Computing with Windows (Scientific and Engineering Computation) by Thomas Sterling" - If Tom says that you can build a beowulf on Windows, I think you can. -Erik ps - define "supercomputer" :) From michael.fitzmaurice at ngc.com Mon Oct 11 09:31:14 2004 From: michael.fitzmaurice at ngc.com (Fitzmaurice, Michael) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] bwbug: BWBUG meeting tomorrow at 3:00 PM in McLean Virginia Message-ID: <1C0477C28F2A16489E5765E92BCD9A04029418CB@xcgva008.northgrum.com> The next BWBUG meeting: Will meet October 12, 2004 at Northrop Grumman Corporation at 7575 Colshire Drive McLean Virginia 22102 at 3:00 PM to 5:00 PM. There will be two presentations. One, on Global Files Systems the key component to large reliable data storage systems. The second talk will on a comprehensive study that will provide a unique insight into the Linux HPC market from the users perceptive. Join us for two great talks this Tuesday. Speaker: Sudhir Srinivasan, Ph.D., CTO and VP Engineering for IBRIX a leader in Global File Systems Description: As the market coalesces around cluster-based computing, one of the primary impediments to scalability and performance is the file system. As such, technology vendors have developed distributed parallel file systems to overcome these I/O challenges. This session will offer a brief overview of how today's parallel file system offerings take advantage of clustered environments to get the best performance and scalability possible. Further, it will endeavor to explain how design and architectural elements such as segmentation, metadata algorithms, and non-hierarchical architectures make large clustered file systems more scalable and practical. We believe attendees will come away with a better understanding of the elements that make file systems solutions appropriate given the cluster environment and applications users are running. John L Payne is president of JLP Associates, a consulting company specializing in computer and communication technologies. HPC Clusters - The best technology buy! The first independent study of HPC clusters and HPC industry growth. Based on comprehensive interviews with more than 40 users. The talk will provide insights into users operational experience, reliability and performance. It will also look at their views on COTS versus Blades. Future design options and requirements will be discussed. T. Michael Fitzmaurice Coordinator of the BWBUG 8110 Gatehouse Road 400W Falls Church, Virginia 22042 Office 703-205-3132 Cell 703-625-9054 http://www.it.northropgrumman.com/index.asp http://www.bwbug.org _______________________________________________ bwbug mailing list bwbug@pbm.com http://www.pbm.com/mailman/listinfo/bwbug From eugen at leitl.org Mon Oct 11 14:43:31 2004 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] HPC in Windows In-Reply-To: <20041011171650.GE15038@cobalt.cs.wisc.edu> References: <01ee01c4adbc$3618d330$39140897@PMORND> <20041011171650.GE15038@cobalt.cs.wisc.edu> Message-ID: <20041011214331.GV1457@leitl.org> On Mon, Oct 11, 2004 at 12:16:50PM -0500, Erik Paulson wrote: > It's really time that gave up on trying to hold a strong definition > to "beowulf". It's like kleenex or hacker/cracker. The world doesn't > care. Clusters of x86 PCs doing "HPC" = beowulf The term "Beowulf" is completely unknown outside of a small community. > And on the Beowulf on Windows bit - > http://www.amazon.com/exec/obidos/tg/detail/-/0262692759/qid=1097514164/sr=8-1/ref=sr_8_xs_ap_i1_xgl14/104-7091285-1915902?v=glance&s=books&n=507846 > > "Beowulf Cluster Computing with Windows (Scientific and Engineering Computation) > by Thomas Sterling" - If Tom says that you can build a beowulf on > Windows, I think you can. Yeah, if your node licenses are subsidized, and you don't care for worst-case message passing latency, and lack of tools, I guess you can... Sorry, but you seem to subscribe to a very peculiar definition of COTS. -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20041011/6def4452/attachment.bin From hunting at ix.netcom.com Mon Oct 11 17:25:52 2004 From: hunting at ix.netcom.com (Michael Huntingdon) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] SATA vs SCSI drives In-Reply-To: <41695CE1.80507@tesseract-tech.com> References: <41695CE1.80507@tesseract-tech.com> Message-ID: <6.1.2.0.2.20041011170124.01d64600@popd.ix.netcom.com> Hernando Though slightly dated, I hope the attachment is helpful....btw....I didn't do an exhaustive search, but found the 10K SATA drives only offered at 72GB's and under. The higher cap drives are 7200RPM. cheers michael At 09:01 AM 10/10/2004, H.Vidal, Jr. wrote: >Hello all. > >We are building some Network Area Storage gear around some high-end >imaging and data acq. systems. Reliability for storage of this data is >a big time must. > >To date, we have built all of this lab's gear around SCSI drives because it >has been our research and experience that SCSI drives are better built >than IDE drives. However, when looking at these drive arrays and NAS >appliances, it is very clear that SATA drives are really driving large scale >storage. > >What has been the general experience on this list of SATA vs SCSI in terms >of performance, reliability, quoted as well as real-world failure rates, etc? >Which SATA drives are considered 'the best' the way, say Seagate drives are >held in high esteem for SCSI? > >And, if anybody likes any particular RAID and/or NAS system, let's hear >your stories. About 1.4-1.7 Terabyte raw space. > >Thanks for your collective help and attention. > >Hernando Vidal, Jr. >Tesseract Technology > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- A non-text attachment was scrubbed... Name: SATA vs. SAS disk technology.pdf Type: application/pdf Size: 393316 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20041011/00b7ca2c/SATAvs.SASdisktechnology.pdf -------------- next part -------------- ********************************************************************* Systems Performance Consultants Michael Huntingdon Higher Education Technology Office (408) 294-6811 131-A Stony Circle, Suite 500 Cell (707) 478-0226 Santa Rosa, CA 95401 fax (707) 577-7419 Web: <http://www.spcnet.com> hunting@ix.netcom.com ********************************************************************* From george at galis.org Mon Oct 11 16:44:21 2004 From: george at galis.org (George Georgalis) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] SATA vs SCSI drives In-Reply-To: References: <41695CE1.80507@tesseract-tech.com> <20041010234414.GB777@trot.local> Message-ID: <20041011234421.GD9260@trot.local> On Mon, Oct 11, 2004 at 09:06:18AM -0400, Joshua Baker-LePain wrote: >On Sun, 10 Oct 2004 at 7:44pm, George Georgalis wrote > >> If you build your own, the 3com controllers can be had under $400 and are > ^^^^ >I think you meant "3ware" there... Indeed. Thanks. // George -- George Georgalis, systems architect, administrator Linux BSD IXOYE http://galis.org/george/ cell:646-331-2027 mailto:george@galis.org From mark.westwood at ohmsurveys.com Tue Oct 12 01:11:39 2004 From: mark.westwood at ohmsurveys.com (Mark Westwood) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] Grid Engine question Message-ID: <416B91BB.8070102@ohmsurveys.com> Hi All We use the open source Grid Engine, Enterprise Edition v5.3, here to manage job submission to our 70 processor Beowulf. I'm rather new to managing Grid Engine and my users have me baffled with a question of priorities. The scenario is this: - suppose that there is a job running on 40 processors, leaving 30 free; - a high priority job, requesting 64 processors, is submitted; - a low priority, but long, job, requesting 24 processors is submitted. Currently, with our configuration, the low priority job would be run immediately, since there are more than 24 processors available. However, my users want to hold that job until the high priority job has run. Can we configure Grid Engine so that the low priority job is not started until after the high priority job, even though there are resources available for the low priority job when it is submitted ? Thanks for any help you can provide. PS Yes I have RTFMed and am not much the wiser on this specific question. -- Mark Westwood Software Engineer OHM Ltd The Technology Centre Offshore Technology Park Claymore Drive Aberdeen AB23 8GD United Kingdom +44 (0)870 429 6586 www.ohmsurveys.com From rgb at phy.duke.edu Tue Oct 12 08:35:05 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] HPC in Windows In-Reply-To: <20041011171650.GE15038@cobalt.cs.wisc.edu> References: <01ee01c4adbc$3618d330$39140897@PMORND> <20041011171650.GE15038@cobalt.cs.wisc.edu> Message-ID: On Mon, 11 Oct 2004, Erik Paulson wrote: > On Sat, Oct 09, 2004 at 06:11:01PM -0400, Robert G. Brown wrote: > > On Sat, 9 Oct 2004, Rajiv wrote: > > > > > Dear All, > > > Are there any Beowulf packages for windows? > > > > Not that I know of. In fact, the whole concept seems a bit oxymoronic, > > as the definition of a beowulf is a cluster supercomputer running an > > open source operating system. > > > > It's really time that gave up on trying to hold a strong definition > to "beowulf". It's like kleenex or hacker/cracker. The world doesn't > care. Clusters of x86 PCs doing "HPC" = beowulf Now look what you did. Used up my whole morning, just about. The easily bored can skip the rant below. This (what's a beowulf?) is list discussion #389, actually. Or maybe it is that the discussion has occurred 389 times, I can't remember. I do remember that the first time I participated in it was around seven or eight years ago, that I advanced the point of view that you espouse here -- and that I changed my mind. The definition of beowulf as OPPOSED to "just" a cluster of systems (nuttin' in the definition about them being "PC"s, just COTS systems) was given by the members of the original beowulf project with explicit reasons for each component. Note well that cluster supercomputing was at the time not new -- I'd been doing it myself by then for years (on COTS systems, for that matter, if Unix workstations can be considered off the shelf), and I was far, far from the first. At that time, there were already NOWs, COWs, PoPs and more. See Pfister's "In Search of Clusters" for a lovely, balanced, and not terribly beowulf-centric historical review. Two things differentiated the beowulf from earlier cluster efforts. a) Custom software designed to present a view of the cluster as "a supercomputer" in the same sense (precisely) that e.g. an SP2 or SP3 is "a supercomputer" -- a single "head" that is identified as being "the computer", specialized communications channels to augment the speed of communications (then quite slow on 10 Mbps ethernet), stuff like bproc designed to support the member computers being "processors" in a multiprocessor machine rather than standalone computers. Note that this idea was NOT totally original to the beowulf project, as PVM already had incorporated much of this vision years earlier. b) The fact that the beowulf utilized an open source operating system and was built on top of open source software. The reasons for this at the time were manifest, and really haven't changed. In order to realize their design goals that >>extended<< the concepts already in place in PVM, they had to write numerous kernel drivers (hard to do without the kernel source) as well as a variety of support packages. Don Becker wrote (IIRC) something like -- would that be all of the linux kernel's network drivers at the time or just 80% of them? -- hard to remember at this point, but a grep on Becker in /usr/src/linux/drivers/net is STILL pretty revealing. Now look for Sterling and Becker's contributions to the WinXX networking stack. Hmmmm.... The insistence on COTS hardware, actually, is what I'd consider the "weakest" component of the original definition, as it is the one component that was readily bent by the community in order to better realize the design goal of a parallel supercomputer capable of running fine grained parallel code competitively with "big iron" supercomputers. The beowulf community readily embraced non-commodity networks when they appeared. Note that I consider "commodity" as meaning multisourced with real competition holding down prices and generally built on an "open" standard, e.g. ethernet is open and has many vendors, myrinet is not open and is available only from Myricom (although at all points there has been at least some generic competition at least between high end proprietary networks). Myrinet historically was perhaps >>the<< key component that permitted beowulves to reach and even exceed the performance of so-called big iron supercomputers for precisely the kind of fine grained numerical problems that the supercomputers had historically dominated. I remember well Greg Lindahl, for example, showing graphs of Alpha/Myrinet speedup scaling compared to e.g. SP-series systems and others, with the beowulf model actually winning (at less than 1/3 the price, even using the relatively expensive hardware involved). > And on the Beowulf on Windows bit - > http://www.amazon.com/exec/obidos/tg/detail/-/0262692759/qid=1097514164/sr=8-1/ref=sr_8_xs_ap_i1_xgl14/104-7091285-1915902?v=glance&s=books&n=507846 > > "Beowulf Cluster Computing with Windows (Scientific and Engineering Computation) > by Thomas Sterling" - If Tom says that you can build a beowulf on > Windows, I think you can. I can only reply with: http://www.beowulf.org/community/column2.html by Don Becker, in which he points out that when they first met, Sterling was "obsessed with writing open source network drivers". Or if you prefer, Question Number One of the beowulf FAQ: 1. What's a Beowulf? Beowulf Clusters are scalable performance clusters based on commodity hardware, on a private system network, with open source software (Linux) infrastructure. Each consists of a cluster of PCs or workstations dedicated to running high-performance computing tasks. The nodes in the cluster don't sit on people's desks; they are dedicated to running cluster jobs. It is usually connected to the outside world through only a single node. Some Linux clusters are built for reliability instead of speed. These are not Beowulfs. Or check out my "snapshot" of the original beowulf website, preserved in electronic amber (so to speak) from back when I ran a mirror: http://www.phy.duke.edu/resources/computing/brahma/Resources/beowulf/ The introduction and overview contains a number of lovely tidbits concerning the beowulf design and how it differs from a NOW. It makes it pretty clear that the only way a pile of WinXX boxes could be "a beowulf" (as opposed to a NOW) would be if Microsoft Made it So -- the WinXX kernels and networking stack and job scheduling and management are essentially inaccessible to developers in an open community, which is why WinXX clusters like Cornell's (however well they work) stand alone, supported only to the extent that MS or Cornell pay for it with little community synergy. Nobody would argue, of course, that one can't build a NOW based on WinXX boxes. A number exist. WinXX boxes run PVM or MPI (and have been able to for many years, probably even predating the beowulf project although I'm too lazy to check the mod dates of the WinXX ifdefs in PVM). One can also obviously build a grid with WinXX boxes in it, probably more easily than one can build a true parallel cluster. Grid-style clusters (a.k.a. "compute farms") predate even virtual supercomputers in cluster taxonomy, for all that they have a new name and a relatively new set of high-level support software (just as the beowulf has, in the form of bproc implemented in clustermatic and scyld). Those of use who used to "roll our own" gridware to permit the use of entire LANs of workstations on embarrassingly parallel problems find this (toplevel support software) a welcome development, and it has indeed blurred the lines between beowulfs and other NOWs to some degree, but if anything it is DIMINISHING the identification of all clusters as "beowulfs". Look at all the Grid projects in the universe -- BioGRID, the smallpox grid, ATLAS grid, PatriotGrid -- grids are proliferating like crazy, but they aren't considered or referred to as beowulfs. In most cases "beowulf" isn't even mentioned in their toplevel documentation. One of the fundamental reasons for differentiation is this very list. Few people who have been on the list for a long time and who have worked with beowulfs and other kinds of open source clusters for a long time have any particular interest in providing community support to cluster computing under Windows. For one thing, it is nearly impossible -- it requires somebody with trans-MCSE knowledge of Windows' kernels, libraries, drivers, networking stack, and tools including the various WinXX ports of key cluster software where it exists. For another, people who work in that community who DO have that level of expertise don't seem to want to share -- they want to sell. One has to pay to become a MCSE; one then expects a high rate of consultative return on the investment. One cannot easily obtain access to WinXX source code, and open or not, access to kernel-level source code turns out to be essential to getting maximal performance out of a true beowulf or even advanced non-beowulf style cluster. Besides, nearly all the tools involved (beyond userspace stuff like PVM or MPI in certain flavors) are SOLD and supported by Microsoft (only) or other Microsoft-connected commercial developers and the only "benefit" we get back in the community from providing support for them is to increase their profits and to encourage them to turn around and resell us our own developments and ideas at a high cost. So let THEM provide the consultation and expertise and "intellectual property" they prize so highly; I will not contribute. Contrast that with the really rather unbelieveable level of support freely offered via this list to (yes) general cluster computer users and builders (not just "beowulf" builders by the strict definition). This support is predicated on the fundamental notions of open source software -- that effort expended on it comes back to you amplified tenfold as the COMMUNITY is strengthened in the open and free exchange of ideas. Consider the many tools and products that support beowulfery (or generalized cluster computer operation) that would simply be impossible to develop in a closed source proprietary model. People who participate in this sort of development have no desire to do all the work to create new tools and products only to have Microsoft and its software lackeys do its usual job of co-opting the tool, branding it, shifting the core standard from open to proprietary, and then squeezing out the original inventors (extended rant available on request:-). For all of these reasons, I think that it is worthwhile to maintain the moderately strict definition of "a beowulf" as a particular isolated network arrangement of COTS systems running open source software and functioning as a cluster capable of running anything from fine grained parallel problems down to distributed single tasks with a single "view" of task ID space. This is a fairly open and embracing definition -- people on the list run "beowulfs" with a single head, multiple heads, many operating systems other than Linux (most of them open source -- WinXX users are subjected to fairly merciless teasing if nothing else ...hotter:-). It is differentiated from (recently emerging) definitions of Grid-style clusters, from my much older definition of a "distributed parallel supercomputer" (built largely of dual use workstations that function as desktop machines in a LAN while still permitting long-running numerical tasks to be run in the background), from MUCH older definitions of NOWs, COWs, Piles of PCs. So, if somebody says they've "built a beowulf" out of a bunch of WinXX boxes, yes, I know what they mean, even though what they say is almost certainly not correct. The list is fairly tolerant of pretty much anybody doing any kind of cluster computing, even Windows based NOWs or Grids. "Extreme Linux" as a more general vehicle for linux cluster development never quite took off, and www.extremelinux.org continues to be a blank page as it has been for years now. As I said above, I personally don't even DO "real" beowulf computing and never have -- my clusters tend to be NOWs, although we're gradually shifting more towards a Grid model as improved software makes this the easy path support-wise. As a final note, I personally view the original PVM team as the "inventors of commodity cluster computing" even more than Sterling and Becker (much as I revere their contributions). If a "beowulf" is a network of computers running e.g. PVM on top of proprietary software, Dongarra et. al. beat Sterling and Becker to the punch by years. This isn't a crazy idea -- PVM already contains "out of the box" many of the design goals of the beowulf project -- a unified process id space (tids), a single control head that supports the "virtual machine" model, the ability to run on commodity hardware. It just does it in userspace, and hence has limits on what can be accomplished performance-wise, and has the usual PVM vs MPI problems with the older supercomputer programmers (who all used MPI, for interesting historical reasons). (Interestingly, "old hands" in the beowulf/cluster business nearly all tell me that they used to use and still prefer PVM, while MPI is still the "commercially salable" parallel library that better favors the traditional big iron supercomputing model;-) To what PVM already provided, Sterling and Becker contributed the notions of >>network isolation<< to achieve predictable network latency, >>channel bonding<< of network channels, built on top of open source network drivers, to improve network bandwidth (an accomplishment somewhat overshadowed by the rapid development faster networks and low-latency networks), and eventually >>kernel-level modifications<< that truly converted a cluster of PCs into a "single machine" the components of which could no longer stand alone but were merely "processors" in a massively parallel system with a single user-level kernel interface. So how in the world can Sterling argue that this >>beowulf<< software, developed by the original beowulf team, is available for Windows? Did I miss something? Network isolation, fine, that's a matter of trivial network arrangement that anybody with $50 for an OTC router/firewall can now accomplish, but channel bonded networks? Unified process id spaces? Kernel modifications that make nodes into virtual processors in a single "machine"? Not that I know of, anyway, and obviously impossible without fairly open access to Windows source code in any event. At a guess, it would require such a violent modification even to the more modern and POSIX compliant WinXX's that the result could be called "Windows" only in the sense that linux running a windowing system can be called "Windows" -- pretty much a complete rewrite and de-integration of the GUI from the OS kernel would be required (something that Microsoft has argued in court is impossible, amusingly enough, as they have sought to convince an ignorant public that Internet Explorer -- a userspace program if ever there was one -- cannot be be de-integrated from Windows:-). Asserting that there are truly Windows-based beowulfs does not make it so, and coopting the term "beowulf" to apply to generic computing models and tools that preceded the project by years is a kind of newspeak. I'll have to just go on thinking of the idea as an oxymoronic one, at least until Microsoft opens its source code or somebody succeeds in rewriting history and the original definition and goals of the beowulf project. > ps - define "supercomputer" :) AT THE TIME of the beowulf project, the definition was actually pretty clear, if only by example. I'd say it is still pretty clear, actually. At that time (and still today, mostly) the generic term "computer" embraced: a) Mainframes (the oldest example of "computer", still annoyingly common in business, industry and academe). b) Minicomputers (e.g. PDP's, Vaxes, Harris's). Basically cheaper/smaller versions of mainframes that generally stood alone although of course a number of them were used as the core servers for Unix-based workstation LANs. c) Workstations (e.g. Suns, SGIs). Typically desktop-sized computers in a client-server arrangement on a LAN. Server-class Suns and SGIs were sometimes refrigerator-sized units that were de facto minicomputers, blurring the lines between b) and c) in the case where both were running Unix flavors (or at least real multitasking operating systems). d) Personal computers. A "personal" computer was always a desktop sized unit, and the term "PC" generally applied to x86-family examples, although clearly Apples were (and continue to be) PCs as well. Note that PCs were sometimes as capable, hardware-wise, as workstations and had been networkable for years, so networking or hardware per se had nothing to do with being a PC vs a workstation. A PC really was differentiated from being a workstation by a key feature of its operating system -- the INability to login to the system remotely over a network. To use a PC, you had to sit at the PC's actual interface. (Note that aftermarket tools like "PC anywhere" did not a PC a workstation make). e) Supercomputers. A supercomputer was (and continues to be) a generic term for a "computer" capable of doing numerical (HPC) computations much faster than the CURRENT GENERATION of a-d computers. Obviously a moving target, given Moore's Law. From the "first" so-called supercomputer, the 12 MFLOP Cray-1, through to today's top 500 list, the differentiating feature is obviously RELATIVE performance, as the Palm Tungsten C in my pocket (with its 400 MHz CPU) is faster than the Cray 1. f) Today there is a weak association between "supercomputer" and "single task" HPC (so Grids and compute farms of various sorts are somewhat excluded, probably BECAUSE of the top500 list and its insistence on parallel linpack-y sorts of stuff as the relevant measure of supercomputer performance). So Grids have emerged as a kind of cluster in their own right that isn't ordinarily viewed as a supercomputer although a Grid is essentially unbounded from above in terms of aggregate floating point capacity in a way that supercomputers are not. One could make a grid of all the top500 supercomputers, in fact... Note that historically supercomputers are differentiated from other a-d class computers not by being "mainframe" or not, not by being vector processor based vs interconnected parallel multiprocessor based, not by its operating system, not even by its underlying computational paradigm (e.g. shared memory vs message passing), certainly not by its ABSOLUTE performance, but strictly by relative numerical performance. My Palm a decade ago would have been an export-restricted munition supercomputer, usable by rogue nations to simulate nuclear blasts and build WMD. Today it is a casual tool used by businessmen to check the web and email and remind them of appointments, while other munitions-quality systems are now toys, used by my kids to race virtual motorcycles around hyperrealistically rendered city streets. Talk about swords into plowshares...;-) The exact multiplier between "ordinary computer" performance and supercomputer performance is of course not terribly sharp. Over the years, a factor of order ten has often sufficed. In the original beowulf project, aggregating 16 80486DX processors (at best a few hundred aggregate FLOPS, again, my Palm probably would beat it at a walk) was enough. Nowadays perhaps we are jaded, and only clusters consisting of hundreds or thousands of CPUs, instead of tens, are in the running. Maybe only the top500 systems are "supercomputers. Maybe the term itself is really obsolete, as fewer and fewer systems that are anything BUT a beowulf style cluster (even if it is assembled and sold as a big iron "single system" with its internal cluster CPUs and IPC network and memory model hidden by a custom designed operating system) appear in the HPC marketplace. Still, I think most people still know what "supercomputer" means. In fact, when one looks over the current top500, it appears that it has >>almost<< become synonymous with the term "beowulf";-) But not (note well!) with the term "grid", as grids aren't architected to excell at linpack, and a grid is very definitely not a beowulf. As far as I can tell, just about 100% of the top500 are clusters (COTS or otherwise) architected along the lines laid out by the beowulf project, with 95% of them having lots scalar processors and the remaining 5% having lots of vector processors. Unfortunately, the top500 (which I continue to think of as being almost totally useless for anything but advertising) doesn't present us with a clear picture of the operating systems or systems software architectures in place on most of the clusters. In fact, it provides remarkably little useful information except the name of the cluster manufacturer/integrator/reseller (imagine that;-). Two clusters on the list (#146 at Cornell and #233 in Korea) are explicitly indicated as running Windows. Looking over the general cluster hardware architectures and manufacturer/integrator/resellers, I would guess that linux is overwhelmingly dominant, followed by freebsd and other (proprietary) flavors of Unix, with WinXX quite possibly dead last. Open source development is an evolutionary model, capable of paradigm shifts, far jumps in parametric space, and N^3 advantage in searching high dimensional spaces. Proprietary software development is by its nature a gradient search process, prone to optimizing in perpetuity around a slowly evolving local minimum, making long jumps only when it steals fully developed memetic patterns (such as the Internet, cluster computing, and many more) more often than not produced by evolutionary communities. To be fair, new patterns are sometimes introduced a priori by brilliant individuals without clear roots in open communities (e.g. "Turbo" compilers), although that is less common in recent years as the open source development process has itself evolved. The individuals only RARELY work for major corporations any more, and the corporations that are famous as idea factories -- e.g. Bell Labs -- created internal "open" communities of their very own where the new ideas were incubated and exchanged and kicked around. It's just a matter of mathematics, you see. Linux = mammal (sorry, Tux:-) Evolving at a stupendous speed (compare everything from kernel to complete distributions over the last decade) WinXX = Great White Shark Evolutionarily frozen, remarkably efficient at what it does, immensely yet curiously vulnerable... Well, that's enough rant for the day. I've GOT to get some actual work done... rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From craig at craigsplanet.force9.co.uk Tue Oct 12 01:18:30 2004 From: craig at craigsplanet.force9.co.uk (Craig Robertson) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] Re: HPC in Windows Message-ID: <416B9356.40400@craigsplanet.force9.co.uk> All, Not wishing to be pedantic, but as RGB points out, the definition of Beowulf means that there isn't such a thing as a MS based Beowulf. Yes, you can use COTS hardware but Windows is a non-free OS (in both senses of the word). Perhaps this was a disguised advertisement of some kind ;0) A likely scenario with an MS based cluster would be that a problem would present itself and there really would be no way of fixing it. Time, effort and money would then have to be expended on a kludge since you've already paid out thousands of dollars on licensing fees. -- Craig. --------------------------------------------------------- Dr. C. Robertson Craig's Planet Ltd. tel/fax: +44 1383 411123 fax2email: +44 870 7050992 mobile: +44 7890 565695 email: craig@craigspla.net http://www.craigspla.net --------------------------------------------------------- The information contained within this e-mail is confidential and may be privileged. It is intended for the addressee only. If you have received this e-mail in error please inform the sender and delete this e-mail and any attachments immediately. The contents of this e-mail must not be disclosed or copied without the sender's consent. Statements made in email are binding in honour only. --- Litigous people force us to put this statement here --- From john.hearns at clustervision.com Tue Oct 12 07:42:30 2004 From: john.hearns at clustervision.com (John Hearns) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] Grid Engine question In-Reply-To: <416B91BB.8070102@ohmsurveys.com> Message-ID: On Tue, 12 Oct 2004, Mark Westwood wrote: > Hi All > > We use the open source Grid Engine, Enterprise Edition v5.3, here to > manage job submission to our 70 processor Beowulf. I'm rather new to > managing Grid Engine and my users have me baffled with a question of > priorities. Mark, you would be better off asking this on the Gridengine mailing list. And if you don't mind me being a little forward, Gridengine version 6.0u1 is now available. > > The scenario is this: > > - suppose that there is a job running on 40 processors, leaving 30 free; > - a high priority job, requesting 64 processors, is submitted; > - a low priority, but long, job, requesting 24 processors is submitted. > > Currently, with our configuration, the low priority job would be run > immediately, since there are more than 24 processors available. > However, my users want to hold that job until the high priority job has run. > > Can we configure Grid Engine so that the low priority job is not started > until after the high priority job, even though there are resources > available for the low priority job when it is submitted ? > I'm not sure of the exact answer here. But SGE 6 does have advance reservations - so a hold could be put on processors till 64 become free. From hahn at physics.mcmaster.ca Tue Oct 12 08:46:07 2004 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] SATA vs SCSI drives In-Reply-To: <6.1.2.0.2.20041011170124.01d64600@popd.ix.netcom.com> Message-ID: > Though slightly dated, I hope the attachment is helpful....btw....I didn't > do an exhaustive search, but found the 10K SATA drives only offered at > 72GB's and under. The higher cap drives are 7200RPM. that's correct. but remember - RPM is mainly for latency, not bandwidth. if your workload is not incredibly seeky, then you don't want to pay for latency, since higher density leads to lower cost, bigger disks, higher bandwidth and slower seeks. in summary: - meet your reliability requirements using raid. it's insane to think about relying on a single disk in any non-ephemeral setting anyway. raid lets you achieve pretty much any reliability you want (as well as offering a broad spectrum of performance.) - meet your seek-rate requirements using RPM. I find very, very few applications are really seek-limited - really it's only very databases with uniform-random distribution of reads of tiny data from monumentally large tables. in particular, if there's any data locality or reuse at all, spend money on RAM not RPM. - for anything large, get MTBF specs for prospective disks. this lets you calculate how often you'll be replacing hardware, physically. your raid has taken care of data robustness; this is purely a maintenance issue. there's no dramatic difference in any of the families of disks available (well, avoid 1yr warranties, of course!). consider, for instance, that you can easily build raids based on 300G SATA disks that have half as many moving parts as with 147G SCSI disks. even if the MTBF's differ by 50% (guess 1.0 and 1.5 Mhours respectively) SATA is more reliabile. it'll probably also be 1/4 the price and sometimes actually faster. regards, mark hahn. From James.P.Lux at jpl.nasa.gov Tue Oct 12 11:49:07 2004 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] SATA vs SCSI drives References: <6.1.2.0.2.20041011170124.01d64600@popd.ix.netcom.com> Message-ID: <5.2.0.9.2.20041012113818.017dca28@mail.jpl.nasa.gov> At 11:46 AM 10/12/2004 -0400, Mark Hahn wrote: > > Though slightly dated, I hope the attachment is helpful....btw....I didn't > > do an exhaustive search, but found the 10K SATA drives only offered at > > 72GB's and under. The higher cap drives are 7200RPM. > >that's correct. but remember - RPM is mainly for latency, not bandwidth. >if your workload is not incredibly seeky, then you don't want to pay >for latency, since higher density leads to lower cost, bigger disks, higher >bandwidth and slower seeks. > >in summary: > - meet your reliability requirements using raid. it's insane > to think about relying on a single disk in any non-ephemeral > setting anyway. raid lets you achieve pretty much any reliability > you want (as well as offering a broad spectrum of performance.) > > - meet your seek-rate requirements using RPM. I find very, very > few applications are really seek-limited - really it's only very > databases with uniform-random distribution of reads of tiny data > from monumentally large tables. in particular, if there's any > data locality or reuse at all, spend money on RAM not RPM. > > - for anything large, get MTBF specs for prospective disks. > this lets you calculate how often you'll be replacing hardware, > physically. your raid has taken care of data robustness; > this is purely a maintenance issue. > >there's no dramatic difference in any of the families of disks available >(well, avoid 1yr warranties, of course!). consider, for instance, that >you can easily build raids based on 300G SATA disks that have half as >many moving parts as with 147G SCSI disks. even if the MTBF's differ >by 50% (guess 1.0 and 1.5 Mhours respectively) SATA is more reliabile. >it'll probably also be 1/4 the price and sometimes actually faster. Read those MTBF specs carefully... Typically they'll have some sort of usage tied to it (so many seeks per second, number of power up/power down cycles). ALso check the temperature effects on MTBF. It's not unheard of for mfrs to specify MTBF assuming a 20C drive temperature, which is unrealistically cold. Typically MTBF halves for each 10C rise in temperature. All the big mfrs have fairly decent descriptions of how they rate MTBF for their various drive classes. Note well that the assumptions of use for drives intended for, e.g. consumer PCs, are very different from those intended for server duty, and this is primarily determined by how they are positioned in the market. James Lux, P.E. Spacecraft Radio Frequency Subsystems Flight Telecommunications Systems Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From cnsidero at syr.edu Tue Oct 12 12:59:27 2004 From: cnsidero at syr.edu (Chris Sideroff) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] choosing a high-speed interconnect Message-ID: <1097611167.28704.104.camel@syru212-207.syr.edu> I'm sure posing this may raise more questions than answer but which high-speed interconnect would offer the best 'bang for the buck': 1) myrinet 2) quardics qsnet 3) mellanox infiniband Currently, our 30 node dual Opteron (MSI K8D Master-FT boards) cluster uses Gig/E and are looking to upgrade to a faster network. As well, what are the components would one need for each setup? The reason I ask is for example the Myrinet switches accept different line cards and am not sure which one to use. Sorry if this a bit of a newbie question but I have no experience with any of this kind of hardware. I am reading the docs for each but thought your feedback would be good. Thanks Chris Sideroff From mathog at mendel.bio.caltech.edu Tue Oct 12 11:38:10 2004 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] Tyan 2466 crashes, no obvious reason why Message-ID: Just thought I'd share the final outcome of this. After much swapping around of components and days of running memtest86 the problem was moving with the power supply. Swapping in the spare PS fixed it and that node has not so much as hiccupped in the month since. Note in particular that all of the voltages seen by the motherboard were always in range. My working hypothesis is that the PS either passes too much noise or just glitches occasionally (for instance, an intermittant internal short). The PS was a Zippy power supply with a power cord that attached via spades to the socket at the back of the 2U case. model AX2-5300FB-2S P/N 6AX2-300B055 ser no: T21905564M1A977732 Big EMACS loggy, tiny www.zippy.com.tw down at the bottom. It was still under Zippy's warranty and the good folks at PSSC handled the exhange promptly. A day (!) after the replacement unit came in a second node started doing the exact same thing - unexplained crashes and lock ups with nothing in the log file. Logging lm_sensors every 2 minutes showed nothing untoward up through the last entry. Crashes were every few hours. This time I just swapped the PS first thing and it has been ok now for over 4 days. Same type of power supply inside, this one with Serial No. T21905562M1A977732, which differs by only one digit from the first one that failed. Could be a coincidence but I'm beginning to suspect that there may be a bad component in this lot of power supplies, in which case an unpleasant series of node failures can probably be expected in the not too distant future. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From agrajag at dragaera.net Tue Oct 12 13:25:32 2004 From: agrajag at dragaera.net (Sean Dilda) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] Grid Engine question In-Reply-To: <416B91BB.8070102@ohmsurveys.com> References: <416B91BB.8070102@ohmsurveys.com> Message-ID: <1097612732.18951.14.camel@pel> On Tue, 2004-10-12 at 04:11, Mark Westwood wrote: > Hi All > > We use the open source Grid Engine, Enterprise Edition v5.3, here to > manage job submission to our 70 processor Beowulf. I'm rather new to > managing Grid Engine and my users have me baffled with a question of > priorities. > > The scenario is this: > > - suppose that there is a job running on 40 processors, leaving 30 free; > - a high priority job, requesting 64 processors, is submitted; > - a low priority, but long, job, requesting 24 processors is submitted. > > Currently, with our configuration, the low priority job would be run > immediately, since there are more than 24 processors available. > However, my users want to hold that job until the high priority job has run. > > Can we configure Grid Engine so that the low priority job is not started > until after the high priority job, even though there are resources > available for the low priority job when it is submitted ? In SGE 6.0 they added a feature they call 'advanced reservations'. Its not really advanced, and its not what I consider 'reservations' to be, but it is exactly what you want. When reservations are enabled on the cluster, and the job is submitted with '-R y', the mutli-processor job will be able to 'hold' available resources until it has enough to run, and thus keep lower priority jobs from using them. However, to do this you need to upgrade to at least version 6.0. However, 6.0 also has cluster queues which I find makes administration much easier (it allows you to create one queue setup and assign it to multiple hosts instead of doing a separate setup for each compute host). From landman at scalableinformatics.com Tue Oct 12 13:48:57 2004 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] choosing a high-speed interconnect In-Reply-To: <1097611167.28704.104.camel@syru212-207.syr.edu> References: <1097611167.28704.104.camel@syru212-207.syr.edu> Message-ID: <416C4339.3040309@scalableinformatics.com> First questions first: Why do you think you need a faster network, and what aspect of fast do you think you need? Low latency? High bandwidth? Then... What codes are you running? Across how many CPUS? Have you done a performance analysis on your system to observe "slow" runs in progress, and are you convinced that the network is the issue? We have done lots of tuning bits for customers where the issues wound up being something else than what they had thought. It is worth at least looking into for your code/problems, and identifying the bottleneck (if you haven't already done so). That said, all the below require an external "switch" fabric. All range from $500-$2000 per HBA, and about $1000 or more per switch port. Varies a bit. Performance is comparible in most cases, with IB seeming to have a higher ceiling than the others. Joe Chris Sideroff wrote: >I'm sure posing this may raise more questions than answer but which >high-speed interconnect would offer the best 'bang for the buck': > >1) myrinet >2) quardics qsnet >3) mellanox infiniband > >Currently, our 30 node dual Opteron (MSI K8D Master-FT boards) cluster >uses Gig/E and are looking to upgrade to a faster network. > >As well, what are the components would one need for each setup? The >reason I ask is for example the Myrinet switches accept different line >cards and am not sure which one to use. Sorry if this a bit of a newbie >question but I have no experience with any of this kind of hardware. I >am reading the docs for each but thought your feedback would be good. > >Thanks > >Chris Sideroff > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 612 4615 From hahn at physics.mcmaster.ca Tue Oct 12 14:06:41 2004 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] choosing a high-speed interconnect In-Reply-To: <1097611167.28704.104.camel@syru212-207.syr.edu> Message-ID: > I'm sure posing this may raise more questions than answer but which > high-speed interconnect would offer the best 'bang for the buck': > > 1) myrinet > 2) quardics qsnet > 3) mellanox infiniband at least in the last cluster I bought, Myrinet and IB had similar overall costs and MPI latency. so far at least, I haven't found any users who are bandwidth-limited, and so no reason there to prefer IB. (Myri can match the others in bandwidth if you go dual-port; that approximately doubles the Myri cost, though, making it clearly more expensive than IB.) quadrics is more expensive, but also much faster in latency, and competitive with IB in bandwidth. (there are only three interconnects that can claim <2 us latency: quadrics elan4, SGI's numalink and the cray xd1/octigabay.) IB vendors swear up and down that they're cheaper than Myri, lower-latency, higher bandwidth and taste great with iced cream. I must admit to some skepticism in spite of lacking any IB experience ;) it does seem clear that upcoming PCI-e systems will let IB vendors drop a few more chips off their nic, and theoretically come down to the $2-300/nic range. as far as I know, switches are staying more or less at the same price. and it's worth remembering that IB still doesn't have *that* much field-proof (questions regarding whether IB will continue to be a sole-source ecosystem, issues of integrating with Linux, rumors of sticking points regarding pinned memory, qpair scaling in large clusters, handling congestion, etc.) > Currently, our 30 node dual Opteron (MSI K8D Master-FT boards) cluster > uses Gig/E and are looking to upgrade to a faster network. why? how have you evaluated your need for faster networking? do you know whether by "faster" you mean latency or bandwidth? offhand, I'd be a little surprised if a 30-node cluster made a lot of sense with quadrics, since you're unlikley to *need* the superior latency. (ie, it seems like people jones for low-lat mainly when they have frequent, large collective operations. where large means "hundreds" of MPI workers...) > As well, what are the components would one need for each setup? The > reason I ask is for example the Myrinet switches accept different line > cards and am not sure which one to use. Sorry if this a bit of a newbie > question but I have no experience with any of this kind of hardware. I > am reading the docs for each but thought your feedback would be good. hmm, myrinet's pages aren't stunningly clear, but also not *that* hard, since they do describe some sample configs. for instance, you can see the "small switches" section of http://www.myrinet.com/myrinet/product_list.html and notice that it's all based on a single 3U enclosure, one or two 8-way cards (M3-SW16-8F) and an optional monitoring card (M3-M). for a 32-node cluster, you'd need 32 nics, a 5-slot cab, 4x M3-SW16-8F's, either a monitoring card or a blanking panel, and 32 cables. if you have fairly firm and short-term plans for adding more nodes, consider getting a bigger chassis. if you have any reason to do IO over myrinet (speed!), consider giving the fileserver(s) dual-port access... configuring other networks is not drastically different, though they often have different terminology, etc. for instance, quadrics switches can be configured with "slim" fat-trees (partially populated with spine/switching cards.) configuration beyond a single switch cab also tends to be interesting ;) regards, mark hahn. From landman at scalableinformatics.com Tue Oct 12 14:23:53 2004 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] choosing a high-speed interconnect In-Reply-To: <1097615581.28704.126.camel@syru212-207.syr.edu> References: <1097611167.28704.104.camel@syru212-207.syr.edu> <416C4339.3040309@scalableinformatics.com> <1097615581.28704.126.camel@syru212-207.syr.edu> Message-ID: <416C4B69.1020509@scalableinformatics.com> Chris Sideroff wrote: >On Tue, 2004-10-12 at 16:48, Joe Landman wrote: > > >>First questions first: >> >> Why do you think you need a faster network, and what aspect of fast do >>you think you need? Low latency? High bandwidth? >> >> > > To tell you the truth I can't answer that with more than, "I have a >gut feeling". I am in the process of profiling the performance of our >current cluster with our programs. Any suggestions ??? > > Yes, measure the performance as a function of number of CPUs, and then trying this on another similar cluster with the faster interconnect. Do this for "real" runs. Contact me offline if you would like to discuss. > > >>Then... >> >> What codes are you running? Across how many CPUS? Have you done a >>performance analysis on your system to observe "slow" runs in progress, >>and are you convinced that the network is the issue? >> >> > > We run exclusively computation fluid dynamics on it. One program is >Fluent the other is an in-house turbo-machinery code. My experiences so >far have led me to believe Fluent is much more sensitive to the >network's performance than the in-house program. Thus my inquiry into a >higher performance network. > > I haven't run fluent in the last few months, but it is a latency sensitive code. Would be worth exploring your models performance on a faster (e.g. lower latency) net. > > >>We have done lots of tuning bits for customers where the issues wound up >>being something else than what they had thought. It is worth at least >>looking into for your code/problems, and identifying the bottleneck (if >>you haven't already done so). >> >> > > Do you have more information on this 'tuning for customers'. I am >interested in your results. Again any suggestions on how to go about >this are welcomed. > > Get atop (http://freshmeat.net/projects/atop/), it is your friend. Profile your code with the profile tools. If you see lots of time spent in "do_writ" and similar, as well as high IO percentages in run times from sar, atop, and other tools, you might want to look at IO tuning. The important aspect of this is to gather real data about where your program spends its time. That is invaluable in deciding how to speed it up. Joe >Thanks, Chris > > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 612 4615 From cnsidero at syr.edu Tue Oct 12 14:13:46 2004 From: cnsidero at syr.edu (Chris Sideroff) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] choosing a high-speed interconnect In-Reply-To: <416C4339.3040309@scalableinformatics.com> References: <1097611167.28704.104.camel@syru212-207.syr.edu> <416C4339.3040309@scalableinformatics.com> Message-ID: <1097615626.28704.129.camel@syru212-207.syr.edu> On Tue, 2004-10-12 at 16:48, Joe Landman wrote: > First questions first: > > Why do you think you need a faster network, and what aspect of fast do > you think you need? Low latency? High bandwidth? To tell you the truth I can't answer that with more than, "I have a gut feeling". I am in the process of profiling the performance of our current cluster with our programs. Any suggestions ??? > Then... > > What codes are you running? Across how many CPUS? Have you done a > performance analysis on your system to observe "slow" runs in progress, > and are you convinced that the network is the issue? We run exclusively computation fluid dynamics on it. One program is Fluent the other is an in-house turbo-machinery code. My experiences so far have led me to believe Fluent is much more sensitive to the network's performance than the in-house program. Thus my inquiry into a higher performance network. > We have done lots of tuning bits for customers where the issues wound up > being something else than what they had thought. It is worth at least > looking into for your code/problems, and identifying the bottleneck (if > you haven't already done so). Do you have more information on this 'tuning for customers'. I am interested in your results. Again any suggestions on how to go about this are welcomed. Thanks, Chris From laytonjb at charter.net Tue Oct 12 14:43:37 2004 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] choosing a high-speed interconnect In-Reply-To: <1097611167.28704.104.camel@syru212-207.syr.edu> References: <1097611167.28704.104.camel@syru212-207.syr.edu> Message-ID: <416C5009.7070902@charter.net> Chris Sideroff wrote: >I'm sure posing this may raise more questions than answer but which >high-speed interconnect would offer the best 'bang for the buck': > >1) myrinet >2) quardics qsnet >3) mellanox infiniband > > Just as a data point, I've recently seen IB prices as low as $600 a port including HBA's, cables, software, etc. To also had a little fuel to the fire, if you are using your own codes, try a different MPI. There are a couple of MPI's with really good performance over GigE. Another option is to look at a RDMA NIC. For example, Ammasso has a low-latency GigE NIC. I don't know prices, but be sure to do some testing on these NICs vs. IB and Myrinet. Then you can make a better decision. Good Luck! Jeff From mprinkey at aeolusresearch.com Tue Oct 12 14:05:27 2004 From: mprinkey at aeolusresearch.com (Michael T. Prinkey) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] choosing a high-speed interconnect In-Reply-To: <416C4339.3040309@scalableinformatics.com> Message-ID: This won't help with your Opteron systems as they probably have broadcom (tg3) NICs, but GAMMA has just released an update that supports Intel (e1000) gigabit cards: http://www.disi.unige.it/project/gamma/index.html They have an MPI implementation as well: http://www.disi.unige.it/project/gamma/mpigamma/index.html They claim vastly improved latency and incrementally improved bandwidth on gigabit hardware relative to TCP/IP. We are planning to test it with the new Xeon cluster we will be building next month. It will be interesting to see how it fairs with LINPACK and the MFIX CFD code. Anyone given GAMMA a try? Mike On Tue, 12 Oct 2004, Joe Landman wrote: > First questions first: > > Why do you think you need a faster network, and what aspect of fast do > you think you need? Low latency? High bandwidth? > > Then... > > What codes are you running? Across how many CPUS? Have you done a > performance analysis on your system to observe "slow" runs in progress, > and are you convinced that the network is the issue? > > We have done lots of tuning bits for customers where the issues wound up > being something else than what they had thought. It is worth at least > looking into for your code/problems, and identifying the bottleneck (if > you haven't already done so). > > That said, all the below require an external "switch" fabric. All range > from $500-$2000 per HBA, and about $1000 or more per switch port. > Varies a bit. Performance is comparible in most cases, with IB seeming > to have a higher ceiling than the others. > > Joe > > Chris Sideroff wrote: > > >I'm sure posing this may raise more questions than answer but which > >high-speed interconnect would offer the best 'bang for the buck': > > > >1) myrinet > >2) quardics qsnet > >3) mellanox infiniband > > > >Currently, our 30 node dual Opteron (MSI K8D Master-FT boards) cluster > >uses Gig/E and are looking to upgrade to a faster network. > > > >As well, what are the components would one need for each setup? The > >reason I ask is for example the Myrinet switches accept different line > >cards and am not sure which one to use. Sorry if this a bit of a newbie > >question but I have no experience with any of this kind of hardware. I > >am reading the docs for each but thought your feedback would be good. > > > >Thanks > > > >Chris Sideroff > > > >_______________________________________________ > >Beowulf mailing list, Beowulf@beowulf.org > >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > > > > From rgb at phy.duke.edu Tue Oct 12 15:39:19 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] choosing a high-speed interconnect In-Reply-To: <1097615626.28704.129.camel@syru212-207.syr.edu> References: <1097611167.28704.104.camel@syru212-207.syr.edu> <416C4339.3040309@scalableinformatics.com> <1097615626.28704.129.camel@syru212-207.syr.edu> Message-ID: On Tue, 12 Oct 2004, Chris Sideroff wrote: > On Tue, 2004-10-12 at 16:48, Joe Landman wrote: > > First questions first: > > > > Why do you think you need a faster network, and what aspect of fast do > > you think you need? Low latency? High bandwidth? > > To tell you the truth I can't answer that with more than, "I have a > gut feeling". I am in the process of profiling the performance of our > current cluster with our programs. Any suggestions ??? Analyze the applications, preferrably at the code level. If they exchange a few, big messages then they are likely bandwidth limited. If they exchange many, small messages then they are likely latency limited. If you don't have access to the code, then run a tool such as xmlsysd/wulfstat that lets you watch the (ether)net on a whole cluster at once as it runs your applications and take note on e.g. packet counts per second per node, net data throughput per second per node. Joe's question is dead on the money. Until you do this, you cannot be sure that your application is choking due to a network that is "slow" in any dimension. Even if it IS slow due the network, it may not be slow in a sense that can be substantively fixed by changing networks, if you're already using gigE. gigE's latency isn't great, but its bandwidth should be at least comparable (within a factor of 1-3) of the faster networks. Sometimes, also, the problem is the network but not at the physical layer; rather in the way the code itself is organized and uses the network. If the code is YOUR code, then a trip through e.g. Ian Foster's book on parallel programming and algorithms (there are several others with good reputations) is indicated before investing a LOT of money in a new network. If the code is somebody else's code, then the list is a great place to get actual feedback on what the essential bottlenecks are and to learn of actual clusters that are successful designs. It sounds (below) like you have a bit of both -- good luck finding Fluent users or a Fluent-savvy consultant on the list (both seem pretty likely). Before departing, I'd suggest working with vendors to arrange a loaner network and prototyping it with your programs before finally buying it. These networks are a substantial investment, as the companies that sell them well know. The companies are quite competitive and want your business. They are usually pretty willing to let their hardware "speak for itself" so you aren't investing $1-2K/node only to learn afterwards that it doesn't speed your code up at all. That is an outcome that benefits nobody, really, not even the network vendor (as you'll doubtless later poison their reputation in this very competitive and reputation-sensitive marketplace). rgb > > > Then... > > > > What codes are you running? Across how many CPUS? Have you done a > > performance analysis on your system to observe "slow" runs in progress, > > and are you convinced that the network is the issue? > > We run exclusively computation fluid dynamics on it. One program is > Fluent the other is an in-house turbo-machinery code. My experiences so > far have led me to believe Fluent is much more sensitive to the > network's performance than the in-house program. Thus my inquiry into a > higher performance network. > > > We have done lots of tuning bits for customers where the issues wound up > > being something else than what they had thought. It is worth at least > > looking into for your code/problems, and identifying the bottleneck (if > > you haven't already done so). > > Do you have more information on this 'tuning for customers'. I am > interested in your results. Again any suggestions on how to go about > this are welcomed. > > Thanks, Chris > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Tue Oct 12 16:38:51 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] choosing a high-speed interconnect In-Reply-To: References: <1097611167.28704.104.camel@syru212-207.syr.edu> <416C4339.3040309@scalableinformatics.com> <1097615626.28704.129.camel@syru212-207.syr.edu> Message-ID: On Tue, 12 Oct 2004, Robert G. Brown wrote: > you're already using gigE. gigE's latency isn't great, but its > bandwidth should be at least comparable (within a factor of 1-3) of the > faster networks. Correction (as Greg pointed out offline, very kindly): My gross generalization is grossly incorrect -- you CAN get nearly an order of magnitude improvement in bandwidth with Quadrics and IB. I stand humbly corrected. But you still should verify that bandwidth is your problem before investing in more of it. rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From daniel.kidger at quadrics.com Tue Oct 12 16:44:28 2004 From: daniel.kidger at quadrics.com (Dan Kidger) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] choosing a high-speed interconnect In-Reply-To: <1097611167.28704.104.camel@syru212-207.syr.edu> References: <1097611167.28704.104.camel@syru212-207.syr.edu> Message-ID: <200410130044.28260.daniel.kidger@quadrics.com> Chris, > I'm sure posing this may raise more questions than answer but which > high-speed interconnect would offer the best 'bang for the buck': > > 1) myrinet > 2) quadrics qsnet > 3) mellanox infiniband > > Currently, our 30 node dual Opteron (MSI K8D Master-FT boards) cluster > uses Gig/E and are looking to upgrade to a faster network. WelI I am from one of the vendors that you cite so perhaps by reply is biased. But hopefully I can reply without it seeming like a sales pitch. Our QsNetII interconnect sells for around $1700 per node (card=$999, rest is cable and share of the switch). A 4U high 32-way switch would be the nearest match in tems of size for a 30-node cluster. (c $14K iirc) MPI bandwidth is 875MB/s on Opteron (higher on say IA64/Nocona but the AMD PCI-X bridge limits us), MPI latency is 1.5us on Opteron. - only sligthtly better the Cray/Octigabay Opteron product (usually quoted as 1.7us.) Infiniband bandwidth is only a little less than ours, and latency not much worse than twice ours. Myrinet lags a fair bit currently but they do have a new faster product soon to hit the market which you should look out for. All vendors have a variety of switch sizes - either as a fixed size configuration - or as a chassis that takes one or more line cards that can be upgraded if your cluster gets expanded. Some solutions such as Myrinet revE cards need two switch ports per node but otherwise you just need a switch big enough for your node count and allowing for possible future expansion. Very large clusters have multiple switch cabinets arranged as node-level switches which have links to the nodes and top-level 'spine' switch cabinets that interconnect the node-level cabinets. If you have the same number of links to the spine switches as you do to the actual nodes then you should have 'full bisectionall bandwidth'. However you can save money by cutting back on the amount of spine switching you buy. Many interconnect vendors offer a choice of copper or fibre cabling. The former is often cheaper (no expensive lasers) but the latter can be used for longer cable runs and is often easier to physically manage particularly when installing very large clusters. What to buy depends very much on your application. Maybe you haven't proved that your GigE is the limiting factor. I do have figures for Fluent on ours and other interconnects but the Beowulf list is not the correct place to post these. As Robert pointed out, most vendors will loan equipment for a month or so and indeed many can provide external access to clusters for benchmarking purposes. Also for example the AMD Developer Center has large Myrient and Infiniband clusters that you can ask to get access to. Hope this helps, Daniel -------------------------------------------------------------- Dr. Dan Kidger, Quadrics Ltd. daniel.kidger@quadrics.com One Bridewell St., Bristol, BS1 2AA, UK 0117 915 5505 ----------------------- www.quadrics.com -------------------- From evan.cull at duke.edu Tue Oct 12 18:47:29 2004 From: evan.cull at duke.edu (Evan Cull) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] a cluster to drive a wall of monitors Message-ID: <416C8931.4030805@duke.edu> Hi all, I was told this list would be a good place to ask for advice on the following project. (I've tried to search through list archives for related info, but I haven't managed to spot anything so far.) I'm helping with a project that want's to drive a wall of about 50 LCD panels with a linux cluster running Syzygy: http://www.isl.uiuc.edu/syzygy.htm I was considering a cluster of either 50 single processor nodes or 25 dual processor + dual output graphics card nodes. I suppose 50 dual processor nodes would be nice, but I'm pretty sure that's well out of my budget range. I'm betting that the 50 single processor nodes would easily have twice the graphics performance of the 25 dual nodes because they have 2x as many video cards. The tradeoff here is that the dual processor nodes might be more useful for other more general computing tasks we could run on them. Does anyone here have experience buying rackmountable cluster nodes *with graphics cards* who can point me to a vendor? For that matter, have any of you built a similar system & have any suggestions / comments? thanks, Evan Cull From mlleinin at hpcn.ca.sandia.gov Tue Oct 12 20:42:48 2004 From: mlleinin at hpcn.ca.sandia.gov (Matt L. Leininger) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] choosing a high-speed interconnect In-Reply-To: References: Message-ID: <1097638968.8496.184.camel@trinity> On Tue, 2004-10-12 at 17:06 -0400, Mark Hahn wrote: > > IB vendors swear up and down that they're cheaper than Myri, > lower-latency, higher bandwidth and taste great with iced cream. > I must admit to some skepticism in spite of lacking any IB experience ;) > it does seem clear that upcoming PCI-e systems will let IB vendors > drop a few more chips off their nic, and theoretically come down to > the $2-300/nic range. as far as I know, switches are staying more or > less at the same price. and it's worth remembering that IB still > doesn't have *that* much field-proof (questions regarding whether IB > will continue to be a sole-source ecosystem, issues of integrating > with Linux, rumors of sticking points regarding pinned memory, qpair > scaling in large clusters, handling congestion, etc.) > There are multiple 128 node (and greater) IB systems that are stable and are being used for production apps. The #7 top500 machine from RIKEN is using IB and has been in production for over six months. My cluster at Sandia (about 128 nodes) is being used for IB R&D and production science runs. The science runs have produce many papers over the last 9 months. We've purchased other IB clusters ranging from 64 to >300 nodes that are for production use. All run great under Linux, and you have multiple IB vendors to choose from (Voltaire, Topspin, InfiniCon, and Mellanox). Almost all of the IB software development is done under Linux first and then ported to other OSes. QP scaling isn't as critical an issue if the MPI implementation sets up the connections as needed (kinda of a lazy connection setup). Why set up an all-to-all QP connectivity if the MPI implements an all-to-all or collectives as tree based pt2pt algorithms. Network congestion on larger clusters can be reduced by using source based adaptive (multipath) routing instead of the standard IB static routing. Also remember that IB has a lot more field experience than the latest Myricom hardware and MX software stack. - Matt From Thomas_Hoeffel at chiron.com Tue Oct 12 14:54:09 2004 From: Thomas_Hoeffel at chiron.com (Hoeffel, Thomas) Date: Wed Nov 25 01:03:28 2009 Subject: [Beowulf] torque vs openpbs? Message-ID: <1D07750058CEAC4396F1FAB701900301028C7E21@emvosiris.chiron.com> our cluster environment is beginning to tax our openpbs installation. It runs fine on our old cluster (PIII's/10/100 switch) but is a bit quirky on the newer opterons (gig switches, more mem...etc.) Pricing for PBSPro is, well, a bit outrageous and I'm considering Torque/Maui combo. Any thoughts/feedback on the size of the torque community, it's life expectancy..etc. SGE is currently not an option as we have 3rd party code which interfaces well w/ PBS but not SGE. Thomas J. Hoeffel Computational Chemistry Chiron Corp. MS 4.2 4560 Horton St. Emeryville, CA 94608 From alvin at Mail.Linux-Consulting.com Tue Oct 12 22:05:40 2004 From: alvin at Mail.Linux-Consulting.com (Alvin Oga) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] a cluster to drive a wall of monitors In-Reply-To: <416C8931.4030805@duke.edu> Message-ID: hi ya evan On Tue, 12 Oct 2004, Evan Cull wrote: > I'm helping with a project that want's to drive a wall of about 50 LCD > panels with a linux cluster running Syzygy: > http://www.isl.uiuc.edu/syzygy.htm i didn't see 4, 16, 50 monitors at that site :-) but maybe i didnt look in the right places or with the right eyeballs for 2x3 or more monitors .. http://www.linux-1u.net/X11/Quad/ - a wall of 16 monitors http://www.linux-1u.net/X11/Quad/gstreamer.net/vw1.png http://www.linux-1u.net/X11/Quad/gstreamer.net/vw2.png http://www.linux-1u.net/X11/Quad/gstreamer.net/video-wall-howto.html the trick is to divide out the one pic into 1/4 pics each and the bracket between each adjacent lcd to be minimal and non-distracting fromt eh whole image displayed on 4 or more monitors lots of XF86Config editing and tweeking doing that with *.jpg is almost trivial doing that with *.mpeg with mplayer/zine becomes a fun project > I was considering a cluster of either 50 single processor nodes or 25 > dual processor + dual output graphics card nodes. I suppose 50 dual an itty bitty P3-800 equivalent cpu can trivially play an mpeg file ( you dont need horsepower to play mpegs ) if you are encoding ... that might be trickier .. and that you'd need to keep the video and audio in sync ( not trivial ) - lots of rejected *.mpegs due to sound and video being out of sync ( even on the fastest pcs ) > Does anyone here have experience buying rackmountable cluster nodes > *with graphics cards* who can point me to a vendor? we sell those puppises, which is half the fun .. :-) > For that matter, have any of you built a similar system & have any > suggestions / comments? depending on where the movies are being played, remote admin or not and if they "hit reset" or powerfailures will be yur biggest problem - we have 100 systems in 100 cities across this itty-bitty-land c ya alvin From hahn at physics.mcmaster.ca Tue Oct 12 22:40:03 2004 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] choosing a high-speed interconnect In-Reply-To: <1097638968.8496.184.camel@trinity> Message-ID: > There are multiple 128 node (and greater) IB systems that are stable > and are being used for production apps. The #7 top500 machine from I thank you for this street-level information! it's frustrating to only know a technology based on marketing... > RIKEN is using IB and has been in production for over six months. My > cluster at Sandia (about 128 nodes) is being used for IB R&D and still, 128 nodes is fairly small these days. would you characterize your applications as fairly bandwidth-intensive? I know that many of the apps that run on really big weapons-related labs tend to emphasize latency to an extreme degree, but perhaps your codes are not like that? > >300 nodes that are for production use. All run great under Linux, and > you have multiple IB vendors to choose from (Voltaire, Topspin, > InfiniCon, and Mellanox). well, aren't all of those just minor modifications of the same mellanox chip? that's what I meant by "not-really-multi-vendor". the IB world would like to compare itself to the eth world, but it's a very, very long way away from being really vendor-independent. > Almost all of the IB software development is > done under Linux first and then ported to other OSes. very interesting! do you mean user-level IB software and middleware? I had the impression (circa OLS in July) that there was no real unification of linux IB stacks, and significant problems with windows-centricness of the code. > QP scaling isn't as critical an issue if the MPI implementation sets > up the connections as needed (kinda of a lazy connection setup). Why > set up an all-to-all QP connectivity if the MPI implements an all-to-all > or collectives as tree based pt2pt algorithms. that sounds reasonable, but does it work out well? I guess it would depend mainly on whether the actual collective groups change frequently and are reused. > Network congestion on > larger clusters can be reduced by using source based adaptive > (multipath) routing instead of the standard IB static routing. interesting, again! in the most recent visit by S&M people from an IB vendor, they claimed that there was no problem and that any reasonably smart switch would have a routing manager smart enough to prevent the non-problem. > Also remember that IB has a lot more field experience than the latest > Myricom hardware and MX software stack. to me, "recent myricom" means e-cards, which I, perhaps naively, think are more of a known quantity than anything IB. and I haven't managed to lay hands on MX yet . I'm really glad to hear early adopters of IB speak up; I still claim that they actually are early adopters, though ;) regards, mark hahn. From nixon at nsc.liu.se Wed Oct 13 07:15:11 2004 From: nixon at nsc.liu.se (Leif Nixon) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] Grid Engine question In-Reply-To: <1097612732.18951.14.camel@pel> (Sean Dilda's message of "Tue, 12 Oct 2004 16:25:32 -0400") References: <416B91BB.8070102@ohmsurveys.com> <1097612732.18951.14.camel@pel> Message-ID: <873c0iwvps.fsf@nsc.liu.se> Sean Dilda writes: > In SGE 6.0 they added a feature they call 'advanced reservations'. Its > not really advanced, and its not what I consider 'reservations' to be, > but it is exactly what you want. That's "advance reservations", not "advanced reservations". -- Leif Nixon Systems expert ------------------------------------------------------------ National Supercomputer Centre Linkoping University ------------------------------------------------------------ From brian at cypher.acomp.usf.edu Wed Oct 13 09:05:57 2004 From: brian at cypher.acomp.usf.edu (Brian R Smith) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] a cluster to drive a wall of monitors In-Reply-To: <416C8931.4030805@duke.edu> References: <416C8931.4030805@duke.edu> Message-ID: <1097683557.23424.11.camel@cypher.acomp.usf.edu> Evan, We just built a smaller version of that, a 4x3 display wall. We went the dual-output video card route and deeply regret it. The performance is rather lackluster and getting the screens to align correctly e.g. not displaying broken chunks of images, is damn near impossible with DMX/Chromium. We originally configured it so that each video card ran two frame buffers, one for each screen, allowing us to configure each displays bounderies manually. The problem was that running two frame buffers on each card resulted in further degrading each machine's performance. If at all possible, make sure you have one machine for each screen. Of course, YMMV, but this has been our experience in the matter. Good luck. Brian Smith On Tue, 2004-10-12 at 21:47, Evan Cull wrote: > Hi all, > > I was told this list would be a good place to ask for advice on the > following project. (I've tried to search through list archives for > related info, but I haven't managed to spot anything so far.) > > I'm helping with a project that want's to drive a wall of about 50 LCD > panels with a linux cluster running Syzygy: > http://www.isl.uiuc.edu/syzygy.htm > > I was considering a cluster of either 50 single processor nodes or 25 > dual processor + dual output graphics card nodes. I suppose 50 dual > processor nodes would be nice, but I'm pretty sure that's well out of my > budget range. I'm betting that the 50 single processor nodes would > easily have twice the graphics performance of the 25 dual nodes because > they have 2x as many video cards. The tradeoff here is that the dual > processor nodes might be more useful for other more general computing > tasks we could run on them. > > Does anyone here have experience buying rackmountable cluster nodes > *with graphics cards* who can point me to a vendor? > > For that matter, have any of you built a similar system & have any > suggestions / comments? > > thanks, > Evan Cull > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From chliaskos at yahoo.gr Wed Oct 13 00:10:53 2004 From: chliaskos at yahoo.gr (Chris LS) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] Optimal Number of nodes? Message-ID: <20041013071053.82519.qmail@web86906.mail.ukl.yahoo.com> Hello I'm an el. engineering student Can anyone help me on the following subject? I have to theoritically design a clustered server ,and although most parts of the procedure are complete ,i can't find a way to calculate or estimate the optimal number of nodes needed . I've spent countless hours searching but i can't find anything but very general advices, while i'm interested in the exact procedure -if it exists. Can anyone give me any related links or any other info on this? Thanks in advance ! Chris chliaskos@yahoo.gr --------------------------------- Do You Yahoo!? ????????? ??? ?????? ???@yahoo.gr ????????? ??? Yahoo! Mail. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20041013/f27d219d/attachment.html From cnsidero at syr.edu Wed Oct 13 07:55:05 2004 From: cnsidero at syr.edu (Chris Sideroff) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] choosing a high-speed interconnect In-Reply-To: <1097611167.28704.104.camel@syru212-207.syr.edu> References: <1097611167.28704.104.camel@syru212-207.syr.edu> Message-ID: <1097679300.28704.160.camel@syru212-207.syr.edu> Thanks for all the excellent replies. The consensus seems to be carry out some performance profiling with the current hardware and compare with a high-speed network cluster. The former which I am in the process of doing and the latter which I will try to do. More specifically, I believe the most important testing for our cluster will be Fluent's scalability and sensitivity to the network. The reason I say this is because there are multiple users (~6-8) running large Fluent jobs (1-10 million cells) with various solvers, which have different CPU and memory requirements. While the in-house code is run by one person, a lot less frequently and (for other reasons) currently does not use more than 8 processors. Following rgb's suggestions, I will be difficult to analyze Fluent at the code level since we don't have the code but it has some built in monitors that I can use combined with some Linux tools. When the time comes to scale the in-house code to >8 procs we will have greater flexibility tuning it. BTW, thanks for the parallel programming reference. I did find some benchmarks on Fluent's website which indicate that Myrinet definitely scales better than GigE but I still want to carry out my own benchmarking. If anyone has any experiences benchmarking their clusters using Fluent feel free to supply your thoughts. From joachim at ccrl-nece.de Wed Oct 13 07:37:18 2004 From: joachim at ccrl-nece.de (joachim@ccrl-nece.de) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] choosing a high-speed interconnect In-Reply-To: <1097638968.8496.184.camel@trinity> References: <1097638968.8496.184.camel@trinity> Message-ID: <3836.221.114.211.251.1097678238.squirrel@postman.ccrl-nece.de> > QP scaling isn't as critical an issue if the MPI implementation sets > up the connections as needed (kinda of a lazy connection setup). Why > set up an all-to-all QP connectivity if the MPI implements an all-to-all > or collectives as tree based pt2pt algorithms. Network congestion on Good MPI collectives often are not tree based, but need more connectivity. Of course, the best collectives are optimized for a certain interconnect. Joachim From john.hearns at clustervision.com Wed Oct 13 00:26:35 2004 From: john.hearns at clustervision.com (John Hearns) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] torque vs openpbs? In-Reply-To: <1D07750058CEAC4396F1FAB701900301028C7E21@emvosiris.chiron.com> References: <1D07750058CEAC4396F1FAB701900301028C7E21@emvosiris.chiron.com> Message-ID: <1097652395.8836.8.camel@vigor12> On Tue, 2004-10-12 at 22:54, Hoeffel, Thomas wrote: > our cluster environment is beginning to tax our openpbs installation. > It runs fine on our old cluster (PIII's/10/100 switch) but is a bit quirky > on the newer opterons (gig switches, more mem...etc.) > Pricing for PBSPro is, well, a bit outrageous and I'm considering > Torque/Maui combo. > > Any thoughts/feedback on the size of the torque community, it's life > expectancy..etc. > SGE is currently not an option as we have 3rd party code which interfaces > well w/ PBS but not SGE. Is this a package called Materials Studio by any chance? I posted something a few weeks ago to the Gridengine list about that. Looking at the interface, it didn't look too hard to port to Gridengine. From laurenceliew at yahoo.com.sg Wed Oct 13 06:04:21 2004 From: laurenceliew at yahoo.com.sg (Laurence Liew) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] a cluster to drive a wall of monitors In-Reply-To: <416C8931.4030805@duke.edu> References: <416C8931.4030805@duke.edu> Message-ID: <416D27D5.4020802@yahoo.com.sg> Hi, Check out the Visualization Roll over in www.rocksclusters.org... The SDCS Rocks guys have got such a system setup.... 3x3 panels... driven by a Rocks cluster with 9 compute nodes (Shuttle XPC + NVidia cards) and 1 frontend.... Contact me offline if there is more interest... or post to the Rocks mailing list. Cheers! laurence Evan Cull wrote: > Hi all, > > I was told this list would be a good place to ask for advice on the > following project. (I've tried to search through list archives for > related info, but I haven't managed to spot anything so far.) > I'm helping with a project that want's to drive a wall of about 50 LCD > panels with a linux cluster running Syzygy: > http://www.isl.uiuc.edu/syzygy.htm > > I was considering a cluster of either 50 single processor nodes or 25 > dual processor + dual output graphics card nodes. I suppose 50 dual > processor nodes would be nice, but I'm pretty sure that's well out of my > budget range. I'm betting that the 50 single processor nodes would > easily have twice the graphics performance of the 25 dual nodes because > they have 2x as many video cards. The tradeoff here is that the dual > processor nodes might be more useful for other more general computing > tasks we could run on them. > Does anyone here have experience buying rackmountable cluster nodes > *with graphics cards* who can point me to a vendor? > > For that matter, have any of you built a similar system & have any > suggestions / comments? > > thanks, > Evan Cull > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- A non-text attachment was scrubbed... Name: laurenceliew.vcf Type: text/x-vcard Size: 150 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20041013/432b1a47/laurenceliew.vcf From oliviacal at earthlink.net Tue Oct 12 21:55:24 2004 From: oliviacal at earthlink.net (Olivia) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] Beowulf Illustrated Message-ID: <410-220041031345524410@earthlink.net> Is There a picture of Beowulf? I have to draw it on a poster board for a project. Olivia Calzada oliviacal@earthlink.net Why Wait? Move to EarthLink. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20041012/65bc212c/attachment.html From bill at cse.ucdavis.edu Wed Oct 13 13:31:39 2004 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] choosing a high-speed interconnect In-Reply-To: <1097611167.28704.104.camel@syru212-207.syr.edu> References: <1097611167.28704.104.camel@syru212-207.syr.edu> Message-ID: <20041013203139.GB23216@cse.ucdavis.edu> I'm glossing over many details, but in general I've found the below mentioned strategies a good first order approximation. I'd suggest taking several representative production runs and try graph the performance on 1,2,4,8,16,32 processors or whatever is feasible for your jobs and cluster. If you see good scaling I.e. each jump gets almost twice as much work done you very likely will not benefit from a faster interconnect. If you do not see good scaling then you might be bottlenecked by latency or bandwidth, or possibly other factors like a faster than linear increase in work with extra nodes, and disk I/O performance among others. Hope for the first, it will save you money, time and effort. If it's the later then it would be worth your while to try to find out exactly why your code isn't scaling. Even the simplest measures can help, for instance recording the before and after packet counts as reported by ifconfig. Graphing how the number of packets increases with N and how the performance scales with N might provide valuable insight. Another dirty hack can be to force your interfaces to 100 Mbit and see how the performance changes. If it's minimal it's likely to be either latency (100 Mbit and GigE usually don't vary by much) or not bandwidth constrained. Also something like ganglia can provide you with a significant amount of additional info for a run, so you can watch memory, network, load, memory used, buffers used etc. See how these variables change with the timestep and with the number of nodes can be very helpful for getting a general idea of how your job is behaving. One particular job I was running had network traffic increasing with each iteration, above a certain point the wall clock time per timestep increased. Calculations showed I was getting 30% of peak GigE performance, it is likely that between the packet overhead and MPI overhead that was as fast as I was likely to see. Certainly none of the above will give you as good as an idea as a source code analysis or a fully profiled run, but they can help steer you in the right direction. -- Bill Broadley Computational Science and Engineering UC Davis From landman at scalableinformatics.com Tue Oct 12 22:12:31 2004 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] choosing a high-speed interconnect In-Reply-To: <1097638968.8496.184.camel@trinity> References: <1097638968.8496.184.camel@trinity> Message-ID: <416CB93F.40502@scalableinformatics.com> Hi Matt: Good to see you here ... :) Matt L. Leininger wrote: > > > > There are multiple 128 node (and greater) IB systems that are stable >and are being used for production apps. The #7 top500 machine from >RIKEN is using IB and has been in production for over six months. My >cluster at Sandia (about 128 nodes) is being used for IB R&D and > > FWIW I used the nice setup that the AMD Dev center team have set up for benchmarking and testing. They have a nice IB platform there. [...] > QP scaling isn't as critical an issue if the MPI implementation sets >up the connections as needed (kinda of a lazy connection setup). Why >set up an all-to-all QP connectivity if the MPI implements an all-to-all >or collectives as tree based pt2pt algorithms. Network congestion on >larger clusters can be reduced by using source based adaptive >(multipath) routing instead of the standard IB static routing. > > On features utility ... (qp scaling, ...) (more to Mark than Matt here) One of the things I remember as a "feature" much touted by the marketeers in the ccNUMA 6.5 IRIX days was page migration. This feature was supposed to ameliorate memory access hotspots in parallel codes. Enough hits on a page from a remote CPU, and whammo, off it went to the remote CPU. Turns out this was "A Bad Thing(TM)". There were many reasons for this, but in the end, page migration was little more than a marginal feature, best used in specific corner cases. Sure, someone will speak up and tell me how much pain it saved them, or made their code 3 orders of magnitude faster. I never saw that in general. I got better results from dplace, and large pages than I ever got from some of these other features. The point is that there are often lots of features. Some of which might even be generally useful. Others might simply not be useful as the application level issues might be better served by other methods (as you pointed out). IB works pretty nicely on clusters. So do many of the other interconnects. If you have latency bound or bandwidth bound problems, certainly it would be worth looking into. The original question was which to look at. First the need needs to be assessed, and from there, a reasonable comparison may be made. IB does look like it is drawing wide support right now, and is not single sourced. It may be possible (though I haven't done much in the way of measurement) that tcp offload systems might help as well. If you are not extremely sensitive to latency, you might be able to use these. If you are, you should stick to the low latency fabrics. > Also remember that IB has a lot more field experience than the latest >Myricom hardware and MX software stack. > > Joe From deadline at linux-mag.com Wed Oct 13 15:41:32 2004 From: deadline at linux-mag.com (Douglas Eadline, Cluster World Magazine) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] a cluster to drive a wall of monitors In-Reply-To: <416C8931.4030805@duke.edu> Message-ID: We did an issue on this: http://www.clusterworld.com/issues/jul-04-preview.shtml BTW: Issue gallery is here: http://www.clusterworld.com/issues.shtml We are working on way make back issues available. For now, if you know someone that gets ClusterWorld, maybe you can borrow an issue. Oh, I see you are at Duke. Maybe contact rgb. (http://www.phy.duke.edu/~rgb/) Doug On Tue, 12 Oct 2004, Evan Cull wrote: > Hi all, > > I was told this list would be a good place to ask for advice on the > following project. (I've tried to search through list archives for > related info, but I haven't managed to spot anything so far.) > > I'm helping with a project that want's to drive a wall of about 50 LCD > panels with a linux cluster running Syzygy: > http://www.isl.uiuc.edu/syzygy.htm > > I was considering a cluster of either 50 single processor nodes or 25 > dual processor + dual output graphics card nodes. I suppose 50 dual > processor nodes would be nice, but I'm pretty sure that's well out of my > budget range. I'm betting that the 50 single processor nodes would > easily have twice the graphics performance of the 25 dual nodes because > they have 2x as many video cards. The tradeoff here is that the dual > processor nodes might be more useful for other more general computing > tasks we could run on them. > > Does anyone here have experience buying rackmountable cluster nodes > *with graphics cards* who can point me to a vendor? > > For that matter, have any of you built a similar system & have any > suggestions / comments? > > thanks, > Evan Cull > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- ---------------------------------------------------------------- Editor-in-chief ClusterWorld Magazine Desk: 610.865.6061 Fax: 610.865.6618 www.clusterworld.com From rgb at phy.duke.edu Wed Oct 13 16:01:55 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] a cluster to drive a wall of monitors In-Reply-To: References: Message-ID: On Wed, 13 Oct 2004, Douglas Eadline, Cluster World Magazine wrote: > > We did an issue on this: > > http://www.clusterworld.com/issues/jul-04-preview.shtml > > BTW: Issue gallery is here: > http://www.clusterworld.com/issues.shtml > > We are working on way make back issues available. For now, if > you know someone that gets ClusterWorld, maybe you can borrow an issue. > > Oh, I see you are at Duke. Maybe contact rgb. > (http://www.phy.duke.edu/~rgb/) Oh yeah, kinda forgot about that. Busy week. I'll see if I can dig out the issue from my neatly organized stash (ha!) rgb > > Doug > > > On Tue, 12 Oct 2004, Evan Cull wrote: > > > Hi all, > > > > I was told this list would be a good place to ask for advice on the > > following project. (I've tried to search through list archives for > > related info, but I haven't managed to spot anything so far.) > > > > I'm helping with a project that want's to drive a wall of about 50 LCD > > panels with a linux cluster running Syzygy: > > http://www.isl.uiuc.edu/syzygy.htm > > > > I was considering a cluster of either 50 single processor nodes or 25 > > dual processor + dual output graphics card nodes. I suppose 50 dual > > processor nodes would be nice, but I'm pretty sure that's well out of my > > budget range. I'm betting that the 50 single processor nodes would > > easily have twice the graphics performance of the 25 dual nodes because > > they have 2x as many video cards. The tradeoff here is that the dual > > processor nodes might be more useful for other more general computing > > tasks we could run on them. > > > > Does anyone here have experience buying rackmountable cluster nodes > > *with graphics cards* who can point me to a vendor? > > > > For that matter, have any of you built a similar system & have any > > suggestions / comments? > > > > thanks, > > Evan Cull > > > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > > -- > ---------------------------------------------------------------- > Editor-in-chief ClusterWorld Magazine > Desk: 610.865.6061 > Fax: 610.865.6618 www.clusterworld.com > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From kus at free.net Wed Oct 13 11:09:54 2004 From: kus at free.net (Mikhail Kuzminsky) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] choosing a high-speed interconnect In-Reply-To: Message-ID: In message from "Michael T. Prinkey" (Tue, 12 Oct 2004 17:05:27 -0400 (EDT)): > >This won't help with your Opteron systems as they probably have >broadcom >(tg3) NICs, but GAMMA has just released an update that supports Intel >(e1000) gigabit cards: > >http://www.disi.unige.it/project/gamma/index.html > >They have an MPI implementation as well: > >http://www.disi.unige.it/project/gamma/mpigamma/index.html We had some experience w/older GAMMA versions, but then we stopped our using because of the absense of reliable SMP support and e1000 cards we begun to use. Now e1000 is supported only for 2.6 kernels (we are using 2.4 for x86_64). Latest pair of GAMMA versions for 2.4 and 2.6 kernels is attractive and may give GAMMA "new life". But I'm afraid the instability of Intel Pro/1000 NICs in the sense of extremally often NIC chips exchange by Intel - between different NICs "version". GAMMA staff say about i82546 chipset; what will be for other ? Yours Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow >They claim vastly improved latency and incrementally improved >bandwidth on >gigabit hardware relative to TCP/IP. We are planning to test it with >the >new Xeon cluster we will be building next month. It will be >interesting >to see how it fairs with LINPACK and the MFIX CFD code. > >Anyone given GAMMA a try? > >Mike > From hugo at dolphinics.no Wed Oct 13 11:03:09 2004 From: hugo at dolphinics.no (Hugo Kohmann) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] Re: HPC in Windows In-Reply-To: <416B9356.40400@craigsplanet.force9.co.uk> References: <416B9356.40400@craigsplanet.force9.co.uk> Message-ID: All, A very good and stable open source MPI package for Windows ( and Linux/Solaris ) can be found at http://www.lfbs.rwth-aachen.de/content/index.php?ctl_pos=172 This package has been available for several years. Best regards Hugo ========================================================================================= Hugo Kohmann | Dolphin Interconnect Solutions AS | E-mail: P.O. Box 150 Oppsal | hugo at dolphinics.com N-0619 Oslo, Norway | Web: Tel:+47 23 16 71 73 | http://www.dolphinics.com Fax:+47 23 16 71 80 | Visiting Address: Olaf Helsets vei 6 | From michael at halligan.org Wed Oct 13 22:38:33 2004 From: michael at halligan.org (Michael T. Halligan) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] RedHat Satellite Server as a cluster management tool. Message-ID: <416E10D9.70006@halligan.org> Has anybody used (or tried to use) the RHN system as a HPC management tool. I've implemented this successfully in a 100 host environment for a customer of mine, and am in the process of re-architecting an infrastructure with about 150 nodes.. That's about as far as I've gotten with it. Once I get past the cost, the poor documentation, and "OK" support, I'm finding that it's actually a great (though slightly immature) piece of software for the enterprise. The ease of keeping an infrastructure in sync, and tthe lowered workload for sysadmins At 100 nodes, the pricing seems to be about $274/year per node including licensing, entitlements, and the software cost of a RHN server (add another $5k-$7k for a pair of beefy boxes to act as the RHN server.. though as far as I can tell, redhat's specs on the RHN server are far exagerrated.. I could get by with $2500 worth of servers on that end for the environments I've deployed on). So, in the end, $28k/year for an enterprise of 100 servers, in one environment has meant being able to shrink the next year staffing needs by 2 people, and in one by one person, it pays for itself.. We have a 512 node render farm project we're bidding on for a new customer, and I'm wondering how those in the beowulf community who have used RHN satellite server perceive it. So far we're considering LFS and Enfusion, which are both more HPC oriented, but I'm really enjoying RHN as a management system. ---------------- BitPusher, LLC http://www.bitpusher.com/ 1.888.9PUSHER (415) 724.7998 - Mobile From rgb at phy.duke.edu Thu Oct 14 10:39:55 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] RedHat Satellite Server as a cluster management tool. In-Reply-To: <416E10D9.70006@halligan.org> Message-ID: On Wed, 13 Oct 2004, Michael T. Halligan wrote: > Has anybody used (or tried to use) the RHN system as a HPC management > tool. I've implemented this > successfully in a 100 host environment for a customer of mine, and am in > the process of > re-architecting an infrastructure with about 150 nodes.. That's about as > far as I've gotten > with it. Once I get past the cost, the poor documentation, and "OK" > support, I'm finding > that it's actually a great (though slightly immature) piece of software > for the enterprise. The ease of keeping > an infrastructure in sync, and tthe lowered workload for sysadmins I can only say "why bother". Everything it does can be done easier, faster, and better with PXE/kickstart for the base install followed by yum for fine tuning the install, updates and maintenance (all totally automagical). Yum is in RHEL, is fully GPL, is well documented, has a mailing list providing the active support of LOTS of users as well as the developers/maintainers, and is free as in air. Oh, and it works EQUALLY well with Centos, SuSE, Fedora Core 2, and other RPM-based distros, and is in wide use in clusters (and LANs) across the country. With PXE/kickstart/yum, you just build and test a kickstart file for the basic node install (necessary in any event), bootstrap the install over the net via PXE, and then forget the node altogether. yum automagically handles updates, and can also manage things like distributed installs and locking a node to a common specified set of packages. It manages all dependencies for you so that things work properly. It takes me ten minutes to install ten nodes, mostly because I like to watch the install start before moving on to handle the rare install that is interrupted for some reason (e.g. a faulty network connection). One can do a lot more than this much faster if you control the boot strictly from PXE so you don't even need to interact with the node on the console at all. How much better than that can you do? Alternatively, there are things like warewulf and scyld where even commercial solutions probably won't work out to be much more (if any more) expensive. Especially when you add in the cost of those two "beefy boxes acting as RHN servers". What a waste! We use a single repository to manage installs and updates for our entire campus (close to 1000 systems just in clusters, plus that many more in LANs and on personal desktops). And the server isn't terribly beefy -- it is actually a castoff desktop being pressed into extended service, although we finally have plans to put a REAL server in pretty soon. I mean, what kind of load does a cluster node generally PLACE on a repository server after the original install? Try "none" and you'd be really close to the truth -- an average of a single package a week updated is probably too high an estimate, and that consumes (let's see) something like 1 network-second of capacity between server and node a week with plain old 100BT. There are solutions that are designed to be scalable and easy to understand and maintain, and then there are solutions designed to be topdown manageable with a nifty GUI (and sell a lot of totally unneeded resources at the same time). Guess which one RHN falls under. Flamingly yours (not at you, but at RHN) rgb > > At 100 nodes, the pricing seems to be about $274/year per node including > licensing, entitlements, and the > software cost of a RHN server (add another $5k-$7k for a pair of beefy > boxes to act as the > RHN server.. though as far as I can tell, redhat's specs on the RHN > server are far exagerrated.. I > could get by with $2500 worth of servers on that end for the > environments I've deployed on). So, in the > end, $28k/year for an enterprise of 100 servers, in one environment has > meant being able to shrink the > next year staffing needs by 2 people, and in one by one person, it pays > for itself.. > > We have a 512 node render farm project we're bidding on for a new > customer, and I'm wondering how those in the > beowulf community who have used RHN satellite server perceive it. So far > we're considering LFS and Enfusion, > which are both more HPC oriented, but I'm really enjoying RHN as a > management system. > > ---------------- > BitPusher, LLC > http://www.bitpusher.com/ > 1.888.9PUSHER > (415) 724.7998 - Mobile > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Thu Oct 14 12:14:51 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] RedHat Satellite Server as a cluster management tool. In-Reply-To: <53761.66.150.251.142.1097780055.squirrel@mail3.bitpusher.com> Message-ID: On Thu, 14 Oct 2004, Michael T. Halligan wrote: > > Robert, > > So have you actually used the satellite server? My biggest problem with > using RHN has been the strong lack of deployments it's had.. A lot of We looked at it pretty seriously at Duke -- RH is a short walk away, we've had a long and productive relationship with them, and they were offering us a "deal" on their RHN supported product. The problem (and the likely reason for the strong lack of deployment) is the cost scaling and minimum buy-in. Frankly, if they gave it away free the server requirements are kind of crazy, given the number of machines we run on campus (and the fact that we manage it now, quite successfully, on a shoestring and a mix/choice of Centos and FC2). I think that RHN's major advantage to consumers is topdown network management in corporate environments where the costs of this sort of management tool are swallowed in the greater TCO issues of running a major data center (and where local sysadmin competence is likely to be "red hat certified systems engineers" who've gone through their training and are roughly as deep-roots hapless as their MCSE counterparts tend to be). That is, they know how to use RH's GUI tools, but they really don't UNDERSTAND that much about the systems they manage. For a corporation who just wants it to work and considers spending $100K on this and that so it works with the human resources they have available to be petty cash, that's fine. In the University/Research world, resources tend to be tight, expertise levels are relatively high, and there is even opportunity cost labor in the expert labor pools that can be diverted to learning how to do something really cheaply AND really well. That's why you have Debian clusters, ROCKS clusters, RH/FC/Centos clusters, SuSE clusters, Mandrake clusters, Warewulf clusters, Clustermatic clusters -- all largely "homebrew" at the administrative level (although the cluster-specific projects can get pretty fancy wrapping up the brew;-) -- that avoid using a) anything that you have to pay for if possible; b) anything that you have to pay a LOT for period; c) anything that doesn't scale. Yum requires more investment of effort at the beginning to learn it, as it is a real, command line sysadmin tool and yes, you'll need to read the documentation (some of which I wrote:-), work with it, play with it, figure out how to make it jump through hoops, and ultimately realize that it is REALLY powerful. Designed by sysadmins, for sysadmins. Designed and maintained by people who use it every day in large scale deployments in resource starved institutions. That sort of thing. Like all GUI tools vs command line tools, there is the usual "learn to use it in a day, pay for using it forever" that plagues the user of any windowing interface that actually has to manipulate large numbers of files and complex relationships (GUIs are all about simplicity, but not everything is "simple"). So Duke has at least for the moment tabled the RHN issue until there is a clear and burning need for it that justifies the cost, including the cost of diverting our human resources AWAY from using a tool that manifestly scales better once it is mastered. > people just naturally assume redhat is bad (hell, I even do. I use debian > for all of my personal and corporate servers).. But very few who > automatically take that stance have actually worked with the products > enough to give emperical evidence as to why. I love RH. I used to pay them money for their OS distro every major release voluntarily, until they went hypercommercial. Now I use FC2 and may migrate even further away. RH's pricing model is purely corporate. I just don't think they've grokked either the university or the personal or the HPC cluster market, or maybe they have and just don't care. rgb > > It took a while to gather enthusiasm enough to evaluate it, and a couple > of months of solid testing before I could recommend it. I've built about > 1/2 dozen similar deployment/management tools at this point, each one > built for a customer (hence the reason building 6 instead of just > improving upon the same one). > > Imaging is one thing, and yeah kickstart is easy, no objections to that.. > RHN just makes it a lot easier to deal with kickstart. It also gives a > rather useful, but more enterprise focused management system to allow you > to manage (software|config) channels, server groups, and a good method to > deal with groups with unions & intersections. I'm finding it especially > nice at one site at which 1/2 of their servers are used for testing and > 1/2 for their production environment. Pushing new patches, scripts, > commands, files to select sets of systems requires very little effort. > > RedHat's configuration management system is actually really nice. They've > put a simple (but extensible) macro system into it, which allows you to > keep one configuration file for all of the servers in a given class, when > only a few things change, and having system-specific variables be parsed > out when servers pull configs from the gold server.. Sure, you can do this > with cfengine or pikt, but uploading a config file to a webform is a lot > simpler than setting up cfengine/pikt and implementing it (I know this > from a lot of experience. > > One of the lackings of using a yum/pxe/kickstart environment (of which I'm > rather familiar with, currently managing 6 customers with a similar > environment) is that there's no "already there" configuration/versioning > management system. That was one of the key points of redhat, the fact > that I can do at-will repurposing/reprovisioning (like turning a 100 > server 30/70 app db server/app server environment into a 70/30 app/db > server environment in 5 minutes without kickstarting and zero manual > interaction).. Sure, I agree. Although it needn't take THAT long to do with yum and the cluster shell of your choice. Versioning is currently done fairly casually, or outside yum itself. There is no point and click package selector (although any editor works just fine). However, you can buy an awful lot of FTE minutes for hundreds of dollars a seat plus thousands for servers, and clusters typically DON'T change often, or much, once their prototypes are completed and debugged. > In the end, it's probably just an apples/oranges comparison.. in a science > lab/school cluster environment, it's probably more a more valuable place > to use a more manual process because grad students are cheap, and interns > are free.. :) In a corporate world, the $28k i'd spend for a 100 server > environment to save a sysadmin's worth of time, pays for itself 10 fold in > terms of environment consistency.. For servers, especially heterogenous servers, it might be worth it. If by "servers" you mean identical nodes (server or otherwise) I'd say this is a waste of money. In HPC it is the latter. In a lot of corporate environments, it is a mix of mostly the latter and some of the former. But I totally agree, the tool is designed for that kind of environment -- structurally complex and deep pocketed. > Either way, I'm not trying to evangelize, just relate my own experiences, > and try to find the best solution for a given problem. What tools out > there are good for this type of a situation, then? Thanks for the refs to > werewulf, I'm checking it out now. No problem. I just am a cost-benefit fanatic. You have to work to convince me that spending order of 20% of the nodes you might be able to buy in a compute cluster on RHN will get more work done per total dollar invested, in most HPC cluster environments, compared to any of a number of GPL free alternatives (many of which have further benefits to their use anyway). rgb > > > > > > > > > On Wed, 13 Oct 2004, Michael T. Halligan wrote: > > > >> Has anybody used (or tried to use) the RHN system as a HPC management > >> tool. I've implemented this > >> successfully in a 100 host environment for a customer of mine, and am in > >> the process of > >> re-architecting an infrastructure with about 150 nodes.. That's about as > >> far as I've gotten > >> with it. Once I get past the cost, the poor documentation, and "OK" > >> support, I'm finding > >> that it's actually a great (though slightly immature) piece of software > >> for the enterprise. The ease of keeping > >> an infrastructure in sync, and tthe lowered workload for sysadmins > > > > > > > > I can only say "why bother". Everything it does can be done easier, > > faster, and better with PXE/kickstart for the base install followed by > > yum for fine tuning the install, updates and maintenance (all totally > > automagical). Yum is in RHEL, is fully GPL, is well documented, has a > > mailing list providing the active support of LOTS of users as well as > > the developers/maintainers, and is free as in air. Oh, and it works > > EQUALLY well with Centos, SuSE, Fedora Core 2, and other RPM-based > > distros, and is in wide use in clusters (and LANs) across the country. > > > > With PXE/kickstart/yum, you just build and test a kickstart file for the > > basic node install (necessary in any event), bootstrap the install over > > the net via PXE, and then forget the node altogether. yum automagically > > handles updates, and can also manage things like distributed installs > > and locking a node to a common specified set of packages. It manages > > all dependencies for you so that things work properly. > > > > It takes me ten minutes to install ten nodes, mostly because I like to > > watch the install start before moving on to handle the rare install that > > is interrupted for some reason (e.g. a faulty network connection). One > > can do a lot more than this much faster if you control the boot strictly > > from PXE so you don't even need to interact with the node on the console > > at all. How much better than that can you do? > > > > Alternatively, there are things like warewulf and scyld where even > > commercial solutions probably won't work out to be much more (if any > > more) expensive. Especially when you add in the cost of those two > > "beefy boxes acting as RHN servers". What a waste! We use a single > > repository to manage installs and updates for our entire campus (close > > to 1000 systems just in clusters, plus that many more in LANs and on > > personal desktops). And the server isn't terribly beefy -- it is > > actually a castoff desktop being pressed into extended service, although > > we finally have plans to put a REAL server in pretty soon. > > > > I mean, what kind of load does a cluster node generally PLACE on a > > repository server after the original install? Try "none" and you'd be > > really close to the truth -- an average of a single package a week > > updated is probably too high an estimate, and that consumes (let's see) > > something like 1 network-second of capacity between server and node a > > week with plain old 100BT. > > > > There are solutions that are designed to be scalable and easy to > > understand and maintain, and then there are solutions designed to be > > topdown manageable with a nifty GUI (and sell a lot of totally unneeded > > resources at the same time). Guess which one RHN falls under. > > > > > > Flamingly yours (not at you, but at RHN) > > > > rgb > > > >> > >> At 100 nodes, the pricing seems to be about $274/year per node including > >> licensing, entitlements, and the > >> software cost of a RHN server (add another $5k-$7k for a pair of beefy > >> boxes to act as the > >> RHN server.. though as far as I can tell, redhat's specs on the RHN > >> server are far exagerrated.. I > >> could get by with $2500 worth of servers on that end for the > >> environments I've deployed on). So, in the > >> end, $28k/year for an enterprise of 100 servers, in one environment has > >> meant being able to shrink the > >> next year staffing needs by 2 people, and in one by one person, it pays > >> for itself.. > >> > >> We have a 512 node render farm project we're bidding on for a new > >> customer, and I'm wondering how those in the > >> beowulf community who have used RHN satellite server perceive it. So far > >> we're considering LFS and Enfusion, > >> which are both more HPC oriented, but I'm really enjoying RHN as a > >> management system. > >> > >> ---------------- > >> BitPusher, LLC > >> http://www.bitpusher.com/ > >> 1.888.9PUSHER > >> (415) 724.7998 - Mobile > >> > >> > >> _______________________________________________ > >> Beowulf mailing list, Beowulf@beowulf.org > >> To change your subscription (digest mode or unsubscribe) visit > >> http://www.beowulf.org/mailman/listinfo/beowulf > >> > > > > -- > > Robert G. Brown http://www.phy.duke.edu/~rgb/ > > Duke University Dept. of Physics, Box 90305 > > Durham, N.C. 27708-0305 > > Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu > > > > > > > > > > > ------------------- > BitPusher, LLC > http://www.bitpusher.com/ > 1.888.9PUSHER > (415) 724.7998 - Mobile > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From agrajag at dragaera.net Thu Oct 14 11:30:41 2004 From: agrajag at dragaera.net (Sean Dilda) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] RedHat Satellite Server as a cluster management tool. In-Reply-To: <416E10D9.70006@halligan.org> References: <416E10D9.70006@halligan.org> Message-ID: <1097778641.22262.27.camel@pel> On Thu, 2004-10-14 at 01:38, Michael T. Halligan wrote: > So, in the > end, $28k/year for an enterprise of 100 servers, in one environment has > meant being able to shrink the > next year staffing needs by 2 people, and in one by one person, it pays > for itself.. As the only sysadmin for a 260-node cluster, I'm extremely curious what jobs those 2 people were supposed to be doing. I have an operations staff to rely on for some environmental stuff and for handling service calls with vendors (I report the problem to them and do the hw replacement, they just take care of the phone call). However, even with 260 nodes I still find a lot of my time spent in trying to improve the cluster as opposed to just keeping it running. From michael at halligan.org Thu Oct 14 11:54:15 2004 From: michael at halligan.org (Michael T. Halligan) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] RedHat Satellite Server as a cluster management tool. In-Reply-To: References: <416E10D9.70006@halligan.org> Message-ID: <53761.66.150.251.142.1097780055.squirrel@mail3.bitpusher.com> Robert, So have you actually used the satellite server? My biggest problem with using RHN has been the strong lack of deployments it's had.. A lot of people just naturally assume redhat is bad (hell, I even do. I use debian for all of my personal and corporate servers).. But very few who automatically take that stance have actually worked with the products enough to give emperical evidence as to why. It took a while to gather enthusiasm enough to evaluate it, and a couple of months of solid testing before I could recommend it. I've built about 1/2 dozen similar deployment/management tools at this point, each one built for a customer (hence the reason building 6 instead of just improving upon the same one). Imaging is one thing, and yeah kickstart is easy, no objections to that.. RHN just makes it a lot easier to deal with kickstart. It also gives a rather useful, but more enterprise focused management system to allow you to manage (software|config) channels, server groups, and a good method to deal with groups with unions & intersections. I'm finding it especially nice at one site at which 1/2 of their servers are used for testing and 1/2 for their production environment. Pushing new patches, scripts, commands, files to select sets of systems requires very little effort. RedHat's configuration management system is actually really nice. They've put a simple (but extensible) macro system into it, which allows you to keep one configuration file for all of the servers in a given class, when only a few things change, and having system-specific variables be parsed out when servers pull configs from the gold server.. Sure, you can do this with cfengine or pikt, but uploading a config file to a webform is a lot simpler than setting up cfengine/pikt and implementing it (I know this from a lot of experience. One of the lackings of using a yum/pxe/kickstart environment (of which I'm rather familiar with, currently managing 6 customers with a similar environment) is that there's no "already there" configuration/versioning management system. That was one of the key points of redhat, the fact that I can do at-will repurposing/reprovisioning (like turning a 100 server 30/70 app db server/app server environment into a 70/30 app/db server environment in 5 minutes without kickstarting and zero manual interaction).. In the end, it's probably just an apples/oranges comparison.. in a science lab/school cluster environment, it's probably more a more valuable place to use a more manual process because grad students are cheap, and interns are free.. :) In a corporate world, the $28k i'd spend for a 100 server environment to save a sysadmin's worth of time, pays for itself 10 fold in terms of environment consistency.. Either way, I'm not trying to evangelize, just relate my own experiences, and try to find the best solution for a given problem. What tools out there are good for this type of a situation, then? Thanks for the refs to werewulf, I'm checking it out now. > > On Wed, 13 Oct 2004, Michael T. Halligan wrote: > >> Has anybody used (or tried to use) the RHN system as a HPC management >> tool. I've implemented this >> successfully in a 100 host environment for a customer of mine, and am in >> the process of >> re-architecting an infrastructure with about 150 nodes.. That's about as >> far as I've gotten >> with it. Once I get past the cost, the poor documentation, and "OK" >> support, I'm finding >> that it's actually a great (though slightly immature) piece of software >> for the enterprise. The ease of keeping >> an infrastructure in sync, and tthe lowered workload for sysadmins > > > > I can only say "why bother". Everything it does can be done easier, > faster, and better with PXE/kickstart for the base install followed by > yum for fine tuning the install, updates and maintenance (all totally > automagical). Yum is in RHEL, is fully GPL, is well documented, has a > mailing list providing the active support of LOTS of users as well as > the developers/maintainers, and is free as in air. Oh, and it works > EQUALLY well with Centos, SuSE, Fedora Core 2, and other RPM-based > distros, and is in wide use in clusters (and LANs) across the country. > > With PXE/kickstart/yum, you just build and test a kickstart file for the > basic node install (necessary in any event), bootstrap the install over > the net via PXE, and then forget the node altogether. yum automagically > handles updates, and can also manage things like distributed installs > and locking a node to a common specified set of packages. It manages > all dependencies for you so that things work properly. > > It takes me ten minutes to install ten nodes, mostly because I like to > watch the install start before moving on to handle the rare install that > is interrupted for some reason (e.g. a faulty network connection). One > can do a lot more than this much faster if you control the boot strictly > from PXE so you don't even need to interact with the node on the console > at all. How much better than that can you do? > > Alternatively, there are things like warewulf and scyld where even > commercial solutions probably won't work out to be much more (if any > more) expensive. Especially when you add in the cost of those two > "beefy boxes acting as RHN servers". What a waste! We use a single > repository to manage installs and updates for our entire campus (close > to 1000 systems just in clusters, plus that many more in LANs and on > personal desktops). And the server isn't terribly beefy -- it is > actually a castoff desktop being pressed into extended service, although > we finally have plans to put a REAL server in pretty soon. > > I mean, what kind of load does a cluster node generally PLACE on a > repository server after the original install? Try "none" and you'd be > really close to the truth -- an average of a single package a week > updated is probably too high an estimate, and that consumes (let's see) > something like 1 network-second of capacity between server and node a > week with plain old 100BT. > > There are solutions that are designed to be scalable and easy to > understand and maintain, and then there are solutions designed to be > topdown manageable with a nifty GUI (and sell a lot of totally unneeded > resources at the same time). Guess which one RHN falls under. > > > Flamingly yours (not at you, but at RHN) > > rgb > >> >> At 100 nodes, the pricing seems to be about $274/year per node including >> licensing, entitlements, and the >> software cost of a RHN server (add another $5k-$7k for a pair of beefy >> boxes to act as the >> RHN server.. though as far as I can tell, redhat's specs on the RHN >> server are far exagerrated.. I >> could get by with $2500 worth of servers on that end for the >> environments I've deployed on). So, in the >> end, $28k/year for an enterprise of 100 servers, in one environment has >> meant being able to shrink the >> next year staffing needs by 2 people, and in one by one person, it pays >> for itself.. >> >> We have a 512 node render farm project we're bidding on for a new >> customer, and I'm wondering how those in the >> beowulf community who have used RHN satellite server perceive it. So far >> we're considering LFS and Enfusion, >> which are both more HPC oriented, but I'm really enjoying RHN as a >> management system. >> >> ---------------- >> BitPusher, LLC >> http://www.bitpusher.com/ >> 1.888.9PUSHER >> (415) 724.7998 - Mobile >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > -- > Robert G. Brown http://www.phy.duke.edu/~rgb/ > Duke University Dept. of Physics, Box 90305 > Durham, N.C. 27708-0305 > Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu > > > > ------------------- BitPusher, LLC http://www.bitpusher.com/ 1.888.9PUSHER (415) 724.7998 - Mobile From michael at halligan.org Thu Oct 14 11:58:46 2004 From: michael at halligan.org (Michael T. Halligan) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] RedHat Satellite Server as a cluster management tool. In-Reply-To: <1097778641.22262.27.camel@pel> References: <416E10D9.70006@halligan.org> <1097778641.22262.27.camel@pel> Message-ID: <53784.66.150.251.142.1097780326.squirrel@mail3.bitpusher.com> > As the only sysadmin for a 260-node cluster, I'm extremely curious what > jobs those 2 people were supposed to be doing. I have an operations > staff to rely on for some environmental stuff and for handling service > calls with vendors (I report the problem to them and do the hw > replacement, they just take care of the phone call). However, even with > 260 nodes I still find a lot of my time spent in trying to improve the > cluster as opposed to just keeping it running. Well, this is probably an apples to oranges comparison.. I've worked in environments where I was the only systems administrator, and ran 500 servers on my own.. It's rather trivial to administer a real cluster, where there's only one or two functions for the entire thing.. It's exponentially more work to keep good process in terms of consistency, configuration management, version control, patch management, and the general overall health in a non-cluster environment where you might have 100 servers, in groups of 2 or 4 servers per function, and maybe even several one-off servers. This is my first forray into building a single-function cluster in several years, and I'm trying to determine if tried & true enterprise management techniques can be a value or a detriment in a beowulf environment, or at least figure out which concepts carry over, which are superfluous, and which just aren't applicable. ------------------- BitPusher, LLC http://www.bitpusher.com/ 1.888.9PUSHER (415) 724.7998 - Mobile From tmattox at gmail.com Thu Oct 14 19:56:44 2004 From: tmattox at gmail.com (Tim Mattox) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] RedHat Satellite Server as a cluster management tool. In-Reply-To: <53761.66.150.251.142.1097780055.squirrel@mail3.bitpusher.com> References: <416E10D9.70006@halligan.org> <53761.66.150.251.142.1097780055.squirrel@mail3.bitpusher.com> Message-ID: Hello Michael, I'm one of the co-developers of Warewulf. (http://warewulf-cluster.org/) We try to make it as sysadmin friendly as we can. If you haven't seen it yet, check out the README file inside the RPM (it gets put in /usr/share/doc/warewulf...) It explains the philosophy behind Warewulf's development. If you have any questions about Warewulf, feel free to post to it's mailing list. I also follow the Beowulf mailing list, but not daily. I admit Warewulf's documentation can be lacking (or there, but talking about a previous version), but once you get into it a bit, the system makes quite a lot of sense... mostly ;-) For your specific task you describe, I would think Warewulf would work well for you. It's not perfect, but we eat our own dogfood, and this is the best tasting "dogfood" I've used for cluster management. ;-) Managing a cluster with Warewulf is kind of like sysadmining less than two machines... the boot server, and then a "virtual node"... which is just a chroot on the boot server. If your cluster is heterogeneous, you can set up more than one VNFS (virtual node file system). And I can't pass up commenting about the costs for "per node" software... I grimace at anything where the cost of the software has ANY non-zero multiple related to the number of nodes. Why? The hardware costs in the cluster's I've helped build tend to be far under $1k per node, and usually under $500 per node. RHN is just not an option for that kind of cluster. Anyway, good luck choosing a cluster management tool for your setup. The ones rgb mentioned are all worth considering. -- Tim Mattox - tmattox@gmail.com - http://homepage.mac.com/tmattox/ From llwaeva at 21cn.com Fri Oct 15 11:38:45 2004 From: llwaeva at 21cn.com (llwaeva@21cn.com) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] about managment Message-ID: <20041016022741.B9B6.LLWAEVA@21cn.com> Hi all, I am running 8-node LAM/MPI parallel computing system. I found it's trouble to maintain the user accounts and software distribution on all the nodes. For example, whenever I install a new software , I have to repeat the job 8 times! The most annoying thing is that the configuration or managment of the user accounts over the network is a heavy job. Someone suggests that I should utilize NFS and NIS. However, in my case, it's difficult to have an additional computer as a server. Would anyone please share your experience in maintaining the beowulf cluster? Thanks in advance. From rgb at phy.duke.edu Fri Oct 15 15:55:25 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] about managment In-Reply-To: <20041016022741.B9B6.LLWAEVA@21cn.com> References: <20041016022741.B9B6.LLWAEVA@21cn.com> Message-ID: On Sat, 16 Oct 2004 llwaeva@21cn.com wrote: > Hi all, > I am running 8-node LAM/MPI parallel computing system. I found it's > trouble to maintain the user accounts and software distribution on all > the nodes. For example, whenever I install a new software , I have to > repeat the job 8 times! The most annoying thing is that the > configuration or managment of the user accounts over the network is a > heavy job. Someone suggests that I should utilize NFS and NIS. However, > in my case, it's difficult to have an additional computer as a server. > Would anyone please share your experience in maintaining the beowulf > cluster? You don't need an additional computer as a server to run NIS -- just use your head node as an NIS server. Ditto crossmounting disk space with NFS -- just export a partition (it doesn't have to be huge, just big enough to hold your user home directories and workspace) to all the nodes. Remember, NIS and NFS were in some twenty years ago -- when I first started managing Unix systems a "server" supporting dozens of user accounts might have one or two hundred MEGABYTES of disk in exported directories (an amount of disk that costs many thousands of dollars) and might deliver a whopping 4 MIPS of performance, and yet the server would still be useable for NIS, NFS and even some modest amounts of computation in its relatively miniscule memory. For a mini-cluster with only eight nodes, serving NIS and NFS won't even warm it up, and the smallest disks being sold today are some 60 GB in size -- an amount that even five or six years ago would have constituted a departmental server's collective store (and that server might have served 50 or 100 workstations, 100 or so users, and managed it with a processor ten to twenty times slower). There are also numerous alternative solutions, if setting up servers is something you don't know how to do or don't want to do for other reasons. You could use rsync to synchronize user accounts across nodes. This works well (and even yields a performance advantage) if they change slowly, but will be a pain if they change a lot (the advantage of NIS is that it pretty much completely automates this after a bit of work setting it up originally). You can and should use tools like ones that were just discussed, e.g. kickstart and yum, to automate installation and maintenance. Finally, look into the various "cluster distributions" that do it all for you, noteably ROCKS and warewulf. Be aware, though, that they are pretty likely to use things like NFS and NIS as (possibly optional) components of their solutions. [BTW, y'all are just spiled, spiled rotten. Why, back in the OLD days geeks were geeks and had to slam massive amounts of cola to get real work done on networks, CPUs, memory that my PDA has beat hands down today. Here you put together a networked cluster the least component of which would have been "inconceivably" powerful (in every dimension) three decades ago, which is more powerful all by itself than the first twenty or so beowulfs ever built, and you can't find a "server"...;-)] rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From tmattox at gmail.com Fri Oct 15 16:00:29 2004 From: tmattox at gmail.com (Tim Mattox) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] about managment In-Reply-To: <20041016022741.B9B6.LLWAEVA@21cn.com> References: <20041016022741.B9B6.LLWAEVA@21cn.com> Message-ID: You should look at any of a variety of cluster management software packages. Some are free, some are commercial. Here is a short list that I can name off the top of my head: Rocks, Clustermatic, Oscar, Warewulf, Scyld and others. I'm a co-developer of Warewulf, a free one that has a fairly unique approach to the problem. You can find out more on it's website: http://warewulf-cluster.org/ The short version is that Warewulf builds a ramdisk image that it uses to network boot the nodes. The ramdisk is built from a Virtual Node File System (VNFS) that is maintained on the boot server. You can use pretty much any RPM based Linux distribution for the boot server and the VNFS. With this approach, the nodes get a fresh filesystem at boot time, without any worries about version or package creep. Adding or upgrading programs and changing the list of users is very easy. Good luck. On Sat, 16 Oct 2004 02:38:45 +0800, llwaeva@21cn.com wrote: > Hi all, > I am running 8-node LAM/MPI parallel computing system. I found it's > trouble to maintain the user accounts and software distribution on all > the nodes. For example, whenever I install a new software , I have to > repeat the job 8 times! The most annoying thing is that the > configuration or managment of the user accounts over the network is a > heavy job. Someone suggests that I should utilize NFS and NIS. However, > in my case, it's difficult to have an additional computer as a server. > Would anyone please share your experience in maintaining the beowulf > cluster? > > Thanks in advance. -- Tim Mattox - tmattox@gmail.com - http://homepage.mac.com/tmattox/ From andrewxwang at yahoo.com.tw Fri Oct 15 19:59:46 2004 From: andrewxwang at yahoo.com.tw (Andrew Wang) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] Grid Engine question In-Reply-To: <1097612732.18951.14.camel@pel> Message-ID: <20041016025946.91468.qmail@web18010.mail.tpe.yahoo.com> They call it "Resource Reservation". For a list of new features, see this paper: http://www.sun.com/products-n-solutions/edu/whitepapers/pdf/N1GridEngine6.pdf Andrew. --- Sean Dilda ªº°T®§¡G > In SGE 6.0 they added a feature they call 'advanced > reservations'. Its > not really advanced, and its not what I consider > 'reservations' to be, > but it is exactly what you want. When reservations > are enabled on the > cluster, and the job is submitted with '-R y', the > mutli-processor job > will be able to 'hold' available resources until it > has enough to run, > and thus keep lower priority jobs from using them. > > However, to do this you need to upgrade to at least > version 6.0. > However, 6.0 also has cluster queues which I find > makes administration > much easier (it allows you to create one queue setup > and assign it to > multiple hosts instead of doing a separate setup for > each compute host). > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or > unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > ----------------------------------------------------------------- Yahoo!©_¼¯¹q¤l«H½c 100MB §K¶O«H½c¡A¹q¤l«H½c·s¬ö¤¸±q³o¶}©l¡I http://mail.yahoo.com.tw/ From andrewxwang at yahoo.com.tw Fri Oct 15 20:17:41 2004 From: andrewxwang at yahoo.com.tw (Andrew Wang) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] 64bit comparisons In-Reply-To: <436A69503F99214D9EBC27415F66C40C3288E1@blums0008.sd.gd.com> Message-ID: <20041016031741.31345.qmail@web18007.mail.tpe.yahoo.com> I believe you can get more info from the following mailing lists: "hpc", "scitech", "xgrid-users" see: http://lists.apple.com/mailman/listinfo Also, people on those lists (if I remember correctly) use LAM-MPI, GridEngine, and also the IBM xlc/xlf compilers (XL compilers generate faster code for the G5). Andrew. --- "Hujsak, Jonathan T (US SSA)" ªº°T®§¡G > Have you gained any new 'lessons learned' since the > communication > below? Can you recommend a good version of MPI to > use for these? > > We've been looking at MPICH, MPIPro and also the > Apple xgrid... > > > > Thanks! > > > > Jonathan Hujsak > > BAE Systems > > San Diego > > > > Bill Broadley bill at cse.ucdavis.edu > s&In-Reply-To=200405141644.i4EGi1Aq023213%40marvin.ibest.uidaho.edu> > > Fri May 14 11:48:21 PDT 2004 > > * Previous message: [Beowulf] 64bit comparisons > > > * Next message: [Beowulf] 64bit comparisons > > > * Messages sorted by: [ date ] > > [ > thread ] > > [ > subject ] > > [ author ] > > > > _____ > > On Fri, May 14, 2004 at 09:44:01AM -0700, Robert B > Heckendorn wrote: > > One of the options we are strongly considering for > our next cluster is > > going with Apple X-servers. There performance is > purported to be good > > Careful to benchmark both processors at the same > time if that is your > intended usage pattern. Are the dual-g5's shipping > yet? Last I heard > yield problems were resulting in only uniprocessor > shipments. My main > concern that despite the marketing blurb of 2 > 10GB/sec CPU interfaces > or similar that there is a shared 6.4 GB/sec memory > bus. > > > and their power consumption is small. > > Has anyone measured a dual g5 xserv with a > kill-a-watt or similar? > > > Can people comment on any comparisons betwee Apple > and (Athlon64 > > or Opteron)? > > Personally I've had problems, I need to spend more > time resolving them, > things like: > * Need to tweak /etc/rc to allow Mpich to use > shared memory > * Latency between two mpich processes on the > same node is 10-20 > times the > linux latency. I've yet to try LAM. > * Differences in semaphores requires a rewrite for > some linux code I > had > * Difference in the IBM fortran compiler required > a rewrite compared > to code > that ran on Intel's, portland group's, and GNU's > fortran compiler. > > > Given all that I'm still interested to see what the > G5 is good at and > under > what workloads the G5 wins perf/price or perf/watt. > > -- > Bill Broadley > Computational Science and Engineering > UC Davis > > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or > unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > ----------------------------------------------------------------- Yahoo!©_¼¯¹q¤l«H½c 100MB §K¶O«H½c¡A¹q¤l«H½c·s¬ö¤¸±q³o¶}©l¡I http://mail.yahoo.com.tw/ From john.hearns at clustervision.com Fri Oct 15 21:52:10 2004 From: john.hearns at clustervision.com (John Hearns) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] about managment In-Reply-To: <20041016022741.B9B6.LLWAEVA@21cn.com> References: <20041016022741.B9B6.LLWAEVA@21cn.com> Message-ID: <1097902329.6223.945.camel@vigor12> On Fri, 2004-10-15 at 19:38, llwaeva@21cn.com wrote: > Hi all, > I am running 8-node LAM/MPI parallel computing system. I found it's > trouble to maintain the user accounts and software distribution on all > the nodes. For example, whenever I install a new software , I have to > repeat the job 8 times! There are several answers to this question, which you can learn about by staying on this group, and consulting online resources. A quick answer is that you could construct the cluster using one of the toolkits, such as Rocks or Warewulf - many others. And a very quick answer to your current dilemma. There are utilities which allow parallel execution of commands on a set of machines, or even to have a terminal session in parallel across a set of machines. Once you have a server (below) you can rsync each node to that. > The most annoying thing is that the > configuration or managment of the user accounts over the network is a > heavy job. Someone suggests that I should utilize NFS and NIS. However, > in my case, it's difficult to have an additional computer as a server. Not meaning to be rude, but you are wrong there. Just use one of your compute nodes as the server. The additional CPU load will not be great. You should use some sort of centralised account management NIS or LDAP. Even if you point blank refuse to do that, a cron job to rsync the relevant files will help cut down your admin load. And remember - eight machines may not seem a lot. But what happens if you make a mistake on one machine, or one machine is down when you are adding an account or software. Are you sure to run identical commands by hand the next time it is up? From eugen at leitl.org Sat Oct 16 11:45:31 2004 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] InfiniBand Drivers Released for Xserve G5 Clusters (fwd from brian-slashdotnews@hyperreal.org) Message-ID: <20041016184531.GF1457@leitl.org> ----- Forwarded message from brian-slashdotnews@hyperreal.org ----- From: brian-slashdotnews@hyperreal.org Date: 16 Oct 2004 01:26:01 -0000 To: slashdotnews@hyperreal.org Subject: InfiniBand Drivers Released for Xserve G5 Clusters User-Agent: SlashdotNewsScooper/0.0.3 Link: http://slashdot.org/article.pl?sid=04/10/15/2135211 Posted by: pudge, on 2004-10-15 23:30:00 from the insert-grunting-noise-here dept. A user writes, "A company called [1]Small Tree just [2]announced the release of InfiniBand drivers for the Mac, for more supercomputing speed. People have already been making supercomputer clusters for the Mac, including Virginia Tech's [3]third-fastest supercomputer in the world, but InfiniBand is supposed to make the latency drop. A lot. [4]Voltaire also makes some sort of Apple InfiniBand products, though it's not clear whether they make the drivers or hardware." IFRAME: [5]pos6 References 1. http://www.small-tree.com/ 2. http://www.wistechnology.com/article.php?id=1255 3. http://www.macobserver.com/article/2003/11/17.1.shtml 4. http://www.voltaire.com/apple.html 5. http://ads.osdn.com/?ad_id=2936&alloc_id=10685&site_id=1&request_id=2846371 ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20041016/ff36ac07/attachment.bin From atp at piskorski.com Sat Oct 16 12:01:34 2004 From: atp at piskorski.com (Andrew Piskorski) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] s_update() missing from AFAPI ? Message-ID: <20041016190133.GA42332@piskorski.com> The old 1997 paper by Dietz, Mattox, and Krishnamurthy, "The Aggregate Function API: It's Not Just For PAPERS Anymore", briefly mentions that their AFAPI library also supports, "fully coherent, polyatomic, replicated shared memory". It even gives a little chart showing how many microseconds their s_update() function takes to update that shared memory. That sounds interesting (even given the extremely low bandwith of the PAPERS hardware, etc.), but, no such function exists in the last 1999-12-22 AFAPI release! s_update() just isn't in there at all. Why? Tim M., I know you follow the Beowulf list, so could you fill us in a bit on what what happened there? http://aggregate.org/TechPub/lcpc97.html http://aggregate.org/AFAPI/AFAPI_19991222.tgz While I'm at it I might as well ask this too: That same old PAPERS papers says "UDPAPERS", using Ethernet and UDP, was implemented, but it doesn't seem to be in the AFAPI release either. What happened with that? Did it work? As well as the custom PAPERS hardware? If so, how? Dirt cheap 10/100 cards and UTP cable would certainly be a lot more convenient than custom PAPERS hardware for anyone wanting to experiment with the AFAPI stuff, but I'm confused about what part of the ethernet network could be magically made to act as the NAND gate for the aggregate operations. Did it need to use some particular programmable ethernet switch? Or the aggregate operations were actually done on each of the nodes? -- Andrew Piskorski http://www.piskorski.com/ From hahn at physics.mcmaster.ca Sat Oct 16 14:24:01 2004 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] InfiniBand Drivers Released for Xserve G5 Clusters (fwd from brian-slashdotnews@hyperreal.org) In-Reply-To: <20041016184531.GF1457@leitl.org> Message-ID: > world, but InfiniBand is supposed to make the latency drop. A lot. sigh. small-tree claims 6.13 us, which is certainly not exceptional latency these days. for instance, there are three vendors who are shipping <2 us MPI today. maybe I'm just being extra-surly, but if you crow too much about a non-novel accomplishment, you look pretty silly to anyone in the field... From hahn at physics.mcmaster.ca Sat Oct 16 14:36:01 2004 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] bandwidth: who needs it? Message-ID: do you have applications that are pushing the limits of MPI bandwidth? for instance, code that actually comes close to using the 8-900 MB/s that current high-end interconnect provides? we have a fairly wide variety of codes inside SHARCnet, but I haven't found anyone who is even complaining about our last-generation fabric (quadrics elan3, around 250 MB/s). is it just that we don't have the right researchers? I've heard people mutter about earthquake researchers being able to pin a 800 MB/s network, and claims that big FFT folk can do so as well. by contrast, many people claim to notice improvements in latency from old/mundane (6-7 us) to new/good (<2 us). I'd be interested in hearing about applications you know of which are very sensitive to having large bandwidth (say, .8 GB/s today). thanks, mark hahn. From tmattox at gmail.com Sat Oct 16 15:15:14 2004 From: tmattox at gmail.com (Tim Mattox) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] s_update() missing from AFAPI ? In-Reply-To: <20041016190133.GA42332@piskorski.com> References: <20041016190133.GA42332@piskorski.com> Message-ID: Hello Andrew (and the Beowulf list as well), You ask some very good questions, and you asked the person who should know the answers. Hopefully my answers below make sense. On Sat, 16 Oct 2004 15:01:34 -0400, Andrew Piskorski wrote: > The old 1997 paper by Dietz, Mattox, and > Krishnamurthy, "The Aggregate Function API: It's > Not Just For PAPERS Anymore", briefly mentions > that their AFAPI library also supports, "fully > coherent, polyatomic, replicated shared memory". > It even gives a little chart showing how many > microseconds their s_update() function takes to > update that shared memory. > > That sounds interesting (even given the extremely > low bandwith of the PAPERS hardware, etc.), but, > no such function exists in the last 1999-12-22 > AFAPI release! s_update() just isn't in there at > all. Why? Tim M., I know you follow the Beowulf > list, so could you fill us in a bit on what what > happened there? The s_update() function went away because we changed the underlying implementation of the "asyncronous" s_ routines. The new approach had a hardware limit of 3 "fast" signals, and we deemed that it was best to not hard code any of those for this rarely used shared memory functionality. We had intended to supply a routine that replaced the functionality of s_update() that you could install as one of the 3 signal handlers if you chose to. I'm not sure why that code wasn't released. But, over time, it became a moot point, since the speed of the processors improved so much, that the busy-wait/polling scheme we were using for the s_ routines made it very difficult to get any speedup using the equivalent of the s_update() routine. With the parallel port not actively causing an interrupt, all the nodes had to poll for pending s_ operations. Going from the 486 to the Pentium was a dramatic change on the relative overheads for this polling operation and general computations. Basically, the Pentium and later processors were slowed down so dramatically whenever you would do a single IO space read (the polling function) to see if any pending shared memory operations needed to be dealt with, that it was difficult to get any speedup, even with only two processors. On the testing codes I wrote at the time, it was hard to find the right balance for how frequently to poll. If you polled too frequently, the Pentium was slowed down to a crawl on purely local operations. We speculated that the IO instruction caused a flush of the Pentium's pipeline, but we didn't explore it to great detail. Also, if you polled too infrequently, the shared memory operations were stalled for long periods of time, causing the other processor(s) to sit idle waiting to get their shared memory writes processed. Yes, the performance numbers in the LCPC 1997 paper are measured on a 4 node Pentium cluster, but I don't think we had time yet to play with "real" codes that used the s_update routine on a Pentium cluster. That was a long time ago, so I might not be remembering this part very well. But I do remember that once we had more time to play with it on Pentiums, it was clear that no performance critical codes would be using the s_update routine, much less any of the s_ routines as far as we could tell. So, that is why the s_update routine was pulled from the library, to free up the signaling slot for potentially more useful things. > http://aggregate.org/TechPub/lcpc97.html > http://aggregate.org/AFAPI/AFAPI_19991222.tgz > > While I'm at it I might as well ask this too: > That same old PAPERS papers says "UDPAPERS", using > Ethernet and UDP, was implemented, but it doesn't > seem to be in the AFAPI release either. What > happened with that? The UDPAPERS code was being worked on by a colleague of mine for his parallel file system work, and unfortunately for the rest of us, he only implemented the minimum amount of functionality that he needed for his project, not the full AFAPI. Back in 1999 I had hoped to have time to finish it off myself, but it wasn't my top priority, and if you have followed our work, the KLAT2 cluster in the spring of 2000 brought in some much more interesting new ideas with the FNN stuff. > Did it work? Yes, to some degree, but there were still some important corner cases (certain packet loss scenarios) that hadn't been dealt with, and as I said, the full AFAPI wasn't implemented, just a few basic routines. > As well as the custom PAPERS hardware? No, not as well as the custom hardware. Speaking of which: The custom PAPERS hardware has had some additional work since we last published on it. But due to changing priorities, it has been sitting waiting for the next bright student or two to revive it for more modern IO ports (USB, Firewire, ???). You can see the last parts list and board layouts here: http://aggregate.org/AFN/000601/ Unfortunately, the assembly documentation for that board was never written. It's a "small change" from the PAPERS 960801 board, but enough that if you don't know what each thing is intended for, you might not get it right. That's why we haven't posted a public link to the 000601 board design (until now). We almost made a 12 port version of the PCB, but again, the student involved on that finished their project, and the design hasn't been validated, so it's not been sent out to a PCB fab to be built. As a group we decided it would be better to find students interested in doing a new design that used more modern IO ports than the parallel printer port. Know anyone interested in a Masters project were they have to build hardware that actually works? ;-) Academically, it's hard to make such a thing be for a Ph.D. due to the fact that it's mostly just "implementation/development" at this point, with little "academic" research. > If so, how? Dirt cheap 10/100 cards and UTP cable > would certainly be a lot more convenient than > custom PAPERS hardware for anyone wanting to > experiment with the AFAPI stuff, but I'm confused > about what part of the ethernet network could be > magically made to act as the NAND gate for the > aggregate operations. Yep, no NAND gate in the ethernet... > Did it need to use some particular programmable > ethernet switch? Or the aggregate operations > were actually done on each of the nodes? Yeah, the aggregate operations were actually performed within each node on local copies of the data from all the nodes. The basic idea was to have each node send its new data along with all the known data from anyone else for the current (and previous) operation with a UDP broadcast/multicast. Just this semester we finally have a new student working on a UDP/Multicast implementation of AFAPI... or something like it. They are just now getting up to speed on things, so don't hold your breath. Also, it's unlikely we would actually target a new AFAPI release. With the dominance of MPI, it would only make sense to build such a thing for use as a module for LAM-MPI or the new OpenMPI. I hope this answers your questions, but if not, feel free to ask more. I am busy with my own FNN dissertation work now (plus Warewulf), so I won't be working on AFN/AFAPI/PAPERS stuff to any degree until my Ph.D. is finished. -- Tim Mattox - tmattox@gmail.com http://homepage.mac.com/tmattox/ From atp at piskorski.com Sat Oct 16 19:20:47 2004 From: atp at piskorski.com (Andrew Piskorski) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] Re: s_update() missing from AFAPI ? In-Reply-To: References: <20041016190133.GA42332@piskorski.com> Message-ID: <20041017022047.GA44676@piskorski.com> On Sat, Oct 16, 2004 at 06:15:14PM -0400, Tim Mattox wrote: > I hope this answers your questions, but if not, feel free to ask > more. I am busy with my own FNN dissertation work now (plus > Warewulf), so I won't be working on AFN/AFAPI/PAPERS stuff to any > degree until my Ph.D. is finished. Actually, that was excellent, seemed to fill in most of the important PAPERS-related holes in my basic background knowledge. Thanks! -- Andrew Piskorski http://www.piskorski.com/ From agrajag at dragaera.net Sun Oct 17 04:42:38 2004 From: agrajag at dragaera.net (Sean Dilda) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] about managment In-Reply-To: <20041016022741.B9B6.LLWAEVA@21cn.com> References: <20041016022741.B9B6.LLWAEVA@21cn.com> Message-ID: <1098013358.4390.44.camel@pel> On Fri, 2004-10-15 at 14:38, llwaeva@21cn.com wrote: > Hi all, > I am running 8-node LAM/MPI parallel computing system. I found it's > trouble to maintain the user accounts and software distribution on all > the nodes. For example, whenever I install a new software , I have to > repeat the job 8 times! The most annoying thing is that the > configuration or managment of the user accounts over the network is a > heavy job. Someone suggests that I should utilize NFS and NIS. However, > in my case, it's difficult to have an additional computer as a server. > Would anyone please share your experience in maintaining the beowulf > cluster? My cluster relies heavily on NIS and NFS. NIS is used to share user login information so that I only have to make an account once, then create the NIS maps. (I actually have pam configured to authenticate off of the campus kerberos server on the head nodes, and using ssh's host-based authentication across the cluster) We also use NFS for users home directories. However, I make a point of trying to package up third party software that's used and install the rpms on all of the machines in the cluster. As for an extra server, as RGB pointed out, NFS and NIS can easily run on the same box. Do you currently have a scheduler? If so, you can run NIS/NFS on that box. If nothing else, you can just run them on whatever box people use to login to the cluster. The de facto standard for small clusters is to have a single head node that users login to to launch jobs, servers out home directories and possibly NIS, as well as any scheduler you might have. From gary at sharcnet.ca Mon Oct 18 05:59:01 2004 From: gary at sharcnet.ca (Gary Molenkamp) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] about managment In-Reply-To: <1097902329.6223.945.camel@vigor12> Message-ID: On Sat, 16 Oct 2004, John Hearns wrote: > On Fri, 2004-10-15 at 19:38, llwaeva@21cn.com wrote: > > Hi all, > > I am running 8-node LAM/MPI parallel computing system. I found it's > > trouble to maintain the user accounts and software distribution on all > > the nodes. For example, whenever I install a new software , I have to > > repeat the job 8 times! > There are several answers to this question, > which you can learn about by staying on this group, and consulting > online resources. > > A quick answer is that you could construct the cluster using one of the > toolkits, such as Rocks or Warewulf - many others. > > And a very quick answer to your current dilemma. There are utilities > which allow parallel execution of commands on a set of machines, > or even to have a terminal session in parallel across a set of machines. > > Once you have a server (below) you can rsync each node to that. For software distribution, I use systemimager from the Systeminstaller Suite. It simplifies using rsync for managing images of nodes, and works well even across cluster (I have 4 clusters running off one server, that is also the master node of a cluster). > > The most annoying thing is that the > > configuration or managment of the user accounts over the network is a > > heavy job. Someone suggests that I should utilize NFS and NIS. However, > > in my case, it's difficult to have an additional computer as a server. > Not meaning to be rude, but you are wrong there. > Just use one of your compute nodes as the server. The additional CPU > load will not be great. > You should use some sort of centralised account management NIS or LDAP. I've recently deployed LDAP at SHARCNET and it really simplifies the account management process. I still nfs mount home accounts, but I used to rcp the passwd,shadow, and group files around. This made it difficult for users to maintain there account info, and had a long delay to propigate to 200+ busy machines. > Even if you point blank refuse to do that, a cron job to rsync the > relevant files will help cut down your admin load. > > And remember - eight machines may not seem a lot. But what happens if > you make a mistake on one machine, or one machine is down when you are > adding an account or software. Are you sure to run identical commands by > hand the next time it is up? > > > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Gary Molenkamp SHARCNET Systems Administrator University of Western Ontario gary@sharcnet.ca http://www.sharcnet.ca (519) 661-2111 x88429 (519) 661-4000 From kus at free.net Mon Oct 18 09:10:42 2004 From: kus at free.net (Mikhail Kuzminsky) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] Mellanox IB problem: xp0 module ? In-Reply-To: Message-ID: Dear colleagues ! I've some problem w/Mellanox IBHPC-0.5.0 software inst (in particular, the absence of xp0 kernel module) on "standalone" node which isn't connected currently w/IB switch or other node w/IB device. I've installed Infiniband HCA (PCI-X Infinihost MT23108 low profile) to upgarde my interconnect from GigEth to IB. Tyan S2880/Opteron 242 under SuSE Linux 9.0 (2.4.21-243) is used as the node for this installation. It's official software platform supported by whole Mellanox IBHPC-0.5.0 software collection (it includes , in aprticular, THCA-3.2 driver). Software environment is "fixed" because of a set of binary applications requiremnets, so last IBHPC-1.6.0 looks as inappropriate for as. 1)After minor source modification (in mosal.c) the the IBHPC installation (INSTALL script) was finished successfully. IPoIB parameters setting was also performed in the frames of INSTALL script dialog. 2)But after finish of INSTALL and reboot I see that a) mst tools started successfully b) and I see then following boot messages: Setting up network interfaces : eth0 eth1 - both done ib0: modprobe: modprobe: can't locate module xp0 and ib0 interface is down (I should note that IB cable isn't connected to HCA really). But I may do ib0 "up" manually; in particular, /etc/init.d/network start put ib0 in "up" state. I didn't find xb0.o in /lib/modules/..., and in any Mellanox software rpm's also ! I don't know what do xp0 module and where I may found it :-( Any reccomendations/ideas are welcome ! (FYI: some IB things like FLINT verification are OK, and opensm & mst started successfully). 2) I configured IPoIB at IBHPC installation. (To try IBsNice) I issued vapi start after boot, and then I see in particular the message Loading mod_ib_mgt FAILED "Manual" modprobe mod_ib_mgt leads to the message init_module: device or resource busy If I run IBsNice.sh, then I receive the same message about mod_ib_mgt but IBsNice creates virtual eth2 , and ping to the IP of eth2 works normally. I'll be very appreciate if somebody clarify me this situation w/mod_ib_mgt. May be it's simple because of some misconfiguration of some IB software component ? (I didn't configure anythings after running of INSTALL script). Yours Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow From bropers at cct.lsu.edu Mon Oct 18 11:29:28 2004 From: bropers at cct.lsu.edu (Brian D. Ropers-Huilman) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] about managment In-Reply-To: References: Message-ID: <41740B88.10400@cct.lsu.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Gary Molenkamp said the following on 2004-10-18 07:59: | I've recently deployed LDAP at SHARCNET and it really simplifies the | account management process. How do you allow users to change their passwords, shells, or GECOS information? - -- Brian D. Ropers-Huilman .:. Asst. Director .:. HPC and Computation Center for Computation & Technology (CCT) bropers@cct.lsu.edu Johnston Hall, Rm. 350 +1 225.578.3272 (V) Louisiana State University +1 225.578.5362 (F) Baton Rouge, LA 70803-1900 USA http://www.cct.lsu.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.6 (GNU/Linux) iD8DBQFBdAuIwRr6eFHB5lgRAiCSAKCa+75nNOWGFDxmZV/hGfb0wW85yQCguMo0 v8F2Mp3kzpCvK4dYBw1SUWk= =iAHb -----END PGP SIGNATURE----- From cjoung at tpg.com.au Mon Oct 18 19:14:53 2004 From: cjoung at tpg.com.au (cjoung@tpg.com.au) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] MPI & ScaLAPACK: error in MPI_Comm_size: Invalid communicator Message-ID: <1098152093.4174789d863ce@postoffice.tpg.com.au> Hi, I was hoping someone could help me with a F77,MPI & ScaLAPACK problem. Basically, I have a problem making the Scalapack libraries work in my program. Programs with MPI-only calls work fine, e.g. the "pi.f" MPI program that comes with the MPI installation works fine (the one that predicts pi), as do other examples I've gotten from books & simple ones I've written myself, but whenever I try an example with scalapack & blacs calls, it falls over with the same error message (which I can't decipher). If you can help, then I have a more detailed account of whats going on below, Any advice would be most gratefully appreciated, Clint Joung Postdoctoral Research Associate, Department of Chemical Engineering University of SYdney, NSW 2006 Australia ************************************************************** I'm just learning parallel programming. The netlib scalapack website has an example program called 'example1.f' It uses a scalapack subroutine PSGESV to solve the standard matrix equation [A]x=b, and return the answer, vector x. It seemed to compile ok, but on running, I got some error messages. So I systematically stripped down 'example1.f' in stages, recompling & running each time, trying to achieve a working program, eliminating potential bugs & rebuild it from there. Eventually I got down to the following emaciated F77 program (see below). All it does now is initialize a 2x3 process grid, then release it - thats all. ****example2.f******************************************* program example2 integer ictxt,mycol,myrow,npcol,nprow nprow=2 nocol=3 call SL_INIT(ictxt,nprow,npcol) call BLACS_EXIT(0) STOP END ********************************************************* Yet, it still doesn't work!, the following is the output when I try to compile and run it, ********************************************************* [tony@carmine clint]$ mpif77 -o example2 example2.f -L/opt/intel/mkl70cluster/lib/32 -lmkl_scalapack -lmkl_blacsF77init -lmkl_blacs -lmkl_blacsF77init -lmkl_lapack -lmkl_ia32 -lguide -lpthread -static-libcxa [tony@carmine clint]$ mpirun -n 6 ./example2 aborting job: Fatal error in MPI_Comm_size: Invalid communicator, error stack: MPI_Comm_size(82): MPI_Comm_size(comm=0x5b, size=0x80d807c) failed MPI_Comm_size(66): Null Comm pointer aborting job: Fatal error in MPI_Comm_size: Invalid communicator, error stack: MPI_Comm_size(82): MPI_Comm_size(comm=0x5b, size=0x80d807c) failed MPI_Comm_size(66): Null Comm pointer rank 5 in job 17 carmine.soprano.org_32782 caused collective abort of all ranks exit status of rank 5: return code 13 aborting job: Fatal error in MPI_Comm_size: Invalid communicator, error stack: MPI_Comm_size(82): MPI_Comm_size(comm=0x5b, size=0x80d807c) failed MPI_Comm_size(66): Null Comm pointer rank 1 in job 17 carmine.soprano.org_32782 caused collective abort of all ranks exit status of rank 1: return code 13 rank 0 in job 17 carmine.soprano.org_32782 caused collective abort of all ranks exit status of rank 0: return code 13 [tony@carmine clint]$ ********************************************************* ..so apparently somethings wrong with MPI_Comm_size, but beyond that, I can't figure it out. My system details: * I am running this on a '1 node' cluster - i.e. my notebook. (just to prototype before I run on a proper cluster) * O/S: Redhat Fedora Core 1, Kernel 2.4.22 * Compiler: Intel Fortran Compiler for linux 8.0 * MPI: MPICH2 ver 0.971 (was compiled with the ifort compiler, so it should work ok with the ifort compiler) * The Scalapack, blacs, blas and lapack come from the Intel Cluster Maths Kernel Library for Linux 7.0 If you know how to fix this problem, I'd appreciate to hear from you. Please consider me a NOVICE with all three - linux, MPI and Scalapack. The simpler the explanation, the better! with thanks, clint joung From cjoung at tpg.com.au Mon Oct 18 21:59:55 2004 From: cjoung at tpg.com.au (cjoung@tpg.com.au) Date: Wed Nov 25 01:03:29 2009 Subject: [Beowulf] Re: MPI & ScaLAPACK: error in MPI_Comm_size: Invalid communicator Message-ID: <1098161995.41749f4b65333@postoffice.tpg.com.au> Hi again - I need to correct something in my earlier email - I left out the subroutine SL_INIT, Here is the entire code again, ****example2.f*************** program example2 integer ictxt,npcol,nprow nprow=2 nocol=3 call sl_init(ictxt,nprow,npcol) call BLACS_EXIT(0) stop end subroutine sl_init(ictxt,nprow,npcol) integer ictxt,nprow,npcol,iam,nprocs external BLACS_GET,BLACS_GRIDINIT,BLACS_PINFO,BLACS_SETUP call BLACS_PINFO(iam,nprocs) if (nprocs.lt.1) then if (iam.eq.0) nprocs=nprow*npcol call BLACS_SETUP(iam,nprocs) endif call BLACS_GET(-1,0,ictxt) call BLACS_GRIDINIT(ictxt,'Row-major',nprow,npcol) return end ***************************** The errors are still the same however - it doesn't like any of my BLACS calls. Any help would be greatly appreciated, thanks Clint Joung ----- Forwarded message from cjoung@tpg.com.au ----- Date: Tue, 19 Oct 2004 12:14:53 +1000 From: cjoung@tpg.com.au Subject: MPI & ScaLAPACK: error in MPI_Comm_size: Invalid communicator To: beowulf@beowulf.org Hi, I was hoping someone could help me with a F77,MPI & ScaLAPACK problem. Yet, it still doesn't work!, the following is the output when I try to compile and run it, ********************************************************* [tony@carmine clint]$ mpif77 -o example2 example2.f -L/opt/intel/mkl70cluster/lib/32 -lmkl_scalapack -lmkl_blacsF77init -lmkl_blacs -lmkl_blacsF77init -lmkl_lapack -lmkl_ia32 -lguide -lpthread -static-libcxa [tony@carmine clint]$ mpirun -n 6 ./example2 aborting job: Fatal error in MPI_Comm_size: Invalid communicator, error stack: MPI_Comm_size(82): MPI_Comm_size(comm=0x5b, size=0x80d807c) failed MPI_Comm_size(66): Null Comm pointer aborting job: Fatal error in MPI_Comm_size: Invalid communicator, error stack: MPI_Comm_size(82): MPI_Comm_size(comm=0x5b, size=0x80d807c) failed MPI_Comm_size(66): Null Comm pointer rank 5 in job 17 carmine.soprano.org_32782 caused collective abort of all ranks exit status of rank 5: return code 13 aborting job: Fatal error in MPI_Comm_size: Invalid communicator, error stack: MPI_Comm_size(82): MPI_Comm_size(comm=0x5b, size=0x80d807c) failed MPI_Comm_size(66): Null Comm pointer rank 1 in job 17 carmine.soprano.org_32782 caused collective abort of all ranks exit status of rank 1: return code 13 rank 0 in job 17 carmine.soprano.org_32782 caused collective abort of all ranks exit status of rank 0: return code 13 [tony@carmine clint]$ ********************************************************* .so apparently somethings wrong with MPI_Comm_size, but beyond that, I can't figure it out. From kus at free.net Mon Oct 18 11:34:13 2004 From: kus at free.net (Mikhail Kuzminsky) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Re: [suse-amd64] Mellanox Infiniband on SuSE 9.0 - xp0 module, etc In-Reply-To: <20041018184354.GA26195@Baruch.pantasys.com> Message-ID: In message from Bob Lee (Mon, 18 Oct 2004 11:43:54 -0700): >On Sun, Oct 17, 2004 at 11:00:13PM +0400, Mikhail Kuzminsky wrote: >> Dear colleagues ! > >> I've some problem w/Mellanox IB software installation (in >>particular, >> the absence of xp0 kernel module). > >> I've installed Infiniband HCA (PCI-X Infinihost MT23108 low profile) >> to upgarde my interconnect on Tyan S2880 under SuSE Linux 9.0 >> (2.4.21-243). > >> It's official software platform supported by whole Mellanox >> IBHPC-0.5.0 software collection. > > I had difficulty with the 0.5.0 release on SuSE (9.1), but > the latest release (1.6.0) which is available through the > their web site (with registration). This worked seamlessly > IPoIB came right up no problem. The resulting package is > a bit bloated, but you do get everything. I beleive (according 1.6.0 documentations) that it'll not work under SuSE 9.0 :-( Yours Mikhail > > ... > >> 3) TO BE MORE CORRECT: pls take into account, that my host >>w/installed >> software& Mellanox hardware *isn't connected* currently with IB >>switch >> (i.e. is "standalone" server !) > > Remember that you need some form of subnet management to > assign LIDs to the ports (minism or opensm on one node > after the port is in "INIT" state -- using vstat). > > ... remaining deleted to save to old growth electrons ... > >> Yours >> Mikhail Kuzminsky >> Zelinsky Institute of Organic Chemistry >> Moscow > >hope this helps >-bob From Nout.Gemmeke at nl.fujitsu.com Tue Oct 19 08:33:31 2004 From: Nout.Gemmeke at nl.fujitsu.com (Gemmeke Nout) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Bonding results in 1 HBA for Tx and 1 HBA for Rx. Message-ID: Hi there Beowulf, Just a quick question. When using bonding on RehHat AS V3.0 I see that eth0 is used for sending data (Tx) and eth1 for receive of data (Rx). This results in bandwidth of only 1Gbit/sec..... Fail over works fine however. Any idea if this can be configured ?? Thanks, Nout Gemmeke Consultant Enterprise Services FUJITSU SERVICES Fujitsu Services B.V., Het Kwadrant 1 P.O. Box 1067, 3600 BB Maarssen, The Netherlands Tel: +31 346 598451 Mob: +31 651 218661 Fax: +31 346 561909 Email: nout.gemmeke@nl.fujitsu.com Web: nl.fujitsu.com Fujitsu Services B.V., Registered in the Netherlands no 30078286 ___________________________________________________________________________ The information in this e-mail (and its attachments) is confidential and intended solely for the addressee(s). If this message is not addressed to you, please be aware that you have no authorisation to read this e-mail, to copy it or to forward it to any person other than the addressee(s). Should you have received this e-mail by mistake, please bring this to the attention of the sender, after which you are kindly requested to destroy the original message and delete any copies held in your system. Fujitsu Services and its affiliated companies cannot be held responsible or liable in any way whatsoever for and/or in connection with any consequences and/or damage resulting from the contents of this e-mail and its proper and complete dispatch and receipt. Fujitsu Services does not guarantee that this e-mail has not been intercepted and amended, nor that it is virus-free. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20041019/6a64df37/attachment.html From hahn at physics.mcmaster.ca Tue Oct 19 12:33:22 2004 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] MPI & ScaLAPACK: error in MPI_Comm_size: Invalid communicator In-Reply-To: <1098152093.4174789d863ce@postoffice.tpg.com.au> Message-ID: > MPI_Comm_size(82): MPI_Comm_size(comm=0x5b, size=0x80d807c) failed I have no insight into any of this, except that these parameters are obviously reversed. 0x5b is a sensible size, and 0x80d807c is not. but 0x80d807c is a sensible pointer... if this isn't a source-level transposition, perhaps it has to do with mixed calling conventions? From henry.gabb at intel.com Tue Oct 19 14:21:31 2004 From: henry.gabb at intel.com (Gabb, Henry) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] MPI & ScaLAPACK: error in MPI_Comm_size: Invalid communicator Message-ID: Hi Clint, The "Null Comm pointer" error that you're seeing is almost always due to a header mismatch. I don't think Intel Cluster MKL 7.0 supports MPICH2 yet. According to the Intel Cluster MKL system requirements (http://www.intel.com/software/products/clustermkl/sysreq.htm), MPICH-1.2.5 is supported. MPICH-1.2.6 should work too. You might post a question to the Intel MKL forum (http://softwareforums.intel.com/ids) about MPICH2 support. I did a quick test of your program on one of my clusters and it ran fine (after fixing the typo in the nocol=3 statement): [henry@castor1 cmkl-test]$ /opt/mpich-1.2.6-gcc/bin/mpif77 -o example example.f \ -L/opt/intel/mkl70cluster/lib/64 -lmkl_scalapack \ -lmkl_blacsF77init_gnu -lmkl_blacs -lmkl_blacsF77init_gnu \ -lmkl_lapack -lmkl -lguide -lpthread [henry@castor1 cmkl-test]$ /opt/mpich-1.2.6-gcc/bin/mpirun -n 6 ./example [henry@castor1 cmkl-test]$ I used a GNU-built MPICH-1.2.6 because I didn't have an Intel-ready MPICH installation handy. Best regards, Henry Gabb Intel Parallel and Distributed Solutions Division From ashley at quadrics.com Wed Oct 20 06:44:49 2004 From: ashley at quadrics.com (Ashley Pittman) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] bandwidth: who needs it? In-Reply-To: References: Message-ID: <1098279888.854.220.camel@ashley> On Sat, 2004-10-16 at 22:36, Mark Hahn wrote: > do you have applications that are pushing the limits of MPI bandwidth? > for instance, code that actually comes close to using the 8-900 MB/s > that current high-end interconnect provides? > > we have a fairly wide variety of codes inside SHARCnet, but I haven't > found anyone who is even complaining about our last-generation fabric > (quadrics elan3, around 250 MB/s). is it just that we don't have the > right researchers? I've heard people mutter about earthquake researchers > being able to pin a 800 MB/s network, and claims that big FFT folk can > do so as well. by contrast, many people claim to notice improvements > in latency from old/mundane (6-7 us) to new/good (<2 us). > > I'd be interested in hearing about applications you know of which are > very sensitive to having large bandwidth (say, .8 GB/s today). It's not so much that you don't have the right researchers, it's the type of projects they are researching or at least the way they are attacking the problem. Latency is every bit as critical as bandwidth and in many cases more so. Latency at scale is also critical, multi-hop networks dictate the need to use nearest-neighbour algorithms and therefore have trouble scaling to large CPU counts. It's also harder for newcomers and non technical people to conceptualise latency and especially scalable latency. >From code optimisation that I've done in the past I've also found that bandwidth is easier to hide via pipelining than latency and therefore is less critical to wall clock time. Also don't forget that SMP boxes are getting wider, think in terms of Mb/s/CPU and todays 900Mb/s network bandwidth suddenly doesn't sound that much. The good news here however is that the large SMPs tend to have multiple PCI-X busses so can use multiple networks effectively. Ashley, From tony at mpi-softtech.com Wed Oct 20 06:39:19 2004 From: tony at mpi-softtech.com (Anthony Skjellum) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] MPI & ScaLAPACK: error in MPI_Comm_size: Invalid communicator In-Reply-To: References: Message-ID: <41766A87.5000003@mpi-softtech.com> Just my two cents... Size is an out parameter, so it should be an address. Depending on the MPI, MPI_Comm is also secretly mapped to a pointer, but it could also be an index into an array structure (or hash) inside the MPI. So, it is hard to infer anything from the value of comm... It would be interesting to do printf("%x %x",(int)MPI_COMM_WORLD,&size) before calling the Scalapack to get an idea of these quantities... -Tony Mark Hahn wrote: >>MPI_Comm_size(82): MPI_Comm_size(comm=0x5b, size=0x80d807c) failed >> >> > >I have no insight into any of this, except that these parameters >are obviously reversed. 0x5b is a sensible size, and 0x80d807c >is not. but 0x80d807c is a sensible pointer... > >if this isn't a source-level transposition, perhaps it has to do >with mixed calling conventions? > > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From eugen at leitl.org Thu Oct 21 02:35:52 2004 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] PARALLEL PROGRAMMING WORKSHOP Nov 29 - Dec 1, 2004, Juelich, Call for Participation (fwd from rabenseifner@hlrs.de) Message-ID: <20041021093552.GS1457@leitl.org> ----- Forwarded message from Rolf Rabenseifner ----- From: Rolf Rabenseifner Date: Thu, 21 Oct 2004 11:15:53 +0200 (CEST) To: eugen@leitl.org Subject: PARALLEL PROGRAMMING WORKSHOP Nov 29 - Dec 1, 2004, Juelich, Call for Participation Sehr geehrte Dame, sehr geehrter Herr, k?nnten Sie bitte diese Ank?ndigung an interessierte Kollegen weitergeben, da mit dieser Mailingliste die Interessenten f?r Kurse zur "Parallelen Programmierung" oft nicht direkt erreicht werden k?nnen. Es stehen noch Pl?tze im MPI/OpenMP-Kurs im Forschungszentrum J?lich zur Verf?gung. Die Vortr?ge sind in Deutsch, die Folien in Englisch. Mit freundlichen Gr??en Rolf Rabenseifner ====================================================================== Call for Participation ====================================================================== PARALLEL PROGRAMMING WORKSHOP Fall 2004 Date Location Content (for beginners/advanced) ---------------- -------------- ------- Nov.29 - Dec.1 FZ J?lich, ZAM Parallel Programming (70% / 30%) (3-day course in German) Registration and further information: http://www.hlrs.de/news-events/events/2004/parallel_prog_fall2004/ (course G) The aim of this workshop is to give people with some programming experience an introduction into the basics of parallel programming. The focus is on programming models, MPI and OpenMP, and PETSc. Language support is given for Fortran and C. The course was developed by HLRS, EPCC, NIC and ZHR. Hands-on sessions will allow users to test and understand the basic constructs of MPI, OpenMP, and PETSc. Message Passing with MPI is the major programming model on large distributed-memory systems in high-performance computing. OpenMP is dedicated to shared memory systems. PETSc is a high-level progamming interface for parallel solver. Lectures will be given by Dr. Rolf Rabenseifner (HLRS, member of MPI-2 Forum). Extended registration deadline: Nov. 12, 2004. The course language is German. All slides and handouts are in English. --------------------------------------------------------------------- Please forward this announcement to any colleagues who may be interested. Our apologies if you receive multiple copies. --------------------------------------------------------------------- --------------------------------------------------------------------- Dr. Rolf Rabenseifner High Performance Computing Parallel Computing Center Stuttgart (HLRS) Rechenzentrum Universitaet Stuttgart (RUS) Phone: ++49 711 6855530 Allmandring 30 FAX: ++49 711 6787626 D-70550 Stuttgart rabenseifner@hlrs.de Germany http://www.hlrs.de/people/rabenseifner --------------------------------------------------------------------- ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20041021/a8a96556/attachment.bin From scheinin at crs4.it Thu Oct 21 01:48:33 2004 From: scheinin at crs4.it (Alan Scheinine) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] dual Opteron recommendations Message-ID: <200410210848.i9L8mXfm005676@dali.crs4.it> Many venders sell U1 cases for dual Opteron based on the Tyan main boards, but on the other hand, a vender here says that the product Newisys 2100 is much more reliable than Tyan though it costs 10 to 20 percent more. I have not previously heard of Newisys and I do not recall it being mentioned in this mailing list. Would anyone like to comment? best regards, Alan Scheinine Email: scheinin@crs4.it From mphelps at cfa.harvard.edu Thu Oct 21 09:54:31 2004 From: mphelps at cfa.harvard.edu (Matt Phelps) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] dual Opteron recommendations In-Reply-To: <200410210848.i9L8mXfm005676@dali.crs4.it> References: <200410210848.i9L8mXfm005676@dali.crs4.it> Message-ID: <4177E9C7.601@cfa.harvard.edu> Alan Scheinine wrote: > Many venders sell U1 cases for dual Opteron based on the Tyan > main boards, but on the other hand, a vender here says that the > product Newisys 2100 is much more reliable than Tyan though it > costs 10 to 20 percent more. I have not previously heard of > Newisys and I do not recall it being mentioned in this mailing > list. Would anyone like to comment? > best regards, > Alan Scheinine Email: scheinin@crs4.it > Alan, These are available from Sun as the V20z server. We have a 32 node cluster of 'em and (so far ;-) are happy. -- Matt Phelps System Administrator, Computation Facility Harvard - Smithsonian Center for Astrophysics mphelps@cfa.harvard.edu, http://cfa-www.harvard.edu From lindahl at pathscale.com Thu Oct 21 10:29:04 2004 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] dual Opteron recommendations In-Reply-To: <200410210848.i9L8mXfm005676@dali.crs4.it> References: <200410210848.i9L8mXfm005676@dali.crs4.it> Message-ID: <20041021172904.GA1351@greglaptop.internal.keyresearch.com> > Many venders sell U1 cases for dual Opteron based on the Tyan > main boards, but on the other hand, a vender here says that the > product Newisys 2100 is much more reliable than Tyan though it > costs 10 to 20 percent more. Newisys came out with one of the first Opteron motherboards, and their 2 cpu motherboard is still shipped by Sun as the Sunfire 20z. It's a pretty expensive motherboard, but the % added to the final price depends on how much memory you buy and which cpus you're using. There are a *lot* of clusters out there using that Tyan motherboard. I don't think anyone's decided that it's any less reliable than any other low-end server motherboard. -- g From lindahl at pathscale.com Thu Oct 21 10:58:55 2004 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] bandwidth: who needs it? In-Reply-To: References: Message-ID: <20041021175855.GA1724@greglaptop.internal.keyresearch.com> > do you have applications that are pushing the limits of MPI bandwidth? > for instance, code that actually comes close to using the 8-900 MB/s > that current high-end interconnect provides? Bandwidth is important not only for huge messages that hit 900 MB/s, but also for medium sized messages. A naive formula for how long it takes to send a message is: T_size = T_0 + size / max_bandwidth For example, for a 4k message with T_0 = 5 usec and either 400 MB/s or 800 MB/s, T_4k_400M = 5 + 4k/400M = 5 + 10 = 15 usec T_4k_800M = 5 + 4k/800M = 5 + 5 = 10 usec A big difference. But you're only getting 266 MB/s and 400 MB/s bandwidth, respectively. Of course performance is usually a bit less than this naive model. But the effect is real, becoming unimportant for packets smaller than ~ 2k in this example. The size at which this effect becomes unimportant depends on T_0 and the bandwidth. -- greg From joelja at darkwing.uoregon.edu Thu Oct 21 10:32:24 2004 From: joelja at darkwing.uoregon.edu (Joel Jaeggli) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] dual Opteron recommendations In-Reply-To: <200410210848.i9L8mXfm005676@dali.crs4.it> References: <200410210848.i9L8mXfm005676@dali.crs4.it> Message-ID: I haven't personally dealt with newisys although we're looking at them for larger multiway boxen. however if you're looking at tyan, they sell completely integrated building blocks built around their motherboards. take a look at the tyan tranport gx28 which comes in 2 bay and 4 bay flavors http://www.tyan.com/products/html/gx28b2882.html On Thu, 21 Oct 2004, Alan Scheinine wrote: > Many venders sell U1 cases for dual Opteron based on the Tyan > main boards, but on the other hand, a vender here says that the > product Newisys 2100 is much more reliable than Tyan though it > costs 10 to 20 percent more. I have not previously heard of > Newisys and I do not recall it being mentioned in this mailing > list. Would anyone like to comment? > best regards, > Alan Scheinine Email: scheinin@crs4.it > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- -------------------------------------------------------------------------- Joel Jaeggli Unix Consulting joelja@darkwing.uoregon.edu GPG Key Fingerprint: 5C6E 0104 BAF0 40B0 5BD3 C38B F000 35AB B67F 56B2 From jholmes at psu.edu Thu Oct 21 11:34:29 2004 From: jholmes at psu.edu (Jason Holmes) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] dual Opteron recommendations In-Reply-To: References: <200410210848.i9L8mXfm005676@dali.crs4.it> Message-ID: <41780135.6020806@psu.edu> FWIW, we have 80 of the Sun v20z's (Newisys 2100) and we've been very happy with them (very well built, reliable, and no problems so far). We have 16 Angstrom blades with dual Opteron Tyan motherboards (s2882) in them as well. They initially had problems, but Tyan figured out the issue and shipped us replacement motherboards overnight at no cost. Thanks, -- Jason Holmes Joel Jaeggli wrote: > I haven't personally dealt with newisys although we're looking at them > for larger multiway boxen. > > however if you're looking at tyan, they sell completely integrated > building blocks built around their motherboards. > > take a look at the tyan tranport gx28 which comes in 2 bay and 4 bay > flavors > > http://www.tyan.com/products/html/gx28b2882.html > > On Thu, 21 Oct 2004, Alan Scheinine wrote: > >> Many venders sell U1 cases for dual Opteron based on the Tyan >> main boards, but on the other hand, a vender here says that the >> product Newisys 2100 is much more reliable than Tyan though it >> costs 10 to 20 percent more. I have not previously heard of >> Newisys and I do not recall it being mentioned in this mailing >> list. Would anyone like to comment? >> best regards, >> Alan Scheinine Email: scheinin@crs4.it >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > From james.p.lux at jpl.nasa.gov Thu Oct 21 11:53:17 2004 From: james.p.lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] bandwidth: who needs it? References: <20041021175855.GA1724@greglaptop.internal.keyresearch.com> Message-ID: <000801c4b79f$3d016080$33a8a8c0@LAPTOP152422> Bandwidth is also important if you have any sort of "store and forward" process in the comm link (say, in a switch), because, typically, you have to wait for the entire message to arrive (so you can check the CRCC) before you can send it on it's way to the next destination. I'm sure that there are high performance switches around that only wait 'til enough of the header arrives to make the routing decision, but then, the switch has to passively pass the data through without error checking. ----- Original Message ----- From: "Greg Lindahl" To: "Mark Hahn" Cc: Sent: Thursday, October 21, 2004 10:58 AM Subject: Re: [Beowulf] bandwidth: who needs it? > > do you have applications that are pushing the limits of MPI bandwidth? > > for instance, code that actually comes close to using the 8-900 MB/s > > that current high-end interconnect provides? > > Bandwidth is important not only for huge messages that hit 900 MB/s, > but also for medium sized messages. A naive formula for how long it > takes to send a message is: From seth at integratedsolutions.org Thu Oct 21 10:37:50 2004 From: seth at integratedsolutions.org (Seth Bardash) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] dual Opteron recommendations In-Reply-To: <200410210848.i9L8mXfm005676@dali.crs4.it> Message-ID: <200410211737.i9LHbu409434@integratedsolutions.org> -----Original Message----- From: beowulf-bounces@beowulf.org [mailto:beowulf-bounces@beowulf.org] On Behalf Of Alan Scheinine Sent: Thursday, October 21, 2004 2:49 AM To: Beowulf@beowulf.org Subject: [Beowulf] dual Opteron recommendations Many venders sell U1 cases for dual Opteron based on the Tyan main boards, but on the other hand, a vender here says that the product Newisys 2100 is much more reliable than Tyan though it costs 10 to 20 percent more. I have not previously heard of Newisys and I do not recall it being mentioned in this mailing list. Would anyone like to comment? best regards, Alan Scheinine Email: scheinin@crs4.it _______________________________________________ Alan and the list, First, let me say that I am not trying to suggest to anyone to buy our products, only provide information we have seen integrating Dual Opteron systems: 1) The Newisys boards are made expecially for them and only come in their cases. I can not comment on their reliability as we have not used nor tried to integrate their systems. I think Sun uses them and charges appropriately for Sun. 2) We have built over 250 Dual Opteron systems, mostly 1U's used in large linux clusters. Initially, we used Tyan MB's and found that we were getting around a 10% to 15% DOA rate here before burn-in. After burn-in we had no failures here or deployed. So.... My take on the Tyan Dual Opteron MB's is that they work fine once they have gone through burn-in but the DOA rate out of the box is not good. YMMV. We then switched to the Arima (www.accelertech.com) HDAMA, ATO-2161 motherboards. These have had DOA's only caused by the UPS gorillas - All 2 of the HDAMA MB's that were DOA were received in badly damaged boxes. Over 190 have now been received and integrated with no failures - either here, in burn-in or in the field. BTW, this motherboard is the AMD reference design and has been very robust even with Enhanced Latency memory (CL 2-3-2-6-1). We have installed Fedora Core 2, RH ES 3.0, White Box Linux and SUSE 9.1 and they all work fine. We are testing the Server and Workstation Iwill motherboards (www.iwillusa.com) and they seem to work fine so far. There are many other factors that should influence Dual Opteron vendor selection. These factors are: cooling, performance, configuration, I/O, reliability, technical expertise, support and price - your order of importance will usually dictate a vendor. Hope this provides the feedback required to make an informed decision about motherboard selection and system vendors. Seth Bardash Integrated Solutions and Systems http://www.integratedsolutions.org Supplier of AMD and Intel Servers and Systems running Windows and Linux. *Failure can not cope with perseverance* --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.779 / Virus Database: 526 - Release Date: 10/19/2004 From lindahl at pathscale.com Thu Oct 21 12:06:59 2004 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] bandwidth: who needs it? In-Reply-To: <000801c4b79f$3d016080$33a8a8c0@LAPTOP152422> References: <20041021175855.GA1724@greglaptop.internal.keyresearch.com> <000801c4b79f$3d016080$33a8a8c0@LAPTOP152422> Message-ID: <20041021190659.GF1724@greglaptop.internal.keyresearch.com> On Thu, Oct 21, 2004 at 11:53:17AM -0700, Jim Lux wrote: > I'm sure that there are > high performance switches around that only wait 'til enough of the header > arrives to make the routing decision, but then, the switch has to passively > pass the data through without error checking. Actually, it's more common to route immediately after you've seen the header, but to compute the entire CRC and then do something minimal if the CRC turns out to be wrong. I may be confusing the exact details, but I think that Infiniband just counts the bad packets and depends on the endpoint to discard the packet. Myrinet both counts the error, and sticks a zero into the CRC, so that subsequent hops will know that the CRC was found to be bad earlier in the path. GigE and 10 gigE switches receive the whole packet before sending it on, and so the higher bandwidth is a huge help for 10 gig's latency -- drops it from 15 usec for a 1500 byte packet to 1.5 usec. -- greg From hanzl at noel.feld.cvut.cz Thu Oct 21 13:09:16 2004 From: hanzl at noel.feld.cvut.cz (hanzl@noel.feld.cvut.cz) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Storage and cachefs on nodes - NFS support exists In-Reply-To: <20041011001046L.hanzl@unknown-domain> References: <20041008233311G.hanzl@unknown-domain> <20041011001046L.hanzl@unknown-domain> Message-ID: <20041021220916X.hanzl@unknown-domain> I have the pleasure to give you very optimistic update on persistent file caching. Few days ago I wrote these skeptic lines: > I am not sure how much can I expect from linux cachefs as seen in > e.g. 2.6.9-rc3-mm3 - if I got it right, it is a kernel subsystem with > intra-kernel API, being now tested with AFS and intended as usable for > NFS. It is however "low" on NFS team priority list. So linux cachefs > might provide cleaner solutions than Solaris cachefs - if it ever > provides them. and now I see that NFS already can use this local-disk-caching subsystem! There is linux-cachefs maillist for this, you may want to read: http://www.redhat.com/archives/linux-cachefs/2004-October/msg00027.html - 2.6.9-rc4-mm1 patch that will enable NFS (even NFS4) to do persistent file caching on the local harddisk http://www.redhat.com/archives/linux-cachefs/2004-October/msg00004.html - older message explaining what is going on http://www.redhat.com/archives/linux-cachefs/2004-October/msg00019.html - about ways to get this to the mainline kernel http://www.redhat.com/mailman/listinfo/linux-cachefs - list archives and subscription page I believe that this subsystem will be an immense help for work on huge data with mostly read access. And much less administrative hassle - once this gets to the mainline kernel (well, yeah, any help to push it there is welcome!) it will be much much easier to use. Just a normal NFS server. Just a normal NFS client with the NFS_MOUNT_FSCACHE or NFS4_MOUNT_FSCACHE mounting option ON. Hope that this attempt to make relatively simple persistent caching for Linux will catch up and survive even kernel_version+=0.2 (usual killer for similar projects). Best Regards Vaclav Hanzl From redboots at ufl.edu Thu Oct 21 13:19:45 2004 From: redboots at ufl.edu (JOHNSON,PAUL C) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] parallel sparse linear solver choice? Message-ID: <1377588425.1098389985998.JavaMail.osg@osgjas02.cns.ufl.edu> All: I was wondering whats your choice for a parallel sparse linear solver? I have a beowulf cluster(~4 nodes, ok really small I know) connected at 100Mbps. The computers are P4 2.2GHz with 1Gb ram. The matrices are formed by a finite element program. They are sparse, square, symmetric, and I would like to solve problems with more than 200000 columns. Which of the solvers is easiest to set up and utilize? One problem I am trying to solve is 156,240 x 156,240 with 6,023,241 non-zero entries. Thanks for any help, Paul -- JOHNSON,PAUL C From rbw at ahpcrc.org Thu Oct 21 13:59:52 2004 From: rbw at ahpcrc.org (Richard Walsh) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] bandwidth: who needs it? In-Reply-To: <20041021175855.GA1724@greglaptop.internal.keyresearch.com> Message-ID: <20041021205952.CF20C6EB10@clam.ahpcrc.org> Greg Lindahl wrote: >> do you have applications that are pushing the limits of MPI bandwidth? >> for instance, code that actually comes close to using the 8-900 MB/s >> that current high-end interconnect provides? > >Bandwidth is important not only for huge messages that hit 900 MB/s, >but also for medium sized messages. A naive formula for how long it >takes to send a message is: > >T_size = T_0 + size / max_bandwidth > >For example, for a 4k message with T_0 = 5 usec and either 400 MB/s or >800 MB/s, > >T_4k_400M = 5 + 4k/400M = 5 + 10 = 15 usec >T_4k_800M = 5 + 4k/800M = 5 + 5 = 10 usec > >A big difference. But you're only getting 266 MB/s and 400 MB/s >bandwidth, respectively. > >Of course performance is usually a bit less than this naive model. But >the effect is real, becoming unimportant for packets smaller than ~ 2k >in this example. The size at which this effect becomes unimportant >depends on T_0 and the bandwidth. The above also makes a point about a mid-range regime of message sizes whose transfer times are affected ~equally by bandwidth and latency changes. Halving the latency in the 4K/800M case above is equivalent to doubling the bandwidth for a message of this size: T_4k_800M. = 2.5 + 4k/800M = 2.5 + 5.0 = 7.5 usec T_4k_800M = 5.0 + 4k/800M = 5.0 + 5.0 = 10.0 usec T_4k_1600M = 5.0 + 4k/1600M = 5.0 + 2.5 = 7.5 usec For a given interconnect with a known latency and bandwidth there is a "characteristic" message size whose transfer time is equally sensitive to perturbations in bandwidth and latency (latency and bandwidth piece of the transfer time are equal). So, for an "Elan-4-like" interconnect characteristic message length would be 1.6k: T_4k_800M = 1.0 + 1.6k/800M = 1.0 + 2.0 = 3.0 usec T_4k_800M = 2.0 + 1.6k/800M = 2.0 + 2.0 = 4.0 usec T_4k_1600M = 2.0 + 1.6k/1600M = 2.0 + 1.0 = 3.0 usec Messages sizes in the vicinity of the characteristic length will respond approximately equally to improvements in either factor. Messages much larger in size will be more sensitive to bandwidth improvements in an interconnect upgrade while message sizes much smaller will be more sensitive to latency improvements in an upgrade. One might argue that bandwidth actually matters more because message sizes (along with problem sizes) can in theory grow indefinitely (drop in some more memory and double you array sizes) while they can be made only be so small -- this is a position supported by the rate of storage growth, but undermined by slower bandwidth growth and processor count increases. I think I will keep my bandwidth though ... and take any off of the hands of those who ... don't need it ... ;-) ... rbw #--------------------------------------------------- # Richard Walsh # Project Manager, Cluster Computing, Computational # Chemistry and Finance # netASPx, Inc. # 1200 Washington Ave. So. # Minneapolis, MN 55415 # VOX: 612-337-3467 # FAX: 612-337-3400 # EMAIL: rbw@networkcs.com, richard.walsh@netaspx.com # rbw@ahpcrc.org # #--------------------------------------------------- # "What you can do, or dream you can, begin it; # Boldness has genius, power, and magic in it." # -Goethe #--------------------------------------------------- # "Without mystery, there can be no authority." # -Charles DeGaulle #--------------------------------------------------- # "Why waste time learning when ignornace is # instantaneous?" -Thomas Hobbes #--------------------------------------------------- From bill at cse.ucdavis.edu Thu Oct 21 19:03:50 2004 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] dual Opteron recommendations In-Reply-To: <200410210848.i9L8mXfm005676@dali.crs4.it> References: <200410210848.i9L8mXfm005676@dali.crs4.it> Message-ID: <20041022020350.GA32640@cse.ucdavis.edu> On Thu, Oct 21, 2004 at 10:48:33AM +0200, Alan Scheinine wrote: > Many venders sell U1 cases for dual Opteron based on the Tyan > main boards, but on the other hand, a vender here says that the > product Newisys 2100 is much more reliable than Tyan though it > costs 10 to 20 percent more. I have not previously heard of > Newisys and I do not recall it being mentioned in this mailing > list. Would anyone like to comment? > best regards, I'm familar with 48 sun v20z (newisys) machines around here, only one died so far with a hard memory error (I.e. won't boot). We also have 40 ish Tyan systems without any failures, all were burned in by the vendor. Speaking of which, has anyone done anything useful with the v20z LCD display, ours just say something like IP address of the management interface and OS booted or similar. I was hoping for hostname, maybe system load, even a way to pull a node out of the queue (there are several buttons under the LCD). -- Bill Broadley Computational Science and Engineering UC Davis From bill at cse.ucdavis.edu Thu Oct 21 19:10:54 2004 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] dual Opteron recommendations In-Reply-To: <200410210848.i9L8mXfm005676@dali.crs4.it> References: <200410210848.i9L8mXfm005676@dali.crs4.it> Message-ID: <20041022021054.GB32640@cse.ucdavis.edu> Oh, speaking of which the main advantage I've seen in the newisys is the remote managability. You can ssh to the management interface, check temperatures, turn the machine on/off, and other related functionality. Alas, as far as I can tell the passthru for the management interface (it has 2 ethernet ports for the management) isn't usable in any sane way. Unless you want to do something like: ssh node001 ssh node002 ssh node003 ssh node004 .... turn node off exit ... exit exit The idea of not requiring a masterswitch+ for power management, a cyclades or similar for serial management, or a switch for a seperate management network can be attractive. Not that other motherboards don't have management options. -- Bill Broadley Computational Science and Engineering UC Davis From cjoung at tpg.com.au Thu Oct 21 23:02:17 2004 From: cjoung at tpg.com.au (cjoung@tpg.com.au) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] p4_error: interrupt SIGSEGV: 11 Killed by signal 2. Message-ID: <1098424937.4178a269ccd90@postoffice.tpg.com.au> Firstly, Thank you to Mark H, Anthony S and in particular Henry G for your helpful comments on my MPI/scaLAPACK problem. After reading your comments, I removed MPICH2, installed mpich1.2.6 and this did indeed fix the "Null Comm Pointer" error! > Date: Tue, 19 Oct 2004 14:21:31 -0700 > From: "Gabb, Henry" > Subject: Re: [Beowulf] MPI & ScaLAPACK: > error in MPI_Comm_size: Invalid communicator > The "Null Comm pointer" error that you're seeing is almost always due to > a header mismatch. I don't think Intel Cluster MKL 7.0 supports MPICH2 yet. > > > Error message: > > aborting job: Fatal error in MPI_Comm_size: Invalid communicator, error stack: > > MPI_Comm_size(82): MPI_Comm_size(comm=0x5b, size=0x80d807c) failed > > MPI_Comm_size(66): Null Comm pointer ********************************************************** Current Problem: I have another problem now, also related to MPI/scaLAPACK. I tried to compile and run the example1.f program - It uses the scalapack PDGESV subroutine to solve the equation [A]x=b (from the netlib/scalapack website - I did NOT modify it) (see end of email for copy of example1.f) It seems to compile ok, but on execution it gives this error message: **************************************** [tony@carmine clint]$ mpif77 -o ex1 example1.f -L/opt/intel/mkl70cluster/lib/32 -lmkl_scalapack -lmkl_blacsF77init -lmkl_blacs -lmkl_blacsF77init -lmkl_lapack -lmkl_ia32 -lguide -lpthread -static-libcxa [tony@carmine clint]$ mpirun -np 6 ./ex1 p0_24505: p4_error: interrupt SIGSEGV: 11 Killed by signal 2. Killed by signal 2. Killed by signal 2. Killed by signal 2. Killed by signal 2. [tony@carmine clint]$ **************************************** (there are also comments about "broken pipes", but I didn't include that part above) ..as far as I can discover on the web, SIGSEGV represents a "segmentation fault", beyond this, I don't know what to do for a fix. Some people have suggested fixes such as: * increasing memory sizes: i.e. type in: > >export P4_GLOBMEMSIZE=536870912 > > > >(=512MB). ..but this didn't do anything for me. Generally, I have not seen any other solutions which have had a positive response. They say, a problem like this is normally attributed to a bug in the source code, but seeing as the netlib/scalapack developers give this source code out as the beginners basic 'hello world' program, I doubt this program would carry a bug in it! I would have to guess that there's something ELSE wrong. (In any case, I tried another program out of a book that does basically the same thing - same problem) I was hoping a reader in this forum has seen this problem before, and knows of a solution. Any suggestions, even speculative, would be most appreciated, (Please use simple language - I am quite a novice at all of this...) with many thanks, Clint Joung Postdoctoral Research Associate Department of Chemical Engineering University of Sydney, NSW 2006 Australia ps: My system details and the source code 'example1.f': OS: Linux Redhat Fedora Core 2 FC: Intel Fortran Compiler V8.0 (ifort) (I've also tried building MPI libraries using GNU g77 - same problem) CC: GNU gcc (for some reason, MPI libraries don't 'make' properly using intel icpc) Scalapack et al: Intel Cluster Maths Kernel Library for Linux v7.0 MPI: mpich-1.2.6 (I've also tried mpich1.2.5.2 - same problem) The example1.f program. It runs ok as far as the actual call to PDGESV, then it falls over..... **EXAMPLE1.F*************************************** PROGRAM EXAMPLE1 * * Example Program solving Ax=b via ScaLAPACK routine PDGESV * * .. Parameters .. INTEGER DLEN_, IA, JA, IB, JB, M, N, MB, NB, RSRC, $ CSRC, MXLLDA, MXLLDB, NRHS, NBRHS, NOUT, $ MXLOCR, MXLOCC, MXRHSC PARAMETER ( DLEN_ = 9, IA = 1, JA = 1, IB = 1, JB = 1, $ M = 9, N = 9, MB = 2, NB = 2, RSRC = 0, $ CSRC = 0, MXLLDA = 5, MXLLDB = 5, NRHS = 1, $ NBRHS = 1, NOUT = 6, MXLOCR = 5, MXLOCC = 4, $ MXRHSC = 1 ) DOUBLE PRECISION ONE PARAMETER ( ONE = 1.0D+0 ) * .. * .. Local Scalars .. INTEGER ICTXT, INFO, MYCOL, MYROW, NPCOL, NPROW DOUBLE PRECISION ANORM, BNORM, EPS, RESID, XNORM * .. * .. Local Arrays .. INTEGER DESCA( DLEN_ ), DESCB( DLEN_ ), $ IPIV( MXLOCR+NB ) DOUBLE PRECISION A( MXLLDA, MXLOCC ), A0( MXLLDA, MXLOCC ), $ B( MXLLDB, MXRHSC ), B0( MXLLDB, MXRHSC ), $ WORK( MXLOCR ) * .. * .. External Functions .. DOUBLE PRECISION PDLAMCH, PDLANGE EXTERNAL PDLAMCH, PDLANGE * .. * .. External Subroutines .. EXTERNAL BLACS_EXIT, BLACS_GRIDEXIT, BLACS_GRIDINFO, $ DESCINIT, MATINIT, PDGEMM, PDGESV, PDLACPY, $ SL_INIT * .. * .. Intrinsic Functions .. INTRINSIC DBLE * .. * .. Data statements .. DATA NPROW / 2 / , NPCOL / 3 / * .. * .. Executable Statements .. * * INITIALIZE THE PROCESS GRID * CALL SL_INIT( ICTXT, NPROW, NPCOL ) CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW, MYCOL ) * * If I'm not in the process grid, go to the end of the program * IF( MYROW.EQ.-1 ) $ GO TO 10 * * DISTRIBUTE THE MATRIX ON THE PROCESS GRID * Initialize the array descriptors for the matrices A and B * CALL DESCINIT( DESCA, M, N, MB, NB, RSRC, CSRC, ICTXT, MXLLDA, $ INFO ) CALL DESCINIT( DESCB, N, NRHS, NB, NBRHS, RSRC, CSRC, ICTXT, $ MXLLDB, INFO ) * * Generate matrices A and B and distribute to the process grid * CALL MATINIT( A, DESCA, B, DESCB ) * * Make a copy of A and B for checking purposes * CALL PDLACPY( 'All', N, N, A, 1, 1, DESCA, A0, 1, 1, DESCA ) CALL PDLACPY( 'All', N, NRHS, B, 1, 1, DESCB, B0, 1, 1, DESCB ) * * CALL THE SCALAPACK ROUTINE * Solve the linear system A * X = B * CALL PDGESV( N, NRHS, A, IA, JA, DESCA, IPIV, B, IB, JB, DESCB, $ INFO ) * IF( MYROW.EQ.0 .AND. MYCOL.EQ.0 ) THEN WRITE( NOUT, FMT = 9999 ) WRITE( NOUT, FMT = 9998 )M, N, NB WRITE( NOUT, FMT = 9997 )NPROW*NPCOL, NPROW, NPCOL WRITE( NOUT, FMT = 9996 )INFO END IF * * Compute residual ||A * X - B|| / ( ||X|| * ||A|| * eps * N ) * EPS = PDLAMCH( ICTXT, 'Epsilon' ) ANORM = PDLANGE( 'I', N, N, A, 1, 1, DESCA, WORK ) BNORM = PDLANGE( 'I', N, NRHS, B, 1, 1, DESCB, WORK ) CALL PDGEMM( 'N', 'N', N, NRHS, N, ONE, A0, 1, 1, DESCA, B, 1, 1, $ DESCB, -ONE, B0, 1, 1, DESCB ) XNORM = PDLANGE( 'I', N, NRHS, B0, 1, 1, DESCB, WORK ) RESID = XNORM / ( ANORM*BNORM*EPS*DBLE( N ) ) * IF( MYROW.EQ.0 .AND. MYCOL.EQ.0 ) THEN IF( RESID.LT.10.0D+0 ) THEN WRITE( NOUT, FMT = 9995 ) WRITE( NOUT, FMT = 9993 )RESID ELSE WRITE( NOUT, FMT = 9994 ) WRITE( NOUT, FMT = 9993 )RESID END IF END IF * * RELEASE THE PROCESS GRID * Free the BLACS context * CALL BLACS_GRIDEXIT( ICTXT ) 10 CONTINUE * * Exit the BLACS * CALL BLACS_EXIT( 0 ) * 9999 FORMAT( / 'ScaLAPACK Example Program #1 -- May 1, 1997' ) 9998 FORMAT( / 'Solving Ax=b where A is a ', I3, ' by ', I3, $ ' matrix with a block size of ', I3 ) 9997 FORMAT( 'Running on ', I3, ' processes, where the process grid', $ ' is ', I3, ' by ', I3 ) 9996 FORMAT( / 'INFO code returned by PDGESV = ', I3 ) 9995 FORMAT( / $ 'According to the normalized residual the solution is correct.' $ ) 9994 FORMAT( / $ 'According to the normalized residual the solution is incorrect.' $ ) 9993 FORMAT( / '||A*x - b|| / ( ||x||*||A||*eps*N ) = ', 1P, E16.8 ) STOP END SUBROUTINE MATINIT( AA, DESCA, B, DESCB ) * * MATINIT generates and distributes matrices A and B (depicted in * Figures 2.5 and 2.6) to a 2 x 3 process grid * * .. Array Arguments .. INTEGER DESCA( * ), DESCB( * ) DOUBLE PRECISION AA( * ), B( * ) * .. * .. Parameters .. INTEGER CTXT_, LLD_ PARAMETER ( CTXT_ = 2, LLD_ = 9 ) * .. * .. Local Scalars .. INTEGER ICTXT, MXLLDA, MYCOL, MYROW, NPCOL, NPROW DOUBLE PRECISION A, C, K, L, P, S * .. * .. External Subroutines .. EXTERNAL BLACS_GRIDINFO * .. * .. Executable Statements .. * ICTXT = DESCA( CTXT_ ) CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW, MYCOL ) * S = 19.0D0 C = 3.0D0 A = 1.0D0 L = 12.0D0 P = 16.0D0 K = 11.0D0 * MXLLDA = DESCA( LLD_ ) * IF( MYROW.EQ.0 .AND. MYCOL.EQ.0 ) THEN AA( 1 ) = S AA( 2 ) = -S AA( 3 ) = -S AA( 4 ) = -S AA( 5 ) = -S AA( 1+MXLLDA ) = C AA( 2+MXLLDA ) = C AA( 3+MXLLDA ) = -C AA( 4+MXLLDA ) = -C AA( 5+MXLLDA ) = -C AA( 1+2*MXLLDA ) = A AA( 2+2*MXLLDA ) = A AA( 3+2*MXLLDA ) = A AA( 4+2*MXLLDA ) = A AA( 5+2*MXLLDA ) = -A AA( 1+3*MXLLDA ) = C AA( 2+3*MXLLDA ) = C AA( 3+3*MXLLDA ) = C AA( 4+3*MXLLDA ) = C AA( 5+3*MXLLDA ) = -C B( 1 ) = 0.0D0 B( 2 ) = 0.0D0 B( 3 ) = 0.0D0 B( 4 ) = 0.0D0 B( 5 ) = 0.0D0 ELSE IF( MYROW.EQ.0 .AND. MYCOL.EQ.1 ) THEN AA( 1 ) = A AA( 2 ) = A AA( 3 ) = -A AA( 4 ) = -A AA( 5 ) = -A AA( 1+MXLLDA ) = L AA( 2+MXLLDA ) = L AA( 3+MXLLDA ) = -L AA( 4+MXLLDA ) = -L AA( 5+MXLLDA ) = -L AA( 1+2*MXLLDA ) = K AA( 2+2*MXLLDA ) = K AA( 3+2*MXLLDA ) = K AA( 4+2*MXLLDA ) = K AA( 5+2*MXLLDA ) = K ELSE IF( MYROW.EQ.0 .AND. MYCOL.EQ.2 ) THEN AA( 1 ) = A AA( 2 ) = A AA( 3 ) = A AA( 4 ) = -A AA( 5 ) = -A AA( 1+MXLLDA ) = P AA( 2+MXLLDA ) = P AA( 3+MXLLDA ) = P AA( 4+MXLLDA ) = P AA( 5+MXLLDA ) = -P ELSE IF( MYROW.EQ.1 .AND. MYCOL.EQ.0 ) THEN AA( 1 ) = -S AA( 2 ) = -S AA( 3 ) = -S AA( 4 ) = -S AA( 1+MXLLDA ) = -C AA( 2+MXLLDA ) = -C AA( 3+MXLLDA ) = -C AA( 4+MXLLDA ) = C AA( 1+2*MXLLDA ) = A AA( 2+2*MXLLDA ) = A AA( 3+2*MXLLDA ) = A AA( 4+2*MXLLDA ) = -A AA( 1+3*MXLLDA ) = C AA( 2+3*MXLLDA ) = C AA( 3+3*MXLLDA ) = C AA( 4+3*MXLLDA ) = C B( 1 ) = 1.0D0 B( 2 ) = 0.0D0 B( 3 ) = 0.0D0 B( 4 ) = 0.0D0 ELSE IF( MYROW.EQ.1 .AND. MYCOL.EQ.1 ) THEN AA( 1 ) = A AA( 2 ) = -A AA( 3 ) = -A AA( 4 ) = -A AA( 1+MXLLDA ) = L AA( 2+MXLLDA ) = L AA( 3+MXLLDA ) = -L AA( 4+MXLLDA ) = -L AA( 1+2*MXLLDA ) = K AA( 2+2*MXLLDA ) = K AA( 3+2*MXLLDA ) = K AA( 4+2*MXLLDA ) = K ELSE IF( MYROW.EQ.1 .AND. MYCOL.EQ.2 ) THEN AA( 1 ) = A AA( 2 ) = A AA( 3 ) = -A AA( 4 ) = -A AA( 1+MXLLDA ) = P AA( 2+MXLLDA ) = P AA( 3+MXLLDA ) = -P AA( 4+MXLLDA ) = -P END IF RETURN END SUBROUTINE SL_INIT( ICTXT, NPROW, NPCOL ) * * .. Scalar Arguments .. INTEGER ICTXT, NPCOL, NPROW * .. * * Purpose * ======= * * SL_INIT initializes an NPROW x NPCOL process grid using a row-major * ordering of the processes. This routine retrieves a default system * context which will include all available processes. In addition it * spawns the processes if needed. * * Arguments * ========= * * ICTXT (global output) INTEGER * ICTXT specifies the BLACS context handle identifying the * created process grid. The context itself is global. * * NPROW (global input) INTEGER * NPROW specifies the number of process rows in the grid * to be created. * * NPCOL (global input) INTEGER * NPCOL specifies the number of process columns in the grid * to be created. * * ===================================================================== * * .. Local Scalars .. INTEGER IAM, NPROCS * .. * .. External Subroutines .. EXTERNAL BLACS_GET, BLACS_GRIDINIT, BLACS_PINFO, $ BLACS_SETUP * .. * .. Executable Statements .. * * Get starting information * CALL BLACS_PINFO( IAM, NPROCS ) * * If machine needs additional set up, do it now * IF( NPROCS.LT.1 ) THEN IF( IAM.EQ.0 ) $ NPROCS = NPROW*NPCOL CALL BLACS_SETUP( IAM, NPROCS ) END IF * * Define process grid * CALL BLACS_GET( -1, 0, ICTXT ) CALL BLACS_GRIDINIT( ICTXT, 'Row-major', NPROW, NPCOL ) * RETURN * * End of SL_INIT * END *************************************************** From philippe.blaise at cea.fr Fri Oct 22 00:41:58 2004 From: philippe.blaise at cea.fr (Philippe Blaise) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] bandwidth: who needs it? In-Reply-To: <20041021205952.CF20C6EB10@clam.ahpcrc.org> References: <20041021205952.CF20C6EB10@clam.ahpcrc.org> Message-ID: <4178B9C5.4060004@cea.fr> The naive formula T_size = T_0 + size / max_bandwidth for size1/2 = T_0 * max_bandwith gives T_size1/2 = 2 * T0 which is a characteristic message length : you reach (more or less) half the bandwith, and it takes 2 * latency seconds to send / recv the message. For example, with T_0 = 5 usec and max_bandwith = 400 or 800 MB/s you obtain T_size1/2(5, 400) = 5 * 400 = 2 kB T_size1/2(5, 800) = 5 * 800 = 4 kB Phil. Richard Walsh wrote: >Greg Lindahl wrote: > > > >>>do you have applications that are pushing the limits of MPI bandwidth? >>>for instance, code that actually comes close to using the 8-900 MB/s >>>that current high-end interconnect provides? >>> >>> >>Bandwidth is important not only for huge messages that hit 900 MB/s, >>but also for medium sized messages. A naive formula for how long it >>takes to send a message is: >> >>T_size = T_0 + size / max_bandwidth >> >>For example, for a 4k message with T_0 = 5 usec and either 400 MB/s or >>800 MB/s, >> >>T_4k_400M = 5 + 4k/400M = 5 + 10 = 15 usec >>T_4k_800M = 5 + 4k/800M = 5 + 5 = 10 usec >> >>A big difference. But you're only getting 266 MB/s and 400 MB/s >>bandwidth, respectively. >> >>Of course performance is usually a bit less than this naive model. But >>the effect is real, becoming unimportant for packets smaller than ~ 2k >>in this example. The size at which this effect becomes unimportant >>depends on T_0 and the bandwidth. >> >> > >The above also makes a point about a mid-range regime of message sizes >whose transfer times are affected ~equally by bandwidth and latency >changes. Halving the latency in the 4K/800M case above is equivalent >to doubling the bandwidth for a message of this size: > > T_4k_800M. = 2.5 + 4k/800M = 2.5 + 5.0 = 7.5 usec > T_4k_800M = 5.0 + 4k/800M = 5.0 + 5.0 = 10.0 usec > T_4k_1600M = 5.0 + 4k/1600M = 5.0 + 2.5 = 7.5 usec > >For a given interconnect with a known latency and bandwidth there is >a "characteristic" message size whose transfer time is equally sensitive >to perturbations in bandwidth and latency (latency and bandwidth piece >of the transfer time are equal). So, for an "Elan-4-like" interconnect >characteristic message length would be 1.6k: > > T_4k_800M = 1.0 + 1.6k/800M = 1.0 + 2.0 = 3.0 usec > T_4k_800M = 2.0 + 1.6k/800M = 2.0 + 2.0 = 4.0 usec > T_4k_1600M = 2.0 + 1.6k/1600M = 2.0 + 1.0 = 3.0 usec > >Messages sizes in the vicinity of the characteristic length will >respond approximately equally to improvements in either factor. >Messages much larger in size will be more sensitive to bandwidth >improvements in an interconnect upgrade while message sizes much >smaller will be more sensitive to latency improvements in an upgrade. > >One might argue that bandwidth actually matters more because message >sizes (along with problem sizes) can in theory grow indefinitely (drop >in some more memory and double you array sizes) while they can be made >only be so small -- this is a position supported by the rate of storage >growth, but undermined by slower bandwidth growth and processor count >increases. > >I think I will keep my bandwidth though ... and take any off of the >hands of those who ... don't need it ... ;-) ... > >rbw > >#--------------------------------------------------- ># Richard Walsh ># Project Manager, Cluster Computing, Computational ># Chemistry and Finance ># netASPx, Inc. ># 1200 Washington Ave. So. ># Minneapolis, MN 55415 ># VOX: 612-337-3467 ># FAX: 612-337-3400 ># EMAIL: rbw@networkcs.com, richard.walsh@netaspx.com ># rbw@ahpcrc.org ># >#--------------------------------------------------- ># "What you can do, or dream you can, begin it; ># Boldness has genius, power, and magic in it." ># -Goethe >#--------------------------------------------------- ># "Without mystery, there can be no authority." ># -Charles DeGaulle >#--------------------------------------------------- ># "Why waste time learning when ignornace is ># instantaneous?" -Thomas Hobbes >#--------------------------------------------------- > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > From Thomas.Alrutz at dlr.de Fri Oct 22 05:05:47 2004 From: Thomas.Alrutz at dlr.de (Thomas Alrutz) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] dual Opteron recommendations In-Reply-To: <200410210848.i9L8mXfm005676@dali.crs4.it> References: <200410210848.i9L8mXfm005676@dali.crs4.it> Message-ID: <4178F79B.5050903@dlr.de> Alan Scheinine schrieb: > Many venders sell U1 cases for dual Opteron based on the Tyan > main boards, but on the other hand, a vender here says that the > product Newisys 2100 is much more reliable than Tyan though it > costs 10 to 20 percent more. I have not previously heard of > Newisys and I do not recall it being mentioned in this mailing > list. Would anyone like to comment? > best regards, > Alan Scheinine Email: scheinin@crs4.it > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf Hi Alan, we have some Opteron cluster nodes and some Opteron workstations based on the Arima HDAMA MB (Rioworks http://www.rioworks.com/HDAMA.htm). We are quite happy with those systems, because there are running without any failure since installation (one year ago). And as far as I know, there are some 1U barebones with this type of board, too. If you looking for a remote managment system on your motherboards, there is the possibility to attach an IPMI-Board called ARMC to the HDAMA. This would give you the same mamagement features like the Newisys boards, but for additional costs (~300 US$ ??) and size. Thomas -- __/|__ | Dipl.-Math. Thomas Alrutz /_/_/_/ | DLR Institute of Aerodynamics and Flow Technology |/ | Numerical Methods Department DLR | Bunsenstr. 10 | D-37073 Goettingen/Germany From thomas.clausen at aoes.com Fri Oct 22 02:58:26 2004 From: thomas.clausen at aoes.com (Thomas Clausen) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] parallel sparse linear solver choice? In-Reply-To: <1377588425.1098389985998.JavaMail.osg@osgjas02.cns.ufl.edu> References: <1377588425.1098389985998.JavaMail.osg@osgjas02.cns.ufl.edu> Message-ID: <20041022095824.GE19008@aoes.com> Hi Paul, You might want to have a look at http://www-unix.mcs.anl.gov/petsc/petsc-2/index.html Thomas On Thu, Oct 21, 2004 at 04:19:45PM -0400, JOHNSON,PAUL C wrote: > All: > > I was wondering whats your choice for a parallel sparse linear > solver? I have a beowulf cluster(~4 nodes, ok really small I > know) connected at 100Mbps. The computers are P4 2.2GHz with 1Gb > ram. The matrices are formed by a finite element program. They > are sparse, square, symmetric, and I would like to solve problems > with more than 200000 columns. Which of the solvers is easiest to > set up and utilize? One problem I am trying to solve is 156,240 > x 156,240 with 6,023,241 non-zero entries. > > Thanks for any help, > Paul > > -- > JOHNSON,PAUL C > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Thomas Clausen, PhD. AOES Group BV, http://www.aoes.com Phone +31(0)71 5795563 Fax +31(0)71572 1277 From hahn at physics.mcmaster.ca Fri Oct 22 09:41:12 2004 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] dual Opteron recommendations In-Reply-To: <20041022021054.GB32640@cse.ucdavis.edu> Message-ID: > Oh, speaking of which the main advantage I've seen in the newisys is > the remote managability. You can ssh to the management interface, check > temperatures, turn the machine on/off, and other related functionality. this is not newisis-specific, of course! we've got a cluster of HP DL145's (which look a LOT like Celestica a2210's). they have a nice lan-enabled IPMI card, which you can telnet to or use the reasonably secure IPMItools. HP certainly has ssh on other of their products (switches, for instance), so I'd expect them to add ssh support everywhere. > Alas, as far as I can tell the passthru for the management interface > (it has 2 ethernet ports for the management) isn't usable in any sane way. ipmitools seem to work nicely. I've got an perl/expect script for dealing with the telnet interface, if anyone wants it. > The idea of not requiring a masterswitch+ for power management, a > cyclades or similar for serial management, or a switch for a seperate decent IPMI support obsoletes all that junk. the DL145 gives you working bios redirection, as well as power control, warm/cold reset, lm_sensors-type data, etc. we're requiring this kind of remote management in all future purposes, and I'm convinced everyone doing clusters should do so as well. regards, mark hahn. From bmayer at cs.umn.edu Fri Oct 22 08:26:46 2004 From: bmayer at cs.umn.edu (Benjamin W. Mayer) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] parallel sparse linear solver choice? In-Reply-To: <20041022095824.GE19008@aoes.com> References: <1377588425.1098389985998.JavaMail.osg@osgjas02.cns.ufl.edu> <20041022095824.GE19008@aoes.com> Message-ID: I am not sure if this is a good pointer or not. Yousef Saad has a lot of work which may be useful for this application. http://www-users.cs.umn.edu/~saad/software/home.html SPARSKIT A basic tool-kit for sparse matrix computations. pARMS , parallel Algebraic recursive Multilevel Solver. There are also links to related packages on the above page. On Fri, 22 Oct 2004, Thomas Clausen wrote: > Hi Paul, > > You might want to have a look at > > http://www-unix.mcs.anl.gov/petsc/petsc-2/index.html > > Thomas > > On Thu, Oct 21, 2004 at 04:19:45PM -0400, JOHNSON,PAUL C wrote: > > All: > > > > I was wondering whats your choice for a parallel sparse linear > > solver? I have a beowulf cluster(~4 nodes, ok really small I > > know) connected at 100Mbps. The computers are P4 2.2GHz with 1Gb > > ram. The matrices are formed by a finite element program. They > > are sparse, square, symmetric, and I would like to solve problems > > with more than 200000 columns. Which of the solvers is easiest to > > set up and utilize? One problem I am trying to solve is 156,240 > > x 156,240 with 6,023,241 non-zero entries. > > > > Thanks for any help, > > Paul > > > > -- > > JOHNSON,PAUL C > > > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > > > -- > > Thomas Clausen, PhD. > AOES Group BV, http://www.aoes.com > Phone +31(0)71 5795563 Fax +31(0)71572 1277 > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From mathog at mendel.bio.caltech.edu Fri Oct 22 09:45:04 2004 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Tyan mobo and /proc/mtrr Message-ID: How do mtrr settings affect performance? Anybody know what /proc/mtrr "should" say on various Tyan mobos? Three types of Tyan systems here, this is what /proc/mtrr has for each: S2468UGN 2.6.8.1 (in MDK10.0) reg00: base=0x00000000 ( 0MB), size= 512MB: write-back, count=1 reg01: base=0xf5000000 (3920MB), size= 1MB: write-combining, count=1 S2466N 2.6.8.1 (in MDK10.0) reg00: base=0x00000000 ( 0MB), size=1024MB: write-back, count=1 reg01: base=0xf5000000 (3920MB), size= 1MB: write-combining, count=1 S2466N 2.4.18-10 (in RH7.3) reg00: base=0x00000000 ( 0MB), size=1024MB: write-back, count=1 The first line describes the total memory in the system. The second line, if present, corresponds to a setting from the ATI RAGE XL graphics card. The ones running the older OS don't even do that. Should there be additional mtrr settings, and if so, why? Note, lspci reports this for the RAGE XL graphics (on all): Memory at f5000000 (32-bit, non-prefetchable) [size=16M] I/O ports at 1000 [size=256] Memory at f4000000 (32-bit, non-prefetchable) [size=4K] Unclear to me why only 1M of the reported 16M is mapped (apparently) by the mtrr. Possibly this relates to these messages: /var/log/messages:Oct 20 12:44:59 safserver kernel: mtrr:\ 0xf5000000,0x400000 overlaps existing 0xf5000000,0x100000 On the other hand, the graphics aren't really used on the S2466N compute nodes. Graphics are used more on the S2468UGN server. Beats me where one would change this value though. Thanks, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From brian.dobbins at yale.edu Fri Oct 22 10:51:29 2004 From: brian.dobbins at yale.edu (Brian Dobbins) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] parallel sparse linear solver choice? In-Reply-To: <1377588425.1098389985998.JavaMail.osg@osgjas02.cns.ufl.edu> References: <1377588425.1098389985998.JavaMail.osg@osgjas02.cns.ufl.edu> Message-ID: <1098467489.12272.23.camel@rdx.eng.yale.edu> Hi Paul, If you're looking for packaged solvers, then PETSc (already mentioned) and Aztec could be of interest to you. Aztec is a parallel iterative solver for sparse systems developed by Sandia Labs. (Aztec: http://www.cs.sandia.gov/CRF/aztec1.html ) There is also a direct solver for symmetric positive definite matrices called PSPASES (http://www-users.cs.umn.edu/~mjoshi/pspases/), but I haven't ever looked at that myself, so I can't really tell you much about it. Another direct solution package, originally written for the RS6000 platform but now available for linux (I believe!), is the Watson Sparse Matrix Package (WSMP) (http://www-users.cs.umn.edu/~agupta/wsmp.html), by Anshul Gupta. Not only is the package said to be very good, but Gupta has done a LOT of research on sparse matrices, and it couldn't hurt to read some of his publications. On a different note, if you're not looking for packaged solvers, and just want to know about various methods or want to implement your own -often faster, if you have a known structure in your matrix-, you might want to read up on Yousef Saad's work, as mentioned before. Also, a very useful (and thick!) book is Golub and Van Loan's "Matrix Computations". Finally, if you find you're having difficulty with getting decent preconditioners (if necessary), I'd also suggest taking a look at Michele Benzi's work on sparse approximate inverse preconditioners. (http://www.mathcs.emory.edu/~benzi/) (That last link also has links to other people/places doing research that may be of interest to you.) Hope some of that is of interest to you, - Brian From hahn at physics.mcmaster.ca Fri Oct 22 15:04:12 2004 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Tyan mobo and /proc/mtrr In-Reply-To: Message-ID: > How do mtrr settings affect performance? they, along with per-page attributes, determine how the CPU treats the cachability of an address. > Anybody know what /proc/mtrr "should" say on various Tyan > mobos? why do you think there's a problem? > Three types of Tyan systems here, this is what /proc/mtrr has for each: > > S2468UGN 2.6.8.1 (in MDK10.0) > reg00: base=0x00000000 ( 0MB), size= 512MB: write-back, count=1 > reg01: base=0xf5000000 (3920MB), size= 1MB: write-combining, count=1 > S2466N 2.6.8.1 (in MDK10.0) > reg00: base=0x00000000 ( 0MB), size=1024MB: write-back, count=1 > reg01: base=0xf5000000 (3920MB), size= 1MB: write-combining, count=1 > S2466N 2.4.18-10 (in RH7.3) > reg00: base=0x00000000 ( 0MB), size=1024MB: write-back, count=1 > > The first line describes the total memory in the system. > The second line, if present, corresponds to a setting from > the ATI RAGE XL graphics card. The ones running the older OS > don't even do that. mtrr values are supposed to be set by the bios; the OS certainly can change them, but doesn't have to. I forget for sure, but suspect that in the absence of an mtrr, the mem-mapped video area on the third machine would default to write-through or -combining. of course, all this stuff was terribly novel back in the dark ages of RH 7.x. it wouldn't be shocking if RH7.3 got it wrong... > Should there be additional mtrr settings, and if so, why? the main thing is simply to have all real ram in write-back; mtrr's for video can make a difference, but often not as much as you might expect, because the CPU does a certain amount of write coalescing before data even gets to the cache. > Note, lspci reports this for the RAGE XL graphics (on all): > > Memory at f5000000 (32-bit, non-prefetchable) [size=16M] > I/O ports at 1000 [size=256] > Memory at f4000000 (32-bit, non-prefetchable) [size=4K] > > Unclear to me why only 1M of the reported 16M is mapped (apparently) > by the mtrr. Possibly this relates to these messages: > > /var/log/messages:Oct 20 12:44:59 safserver kernel: mtrr:\ > 0xf5000000,0x400000 overlaps existing 0xf5000000,0x100000 IIRC, this is perfectly legal - in fact, the correct way to define a 3MB region is to define a 4MB region and an overlapping 1MB region. > On the other hand, the graphics aren't really used > on the S2466N compute nodes. so ignore them. they can't possibly matter... > Graphics are used more on the > S2468UGN server. Beats me where one would change this value > though. there's an mtrr doc on kernel/Documentation which is all you need, *if* you actually need to change anything. it seems like X started fixing mtrr's several years ago using this interface (/proc/mtrr). I would guess that mtrr's on the framebuffer were less relevant for a number of years (using a video card's hw acceleration obviates the need for the host to directly manipulate the pixels). this may be less true now with some of the newfangled client-side rendering, etc. From mkamranmustafa at gmail.com Fri Oct 22 22:53:11 2004 From: mkamranmustafa at gmail.com (Kamran Mustafa) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Need Help...! Message-ID: Hi, I am working as an IT Manager at NED University of Engineering & Technology, Karachi, Pakistan, and currently managing a Linux based Cluster of 50 nodes. I just wanted to ask you that how to manage licensing issues on a beowulf cluster. Lets say, if you want to run an application software on 50 nodes then will you purchase 50 licenses of that software or if there is any other alternative to handle this licensing issue, because purchasing such a huge number of licences will definitely be very expensive. Actually, I also want to purchase different software for my 50 noded cluster but purchasing 50 licences of each software costs me alot, thats why I am in need of your guidance and kind suggestions. Regards, Muhammad Kamran Mustafa I.T. Manager Centre for Simulation & Modeling, NED University of Engineering & Technology, Karachi, Pakistan. Tel: (9221) 9243261-8 ext 2372 Fax: (9221) 9243248 From rgb at phy.duke.edu Sat Oct 23 09:14:22 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Need Help...! In-Reply-To: References: Message-ID: On Sat, 23 Oct 2004, Kamran Mustafa wrote: > Hi, > > I am working as an IT Manager at NED University of Engineering & > Technology, Karachi, Pakistan, and currently managing a Linux based > Cluster of 50 nodes. I just wanted to ask you that how to manage > licensing issues on a beowulf cluster. Lets say, if you want to run an > application software on 50 nodes then will you purchase 50 licenses of > that software or if there is any other alternative to handle this > licensing issue, because purchasing such a huge number of licences > will definitely be very expensive. Actually, I also want to purchase > different software for my 50 noded cluster but purchasing 50 licences > of each software costs me alot, thats why I am in need of your > guidance and kind suggestions. > > Regards, > > Muhammad Kamran Mustafa Dear Kamran, Please give us a bit more detail. In particular, what software are we talking about? Different packages have very different licensing schmea, and one usually has to go with what a package supports. For example, matlab is in use on some clusters on campus here. matlab uses a license manager that can regulate the number of instances of matlab in use on a cluster. Quite a few packages, actually, use a license manager that can regulate the number of packages one has to buy relative to the number of platforms one wishes to run them on, but of course this is a case by case thing. Compilers have a slightly different issue. There there may be floating license managers, but because compiler usage is sporadic many sites just buy a single license and put in on a specific node, e.g. the head node or the server node (which has direct access to the disk and thus avoids a networking hit). The issue there is libraries -- many compilers come with special libraries that are part of how they get good performance. In some cases the libraries can be used on many systems as long as you buy the compiler/library package for one. I don't know the exact state of things now but at one point in time at least you had to by library licenses for every node for at least some compilers out there in order to run the binaries generated by a compiler-licensed node. Finally there is the OS itself -- commercial linux distributions. There the licensing arrangements are whatever you dicker out of the company. Unfortunately, most of the companies about clusters and what consitutes "reasonable" cost scaling in a cluster where 50-500 systems are literally clones of a basic node configuration, and will cheerily charge hundreds of dollars per node as if those nodes generate some sort of incremental cost for "support". I think it is safe to say that "most" cluster sites avoid this cost by using e.g. Centos (logo-free GPL-based rebuild of RHEL), Fedora Core, Debian, Caosity -- one of the still-free linux distributions. As a FC user, I can attest to the fact that it is entirely possible to assemble a stable and highly functional cluster node (or desktop workstation) on top of FC. Admins tend to lean a bit more towards Centos for high availability/mission critical servers in the expectation of a bit more immediate support, but in the case of a cluster server I'd fully expect FC to be adequately stable and provide good performance. So if your issue is OS license management, I'd suggest going toward one of the fully open/free linuces -- those will certainly minimize your per-box outlay, and from what I can tell there is basically no difference whatsoever in ease of installation or maintenance. You can even get your cluster installation prepackaged for you (for free) from e.g. ROCKS or wulfware, which seem to be stabilizing and have active participants that are keeping them nicely current. Hope this helps. If you want better help, please include detail -- the specific packages you're concerned about, the particular setup of your cluster, and what sort of licensing scheme the packages are supposed to use (the vendors should be able to help you out here). rgb > I.T. Manager > Centre for Simulation & Modeling, > NED University of Engineering & Technology, > Karachi, Pakistan. > Tel: (9221) 9243261-8 ext 2372 > Fax: (9221) 9243248 > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From atp at piskorski.com Sat Oct 23 13:03:56 2004 From: atp at piskorski.com (Andrew Piskorski) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Re: Beowulf of bare motherboards In-Reply-To: <000c01c4a564$c967c430$33a8a8c0@LAPTOP152422> References: <20040927182134.GA23662@piskorski.com> <4158D794.9090704@verizon.net> <20040928040034.GA93760@piskorski.com> <000c01c4a564$c967c430$33a8a8c0@LAPTOP152422> Message-ID: <20041023200355.GA44250@piskorski.com> I recently experimented with running multiple motherboards off a single power supply. This is pretty easy now, because you can buy Y power cables now - no soldering necessary: "ZIPPY Power Cable Splitter: ATX 20 pin to Two ATX 20 pin for ATX Power Supplies", $11 for one; 8 of these cost me $8.49 each shipped: http://www.micer.com/viewItem.asp?idProduct=453056250 I wanted to find out how many nodes I could power from a single supply, and I happened to have 4 different power supplies on hand for testing. In all cases, I plugged the supply into my Kill-a-Watt, attached 3 of the above y-cables to the supply, and simply varied the number of motherboards plugged into those 4 connectors. The 4 nodes in question are all Ebay specials, configured like so: - Motherboard: ECS P4VXMS http://www.ecsusa.com/products/p4vxms.html - 1 Pentium 4 CPU, socket 423, 256 KB cache, 400 FSB; speed GHz: 1.3, 1.4, 1.5, 1.7 - 1 stick RAM, 512 MB PC133 CL3 - 1 AGP graphics card installed, various models. - 1 Panaflo fan (80 mm, 12 V, 0.1 A) blowing on the CPU heat exchanger. - THAT'S IT. (No hard drives, etc.) Below, the reported Watts is simply the approximate maximum W value I saw on the Kill-a-Watt as the nodes booted. The Powe Factor is the lowest and/or most typical PF reported by the Kill-a-Watt: ThermalTake Purepower HPC-420-302 DF, Active PFC, 420 W http://www.newegg.com/app/ViewProductDesc.asp?description=17-153-005 http://www.newegg.com/app/viewProductDesc.asp?description=17-153-005R $53 +$7 from newegg.com 2 nodes, 175 W, PF 0.98, $34.25 per node 3 nodes, would not boot, [$25.67 per node] 4 nodes, would not boot, [$21.38 per node] MGE SuperCharger, 600W http://www.newegg.com/app/viewProductDesc.asp?description=17-167-010 $48 +$7 from newegg.com 2 nodes, 175 W, PF 0.66, $31.75 per node 3 nodes, 255 W, PF 0.67, $24.00 per node 4 nodes, would not boot, [$20.13 per node] Enermax EG301P-VB, 300 W http://www.newegg.com/app/viewProductDesc.asp?description=17-103-423 $31.50 +$7 from newegg.com 2 nodes, 155 W, PF 0.67, $23.50 per node 3 nodes, 226 W, PF 0.68, $18.50 per node 4 nodes, would not boot, [$16.00 per node] Sparkle FSP250-61GT, 250 W Ancient, used to power my old AMD K6-II 380 MHz dektop. 2 nodes, 170 W, PF 0.64 3 nodes, 241 W, PF 0.64 4 nodes, 331 W, PF 0.65 Note that I didn't actually RUN anything on the nodes at all, I just plugged in a monitor and verified that they got through the POST ok and attempted to boot. (They attempt to PXE boot, but I don't yet have anything set up for them to PXE boot FROM.) Newegg used to advertise the MGE 600 W supply above as having active PFC, (which is why I bought it), but nothing on the supply itself says anything about PFC, and the Kill-a-Watt results definitely show that it doesn't have PFC. I find it interesting that the smallest, oldest, and probably cheapest supply is the only one that successfully booted all 4 nodes at once. Perhaps it is running out of spec, and simply lacks the circuitry to shut down in such cases? These motherboards each beep once when they boot, and the beeps seemed to all come very close together with some supplies, and further apart with others. I didn't pay attention to which supplies did this, but this is probably why the Kill-a-Watt seemed to show lower peak Watts for the Enermax supply? Unfortunately I didn't have any el-cheap $12 (plus shipping) supplies to test. Particularly since these nodes are diskless, those might actually work just fine. Newegg is also now selling the slightly larger 480 W ThermalTake active PFC supply for about the same price as the 420 W supply above, which would be worth trying if you really want PFC. -- Andrew Piskorski http://www.piskorski.com/ From hahn at physics.mcmaster.ca Sat Oct 23 13:49:36 2004 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Re: Beowulf of bare motherboards In-Reply-To: <20041023200355.GA44250@piskorski.com> Message-ID: > testing. In all cases, I plugged the supply into my Kill-a-Watt, > attached 3 of the above y-cables to the supply, and simply varied the > number of motherboards plugged into those 4 connectors. very interesting, but somewhat hard to interpret. I love my killawatt, too, but unfortunately for this experiment, you need to know how much current the nodes were drawing on each voltage (and the PS specs.) for instance, the TT PS might actually provide 420, but only enough on 3.3 to support a single CPU's VRM, but *lots* of 12V umph. that wouldn't be unreasonable, given the market for "normal" uses - big PS's support machines with lots of disks, or possibly systems that use extra 12V for hot AGP cards, etc. on the other hand, Sparkle is the only one of these PS vendors that I see in OEM settings - TT/MGE/Enermax all seem to be mostly after-market vendors. interpret that how you will ;) regards, mark hahn. From Glen.Gardner at verizon.net Sat Oct 23 09:20:36 2004 From: Glen.Gardner at verizon.net (Glen Gardner) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Need Help...! References: Message-ID: <417A84D4.8030303@verizon.net> Muhamed; In general, one uses a computer as a license server for all the other machines. This might be on the development node of the cluster, or on a dedicated server. This way you can have all of your licenses served from one machine. As far as commecrcial software being expensive is concerned.... The basic idea behind Beowulf is to use readily available freeware to cut the operating costs. For most Beowulf projects, using payware is a great luxury and most tools are developed from existing freeware resources. However, using commecrially available compilers on an otherwise freeware Beowulf is becoming commonplace (but probably not as big an advantage as some would have you think). If you already have the personnel resources it might prove cheaper to develop much of your own software from existing freeware tools than to buy payware all along the way, and augment that with a few carefully chosen commercial compilers and maybe one or two carefully chosen commercial applications programming libraries. This approach is economical and effective, but will require some time from system admins and programmers to make it all go. If you need to have something that is turnkey, prepackaged and ready to go all the way, then perhaps Beowulf is the wrong concept for your needs and you should consider a commercially built cluster or supercomputer with prepackaged software so your people can just login and go straight to work, with little or no development time. Glen Gardner Kamran Mustafa wrote: >Hi, > >I am working as an IT Manager at NED University of Engineering & >Technology, Karachi, Pakistan, and currently managing a Linux based >Cluster of 50 nodes. I just wanted to ask you that how to manage >licensing issues on a beowulf cluster. Lets say, if you want to run an >application software on 50 nodes then will you purchase 50 licenses of >that software or if there is any other alternative to handle this >licensing issue, because purchasing such a huge number of licences >will definitely be very expensive. Actually, I also want to purchase >different software for my 50 noded cluster but purchasing 50 licences >of each software costs me alot, thats why I am in need of your >guidance and kind suggestions. > >Regards, > >Muhammad Kamran Mustafa >I.T. Manager >Centre for Simulation & Modeling, >NED University of Engineering & Technology, >Karachi, Pakistan. >Tel: (9221) 9243261-8 ext 2372 >Fax: (9221) 9243248 >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > -- Glen E. Gardner, Jr. AA8C AMSAT MEMBER 10593 http://members.bellatlantic.net/~vze24qhw/index.html From michael at halligan.org Sat Oct 23 10:45:08 2004 From: michael at halligan.org (Michael T. Halligan) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Need Help...! In-Reply-To: References: Message-ID: <417A98A4.5030602@halligan.org> Kamran, My first approach would be to call up the application manufacturer directly and ask them how their licensing works. Explain your situation, and get their recommendation. After you find out the deal on that piece of software, start talking about bulk pricing. A lot of companies, especially ones with specialized software, are willing to give good bulk discounts to universities. If that fails, find a VAR (Value Added Reseller). Typically a good VAR can get you discounts on anything they sell of 25-45% off of the manufacturer's price. You'll get even better discounts if you can purchase say the 50 pieces of software, as well as some servers, or some other software, or some support licenses through the var. I always try to make my purchases large, and combined. If I know I'm going to need to spend $250k in a quarter on software & hardware, but I won't need a portion of it until the end of the quarter, I'll let my vendor know what I plan on ordering, and how much I want to order initially, and they will almost always offer me extra discounts to do the entire purchase at once. >Hi, > >I am working as an IT Manager at NED University of Engineering & >Technology, Karachi, Pakistan, and currently managing a Linux based >Cluster of 50 nodes. I just wanted to ask you that how to manage >licensing issues on a beowulf cluster. Lets say, if you want to run an >application software on 50 nodes then will you purchase 50 licenses of >that software or if there is any other alternative to handle this >licensing issue, because purchasing such a huge number of licences >will definitely be very expensive. Actually, I also want to purchase >different software for my 50 noded cluster but purchasing 50 licences >of each software costs me alot, thats why I am in need of your >guidance and kind suggestions. > >Regards, > >Muhammad Kamran Mustafa >I.T. Manager >Centre for Simulation & Modeling, >NED University of Engineering & Technology, >Karachi, Pakistan. >Tel: (9221) 9243261-8 ext 2372 >Fax: (9221) 9243248 >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > -- ------------------- BitPusher, LLC http://www.bitpusher.com/ 1.888.9PUSHER (415) 724.7998 - Mobile From reuti at staff.uni-marburg.de Sat Oct 23 13:41:52 2004 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Need Help...! In-Reply-To: References: Message-ID: <1098564112.417ac2102bb5b@home.staff.uni-marburg.de> Hi, > Please give us a bit more detail. In particular, what software are we > talking about? Different packages have very different licensing schmea, > and one usually has to go with what a package supports. For example, > matlab is in use on some clusters on campus here. matlab uses a > license manager that can regulate the number of instances of matlab in > use on a cluster. Quite a few packages, actually, use a license manager > that can regulate the number of packages one has to buy relative to the > number of platforms one wishes to run them on, but of course this is a > case by case thing. also when there is no license manager included, you have to stay in the range of the bought licenses with some counter in the queuing system you are using (with some of them e.g. SGE you can also control the interactive usage). Some software companies also have different license conditions for commercial usage (pay per machine or sometimes pay per CPU in the machine) or academical usage (pay per platform). Depending on the price, it may be cheaper to buy a site license in some cases (although you will use it in your cluster only). As pointed out, this you have to check for each software you intend to use in detail. > Compilers have a slightly different issue. There there may be floating > license managers, but because compiler usage is sporadic many sites just > buy a single license and put in on a specific node, e.g. the head node > or the server node (which has direct access to the disk and thus avoids Agreed. > a networking hit). The issue there is libraries -- many compilers come > with special libraries that are part of how they get good performance. > In some cases the libraries can be used on many systems as long as you > buy the compiler/library package for one. I don't know the exact state > of things now but at one point in time at least you had to by library > licenses for every node for at least some compilers out there in order > to run the binaries generated by a compiler-licensed node. E.g. the Portland license allows you also to sell the compiled program and distribute some .so files without any extra fee. For the Intel ones, you may in addition distribute the .a files. In each case there is a detailed list, what library files are valid for it. So it should be save to use them (the libraries) on all nodes in a cluster also. > Unfortunately, most of the companies about clusters and what consitutes > "reasonable" cost scaling in a cluster where 50-500 systems are > literally clones of a basic node configuration, and will cheerily charge > hundreds of dollars per node as if those nodes generate some sort of > incremental cost for "support". I think it is safe to say that "most" > cluster sites avoid this cost by using e.g. Centos (logo-free GPL-based > rebuild of RHEL), Fedora Core, Debian, Caosity -- one of the still-free What about SuSE? You can download some floppies from their server and install it over net. And if you want: you can buy support. Cheers - Reuti From taylor65 at cox.net Sat Oct 23 10:52:15 2004 From: taylor65 at cox.net (Ryan Taylor) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Question about v9fs_wire Message-ID: <000e01c4b929$088a8530$1702a8c0@secondneverman> Getting a kmod: failed to exec /sbin/mdprobe ..... v9fs_wire errorno=2, "Unable to handle kernel NULL pointer..." Couldn't find much info on v9fs_wire, can someone help. Using RedHat Linux 9 with ClusterMatic 4. Booting the node (just one test node for now) straight from the CD. Anyone have any suggestions for me. Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20041023/9357eb6c/attachment.html From tmattox at gmail.com Sat Oct 23 09:26:19 2004 From: tmattox at gmail.com (Tim Mattox) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Need Help...! In-Reply-To: References: Message-ID: Hello Kamran Mustafa, Our research group tends to avoid software with "per node" licensing fees. Depending on the kind of software, you should have free/open source alternatives that require no licensing fees at all. If you could list the specific software you are worried about, maybe we (the beowulf list) can suggest free (or single cost) alternatives. For operating systems, I would suggest you look at http://caosity.org/ which I am involved with. Similarly, for a cluster management software, check out http://warewulf-cluster.org/ which I am also involved with. There are plenty more alternatives to those two, but they are a good start looking for the base stuff. As for application level software and/or compilers, I'll leave that to other bewoulfers to comment on. Our group tends to work with application developers, so they have their own codes they compile. On Sat, 23 Oct 2004 10:53:11 +0500, Kamran Mustafa wrote: > Hi, > > I am working as an IT Manager at NED University of Engineering & > Technology, Karachi, Pakistan, and currently managing a Linux based > Cluster of 50 nodes. I just wanted to ask you that how to manage > licensing issues on a beowulf cluster. Lets say, if you want to run an > application software on 50 nodes then will you purchase 50 licenses of > that software or if there is any other alternative to handle this > licensing issue, because purchasing such a huge number of licences > will definitely be very expensive. Actually, I also want to purchase > different software for my 50 noded cluster but purchasing 50 licences > of each software costs me alot, thats why I am in need of your > guidance and kind suggestions. > > Regards, > > Muhammad Kamran Mustafa > I.T. Manager > Centre for Simulation & Modeling, > NED University of Engineering & Technology, > Karachi, Pakistan. > Tel: (9221) 9243261-8 ext 2372 > Fax: (9221) 9243248 -- Tim Mattox - tmattox@gmail.com - http://homepage.mac.com/tmattox/ From atp at piskorski.com Sat Oct 23 20:52:24 2004 From: atp at piskorski.com (Andrew Piskorski) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Re: Beowulf of bare motherboards In-Reply-To: <20041023200355.GA44250@piskorski.com> References: <20040927182134.GA23662@piskorski.com> <4158D794.9090704@verizon.net> <20040928040034.GA93760@piskorski.com> <000c01c4a564$c967c430$33a8a8c0@LAPTOP152422> <20041023200355.GA44250@piskorski.com> Message-ID: <20041024035224.GA44285@piskorski.com> At Mark Hahn's suggestion, I checked the rated amperage on the +3.3 volt line is for each supply. That didn't seem to correlate with anything though, so I've recorded all the amps here: Rated Amps for each line: Volts: +3.3, +5, +12, -5, -12, +5 Sb Nodes, PSU ---- -- --- --- --- ----- 2, TTake: 30, 40, 18, 0.3, 0.8, 2.0 3, MGE: 20, 45, 24, 0.6, 0.6, 2.0 3, Enermax: 28, 30, 22, 1.0, 1 , 2.2 4, Sparkle: 14, 25, 8, 0.8, 0.8, 0.8 The ratings on the -5 V line seem to line up pretty closely with my "how many motherboards can this supply power up" metric, but, the 20 pin ATX power connector on the motherboard doesn't even HAVE a -5 V line, right? Any thoughts on what the driving factor here could be? ThermalTake Purepower HPC-420-302 DF, Active PFC, 420 W +3.3 V: 30 A http://www.newegg.com/app/ViewProductDesc.asp?description=17-153-005 http://www.newegg.com/app/viewProductDesc.asp?description=17-153-005R $53 +$7 from newegg.com 2 nodes, 175 W, PF 0.98, $34.25 per node 3 nodes, would not boot, [$25.67 per node] 4 nodes, would not boot, [$21.38 per node] MGE SuperCharger, 600W +3.3 V: 20 A http://www.newegg.com/app/viewProductDesc.asp?description=17-167-010 $48 +$7 from newegg.com 2 nodes, 175 W, PF 0.66, $31.75 per node 3 nodes, 255 W, PF 0.67, $24.00 per node 4 nodes, would not boot, [$20.13 per node] Enermax EG301P-VB, 300 W +3.3 V: 28 A http://www.newegg.com/app/viewProductDesc.asp?description=17-103-423 $31.50 +$7 from newegg.com 2 nodes, 155 W, PF 0.67, $23.50 per node 3 nodes, 226 W, PF 0.68, $18.50 per node 4 nodes, would not boot, [$16.00 per node] Sparkle FSP250-61GT, 250 W +3.3 V: 14 A Ancient, used to power my old AMD K6-II 380 MHz dektop. 2 nodes, 170 W, PF 0.64 3 nodes, 241 W, PF 0.64 4 nodes, 331 W, PF 0.65 -- Andrew Piskorski http://www.piskorski.com/ From Glen.Gardner at verizon.net Sat Oct 23 22:37:22 2004 From: Glen.Gardner at verizon.net (Glen Gardner) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Re: Beowulf of bare motherboards References: <20040927182134.GA23662@piskorski.com> <4158D794.9090704@verizon.net> <20040928040034.GA93760@piskorski.com> <000c01c4a564$c967c430$33a8a8c0@LAPTOP152422> <20041023200355.GA44250@piskorski.com> <20041024035224.GA44285@piskorski.com> Message-ID: <417B3F92.5060104@verizon.net> You might try turning on one node at a time if you can. You ought to try to run two nodes at full throttle on the same psu. I suspect you will run into more problems. Install an OS on each computer and run something like a heapsort benchmark or linpack on both at the same time and see if you get a crash. Andrew Piskorski wrote: >At Mark Hahn's suggestion, I checked the rated amperage on the +3.3 >volt line is for each supply. That didn't seem to correlate with >anything though, so I've recorded all the amps here: > > Rated Amps for each line: > Volts: +3.3, +5, +12, -5, -12, +5 Sb >Nodes, PSU ---- -- --- --- --- ----- >2, TTake: 30, 40, 18, 0.3, 0.8, 2.0 >3, MGE: 20, 45, 24, 0.6, 0.6, 2.0 >3, Enermax: 28, 30, 22, 1.0, 1 , 2.2 >4, Sparkle: 14, 25, 8, 0.8, 0.8, 0.8 > >The ratings on the -5 V line seem to line up pretty closely with my >"how many motherboards can this supply power up" metric, but, the 20 >pin ATX power connector on the motherboard doesn't even HAVE a -5 V >line, right? Any thoughts on what the driving factor here could be? > >ThermalTake Purepower HPC-420-302 DF, Active PFC, 420 W >+3.3 V: 30 A > http://www.newegg.com/app/ViewProductDesc.asp?description=17-153-005 > http://www.newegg.com/app/viewProductDesc.asp?description=17-153-005R > $53 +$7 from newegg.com >2 nodes, 175 W, PF 0.98, $34.25 per node >3 nodes, would not boot, [$25.67 per node] >4 nodes, would not boot, [$21.38 per node] > >MGE SuperCharger, 600W >+3.3 V: 20 A > http://www.newegg.com/app/viewProductDesc.asp?description=17-167-010 > $48 +$7 from newegg.com >2 nodes, 175 W, PF 0.66, $31.75 per node >3 nodes, 255 W, PF 0.67, $24.00 per node >4 nodes, would not boot, [$20.13 per node] > >Enermax EG301P-VB, 300 W >+3.3 V: 28 A > http://www.newegg.com/app/viewProductDesc.asp?description=17-103-423 > $31.50 +$7 from newegg.com >2 nodes, 155 W, PF 0.67, $23.50 per node >3 nodes, 226 W, PF 0.68, $18.50 per node >4 nodes, would not boot, [$16.00 per node] > >Sparkle FSP250-61GT, 250 W >+3.3 V: 14 A > Ancient, used to power my old AMD K6-II 380 MHz dektop. >2 nodes, 170 W, PF 0.64 >3 nodes, 241 W, PF 0.64 >4 nodes, 331 W, PF 0.65 > > > -- Glen E. Gardner, Jr. AA8C AMSAT MEMBER 10593 Glen.Gardner@verizon.net http://members.bellatlantic.net/~vze24qhw/index.html From hanzl at noel.feld.cvut.cz Sun Oct 24 01:35:49 2004 From: hanzl at noel.feld.cvut.cz (hanzl@noel.feld.cvut.cz) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Question about v9fs_wire In-Reply-To: <000e01c4b929$088a8530$1702a8c0@secondneverman> References: <000e01c4b929$088a8530$1702a8c0@secondneverman> Message-ID: <20041024103549H.hanzl@unknown-domain> > Getting a kmod: failed to exec /sbin/mdprobe ..... v9fs_wire > errorno=2, "Unable to handle kernel NULL pointer..." > Couldn't find much info on v9fs_wire, can someone help. > Using RedHat Linux 9 with ClusterMatic 4. Booting the node (just one > test node for now) straight from the CD. I guess you could just disable the corresponding kernel module in the config file. My knowledge is not quite up-to-date but I know they (Clustermatic team, Ron Minnich in particular) did interesting experiments with Plan-9-like filesystem which could export filesystems and create private namespaces on per-user basis. Then they did not pay too much attention to this for some time; I do not know the current status. You might get more details on the bproc list: http://lists.sourceforge.net/lists/listinfo/bproc-users HTH Vaclav Hanzl From cflau at clc.cuhk.edu.hk Mon Oct 25 03:58:00 2004 From: cflau at clc.cuhk.edu.hk (John Lau) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Can we set MPICH to use ssh instead of rsh at runtime? Message-ID: <1098701880.2791.287.camel@nuts.clc.cuhk.edu.hk> Hi, Can we set MPICH to use ssh instead of rsh at runtime? I know it can be set in compile time by configure opinion. And LAM can do that by setting $LAMRSH environment variable. Best regards, John Lau From edemir_at_andrew at yahoo.com Sun Oct 24 21:31:54 2004 From: edemir_at_andrew at yahoo.com (Ergin Demir) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Re: Beowulf of bare motherboards In-Reply-To: <20041023200355.GA44250@piskorski.com> Message-ID: <20041025043154.58715.qmail@web53705.mail.yahoo.com> How do you boot or shut down individual mobos? I think in this configuration all mobos will boot up or shut down simultaneously. Andrew Piskorski wrote: I recently experimented with running multiple motherboards off a single power supply. This is pretty easy now, because you can buy Y power cables now - no soldering necessary: __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20041024/bf7d81e6/attachment.html From mkamranmustafa at gmail.com Sun Oct 24 23:04:52 2004 From: mkamranmustafa at gmail.com (Kamran Mustafa) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Need Help...! In-Reply-To: <1098564112.417ac2102bb5b@home.staff.uni-marburg.de> References: <1098564112.417ac2102bb5b@home.staff.uni-marburg.de> Message-ID: Hi, Thanks alot for the prompt reply. Right at the moment I am asked to purchased the following for my cluster: 1) MPI/Pro by verari systems 2) PGI CDK Cluster Development Kit by Portland Group Purchasing 100 processes of MPI/Pro is really very expensive for me. Similarly, for my 100 processors I have to purchase 256 licences of PGI CDK because they offer licences in groups of 16/64/256 CPUs. Even if I purchase 256 licences for just 2 simultaneous counts, it costs me a lot... Kindly help me in this issue as soon as possible. I will be thankful to you. Regards, Muhammad Kamran Mustafa I.T. Manager Centre for Simulation & Modeling, NED University of Engineering & Technology, Karachi, Pakistan. Tel: (9221) 9243261-8 ext 2372 Fax: (9221) 9243248 ------------------------------------------------------------------------------------------------------------------------ On Sat, 23 Oct 2004 22:41:52 +0200, Reuti wrote: > Hi, > > > > > Please give us a bit more detail. In particular, what software are we > > talking about? Different packages have very different licensing schmea, > > and one usually has to go with what a package supports. For example, > > matlab is in use on some clusters on campus here. matlab uses a > > license manager that can regulate the number of instances of matlab in > > use on a cluster. Quite a few packages, actually, use a license manager > > that can regulate the number of packages one has to buy relative to the > > number of platforms one wishes to run them on, but of course this is a > > case by case thing. > > also when there is no license manager included, you have to stay in the range > of the bought licenses with some counter in the queuing system you are using > (with some of them e.g. SGE you can also control the interactive usage). > > Some software companies also have different license conditions for commercial > usage (pay per machine or sometimes pay per CPU in the machine) or academical > usage (pay per platform). Depending on the price, it may be cheaper to buy a > site license in some cases (although you will use it in your cluster only). As > pointed out, this you have to check for each software you intend to use in > detail. > > > Compilers have a slightly different issue. There there may be floating > > license managers, but because compiler usage is sporadic many sites just > > buy a single license and put in on a specific node, e.g. the head node > > or the server node (which has direct access to the disk and thus avoids > > Agreed. > > > a networking hit). The issue there is libraries -- many compilers come > > with special libraries that are part of how they get good performance. > > In some cases the libraries can be used on many systems as long as you > > buy the compiler/library package for one. I don't know the exact state > > of things now but at one point in time at least you had to by library > > licenses for every node for at least some compilers out there in order > > to run the binaries generated by a compiler-licensed node. > > E.g. the Portland license allows you also to sell the compiled program and > distribute some .so files without any extra fee. For the Intel ones, you may in > addition distribute the .a files. In each case there is a detailed list, what > library files are valid for it. So it should be save to use them (the > libraries) on all nodes in a cluster also. > > > Unfortunately, most of the companies about clusters and what consitutes > > "reasonable" cost scaling in a cluster where 50-500 systems are > > literally clones of a basic node configuration, and will cheerily charge > > hundreds of dollars per node as if those nodes generate some sort of > > incremental cost for "support". I think it is safe to say that "most" > > cluster sites avoid this cost by using e.g. Centos (logo-free GPL-based > > rebuild of RHEL), Fedora Core, Debian, Caosity -- one of the still-free > > What about SuSE? You can download some floppies from their server and install > it over net. And if you want: you can buy support. > > Cheers - Reuti > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From reuti at staff.uni-marburg.de Mon Oct 25 01:21:38 2004 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Need Help...! In-Reply-To: References: <1098564112.417ac2102bb5b@home.staff.uni-marburg.de> Message-ID: <1098692498.417cb792cc1d9@home.staff.uni-marburg.de> Hi again, > Thanks alot for the prompt reply. Right at the moment I am asked to > purchased the following for my cluster: > > 1) MPI/Pro by verari systems > 2) PGI CDK Cluster Development Kit by Portland Group > Purchasing 100 processes of MPI/Pro is really very expensive for me. > Similarly, for my 100 processors I have to purchase 256 licences of > PGI CDK because they offer licences in groups of 16/64/256 CPUs. Even > if I purchase 256 licences for just 2 simultaneous counts, it costs me > a lot... is there a direct use of the cluster features of the PGI CDK? According to their websites it's just a combination of the compilers, MPICH and a part of OpenPBS. And OpenMP is also included in standard package of their compilers. Is your main application to use the compilers for software development or to use the compiled programs? So you could just buy one license of the normal compiler, download MPICH at http://www-unix.mcs.anl.gov/mpi/mpich and choose a queuing system (maybe OpenPBS or better) SGE from SUN at http://gridengine.sunsource.net , it's also free. In contrast to OpenPBS, you can kill the slave processes on the nodes nicely with SGE and it runs really stable. Just test the performance with this combination, and you can add MPI/Pro later, after you tested the performance gain with a demo of it. Because the performance will also depend on the used network. I don't know the prices of MPI/pro, but maybe a second dedicaded network only for MPI communication is also an option and speed up the things (or Myrinet, Infiniband - this depends on the amount of MPI traffic your applications will generate). Best greetings - Reuti From lusk at mcs.anl.gov Mon Oct 25 08:35:48 2004 From: lusk at mcs.anl.gov (Rusty Lusk) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Can we set MPICH to use ssh instead of rsh at runtime? In-Reply-To: <1098701880.2791.287.camel@nuts.clc.cuhk.edu.hk> References: <1098701880.2791.287.camel@nuts.clc.cuhk.edu.hk> Message-ID: <20041025.103548.29035877.lusk@localhost> > Hi, > > Can we set MPICH to use ssh instead of rsh at runtime? I know it can be > set in compile time by configure opinion. And LAM can do that by setting > $LAMRSH environment variable. > No, there is not a way to do that. I would recommend the use of MPICH2 (http://www.mcs.anl.gov/mpi/mpich2), which doesn't use rsh or ssh to start jobs, but rather a set of daemons. The daemons can be started any way you like, including either rsh or ssh. Regards, Rusty Lusk From reuti at staff.uni-marburg.de Mon Oct 25 09:08:35 2004 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Can we set MPICH to use ssh instead of rsh at runtime? In-Reply-To: <1098701880.2791.287.camel@nuts.clc.cuhk.edu.hk> References: <1098701880.2791.287.camel@nuts.clc.cuhk.edu.hk> Message-ID: <417D2503.3060805@staff.uni-marburg.de> > Can we set MPICH to use ssh instead of rsh at runtime? I know it can be > set in compile time by configure opinion. And LAM can do that by setting > $LAMRSH environment variable. export P4_RSHCOMMAND=ssh or the appropiate path direct to your binary. - Reuti From lusk at mcs.anl.gov Mon Oct 25 10:50:39 2004 From: lusk at mcs.anl.gov (Rusty Lusk) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Can we set MPICH to use ssh instead of rsh at runtime? In-Reply-To: <417D2503.3060805@staff.uni-marburg.de> References: <1098701880.2791.287.camel@nuts.clc.cuhk.edu.hk> <417D2503.3060805@staff.uni-marburg.de> Message-ID: <20041025.125039.103756653.lusk@localhost> From: Reuti Subject: Re: [Beowulf] Can we set MPICH to use ssh instead of rsh at runtime? Date: Mon, 25 Oct 2004 18:08:35 +0200 > > Can we set MPICH to use ssh instead of rsh at runtime? I know it can be > > set in compile time by configure opinion. And LAM can do that by setting > > $LAMRSH environment variable. > > export P4_RSHCOMMAND=ssh > > or the appropiate path direct to your binary. - Reuti My face is red. I forgot you could do it in MPICH1 with an environment variable. -Rusty From john.hearns at clustervision.com Mon Oct 25 08:59:39 2004 From: john.hearns at clustervision.com (John Hearns) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] New OReilly book on clusters Message-ID: <1098719978.17215.144.camel@vigor12> A friend pointed me towards the new OReilly book on clusters. http://www.oreilly.com/catalog/highperlinuxc/ High Performance Linux Clusters with OSCAR, Rocks, OpenMosix, and MPI by Joseph D Sloan As I'm a sucker for OReilly books, no doubt I'll be adding this one to my menagerie. From john.hearns at clustervision.com Mon Oct 25 12:00:49 2004 From: john.hearns at clustervision.com (John Hearns) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Need Help...! In-Reply-To: <1098692498.417cb792cc1d9@home.staff.uni-marburg.de> References: <1098564112.417ac2102bb5b@home.staff.uni-marburg.de> <1098692498.417cb792cc1d9@home.staff.uni-marburg.de> Message-ID: <1098730849.17215.166.camel@vigor12> On Mon, 2004-10-25 at 09:21, Reuti wrote: > Hi again, > > > Thanks alot for the prompt reply. Right at the moment I am asked to > > purchased the following for my cluster: > > > > 1) MPI/Pro by verari systems > > 2) PGI CDK Cluster Development Kit by Portland Group > So you could just buy one license of the normal compiler, download MPICH at > http://www-unix.mcs.anl.gov/mpi/mpich and choose a queuing system (maybe > OpenPBS or better) SGE from SUN at http://gridengine.sunsource.net , it's also > free. In contrast to OpenPBS, you can kill the slave processes on the nodes > nicely with SGE and it runs really stable. I agree with what Reuti says. Witht he caveat that Portland make excellent products (as I'm sure everyone on this list agrees). In addition, I would say that you should start by looking at online resources such as www.clusterworld.com www.beowulf.org http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book.php Once you are a little bit familiar with these, you should try to find a friendly company which provides turn-key Linux clusters. They can help you with the choice of toolkits, compilers. monitoring applications etc. And if no such company exists in Karachi... well then there's an opportunity for somebody. From cflau at clc.cuhk.edu.hk Mon Oct 25 19:21:01 2004 From: cflau at clc.cuhk.edu.hk (John Lau) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] Can we set MPICH to use ssh instead of rsh at runtime? In-Reply-To: <20041025.125039.103756653.lusk@localhost> References: <1098701880.2791.287.camel@nuts.clc.cuhk.edu.hk> <417D2503.3060805@staff.uni-marburg.de> <20041025.125039.103756653.lusk@localhost> Message-ID: <1098757261.2791.307.camel@nuts.clc.cuhk.edu.hk> Hi, It works! Thank you very much. John Lau 2004-10-26 01:50, Rusty Lusk¡G > From: Reuti > Subject: Re: [Beowulf] Can we set MPICH to use ssh instead of rsh at runtime? > Date: Mon, 25 Oct 2004 18:08:35 +0200 > > > > Can we set MPICH to use ssh instead of rsh at runtime? I know it can be > > > set in compile time by configure opinion. And LAM can do that by setting > > > $LAMRSH environment variable. > > > > export P4_RSHCOMMAND=ssh > > > > or the appropiate path direct to your binary. - Reuti > > My face is red. I forgot you could do it in MPICH1 with an environment > variable. > > -Rusty From icub3d at gmail.com Tue Oct 26 12:08:00 2004 From: icub3d at gmail.com (Joshua Marsh) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] High Performance for Large Database Message-ID: <38242de90410261208b9ae5f2@mail.gmail.com> Hi all, I'm currently working on a project that will require fast access to data stored in a postgreSQL database server. I've been told that a Beowulf cluster may help increase performance. Since I'm not very familar with Beowulf clusters, I was hoping that you might have some advice or information on whether a cluster would increase performance for a PostgreSQL database. The major tables accessed are around 150-200 million records. On a stand alone server, it can take several minutes to perform a simple select query. It seems like once we start pricing for servers with 16+ processors and 64+ GB of RAM, the prices sky rocket. If I can acheive high performance with a cluster, using 15-20 dual processor machines, that would be great. Thanks for any help you may have! -Josh From reuti at staff.uni-marburg.de Tue Oct 26 14:03:30 2004 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] High Performance for Large Database In-Reply-To: <38242de90410261208b9ae5f2@mail.gmail.com> References: <38242de90410261208b9ae5f2@mail.gmail.com> Message-ID: <1098824610.417ebba20d313@home.staff.uni-marburg.de> Hi, > I'm currently working on a project that will require fast access to > data stored in a postgreSQL database server. ?I've been told that a > Beowulf cluster may help increase performance. ?Since I'm not very > familar with Beowulf clusters, I was hoping that you might have some > advice or information on whether a cluster would increase performance > for a PostgreSQL database. ?The major tables accessed are around > 150-200 million records. ?On a stand alone server, it can take several > minutes to perform a simple select query. > > It seems like once we start pricing for servers with 16+ processors > and 64+ GB of RAM, the prices sky rocket. ?If I can acheive high > performance with a cluster, using 15-20 dual processor machines, that > would be great. what is your configuration now and what disks are you using at this time? Any RAID array with SCSI? - Reuti From rgb at phy.duke.edu Tue Oct 26 14:21:57 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] High Performance for Large Database In-Reply-To: <38242de90410261208b9ae5f2@mail.gmail.com> References: <38242de90410261208b9ae5f2@mail.gmail.com> Message-ID: On Tue, 26 Oct 2004, Joshua Marsh wrote: > Hi all, > > I'm currently working on a project that will require fast access to > data stored in a postgreSQL database server. I've been told that a > Beowulf cluster may help increase performance. Since I'm not very > familar with Beowulf clusters, I was hoping that you might have some > advice or information on whether a cluster would increase performance > for a PostgreSQL database. The major tables accessed are around > 150-200 million records. On a stand alone server, it can take several > minutes to perform a simple select query. > > It seems like once we start pricing for servers with 16+ processors > and 64+ GB of RAM, the prices sky rocket. If I can acheive high > performance with a cluster, using 15-20 dual processor machines, that > would be great. This sort of cluster isn't a "beowulf" cluster; rather it is a variant of a high availability cluster. It's Extreme Linux, just not beowulf. The beowulf design (and focus of this list) is "high performance computing" clusters, aka supercomputing clusters. With that said, there may be some resources out there that can help you, and listening in on this list and learning how HPC clusters work will certainly help you with other kinds, as the issues are in many cases similar. The first/best place to look is the September issue of Cluster World Magazine (www.clusterworld.com/issues.html). Its cover focus is on "Database Clusters". My copy is at Duke (and I'm at home:-) so although I'm pretty sure it covers mysql used in a cluster environment I cannot recall if it discusses alternatives such as oracle or postgres. Other CWM issues will also be pertinent, regardless. One major issue associated with any kind of file access is assembling a large, shared file store that avoids the file and communications bottlenecks that are as much an issue in HPC as they are in HA. A series of articles just begun by Jeff Layton deals with SAN's and massive scalable storage in general -- he's only done a couple of articles so far, so if there are still September/October issues around you'd be in great shape. CWM also abounds with ads for large and scalable and blindingly fast storage solutions. We just had an extensive discussion on this very list on storage (I kicked it off as we have a big proposal out that had a very large storage component and I needed to learn -- fast!). The recent list archives should show you the thread. Finally, there are some companies out there that make their bread and butter by assembling custom clusters to accomplish very specific tasks at a cost (as you note) far less than the cost of a big multiprocessor machine even though they make a healthy (and well earned) profit on the deal. Some of them have employees or owners on this list -- if any of them can help you I expect they'll talk to you offline. That's about all the help I personally can offer; I haven't built a large database cluster and only have listened halfheartedly when they were discussed on list in the past (although there have been previous discussions you can also google for in the list archives, I think). The problem is a fairly complex one -- not just various file latency and bandwidth issues (these are likely the "easy part") but the issue of sharing the underlying DB brings up locking. It is one thing to provide lots of nodes read-only access to a DB on a SAN engineered for fast, cached, read-only access; it is another to provide all the nodes with read AND write access, as writing requires a lock, and a lock effectively serializes access. This (and related problems) are serious issues with speeding up databases through parallelism. I vaguely recall that big companies like Oracle have dumped pretty serious money into this kind of thing looking for solutions that scale well. Maybe somebody else on list knows more than I do, though, and maybe they'll tell all of us! rgb > > Thanks for any help you may have! > > -Josh > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From laurenceliew at yahoo.com.sg Tue Oct 26 18:29:58 2004 From: laurenceliew at yahoo.com.sg (Laurence Liew) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] High Performance for Large Database In-Reply-To: <38242de90410261208b9ae5f2@mail.gmail.com> References: <38242de90410261208b9ae5f2@mail.gmail.com> Message-ID: <417EFA16.8020503@yahoo.com.sg> Hi, You may wish to search thru the beowulf list or google for "beowulf and databases and postgresql"... there were a couple of threads on exeactly this issue. Very briefly 1. Beowulf clusters CANNOT help make Postgresql or any databases run faster. You need the database code to be modified to do that (think Oracle 10g). I met a company at Supercomputer 03 last year that had Mysql running on a cluster... you may wish to query for them. 2. You could try to sponsor the development of a parallel postrgresql - talk to the postgresql development team... when I broached the idea in 1998.. there was some interest.. unfortunately.. I could not afford the development/sponsorship costs then. 3. Try running Postgresql on a cluster filesystem like PVFS - it is not gauranteed as it probably fails the ACID test for a SQL compliant database. The basic idea is that if we cannot parallelise the database - we make the underlying IO parallel and hence boost the IO performance of the system.. and any applications that run on them.. and this includes Postgresql. Hope this helps. Cheers! Laurence Scalable Systems Singapore Joshua Marsh wrote: > Hi all, > > I'm currently working on a project that will require fast access to > data stored in a postgreSQL database server. I've been told that a > Beowulf cluster may help increase performance. Since I'm not very > familar with Beowulf clusters, I was hoping that you might have some > advice or information on whether a cluster would increase performance > for a PostgreSQL database. The major tables accessed are around > 150-200 million records. On a stand alone server, it can take several > minutes to perform a simple select query. > > It seems like once we start pricing for servers with 16+ processors > and 64+ GB of RAM, the prices sky rocket. If I can acheive high > performance with a cluster, using 15-20 dual processor machines, that > would be great. > > Thanks for any help you may have! > > -Josh > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- A non-text attachment was scrubbed... Name: laurenceliew.vcf Type: text/x-vcard Size: 150 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20041027/bb931a9a/laurenceliew.vcf From kmurphy at dolphinics.com Wed Oct 27 03:39:41 2004 From: kmurphy at dolphinics.com (Keith Murphy) Date: Wed Nov 25 01:03:30 2009 Subject: [Beowulf] High Performance for Large Database References: <38242de90410261208b9ae5f2@mail.gmail.com> <417EFA16.8020503@yahoo.com.sg> Message-ID: <037701c4bc11$455857e0$6901a8c0@dolphinics.no> Check out this url http://www.linuxlabs.com/clusgres.html they look like they have a solution for scaleable Postgres Kindest Regards Keith Murphy Dolphin Interconnect 818-292-5100 kmurphy@dolphinics.com www.dolphinics.com ----- Original Message ----- From: "Laurence Liew" To: "Joshua Marsh" Cc: Sent: Wednesday, October 27, 2004 3:29 AM Subject: Re: [Beowulf] High Performance for Large Database > Hi, > > You may wish to search thru the beowulf list or google for "beowulf and > databases and postgresql"... there were a couple of threads on exeactly > this issue. > > Very briefly > > 1. Beowulf clusters CANNOT help make Postgresql or any databases run > faster. You need the database code to be modified to do that (think > Oracle 10g). I met a company at Supercomputer 03 last year that had > Mysql running on a cluster... you may wish to query for them. > > 2. You could try to sponsor the development of a parallel postrgresql - > talk to the postgresql development team... when I broached the idea in > 1998.. there was some interest.. unfortunately.. I could not afford the > development/sponsorship costs then. > > 3. Try running Postgresql on a cluster filesystem like PVFS - it is not > gauranteed as it probably fails the ACID test for a SQL compliant > database. The basic idea is that if we cannot parallelise the database - > we make the underlying IO parallel and hence boost the IO performance of > the system.. and any applications that run on them.. and this includes > Postgresql. > > Hope this helps. > > Cheers! > Laurence > Scalable Systems > Singapore > > > > Joshua Marsh wrote: > > Hi all, > > > > I'm currently working on a project that will require fast access to > > data stored in a postgreSQL database server. I've been told that a > > Beowulf cluster may help increase performance. Since I'm not very > > familar with Beowulf clusters, I was hoping that you might have some > > advice or information on whether a cluster would increase performance > > for a PostgreSQL database. The major tables accessed are around > > 150-200 million records. On a stand alone server, it can take several > > minutes to perform a simple select query. > > > > It seems like once we start pricing for servers with 16+ processors > > and 64+ GB of RAM, the prices sky rocket. If I can acheive high > > performance with a cluster, using 15-20 dual processor machines, that > > would be great. > > > > Thanks for any help you may have! > > > > -Josh > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > ---------------------------------------------------------------------------- ---- > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From hanzl at noel.feld.cvut.cz Wed Oct 27 02:42:15 2004 From: hanzl at noel.feld.cvut.cz (hanzl@noel.feld.cvut.cz) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] High Performance for Large Database In-Reply-To: References: <38242de90410261208b9ae5f2@mail.gmail.com> Message-ID: <20041027114215V.hanzl@unknown-domain> > > I'm currently working on a project that will require fast access to > > data stored in a postgreSQL database server. I've been told that a > > ... > > and 64+ GB of RAM, the prices sky rocket. If I can acheive high > > performance with a cluster, using 15-20 dual processor machines, that > > would be great. > > This sort of cluster isn't a "beowulf" cluster; rather it is a variant > of a high availability cluster. It's Extreme Linux, just not beowulf. > The beowulf design (and focus of this list) is "high performance > computing" clusters, aka supercomputing clusters. I think that while this is true in many particular cases, it is far from being true in general. There are applications which involve databases and could be as beowulfish as it can get. I know reseachers who work with extremely huge and complex graphs and use a database for this. Should they have say a MPI-based database with all data in RAM they could get tremendous speedups. They would be happy to copy the database to the distributed cluster RAM, do few zillions of operations on it and then copy some results back. I do agree that a database might not be the best tool for their job and complete rewrite of all the code they have might help :-) However I consider programming against a db API to be an important knowledge reuse and nice split of their problem into two parts which together take more computer time than one monolith would but one of them (the db searches) is a problem with commodity solutions. (And I might even argue that even high availability databases may very well use The True Beowulf as a component doing searches on mostly read-only data cached in cluster RAM or even cached in local harddisks.) The only difference I can see is the application (which is not a CFD or galactic evolution or similar). From the point of wiew of interconnects, OS types, parallel libraries used, RAM, processors, cluster management etc. I see no reason why databases and beowulf could not overlap. Best Regards Vaclav Hanzl From mechti01 at luther.edu Tue Oct 26 22:09:48 2004 From: mechti01 at luther.edu (Timo Mechler) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Clic 2.0 lockup problems Message-ID: <2668.172.22.17.130.1098853788.squirrel@172.22.17.130> Hi all, I just finished installing Clic 2.0 on a cluster of 1 server and 12 nodes. After running the setup_auto_cluster script I got everything installed. I created a "cluster user" and proceeded to test out some of the included mpi sample code. This ran fine. I next tried to start this code remotely (through SSH), but when I did this, the server locked up and had to be rebooted. It actually locked up while connecting via SSH, not when executing the sample mpi code. Any idea what might cause this? The server has 3 network interfaces: eth0 - administration eth1 - outside (internet) eth2 - message passing (computing) (The nodes each have 2 interfaces, one for administration and one for message passing) It also seems that when I logon as the "cluster user" or root, and try to access an external website (e.g. google), the server will lockup and need to be rebooted again. Any idea why I'm experiencing these lockups? Is something configured incorrectly? Is it a faulty network card? I was able to access outside websites fine before I ran the setup scripts. Thanks in advance for your help. Regards, -Timo Mechler -- Timo R. Mechler mechti01@luther.edu From rgb at phy.duke.edu Wed Oct 27 08:08:04 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] High Performance for Large Database In-Reply-To: <20041027114215V.hanzl@unknown-domain> References: <38242de90410261208b9ae5f2@mail.gmail.com> <20041027114215V.hanzl@unknown-domain> Message-ID: On Wed, 27 Oct 2004 hanzl@noel.feld.cvut.cz wrote: (a bunch of stuff, leading to...) > The only difference I can see is the application (which is not a CFD or > galactic evolution or similar). From the point of wiew of > interconnects, OS types, parallel libraries used, RAM, processors, > cluster management etc. I see no reason why databases and beowulf > could not overlap. And I would agree, and even pointed out that there WERE areas of logical overlap. The problems being solved and bottlenecks involved are in many cases the same. However, by convention "beowulf" clusters per se and MOST of the energy of this list is devoted to HPC -- numerical computations and applications. It is undeniable that numerical applications exist that interact with data stores of many different forms, including I'm sure databases. Databases are also used to manage clusters. Grids, in particular, tend to integrate tightly with databases to be able to manage distributed storage resources that aren't necessarily viewable or accessible as "mounts" of an ordinary filesystem. When one has thousands of (potential) users and millions of inodes spread out across hundreds of disks with a complicated set of relationships regarding access to the HPC-generated data, a DB is needed just to permit search and retrieval of your OWN results, let alone somebody else's. Nevertheless, databases per se are not numerical HPC, and a cluster built to do SQL transactions on a collective shared database is not properly called a "beowulf" cluster or even a more general HPC cluster or grid. That is one reason that they ARE rarely discussed on this list. In fact, most of the discussion that has occurred and that is in the archives concerns why database server clusters aren't really HPC or beowulf clusters, not how one might build a cluster-based database. The latter is more the purview of: http://www.linux-ha.org/ and its associated list; at least they address the data reliability, failover, data store design, logical volume management aspects of shared DB access. Even this list doesn't actually address "cluster implementation of a database server program" though, because that is actually a very narrow topic. So narrow that it is arguably confined to particular database servers, one at a time. To put it another way, writing a SQL database server is a highly nontrivial task, and good open source servers are rare. Mysql is common and open source (if semi-commercial) for example, but there exist absolute rants on the web against mysql as being a high quality, scalable DB for all of that. (I'm not religiously involved in this debate, BTW, so flames->/dev/null or find the referents with google and flame them instead, I'm just pointing out that the debate exists;-). Writing a PARALLEL SQL database server is even MORE nontrivial, and while yes, some reasons for this are shared by the HPC community, the bulk of them are related directly to locking and the file system and to SQL itself. Indeed, most are humble variants of the time-honored problem of how to deal with race conditions and the like when I'm trying to write to a particular record with a SQL statement on node A at the same time you're trying to read from the record on node B, or worse yet, when you're trying to write at the same time. Most of the solutions to this problem lead to godawful and rapid corruption of the record itself if not the entire database. Robust solutions tend to (necessarily) serialize access, which defeats all trivial parallelizations. NONtrivial parallelizations are things like distributing the execution of actual SQL search statements across a cluster. Whether there is any point in this depends strongly on the design of the database store itself; if it is a single point of presence to the entire cluster, there is an obvious serial bottleneck there that AGAIN defeats most straightforward parallelizations (depending a bit on just how long a node has to crunch a block of the DB compared to the time required to read it in from the server). It also depends strongly on how the DB itself is organized, as the very concept of "block of the DB" may be meaningless. In fact, to make a really efficient parallel DB program, I believe that you have to integrate a design from the datastore on up to avoid serializing bottlenecks. The actual DB has to be stored in a way that can be read in units that can be independently processed. It has to be organized in such a way that the hashing and information-theoretic-efficient parsing of the blocked information can proceed efficiently on the nodes (not easy when there is record linkage in a relational DB -- maybe not POSSIBLE in general in a relational DB). The distributed tasks have to be rewritten from scratch by Very Smart Humans to use parallelizable algorithms (that integrate with the underlying file store and with the underlying DB organization). These algorithms are likely to be so specialized as to be patentable (and I'll bet that e.g. Oracle owns a slew of patents on this very thing). Finally the specter of locking looms over everything, threatening all of your work unless you can arrange for record modification not to serialize everything. For read only access, life is probably livable if not good. RW access to a large relational DB to be distributed across N nodes -- just plain ouch... So yes, it is fun to kick around on this list in the abstract BECAUSE lots of these are also problems in parallel applications that work with data (in a DB per se or not) but in direct reference to the question, no, this list isn't going to provide direct guidance on how to parallelize mysql or oracle or sybase or postgres or peoplesoft because EACH of these has to engineer an efficient parallel solution all the way from the file store to the user interface and API, at least if one wants to get reliable/predictable and beneficial scalability. There may, however, be people on the list that have messed with parallelized versions of at least some of these DBs. There has certainly been list discussion on parallizing postgres before (e.g. http://beowulf.org/pipermail/beowulf/1998-October/002070.html Which is alas no longer accessible in the archives at this address, although google still turns it and a number of other hits up; perhaps it is a part of what was lost when beowulf.org crashed a short while ago. Unfortunately, I failed to capture the list archives in my last mirror of this site. And Doug Eadline probably can say a few words about the parallelization of mysql (which has ALSO been discussed on the list back in 1999 and is ALSO missing from the archives). Both mysql and postgres appear to have a parallel implementation with at least some scalability, see: http://www.illusionary.com/snort_db.html Mysql's is an actual cluster implementation: http://www.mysql.com/products/cluster/ (note that bit about "Designed for 99.999% Availability" -- high availability, not HPC). A thread on mysql and postgres clustering on slashdot: http://developers.slashdot.org/comments.pl?sid=62549&cid=5843509 (a search is complicated by the fact that postgres refers to relational database structures on disk as "clusters" and has actual commands to create them etc.). Postgres based clustering project (of sorts) lives here: http://gborg.postgresql.org/project/erserver/projdisplay.php There is a sourceforge project trying to implement some sort of lowest-common-denominator embarrassingly parallel cluster DB solution that can be implemented "on top of" SQL DBs (as I make it out, read it for yourself). http://ha-jdbc.sourceforge.net/ Really, google is your friend in this. In a nutshell, it IS possible to find support for cluster-type access to at least mysql and postgres in the open source community, in at least two ways (native and as an add-on layer in each case). Add-on layer clustering provides a better than nothing solution to the serial bottleneck problem, but it will not scale well for all kinds of access and has the usual problems with the design of the data store itself. I can't comment on the native implementations beyond observing that mysql looks like it is in production while postgres looks like it is still very much under development, and that both of them are "replication" models that likely won't scale well at all for write access (they likely handle locking at the granularity of the entire DB, although I >>do not know<< and don't plan to look it up:-). Hope this helps somebody. If nothing else, it is likely worthwhile to reinsert a discussion on this into the archives because of recent developments and because previous discussions have gone away. rgb > > Best Regards > > Vaclav Hanzl > > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From hahn at physics.mcmaster.ca Wed Oct 27 10:25:58 2004 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] High Performance for Large Database In-Reply-To: Message-ID: > relationships regarding access to the HPC-generated data, a DB is needed > just to permit search and retrieval of your OWN results, let alone > somebody else's. right. the distinction here is that HPC and filesystems tend to have a very simple DB schema ;) > Writing a PARALLEL SQL database server is even MORE nontrivial, and > while yes, some reasons for this are shared by the HPC community, the > bulk of them are related directly to locking and the file system and to > SQL itself. depends. for instance, it's not *that* uncommon to have DB's which see almost nothing but read-only queries (and updates, if they happen at all, can be batched during an off-time.) that makes a parallel version quite easy, actually: imagine a bunch of 8GB dual-opterons running queries on a simple NFS v3 server over Myrinet. for a read-mostly load, especially one with enough locality to make 8GB caches effective, this would probably *fly*. tweak it with iSCSI and go to 64 GB quad- opterons. how many tables out there wouldn't have a good hit rate in 64GB? > NONtrivial parallelizations are things like distributing the execution > of actual SQL search statements across a cluster. Whether there is any it's easy to imagine that a stream of SQL queries could actually be handled in sort of an adaptive data refinement manner, where most of the thought goes in to managing division of the query labor (distributed indices searched in parallel, etc) , and in placement of data (especially ownership/locking of writable data). I have no idea whether Oracle-level DB's do this, but it wouldn't surprise me. the irony is that most of the thought that goes into advanced Beowulf applications is doing exactly this sort of labor/data division/balancing. I'd hazard a guess that the place to start putting parallelism in a DB is the underlying isam-like table layer... From gary at sharcnet.ca Wed Oct 27 09:22:58 2004 From: gary at sharcnet.ca (Gary Molenkamp) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] dual Opteron recommendations In-Reply-To: <20041022020350.GA32640@cse.ucdavis.edu> Message-ID: On Thu, 21 Oct 2004, Bill Broadley wrote: > I'm familar with 48 sun v20z (newisys) machines around here, only one > died so far with a hard memory error (I.e. won't boot). So far, one of 24 with a power issue (still debugging). > Speaking of which, has anyone done anything useful with the v20z LCD > display, ours just say something like IP address of the management > interface and OS booted or similar. > > I was hoping for hostname, maybe system load, even a way to pull a node > out of the queue (there are several buttons under the LCD). Just a hostname. :) -- Gary Molenkamp SHARCNET Systems Administrator University of Western Ontario gary@sharcnet.ca http://www.sharcnet.ca (519) 661-2111 x88429 (519) 661-4000 From hanzl at noel.feld.cvut.cz Wed Oct 27 10:42:40 2004 From: hanzl at noel.feld.cvut.cz (hanzl@noel.feld.cvut.cz) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] High Performance for Large Database In-Reply-To: References: <20041027114215V.hanzl@unknown-domain> Message-ID: <20041027194240L.hanzl@unknown-domain> Hello RGB, I had no intention to take up your whole morning :-) (Neither did I intend to exploit your susceptibility to DoS attack by making provocative comments :-)) ) Of course, I agree with your explanation about databases. However, > MOST of the energy of this list is devoted to HPC -- numerical > computations and applications. > ... > Nevertheless, databases per se are not numerical HPC, and a cluster > built to do SQL transactions on a collective shared database is not > properly called a "beowulf" yes, but I still have a feeling that you are trying to squeeze _numerical_ to definition of beowulf which would be a pity because there are problems with _numerical/symbolic_ mix best solved on exactly the same type of hardware as the _numerical_ ones. I hope these can live on this list as well, unless cooling the FPU portion of the chip bacames the main topic here :-)) OK, I do confess that I do pursue my selfish goals because my problems are numerical/symbolic mix :-) And no, I do not use SQL databases for them. However I know people who do misuse SQL databases this way (in the similar manner we lazy people waste computer power with perl or Matlab) and who could make easy progress by MPI implementing very very limited subset of SQL, just enough to run those stupid select()s found in their code. But I repeat, normal SQL databases are mostly out of topic here, no doubt. Best Regards Vaclav Hanzl From atp at piskorski.com Wed Oct 27 10:48:19 2004 From: atp at piskorski.com (Andrew Piskorski) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Re: High Performance for Large Database In-Reply-To: <38242de90410261208b9ae5f2@mail.gmail.com> References: <38242de90410261208b9ae5f2@mail.gmail.com> Message-ID: <20041027174819.GB29850@piskorski.com> On Tue, Oct 26, 2004 at 01:08:00PM -0600, Joshua Marsh wrote: > Hi all, > > I'm currently working on a project that will require fast access to > data stored in a postgreSQL database server. I've been told that a > Beowulf cluster may help increase performance. Unlikely in general, although possible in certain cases. For example, look into Clusgres, memcached, and Backplane. I've previously given links and discussion here: http://openacs.org/forums/message-view?message_id=128060 http://openacs.org/forums/message-view?message_id=179348 > Since I'm not very familar with Beowulf clusters, I was hoping that It is more important that you are extensively familiar with RDBMSs in general, and PostgreSQL in particular. Are you? > you might have some advice or information on whether a cluster would > increase performance for a PostgreSQL database. The major tables > accessed are around 150-200 million records. On a stand alone > server, it can take several minutes to perform a simple select > query. 200 million rows is not that big. What's the approximate total size of your database on disk? Your "several minutes for a simple select" query performance is abysmal, and this is unlikely to be because of your hardware. Most likely, your queries just suck, and you need to do some serious SQL tuning work before even considering big huge fancy hardware. Once you have tables with tens or hundreds of millions of rows, doing ANY full table scans of that table at all sucks really badly, so you MUST profile and tune your queries. And eliminating full table scans of large tables is just the first and most obvious step, it is not unusual to still have very sucky queries after that. > It seems like once we start pricing for servers with 16+ processors > and 64+ GB of RAM, the prices sky rocket. If I can acheive high > performance with a cluster, using 15-20 dual processor machines, that > would be great. If you are even thinking about buying an 8-way or larger box, then you are certainly a candidate for several 2-way boxes with an (expensive) SCI interconnect, so see Clusgres in my links above. Or, if want want to spend a lot less money, your access is read-mostly, and you DON'T need full ACID transactional support for your read-only queries, look into using memcached to cache query results in other machines' RAM. However, VERY few people need such large RDBMS boxes. What makes you think you do? What exactly is your application doing, and what sort of load do you need it to sustain? Have you profiled and tuned all your SQL? Tuned your PostgreSQL and Linux kernel settings? Have you read and worked through all the PostgreSQL docs on tuning? (You didn't install PostgreSQL with its DEFAULT settings, did you? Those are intended to just get it up and running on ALL the platforms PostgreSQL supports, not to give good performance.) Investing hundreds of thousands of dollars in fancy server hardware without first doing your basic RDBMS homework makes no sense at all. If your database is dog slow because of poor data modeling or grossly untuned queries, throwing $300k of hardware at the problem may not help much at all. -- Andrew Piskorski http://www.piskorski.com/ From reuti at staff.uni-marburg.de Wed Oct 27 11:28:11 2004 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Clic 2.0 lockup problems In-Reply-To: <2668.172.22.17.130.1098853788.squirrel@172.22.17.130> References: <2668.172.22.17.130.1098853788.squirrel@172.22.17.130> Message-ID: <1098901691.417fe8bb76972@home.staff.uni-marburg.de> Hi, > I just finished installing Clic 2.0 on a cluster of 1 server and 12 nodes. > After running the setup_auto_cluster script I got everything installed. > I created a "cluster user" and proceeded to test out some of the included > mpi sample code. This ran fine. I next tried to start this code remotely > (through SSH), but when I did this, the server locked up and had to be > rebooted. It actually locked up while connecting via SSH, not when > executing the sample mpi code. Any idea what might cause this? The > server has 3 network interfaces: > > eth0 - administration > eth1 - outside (internet) > eth2 - message passing (computing) > > (The nodes each have 2 interfaces, one for administration and one for > message passing) > > It also seems that when I logon as the "cluster user" or root, and try to > access an external website (e.g. google), the server will lockup and need > to be rebooted again. > > Any idea why I'm experiencing these lockups? Is something configured > incorrectly? Is it a faulty network card? I was able to access outside > websites fine before I ran the setup scripts. you tried to login to the server on all of the three interfaces, to check whether it's really completely down and not only one interface? - Reuti From rgb at phy.duke.edu Wed Oct 27 11:41:31 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] High Performance for Large Database In-Reply-To: <20041027194240L.hanzl@unknown-domain> References: <20041027114215V.hanzl@unknown-domain> <20041027194240L.hanzl@unknown-domain> Message-ID: On Wed, 27 Oct 2004 hanzl@noel.feld.cvut.cz wrote: > Hello RGB, > > I had no intention to take up your whole morning :-) (Neither did I > intend to exploit your susceptibility to DoS attack by making > provocative comments :-)) ) Naaa, it was fun. And the list has always been fairly tolerant of a broad definition of HPC as long as it is fun and relevant (I really think that fun more the criterion than whether it is floating point intensive) just as it is tolerant of those people who do HPC on clusters that aren't really "beowulf" style clusters in the original sense of its definition (like me:-). I'm also very interested in just what sort of symbolic manipulation you are working on. I've worked with some of the various algebraic manipulation packages that have existed running back into the dawn of time -- FORMAC, Macsyma, maple, mathematica .. and agree that there is much in the token parsing and algebraic reconstruction process that could be parallelized as parts of algebra are intrinsically independent. There are also still an abundance of problems (in physics, especially) where a good non-commutative algebra engine that can be taught about a set of generators/commutators can really help out. And then there is geometric algebra (the descendent of quaternions, Grassman algebras, Clifford algebras) where I think of things as barely being begun, especially since there is a geometric/visualization component that tags along with the algebraic component. Anything like that? rgb > > Of course, I agree with your explanation about databases. However, > > > MOST of the energy of this list is devoted to HPC -- numerical > > computations and applications. > > ... > > Nevertheless, databases per se are not numerical HPC, and a cluster > > built to do SQL transactions on a collective shared database is not > > properly called a "beowulf" > > yes, but I still have a feeling that you are trying to squeeze > _numerical_ to definition of beowulf which would be a pity because > there are problems with _numerical/symbolic_ mix best solved on > exactly the same type of hardware as the _numerical_ ones. I hope > these can live on this list as well, unless cooling the FPU portion of > the chip bacames the main topic here :-)) > > OK, I do confess that I do pursue my selfish goals because my problems > are numerical/symbolic mix :-) And no, I do not use SQL databases for > them. However I know people who do misuse SQL databases this way (in > the similar manner we lazy people waste computer power with perl or > Matlab) and who could make easy progress by MPI implementing very very > limited subset of SQL, just enough to run those stupid select()s found > in their code. > > But I repeat, normal SQL databases are mostly out of topic here, no > doubt. > > Best Regards > > Vaclav Hanzl > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Wed Oct 27 11:54:04 2004 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] High Performance for Large Database In-Reply-To: References: Message-ID: On Wed, 27 Oct 2004, Mark Hahn wrote: > > NONtrivial parallelizations are things like distributing the execution > > of actual SQL search statements across a cluster. Whether there is any > > it's easy to imagine that a stream of SQL queries could actually > be handled in sort of an adaptive data refinement manner, where most > of the thought goes in to managing division of the query labor (distributed > indices searched in parallel, etc) , and in placement of data (especially > ownership/locking of writable data). I have no idea whether Oracle-level DB's > do this, but it wouldn't surprise me. the irony is that most of the thought > that goes into advanced Beowulf applications is doing exactly this sort of > labor/data division/balancing. > > I'd hazard a guess that the place to start putting parallelism in a DB > is the underlying isam-like table layer... As always, google is your friend. parallel database algorithms turns up lots of current work; I'm sure a look at specific open source projects would turn up more (and maybe more relevant) work. Some of the tools I turned up in my short former query do exploit the kind of simple read-only data parallelism you described, though, and wrap it up all pretty. For small read only databases (backing e.g. a website), the very simplest approach is likely to put a full copy of the DB on each server and distribute the transactions themselves round robin. Use rsync to periodically update the images to accomodate distributed changes, if you permit distributed write and you dare (merging in changes is nontrivial). Or write an engine that uses idle time of the node engines themselves to distribute inserts to be scheduled. Google itself is a pretty good example, actually. Complex searches of a read-only and truly encyclopediac database. All I really know is that this is real computer science, and I am only an amateur at best. I suspect that large scale database parallelism is the subject of much current algorithmic research, as parts of the problem are likely NP complete or at least NP hard. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From mechti01 at luther.edu Wed Oct 27 12:16:42 2004 From: mechti01 at luther.edu (mechti01) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Clic 2.0 lockup problems In-Reply-To: <1098901691.417fe8bb76972@home.staff.uni-marburg.de> References: <2668.172.22.17.130.1098853788.squirrel@172.22.17.130> <1098901691.417fe8bb76972@home.staff.uni-marburg.de> Message-ID: <2277.172.17.4.251.1098904602.squirrel@172.17.4.251> Hi Thanks for your help. No, eth1, because that is the only external interface. In other words, I tried to logon from an external machine. Ssh works fine on other two interfaces though. Thanks again. -Timo > Hi, > >> I just finished installing Clic 2.0 on a cluster of 1 server and 12 >> nodes. >> After running the setup_auto_cluster script I got everything installed. >> I created a "cluster user" and proceeded to test out some of the >> included >> mpi sample code. This ran fine. I next tried to start this code >> remotely >> (through SSH), but when I did this, the server locked up and had to be >> rebooted. It actually locked up while connecting via SSH, not when >> executing the sample mpi code. Any idea what might cause this? The >> server has 3 network interfaces: >> >> eth0 - administration >> eth1 - outside (internet) >> eth2 - message passing (computing) >> >> (The nodes each have 2 interfaces, one for administration and one for >> message passing) >> >> It also seems that when I logon as the "cluster user" or root, and try >> to >> access an external website (e.g. google), the server will lockup and >> need >> to be rebooted again. >> >> Any idea why I'm experiencing these lockups? Is something configured >> incorrectly? Is it a faulty network card? I was able to access outside >> websites fine before I ran the setup scripts. > > you tried to login to the server on all of the three interfaces, to check > whether it's really completely down and not only one interface? - Reuti > -- From hanzl at noel.feld.cvut.cz Wed Oct 27 12:52:06 2004 From: hanzl at noel.feld.cvut.cz (hanzl@noel.feld.cvut.cz) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] High Performance for Large Database In-Reply-To: References: <20041027194240L.hanzl@unknown-domain> Message-ID: <20041027215206U.hanzl@unknown-domain> > I'm also very interested in just what sort of symbolic manipulation you > are working on. My numerical/symbolic mix underlying my opinions is from natural language processing, mostly speech recognition. Involves training phase which uses huge amount of recorded speech which is iteratively turned into estimated statistical distributions of phoneme sounds (multivariate gaussians with some 500.000 parameters, work for the FPU) and huge amount of text turned into dictionaries and grammar rules (symbolic and maybe even SQL). This phase is not very beowulfish, processes can work locally for minutes. Then there is the recognition phase when we match unknown utterances against our models of sounds and pronunciation and dictionaries and grammar and this is very beowulfish as we need to estimate zillions of partial hypothesis and compose them together somehow, likely in real time, and we are happy to pass quick messages around and keep most things in aggregated cluster RAM. Training on huge speech data has very much the pattern just described by Mark Hahn: > depends. for instance, it's not *that* uncommon to have DB's which > see almost nothing but read-only queries (and updates, if they happen > at all, can be batched during an off-time.) that makes a parallel > version quite easy (thought we do not have wav files in SQL :-) ) and we are much interested in ways to divide our data to chunks cached on local harddisks on nodes and repeatedly processed again and again (say 30 times during one itarative process, and we try many variants of this process on the same data.) Of course we have just one cluster for both things, so it constantly switches between being a beowulf and not being a beowulf :-) Best Regards Vaclav Hanzl From mwill at penguincomputing.com Wed Oct 27 09:29:07 2004 From: mwill at penguincomputing.com (Michael Will) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] High Performance for Large Database In-Reply-To: <037701c4bc11$455857e0$6901a8c0@dolphinics.no> References: <38242de90410261208b9ae5f2@mail.gmail.com> <417EFA16.8020503@yahoo.com.sg> <037701c4bc11$455857e0$6901a8c0@dolphinics.no> Message-ID: <200410270929.07862.mwill@penguincomputing.com> On Wednesday 27 October 2004 03:39 am, Keith Murphy wrote: > Check out this url http://www.linuxlabs.com/clusgres.html they look like > they have a solution for scaleable Postgres > > Kindest Regards > > Keith Murphy > Dolphin Interconnect Hey Keith, that is a really cool link. What interconnect does that lock them into again, though? On a more serious side: They advertise a beowulf-with-shared-memory solution, which demands low latency high bandwidth interconnects, and AFAIK they only support Dolphin Interconnect (SCI). Has anybody tried their product yet and can comment on its efficiency and scalability ? It does sound promising for any SMP type software that does not run well on a cluster because of its lack of shared memory. Also check out mysql.com's in-ram database product - they created a database that is not relying on any shared memory, but instead redundandly distributes the data out onto a cluster, using RAM only and claiming to be really fast. http://www.mysql.com/products/cluster/ And then there is oracle that advertises together with Infinicon, HP and AMD they would have set a new TPC-H One-Terabyte record: http://www.oracle.com/corporate/press/home/index.html Michael > 818-292-5100 > kmurphy@dolphinics.com > www.dolphinics.com > ----- Original Message ----- > From: "Laurence Liew" > To: "Joshua Marsh" > Cc: > Sent: Wednesday, October 27, 2004 3:29 AM > Subject: Re: [Beowulf] High Performance for Large Database > > > > Hi, > > > > You may wish to search thru the beowulf list or google for "beowulf and > > databases and postgresql"... there were a couple of threads on exeactly > > this issue. > > > > Very briefly > > > > 1. Beowulf clusters CANNOT help make Postgresql or any databases run > > faster. You need the database code to be modified to do that (think > > Oracle 10g). I met a company at Supercomputer 03 last year that had > > Mysql running on a cluster... you may wish to query for them. > > > > 2. You could try to sponsor the development of a parallel postrgresql - > > talk to the postgresql development team... when I broached the idea in > > 1998.. there was some interest.. unfortunately.. I could not afford the > > development/sponsorship costs then. > > > > 3. Try running Postgresql on a cluster filesystem like PVFS - it is not > > gauranteed as it probably fails the ACID test for a SQL compliant > > database. The basic idea is that if we cannot parallelise the database - > > we make the underlying IO parallel and hence boost the IO performance of > > the system.. and any applications that run on them.. and this includes > > Postgresql. > > > > Hope this helps. > > > > Cheers! > > Laurence > > Scalable Systems > > Singapore > > > > > > > > Joshua Marsh wrote: > > > Hi all, > > > > > > I'm currently working on a project that will require fast access to > > > data stored in a postgreSQL database server. I've been told that a > > > Beowulf cluster may help increase performance. Since I'm not very > > > familar with Beowulf clusters, I was hoping that you might have some > > > advice or information on whether a cluster would increase performance > > > for a PostgreSQL database. The major tables accessed are around > > > 150-200 million records. On a stand alone server, it can take several > > > minutes to perform a simple select query. > > > > > > It seems like once we start pricing for servers with 16+ processors > > > and 64+ GB of RAM, the prices sky rocket. If I can acheive high > > > performance with a cluster, using 15-20 dual processor machines, that > > > would be great. > > > > > > Thanks for any help you may have! > > > > > > -Josh > > > _______________________________________________ > > > Beowulf mailing list, Beowulf@beowulf.org > > > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > > > > > > > ---------------------------------------------------------------------------- > ---- > > > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Michael Will, Linux Sales Engineer NEWS: We have moved to a larger iceberg :-) NEWS: 300 California St., San Francisco, CA. Tel: 415-954-2822 Toll Free: 888-PENGUIN Fax: 415-954-2899 www.penguincomputing.com From eugen at leitl.org Thu Oct 28 01:50:18 2004 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Re: MySQL Cluster with SCI interconnect (fwd from tamada@acornnetworks.co.jp) Message-ID: <20041028085018.GN1457@leitl.org> Of tenuous relevance, but it's a cluster database which uses SCI for signalling fabric. MySQL 4.1 (just out) actually has cluster functionality built-in (but for Sun Solaris, which is a release bug, about to be fixed). For those of you who're interested in HA Linux, the next release is not far away, and removes the 2-node limitation: http://linuxha.trick.ca/ ----- Forwarded message from Junzo Tamada ----- From: "Junzo Tamada" Date: Thu, 28 Oct 2004 16:00:53 +0900 To: "Hugo Kohmann" Cc: "Hugo Kohmann" , Subject: Re: MySQL Cluster with SCI interconnect Reply-To: "Junzo Tamada" X-Mailer: Microsoft Outlook Express 6.00.2900.2180 Dear Hugo, Yes, I'm working on it. The latest code from Mikael works well with Ethernet. I built it with SCI yesterday and will test today. I'll get back to you after verification of MySQL Cluster with SCI. Thank you very much. /Junzo ----- Original Message ----- From: "Hugo Kohmann" To: "Junzo Tamada" Cc: "Hugo Kohmann" ; Sent: Thursday, October 28, 2004 4:02 AM Subject: Re: MySQL Cluster with SCI interconnect > >Dear Junzo, > >MySQL cluster has native support for SCI our you can use the SCI Socket >software. You will find more information about this and some >benchmark numbers at >http://dev.mysql.com/doc/mysql/en/MySQL_Cluster_Interconnects.html > >With SCI Socket, you can configure MySQL cluster for regular networking >and just use SCI Socket between the nodes that requires low latency/high >bandiwidth connections. SCI Socket has typically 10x lower latency than >Gigagit Ethernet and should support any version of MySQL Cluster that runs >properly over Ethernet. > >You will find the SCI Socket software and more information at >http://www.dolphinics.com/products/software/sci_sockets.html > >Best regards > >Hugo > > >On Tue, 26 Oct 2004, Junzo Tamada wrote: > >>Date: Tue, 26 Oct 2004 23:46:49 +0900 >>From: Junzo Tamada >>To: cluster@lists.mysql.com >>Subject: MySQL Cluster with SCI interconnect >> >>Hello >> >>Very recently I found sections in the chapter of Cluster, which is >>decsribing utilization of high-speed interconnection called SCI. >>I would like to test it. >>Currently I have totaly 5 servers and 4 out of 5 are equiped with SCI >>network card. >>Please anyone provide me any suggestions regarding the following cluster >>configuration. >>All server are connected together with Ethernet. >>Node a1 through a4 are with SCI. >>I would like to assign mgmd and api(mysqld) to front-end and ndbd for >>a[1-4]. >>Is this feasible and reasonable configuration ? >> >>I am using MySQL 4.1.7-gamma as of Oct. 26th, 2004. >> >>Thank you in advance. >>/Junzo >> >>-- >>MySQL Cluster Mailing List >>For list archives: http://lists.mysql.com/cluster >>To unsubscribe: >>http://lists.mysql.com/cluster?unsub=hugo@dolphinics.com >> > > >========================================================================================= >Hugo Kohmann | >Dolphin Interconnect Solutions AS | E-mail: >P.O. Box 150 Oppsal | hugo@dolphinics.com >N-0619 Oslo, Norway | Web: >Tel:+47 23 16 71 73 | http://www.dolphinics.com >Fax:+47 23 16 71 80 | Visiting Address: Olaf Helsets >vei 6 | > -- MySQL Cluster Mailing List For list archives: http://lists.mysql.com/cluster To unsubscribe: http://lists.mysql.com/cluster?unsub=eugen@leitl.org ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20041028/f6fca634/attachment.bin From john at clustervision.com Thu Oct 28 02:00:48 2004 From: john at clustervision.com (john@clustervision.com) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Mac OS X and High Performance Heterogenous Environments - London Message-ID: <1098954048.4180b540b0544@www.unreal-inc.net> There has been interest on this list on MacOS clusters. This event in London on Monday should be of interest. The UK Unix Users Group is good at organising technically focussed events. I would say that this won't be a company product puff. http://www.ukuug.org/events/apple04/ Who: Jordan Hubbard (Apple), and other speakers When: Monday 1st November, 2004 10:30am start; 10am doors Where: Institute of Physics, 76 Portland Place, London Cost: Free entry although preregistration is required and places are strictly limited. You do not have to be a UKUUG member to attend Sadly I doubt if I will be able to attend. From jakob at unthought.net Thu Oct 28 03:03:56 2004 From: jakob at unthought.net (Jakob Oestergaard) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] High Performance for Large Database In-Reply-To: <38242de90410261208b9ae5f2@mail.gmail.com> References: <38242de90410261208b9ae5f2@mail.gmail.com> Message-ID: <20041028100356.GN12752@unthought.net> On Tue, Oct 26, 2004 at 01:08:00PM -0600, Joshua Marsh wrote: > Hi all, > > I'm currently working on a project that will require fast access to > data stored in a postgreSQL database server. I've been told that a > Beowulf cluster may help increase performance. Since I'm not very > familar with Beowulf clusters, I was hoping that you might have some > advice or information on whether a cluster would increase performance > for a PostgreSQL database. The major tables accessed are around > 150-200 million records. On a stand alone server, it can take several > minutes to perform a simple select query. > > It seems like once we start pricing for servers with 16+ processors > and 64+ GB of RAM, the prices sky rocket. If I can acheive high > performance with a cluster, using 15-20 dual processor machines, that > would be great. It depends. I was involved in one project where we had some hosts doing a *massive* number of queries against postgres, but no or few updates. This parallelizes very well. A single quiery would not run faster, but when you run thousands of queries, running them against a cluster of postgresql databases will even out the load just nicely, giving you linear scaling (sustained queries per second versus machines in the cluster). I don't think you'll have any luck finding off-the-shelf production-quality database software that will parallelize a single query on a number of nodes. If you just want throughput, large numbers of queries on a large number of databases, and you are doing mostly selects with very few (if any) updates/inserts/deletes, then PostgreSQL comes with software that can help you mirror your database. What you do is, you have a 'master' database - you will perform all updates/deletes/inserts against this master. The master will relay updates to a number of slave databases. You perform all selects against the slaves. Simply, stable, and works perfectly within the limits inherent in such a setup (eg. a single query won't parallelize, the master cannot scale to more updates than what is possible on a single system, etc.) -- / jakob From Dennis_Currit at ATK.COM Thu Oct 28 10:09:00 2004 From: Dennis_Currit at ATK.COM (Currit, Dennis) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Newbie question. Message-ID: <7B8D37027B57D7459984035D83864E360682AC68@exchangeut1.atk.com> We are thinking of putting up a cluster to run MSC Nastran and have about $25,000 budgeted for hardware. Is this enough to get get started? Any suggestions as to what I should buy? Currently we are running large jobs on an older AIX multiprocessor system and smaller jobs on a Xeon 2.8 system, so I think even a small cluster should be an improvement. From landman at scalableinformatics.com Thu Oct 28 11:06:09 2004 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Updated NCBI rpms released for the 2.2.10 toolkit Message-ID: <41813511.6000507@scalableinformatics.com> Folks: We rebuilt the NCBI rpms for AMD64, i386, i586 (non-P4), athlon, and i686 (p4). Feel free to grab them from our site http://downloads.scalableinformatics.com/downloads/ncbi/ They are named NCBI-2.2.10-1.*.rpm, where * = {src,x86_64,i386,i586,i686} They were built on RHEL/SuSE/Fedora Core2 machines. Should install without problems (and use the source if you have problems). Please note that if you have a non-pentium4/non-athlon machine (PIII) you want the i586 or i386 version. If you have a pentium4 based machine, you want the i686 version. AMD64 (and probably EM64T) will use the x86_64. Athlon's will use the athlon version. Unless someone supplies me with G5 or Itanium2, I probably wont be able to do builds for those platforms. Enjoy, and as usual, bug reports/problems to us, not to NCBI. We built the RPMS, so if they are broken, we need to know. Joe -------- Original Message -------- Subject: [blast-announce] [ BLAST_Announce #044] BLAST 2.2.10 released Date: Thu, 28 Oct 2004 11:35:37 -0400 From: Mcginnis, Scott (NIH/NLM/NCBI) To: 'blast-announce@ncbi.nlm.nih.gov' Binaries can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.10/ Source code can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/20041020/ Additionally, NCBI now provides anoncvs access (http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&rid=too lkit.section.cvs_external) to toolkit sources. A cvsweb source browser (http://www.ncbi.nlm.nih.gov/cvsweb/index.cgi/internal/c++/src/algo/blast/co re/) and doxygen documentation (http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/group__AlgoBlast.h tml) are also available. Notes for the 2.2.10 release New engine We have been rewriting and restructuring the BLAST engine in order to make BLAST more modular and extensible. bl2seq and megablast currently support the new engine; it can be enabled with the -V F option. Using the new engine may result in significant performance improvements in some cases. General changes -megablast now performs ungapped extensions in order to prevent suboptimal alignments -consolidated formatting code -removed fmerge.c -small fixes to sum statistics code -better error handling -fixed masking of translated queries -fixed several readdb threading bugs -improved protein neighbor generation -hsp sorting/inclusion fixes -many changes in HSP linking -several fixes for translated RPS blast BlastPGP -added code to spread out gap costs when extracting data from the sequence alignment to build PSSM -changed handling of all-zero columns of residue frequencies to use the underlying scoring matrix frequency ratios rather than scoring matrix's scores - disallowed an ungapped search if more than one iteration is specified scoremat.asn specification -added a new 'formatrpsdb' application; given a collection of Score-mat ASN.1 files, this application creates a database suitable for use with RPSBLAST -Simplified NCBI-ScoreMat specification to represent PSSMs instead of arbitrary scoring matrices. blastpgp and formatrpsdb can deal with this format. If you have any questions please write to blast-help@ncbi.nlm.nih.gov -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 612 4615 _______________________________________________ Bioclusters maillist - Bioclusters@bioinformatics.org https://bioinformatics.org/mailman/listinfo/bioclusters -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 612 4615 From rokrau at yahoo.com Thu Oct 28 12:06:40 2004 From: rokrau at yahoo.com (Roland Krause) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron Message-ID: <20041028190640.9782.qmail@web52907.mail.yahoo.com> Folks, does anybody here have positive or negative experiences with using the Intel EMT-64 Fortran compiler on AMD Opteron systems? I am at this point not so much interested in speed issues but more stability and correctness especially with respect to OpenMP. Or in other words: Is it worth trying yet? Best regards, Roland __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From ctierney at HPTI.com Thu Oct 28 12:39:52 2004 From: ctierney at HPTI.com (Craig Tierney) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron In-Reply-To: <20041028190640.9782.qmail@web52907.mail.yahoo.com> References: <20041028190640.9782.qmail@web52907.mail.yahoo.com> Message-ID: <1098992392.3052.83.camel@hpti9.fsl.noaa.gov> On Thu, 2004-10-28 at 13:06, Roland Krause wrote: > Folks, > does anybody here have positive or negative experiences with using the > Intel EMT-64 Fortran compiler on AMD Opteron systems? > Is it even going to work until the Opteron supports SSE3? I suspect if you don't vectorize, or only build 32-bit apps you will be ok. However, for most applications the vectorization is going to give you the big win. Craig > I am at this point not so much interested in speed issues but more > stability and correctness especially with respect to OpenMP. Or in > other words: Is it worth trying yet? > > Best regards, > Roland > > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From lindahl at pathscale.com Thu Oct 28 13:28:35 2004 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron In-Reply-To: <1098992392.3052.83.camel@hpti9.fsl.noaa.gov> References: <20041028190640.9782.qmail@web52907.mail.yahoo.com> <1098992392.3052.83.camel@hpti9.fsl.noaa.gov> Message-ID: <20041028202835.GB2227@greglaptop.internal.keyresearch.com> On Thu, Oct 28, 2004 at 01:39:52PM -0600, Craig Tierney wrote: > However, for most applications the vectorization > is going to give you the big win. People think that, but did you know that SIMD vectorization doesn't help any of the codes in SPECfp? Remember that the Opteron can use both fp pipes with scalar code. This is very different from the Pentium4. I'd say this myth is the #1 myth in the HPC industry right now. -- greg From csamuel at vpac.org Thu Oct 28 21:46:56 2004 From: csamuel at vpac.org (Chris Samuel) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] choosing a high-speed interconnect In-Reply-To: <1097615626.28704.129.camel@syru212-207.syr.edu> References: <1097611167.28704.104.camel@syru212-207.syr.edu> <416C4339.3040309@scalableinformatics.com> <1097615626.28704.129.camel@syru212-207.syr.edu> Message-ID: <200410291446.58866.csamuel@vpac.org> On Wed, 13 Oct 2004 07:13 am, Chris Sideroff wrote: > ? ?We run exclusively computation fluid dynamics on it. ?One program is > Fluent the other is an in-house turbo-machinery code. ?My experiences so > far have led me to believe Fluent is much more sensitive to the > network's performance than the in-house program. ?Thus my inquiry into a > higher performance network. Fluent is very latency sensitive, and apparently the next release of Fluent will support Myrinet on Opteron, which will be nice to see. The list of technologies that they support is at: http://www.fluent.com/about/news/newsletters/03v12i2_fall/img/a26i1_lg.gif Yes, that really is an image.. :-/ Chris -- Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20041029/6ec8516d/attachment.bin From Glen.Gardner at verizon.net Thu Oct 28 18:37:17 2004 From: Glen.Gardner at verizon.net (Glen Gardner) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Newbie question. References: <7B8D37027B57D7459984035D83864E360682AC68@exchangeut1.atk.com> Message-ID: <41819ECD.1080007@verizon.net> $25K will build a nice cluster. I'd be tempted to go with something in small cube cases with on-board gigabit . P4 is good, Xeon is very good, as is Opteron. Mac is an option , but costs a lot more. Depending on how you go about it, $25K ought to be enough money to build a really nice 16 node cluster and maybe you can throw in 4-5 cheap machines to use as PVFS I/O servers. It all depends on how much you want to buy pre-assembled and how much you are willing to build yourself. Glen Currit, Dennis wrote: >We are thinking of putting up a cluster to run MSC Nastran and have about >$25,000 budgeted for hardware. Is this enough to get get started? Any >suggestions as to what I should buy? Currently we are running large jobs on >an older AIX multiprocessor system and smaller jobs on a Xeon 2.8 system, so >I think even a small cluster should be an improvement. >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > -- Glen E. Gardner, Jr. AA8C AMSAT MEMBER 10593 Glen.Gardner@verizon.net http://members.bellatlantic.net/~vze24qhw/index.html From i.kozin at dl.ac.uk Fri Oct 29 02:17:19 2004 From: i.kozin at dl.ac.uk (Kozin, I (Igor)) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron Message-ID: Hi Roland, I think it is. It seems like 8.1 has much better OpenMP support than the previous versions although it is still not perfect. As fas as know it works on Opterons. Best, Igor I. Kozin (i.kozin at dl.ac.uk) CCLRC Daresbury Laboratory tel: 01925 603308 http://www.cse.clrc.ac.uk/disco > -----Original Message----- > From: Roland Krause [mailto:rokrau@yahoo.com] > Sent: 28 October 2004 20:07 > To: beowulf@beowulf.org > Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron > > > Folks, > does anybody here have positive or negative experiences with using the > Intel EMT-64 Fortran compiler on AMD Opteron systems? > > I am at this point not so much interested in speed issues but more > stability and correctness especially with respect to OpenMP. Or in > other words: Is it worth trying yet? > > Best regards, > Roland > > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) > visit http://www.beowulf.org/mailman/listinfo/beowulf > From kus at free.net Fri Oct 29 08:55:26 2004 From: kus at free.net (Mikhail Kuzminsky) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron In-Reply-To: <20041028202835.GB2227@greglaptop.internal.keyresearch.com> Message-ID: In message from Greg Lindahl (Thu, 28 Oct 2004 13:28:35 -0700): >On Thu, Oct 28, 2004 at 01:39:52PM -0600, Craig Tierney wrote: > >> However, for most applications the vectorization >> is going to give you the big win. > >People think that, but did you know that SIMD vectorization doesn't >help any of the codes in SPECfp? It's interesting ! Opteron SPECfp2000 results obtained w/help of PGI 5.1-3 includes -fastsse copmiler option. SPECfp2000 results (for Opteron) based on old ifc 7.0 compiler include options like -xW which allow to create SIMD instructions. Etc. There is 2 possibilities a) These compilers didn't generate SSE2-containing codes for any program from SPECfp2000 - what looks strange for me b) In the case we'll re-translate the source of SPECfp2000 w/suppression of SSE commands generation, performance results will be the same. Do I understand you correctly, that you say about case b) ? BTW, if I remember correctly, ATLAS dgemm codes for Opteron are better if they are using SIMD fp operations - but of course, it's "out of SPECfp2000 codes" > Remember that the Opteron can use >both fp pipes with scalar code. This is very different from the >Pentium4. Yes, but 32-bit ifc compilers (which don't know about Opteron microarchitecture) gave better results than pgi compilers oriented to "right" microarchitecture. Of course, I don't say about yours PathScale compilers which usually are the best (in the perofrmance of codes generated) but too expensive :-( . Yours Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow >I'd say this myth is the #1 myth in the HPC industry right >now. > From kus at free.net Fri Oct 29 08:58:07 2004 From: kus at free.net (Mikhail Kuzminsky) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron In-Reply-To: <1098992392.3052.83.camel@hpti9.fsl.noaa.gov> Message-ID: In message from Craig Tierney (Thu, 28 Oct 2004 13:39:52 -0600): >On Thu, 2004-10-28 at 13:06, Roland Krause wrote: >> Folks, >> does anybody here have positive or negative experiences with using >>the >> Intel EMT-64 Fortran compiler on AMD Opteron systems? >> > >Is it even going to work until the Opteron supports SSE3? You may generate the codes w/o SSE3 commands and w/64-bit support. Yours Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow >I suspect if you don't vectorize, or only build 32-bit apps >you will be ok. However, for most applications the vectorization >is going to give you the big win. > >Craig > >> I am at this point not so much interested in speed issues but more >> stability and correctness especially with respect to OpenMP. Or in >> other words: Is it worth trying yet? > > >> >> Best regards, >> Roland >> >> >> >> >> __________________________________________________ >> Do You Yahoo!? >> Tired of spam? Yahoo! Mail has the best spam protection around >> http://mail.yahoo.com >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >>http://www.beowulf.org/mailman/listinfo/beowulf > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf From kus at free.net Fri Oct 29 09:09:46 2004 From: kus at free.net (Mikhail Kuzminsky) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron In-Reply-To: <20041028190640.9782.qmail@web52907.mail.yahoo.com> Message-ID: In message from Roland Krause (Thu, 28 Oct 2004 12:06:40 -0700 (PDT)): >Folks, >does anybody here have positive or negative experiences with using >the >Intel EMT-64 Fortran compiler on AMD Opteron systems? We have very small ifort/8.1.023 experiense on our Opteron - it's not enough to say about compilers comparison. But you may found comparison results at //www.polyhedron.com site. As I remember, best results are for PathScale, which is really the best in a lot of tests, and ifort is on the second position. But you should take into account that ifort 8.1.023 at some highest optimization level compiler keys generate the codes, which check (at the run time) the processor used and will not work on Opteron. > >I am at this point not so much interested in speed issues but more >stability and correctness especially with respect to OpenMP. According our experience w/ifc -7.1, it is realtive stable. I beleive ifort-8.1 will be also good. But we didn't use OpenMP in our application programs (we checked OpenMP only on some tests). Yours Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow > Or in >other words: Is it worth trying yet? > >Best regards, >Roland > > > > >__________________________________________________ >Do You Yahoo!? >Tired of spam? Yahoo! Mail has the best spam protection around >http://mail.yahoo.com >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf From mack.joseph at epa.gov Fri Oct 29 05:38:01 2004 From: mack.joseph at epa.gov (Joseph Mack) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] High Performance for Large Database References: <38242de90410261208b9ae5f2@mail.gmail.com> <20041028100356.GN12752@unthought.net> Message-ID: <418239A9.70C3A907@epa.gov> Jakob Oestergaard wrote: > > On Tue, Oct 26, 2004 at 01:08:00PM -0600, Joshua Marsh wrote: > > Hi all, > > > > I'm currently working on a project that will require fast access to > > data stored in a postgreSQL database server. I've been told that a > > Beowulf cluster may help increase performance. The Linux Virtual Server (LVS) project www.linuxvirtualserver.org is a load balancer which allows multiple requests to be balanced amongst a set of backend machines. It works perfectly for readonly. If clients write to the backend machines, then the updates have to be propagated to the other backend machines, and you have to do this outside LVS. If your usage is read mostly and you want a low cost solution, then LVS will do what you want. If you want a real parallel database, be prepared to pay lots of money to Oracle. disclaimer: I'm part of the LVS project Joe -- Joseph Mack PhD, High Performance Computing & Scientific Visualization LMIT, Supporting the EPA Research Triangle Park, NC 919-541-0007 Federal Contact - John B. Smith 919-541-1087 - smith.johnb@epa.gov From nick at brealey.org Fri Oct 29 02:13:36 2004 From: nick at brealey.org (Nicholas Brealey) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron In-Reply-To: <20041028190640.9782.qmail@web52907.mail.yahoo.com> References: <20041028190640.9782.qmail@web52907.mail.yahoo.com> Message-ID: <418209C0.8090007@brealey.org> Roland Krause wrote: > Folks, > does anybody here have positive or negative experiences with using the > Intel EMT-64 Fortran compiler on AMD Opteron systems? > > I am at this point not so much interested in speed issues but more > stability and correctness especially with respect to OpenMP. Or in > other words: Is it worth trying yet? > Take a look at the 64 bit AMD Opteron benchmarks results at http://www.polyhedron.com/ and google comp.lang.fortran. It seemed to be able to run all the benchmarks correctly. The Polyhedron benchmarks showed the Intel compiler coming in just behind the Pathscale compiler in 64 bit mode on an Opteron. The benchmarks don't use OpenMP though. Nick From rsweet at aoes.com Fri Oct 29 02:18:59 2004 From: rsweet at aoes.com (Ryan Sweet) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Newbie question. In-Reply-To: <7B8D37027B57D7459984035D83864E360682AC68@exchangeut1.atk.com> References: <7B8D37027B57D7459984035D83864E360682AC68@exchangeut1.atk.com> Message-ID: On Thu, 28 Oct 2004, Currit, Dennis wrote: > We are thinking of putting up a cluster to run MSC Nastran and have about > $25,000 budgeted for hardware. Have you also discussed licensing with MSC? It may have changed recently, as my experience here is a few months old, but Distributed Parallel features are licensed differently than the SMP Parallel features, and are significantly more expensive. > Is this enough to get get started? Certainly it is enough to buy a few nodes, but also if you are just getting started with clustering it may be good to think a couple of times about the direction you want to go and to analyse the associated costs of things like additional power consumption, air-conditioning, etc.... If you already have a sufficient server room infrastructure (racks, power, ac) then perhaps this isn't much of a consideration. > Any suggestions as to what I should buy? We are currently running NASTRAN with success on opteron, though we have also had good experience with PIV and Athlon MPs. I think the best thing to do is to try and benchmark your actual jobs on a few different systems if you can. > Currently we are running large jobs on > an older AIX multiprocessor system and smaller jobs on a Xeon 2.8 system, so > I think even a small cluster should be an improvement. It depends upon the jobs that you are running. If your jobs do benefit well from running in SMP on the AIX system, then you may also have good efficiency with DMP on a cluster. We have had mixed results with our structural analyses: some had near linear speedups on smallish (<16 cpus) cluster runs, and others would gain only about 20% for each cpu added. For many of our NASTRAN runs high speed disk IO is nearly as important as as the CPU. Find out what your job mix is like, and then spend your money accordingly (you may want to splash on fast SATA local disks for scratch space, or on more RAM to use RAM disks for scratch). Also have the engineers read through the manual regarding the DMP options because you have to pay a bit more attention to how your jobs are configured when using distributed parallel. good luck, -Ryan -- Ryan Sweet Advanced Operations and Engineering Services AOES Group BV http://www.aoes.com Phone +31(0)71 5795521 Fax +31(0)71572 1277 From Serguei.Patchkovskii at sympatico.ca Thu Oct 28 15:56:48 2004 From: Serguei.Patchkovskii at sympatico.ca (Serguei.Patchkovskii@sympatico.ca) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron In-Reply-To: <20041028190640.9782.qmail@web52907.mail.yahoo.com> Message-ID: <418140F0.24192.5C7CF22@localhost> On 28 Oct 2004 at 12:06, Roland Krause wrote: > does anybody here have positive or negative experiences with using the > Intel EMT-64 Fortran compiler on AMD Opteron systems? It works fine, as long as you do not use Prescott new instructions (which are in any event of less importance on Opterons - they are not as decode-crippled as P4s are), I've built a few moderately complex quantum chemistry codes with EM64 ifort, and they run OK on Opterons. > I am at this point not so much interested in speed issues but more > stability and correctness especially with respect to OpenMP. Or in > other words: Is it worth trying yet? I can't comment on OpenMP, but for serial code it is a nice, fast and reasonably stable compiler. Not as stable or as fast as Pathscale's, but far superior to PGI's. As usual, YMMV - and probably will. Serguei From vinicius at centrodecitricultura.br Fri Oct 29 05:33:23 2004 From: vinicius at centrodecitricultura.br (Vinicius) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] error in tstmachines In-Reply-To: <20041025.125039.103756653.lusk@localhost> References: <1098701880.2791.287.camel@nuts.clc.cuhk.edu.hk> <417D2503.3060805@staff.uni-marburg.de> <20041025.125039.103756653.lusk@localhost> Message-ID: <1099053203.11351.2.camel@swingle> help me!!!! rsh ok: [swingle@swingle bin]$ /usr/bin/rsh swingle3 Last login: Tue Oct 26 16:26:46 from swingle [swingle@swingle3 swingle]$ [swingle@swingle bin]$ ./tstmachines Errors while trying to run /usr/bin/rsh swingle3 -n /bin/ls /home/swingle/programs/mpich-1.2.6/bin/mpichfoo Unexpected response from swingle3: --> /bin/ls: /home/swingle/programs/mpich-1.2.6/bin/mpichfoo: No such file or directory The ls test failed on some machines. This usually means that you do not have a common filesystem on all of the machines in your machines list; MPICH requires this for mpirun (it is possible to handle this in a procgroup file; see the documentation for more details). Other possible problems include: The remote shell command /usr/bin/rsh does not allow you to run ls. See the documentation about remote shell and rhosts. You have a common file system, but with inconsistent names. See the documentation on the automounter fix. 1 errors were encountered while testing the machines list for LINUX Only these machines seem to be available swingle From hahn at physics.mcmaster.ca Fri Oct 29 17:36:37 2004 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron In-Reply-To: <1098992392.3052.83.camel@hpti9.fsl.noaa.gov> Message-ID: > > does anybody here have positive or negative experiences with using the > > Intel EMT-64 Fortran compiler on AMD Opteron systems? > > Is it even going to work until the Opteron supports SSE3? what SSE3 adds over SSE2 is remarkably minor. > I suspect if you don't vectorize, or only build 32-bit apps > you will be ok. However, for most applications the vectorization > is going to give you the big win. the big win is getting away from the x87 FP stack. vectorization is a wonderful thing, but practically any FP code will see a nice speedup with purely scalar SSE usage (such as you'd get with current gcc.) regards, mark hahn. From gvinodh1980 at yahoo.co.in Sat Oct 30 00:38:35 2004 From: gvinodh1980 at yahoo.co.in (Vinodh) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] MPICH fault handling Message-ID: <20041030073835.95072.qmail@web8503.mail.in.yahoo.com> hello, i established a four node beowulf cluster using MPICH. while testing, i started mpd daemon in all the nodes from the master by mpdboot, then i unplugged one slave node from LAN, and now i tried to execute a program using mpiexec, the master node is not recognising that one of the node has failed. then i checked in www.beowulf.org - Archives, the last discussion about the mpi node failure was at Jan - 2003. so now i want to know, whether there is any update of MPI fault handling. what can i do if 1. any slave node fails. 2. master node fails. __________________________________ Do you Yahoo!? Yahoo! Mail Address AutoComplete - You start. We finish. http://promotions.yahoo.com/new_mail From mechti01 at luther.edu Fri Oct 29 23:02:46 2004 From: mechti01 at luther.edu (Timo Mechler) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Rocks Cluster and 2 Ethernet networks Message-ID: <2937.172.22.17.130.1099116166.squirrel@172.22.17.130> Hi all, I'm considering installing the Rocks cluster distro on a cluster that uses only ethernet. As I understand it, eth0 (or first network interface) is used for administration and also message passing if no other high speed interface is present (e.g. myrinet). My question is, if each of my compute node have two ethernet interfaces, say eth0 and eth1, can the cluster be configured that message passing takes place only over eth1? It would be nice to have an interface devoted to just message passing. If it is possible, how would I go about setting it up? If it's not possible, is there are a lot of performance loss due to the fact that other tasks (such as administration, etc.) are also taking place over eth0? Thanks in advance for your help. -Timo Mechler -- Timo R. Mechler mechti01@luther.edu From jcandy at san.rr.com Sat Oct 30 12:41:47 2004 From: jcandy at san.rr.com (Jeff Candy) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] PVFS on 80 proc (40 node) cluster Message-ID: <4183EE7B.7050804@san.rr.com> Greetings, Does anyone have experience with PVFS on a cluster in the range of 80 processors (40 dual nodes with gigE)? I am considering this over the usual NFS-master node stup since we expect to multiple users/jobs running concurrently. I am interested to hear any information/horror stories, etc. Thanks, Jeff From reuti at staff.uni-marburg.de Sat Oct 30 14:45:42 2004 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Rocks Cluster and 2 Ethernet networks In-Reply-To: <2937.172.22.17.130.1099116166.squirrel@172.22.17.130> References: <2937.172.22.17.130.1099116166.squirrel@172.22.17.130> Message-ID: <1099172742.41840b869e823@home.staff.uni-marburg.de> Hi, > I'm considering installing the Rocks cluster distro on a cluster that uses > only ethernet. As I understand it, eth0 (or first network interface) is > used for administration and also message passing if no other high speed > interface is present (e.g. myrinet). My question is, if each of my > compute node have two ethernet interfaces, say eth0 and eth1, can the > cluster be configured that message passing takes place only over eth1? It > would be nice to have an interface devoted to just message passing. If it > is possible, how would I go about setting it up? If it's not possible, is > there are a lot of performance loss due to the fact that other tasks (such > as administration, etc.) are also taking place over eth0? Thanks in > advance for your help. do you want to use the ch_p4 device of MPICH for communication? Then you simply have to set the machinefile for mpirun to include only the names of the second interface in all nodes. Maybe your queuingsystem can do this already for you. Furthermore, you have to change the setting in mpirun.args that way, that instead: MPI_HOST=`hostname` will be substituded with the name of the second interface. E.g. MPI_HOST=`hostname | sed "s/^node/internal/"` to change the name from node001 to internal001 or whatever names you use. Otherwise your machinefile will be scanned in a wrong way (wrong distribution of the processes to the nodes in the end), and the communication back from the slaves to the head node of the job will still use the wrong interface. You can simply include this at the beginning of the mpirun.arg file. If it's already set, it will no be set later in the script. Cheers - Reuti From reuti at staff.uni-marburg.de Sat Oct 30 15:50:48 2004 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] PVFS on 80 proc (40 node) cluster In-Reply-To: <4183EE7B.7050804@san.rr.com> References: <4183EE7B.7050804@san.rr.com> Message-ID: <1099176648.41841ac8ad892@home.staff.uni-marburg.de> Hi, > Does anyone have experience with PVFS on a cluster > in the range of 80 processors (40 dual nodes with > gigE)? > > I am considering this over the usual NFS-master > node stup since we expect to multiple users/jobs > running concurrently. on the one hand it sounds interesting. I would fear that in a cluster (where each node should do heavy calculations and use the own disk for local scratch data) the performance will be worse than a dedicated file server with a RAID. What programs will your cluster run and how are the users submitting the jobs? - Reuti From jcandy at san.rr.com Sat Oct 30 21:14:43 2004 From: jcandy at san.rr.com (Jeff Candy) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] PVFS on 80 proc (40 node) cluster In-Reply-To: <1099176648.41841ac8ad892@home.staff.uni-marburg.de> References: <4183EE7B.7050804@san.rr.com> <1099176648.41841ac8ad892@home.staff.uni-marburg.de> Message-ID: <418466B3.4000809@san.rr.com> Jeff: >>Does anyone have experience with PVFS on a cluster >>in the range of 80 processors (40 dual nodes with >>gigE)? >> >>I am considering this over the usual NFS-master >>node stup since we expect to multiple users/jobs >>running concurrently. Reuti: > on the one hand it sounds interesting. I would fear that in a cluster (where > each node should do heavy calculations and use the own disk for local scratch > data) the performance will be worse than a dedicated file server with a RAID. > What programs will your cluster run and how are the users submitting the jobs? - the program is a large physics code that does I/O (200KB or less) every 10 to 60 sec. Every 10min or so, a 100MB file is written. - users will submit with PBS (typically, I expect <= 3 jobs to run concurrently). - I want a *single* filesystem, so no local scratch will be used. Are you in favour of a single master with a RAID filesystem, NFS mounted by all nodes? I wonder what fraction of systems now use this scheme. Thanks for your input. Jeff From mechti01 at luther.edu Sat Oct 30 21:54:41 2004 From: mechti01 at luther.edu (mechti01) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Rocks Cluster and 2 Ethernet networks In-Reply-To: <1099172742.41840b869e823@home.staff.uni-marburg.de> References: <2937.172.22.17.130.1099116166.squirrel@172.22.17.130> <1099172742.41840b869e823@home.staff.uni-marburg.de> Message-ID: <1152.172.22.17.130.1099198481.squirrel@172.22.17.130> Hi Reuti, Thanks for your help. I have not installed Rocks just yet. Can you explain to me what the ch_p4 device of MPI_CH is? Nochmals, vielen Dank! -Timo > Hi, > >> I'm considering installing the Rocks cluster distro on a cluster that >> uses >> only ethernet. As I understand it, eth0 (or first network interface) is >> used for administration and also message passing if no other high speed >> interface is present (e.g. myrinet). My question is, if each of my >> compute node have two ethernet interfaces, say eth0 and eth1, can the >> cluster be configured that message passing takes place only over eth1? >> It >> would be nice to have an interface devoted to just message passing. If >> it >> is possible, how would I go about setting it up? If it's not possible, >> is >> there are a lot of performance loss due to the fact that other tasks >> (such >> as administration, etc.) are also taking place over eth0? Thanks in >> advance for your help. > > do you want to use the ch_p4 device of MPICH for communication? Then you > simply > have to set the machinefile for mpirun to include only the names of the > second > interface in all nodes. Maybe your queuingsystem can do this already for > you. > Furthermore, you have to change the setting in mpirun.args that way, that > instead: > > MPI_HOST=`hostname` > > will be substituded with the name of the second interface. E.g. > > MPI_HOST=`hostname | sed "s/^node/internal/"` > > to change the name from node001 to internal001 or whatever names you use. > Otherwise your machinefile will be scanned in a wrong way (wrong > distribution > of the processes to the nodes in the end), and the communication back from > the > slaves to the head node of the job will still use the wrong interface. You > can > simply include this at the beginning of the mpirun.arg file. If it's > already > set, it will no be set later in the script. > > Cheers - Reuti > -- From reuti at staff.uni-marburg.de Sun Oct 31 01:54:10 2004 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] Rocks Cluster and 2 Ethernet networks In-Reply-To: <1152.172.22.17.130.1099198481.squirrel@172.22.17.130> References: <2937.172.22.17.130.1099116166.squirrel@172.22.17.130> <1099172742.41840b869e823@home.staff.uni-marburg.de> <1152.172.22.17.130.1099198481.squirrel@172.22.17.130> Message-ID: <1099216450.4184b6425cef7@home.staff.uni-marburg.de> Hi Timo, > Thanks for your help. I have not installed Rocks just yet. Can you > explain to me what the ch_p4 device of MPI_CH is? Nochmals, vielen Dank! MPI is a standard of a protocol for writing parallel programs. MPICH is one implementation of this standard (there are others, also commercial ones). Inside MPICH you have different "devices" for communication between the nodes to chose from, which will best fit to your computer system and network. The ch_p4 device is one, which will use the p4 communication standard. This will use rsh/ssh to start the tasks on the nodes. Other devices will need a special daemon running on each node. All the programs we use, use the ch_p4 device. Cheers - Reuti From reuti at staff.uni-marburg.de Sun Oct 31 04:29:43 2004 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] PVFS on 80 proc (40 node) cluster In-Reply-To: <418466B3.4000809@san.rr.com> References: <4183EE7B.7050804@san.rr.com> <1099176648.41841ac8ad892@home.staff.uni-marburg.de> <418466B3.4000809@san.rr.com> Message-ID: <1099225783.4184dab7129b7@home.staff.uni-marburg.de> > Reuti: > > > What programs will your cluster run and how are the users submitting the > jobs? > > Jeff: > > - the program is a large physics code that does I/O > (200KB or less) every 10 to 60 sec. Every 10min or > so, a 100MB file is written. It's completely different from our requirements. We share /home with the (small) input files, and each node needs a large local /scratch space (100GB and more). > - I want a *single* filesystem, so no local scratch > will be used. A single file system for /home and /scratch (will your software need a common /scratch space)? Is there any error correction in PVFS in case that a disk or node fails? Another solution could be IBM's GPFS if you need a big and fast common file space. Cheers - Reuti From brian at cypher.acomp.usf.edu Sun Oct 31 19:14:44 2004 From: brian at cypher.acomp.usf.edu (Brian Smith) Date: Wed Nov 25 01:03:31 2009 Subject: [Beowulf] PVFS on 80 proc (40 node) cluster In-Reply-To: <4183EE7B.7050804@san.rr.com> References: <4183EE7B.7050804@san.rr.com> Message-ID: <1099278884.24797.17.camel@ava> Jeff, You should definitely consider PVFS or any other parallel filesystem over NFS mounting for concurrent scratch space. I read your requirements from another post and the number of writes, etc, and those writes would likely flood even the most respectable file server. PVFS2 has much improved fault tolerance over PVFS1 in that there can be redundant file nodes where as with PVFS1, if one node dropped dead, your FS was toast. If you go to their web site, there should be plenty of documentation on how to set it up. You may also want to consider investigating GFS from Red Hat and Luster. Brian Smith On Sat, 2004-10-30 at 12:41 -0700, Jeff Candy wrote: > Greetings, > > Does anyone have experience with PVFS on a cluster > in the range of 80 processors (40 dual nodes with > gigE)? > > I am considering this over the usual NFS-master > node stup since we expect to multiple users/jobs > running concurrently. > > I am interested to hear any information/horror > stories, etc. > > Thanks, > > Jeff > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf