From kragen at pobox.com Sun Dec 1 01:53:18 2002 From: kragen at pobox.com (Kragen Sitaker) Date: Wed Nov 25 01:02:53 2009 Subject: Availability of MPI over inet2 protocol Message-ID: <20021201095318.4912A3F541@panacea.canonical.org> Don Becker writes: > Improve in what way? IPv6 is slower and more complex. IIRC, the designers of IPv6 left out the parts of IPv4 that tended to bottleneck routers, like recalculating header checksums after decrementing the TTL; they intended to make IPv6 faster than IPv4 on the same amount of hardware. I guess you think they failed? -- Kragen Sitaker Edsger Wybe Dijkstra died in August of 2002. The world has lost a great man. See http://advogato.org/person/raph/diary.html?start=252 and http://www.kode-fu.com/geek/2002_08_04_archive.shtml for details. From patrick at myri.com Sun Dec 1 05:10:41 2002 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:02:53 2009 Subject: Availability of MPI over inet2 protocol In-Reply-To: <20021201095318.4912A3F541@panacea.canonical.org> References: <20021201095318.4912A3F541@panacea.canonical.org> Message-ID: <1038748245.634.11.camel@asterix> On Sun, 2002-12-01 at 04:53, Kragen Sitaker wrote: > Don Becker writes: > > Improve in what way? IPv6 is slower and more complex. > > IIRC, the designers of IPv6 left out the parts of IPv4 that tended to > bottleneck routers, like recalculating header checksums after > decrementing the TTL; they intended to make IPv6 faster than IPv4 on > the same amount of hardware. I guess you think they failed? Do you think it will make a bit of a difference at the MPI level ? There are some characteristics of long distance networking that you will never change, like the speed of light. IPv4 or IPv6, don't expect good performance running MPI codes across the Internet, unless there is no communication. When the Grid Computing community will finally realize that... :-) Patrick -- Patrick Geoffray, Phd Myricom, Inc. http://www.myri.com From becker at scyld.com Sun Dec 1 06:12:40 2002 From: becker at scyld.com (Donald Becker) Date: Wed Nov 25 01:02:53 2009 Subject: Availability of MPI over inet2 protocol In-Reply-To: <1038748245.634.11.camel@asterix> Message-ID: On 1 Dec 2002, Patrick Geoffray wrote: > On Sun, 2002-12-01 at 04:53, Kragen Sitaker wrote: > > Don Becker writes: > > > Improve in what way? IPv6 is slower and more complex. > > > > IIRC, the designers of IPv6 left out the parts of IPv4 that tended to > > bottleneck routers, like recalculating header checksums after > > decrementing the TTL; they intended to make IPv6 faster than IPv4 on > > the same amount of hardware. I guess you think they failed? One simplication for routers, compared against the significantly larger and more complex header. The routers now have many more decisions to make based on the address. There were hand-waving arguments that larger addresses would somehow make geographical addressing possible, and that there would somehow be rational allocations of the address space to simplify routing. > Do you think it will make a bit of a difference at the MPI level ? > > There are some characteristics of long distance networking that you will > never change, like the speed of light. IPv4 or IPv6, don't expect good > performance running MPI codes across the Internet, unless there is no > communication. > When the Grid Computing community will finally realize that... :-) It amazes me when I give a talk and am pounded with hostile questions about communication latency, bisection bandwidth, parallel file I/O, the impossibility of converting a specific algorithm, and application availability. The same audience, just a few minutes later, nods in agreements with grid computing, cycle scavenging and "peer-to-peer" (sic) computing claiming to be able to do the same applications. -- Donald Becker becker@scyld.com Scyld Computing Corporation http://www.scyld.com 410 Severn Ave. Suite 210 Scyld Beowulf cluster system Annapolis MD 21403 410-990-9993 From mof at labf.org Sun Dec 1 01:47:34 2002 From: mof at labf.org (Mof) Date: Wed Nov 25 01:02:53 2009 Subject: Java clustering.... Message-ID: <200212012017.34811.mof@labf.org> G'day all, I've been wandering, whether there are any of you that use Java in your clusters. Ifso, how do you find the speed compared to other langauges/environments ? I do almost all of my development in Java, simply because I find it so much faster to prototype in. Mof. From eugen at leitl.org Sun Dec 1 07:42:52 2002 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:02:53 2009 Subject: Availability of MPI over inet2 protocol In-Reply-To: Message-ID: On Sun, 1 Dec 2002, Donald Becker wrote: > One simplication for routers, compared against the significantly > larger and more complex header. The routers now have many more > decisions to make based on the address. There were hand-waving > arguments that larger addresses would somehow make geographical > addressing possible, and that there would somehow be rational True geographic addressing would work with say, labelling the nodes with polar coordinates. 128 bit would be more than sufficient for that (even the NIC raw MAC address space allows for roughly 1 node/m^2 of Earth surface). But I understand one can't just pick up a large IPv6 address subspace and run with it, to say use it for labeling nodes in an ad hoc wireless network. > allocations of the address space to simplify routing. Assuming node ID being derived from, say, WGS 84 in a straightforward notion (latitude|longitude|height) and the network connectivity would not create obstructions to packets (either following a regular mesh, or being a high-dimensional grid) then only a few parallelizable vector lookups (of the 'this link goes to the node closest to target address, so stream packet there') are sufficient for a routing decision. This could be easily accomplished at relativistic speeds, while the header packets are streaming by. Given that at 10 GBps a bit is just <3 cm long a short coil of fiber after the beam splitter would do as a shallow FIFO so it could be done purely photonically. In case of using NIC MACs only trivial modification to switch ASICs would be necessary to allow them to handle true meshes instead of just trees (and limited by the MAC lookup size, too). From waitt at saic.com Sun Dec 1 08:02:51 2002 From: waitt at saic.com (Tim Wait) Date: Wed Nov 25 01:02:53 2009 Subject: Java clustering.... References: <200212012017.34811.mof@labf.org> Message-ID: <3DEA32AB.9040206@saic.com> > I've been wandering, whether there are any of you that use Java in your > clusters. I have a few users who use Java for some EP genetic stuff. I also have a group of users that have a sunblade running an oracle server tied into a Myrinet cluster, doing java/web based data-mining. That is, there are a boatload of java server processes running on each node - As far as I'm concerned, this isn't HPC - so I just ignore them for the most part -- they paid for the equipment. The latencies must be horrendous. It is running IP over Myrinet on the other hand... > Ifso, how do you find the speed compared to other langauges/environments ? > I do almost all of my development in Java, simply because I find it so much > faster to prototype in. I haven't looked at this for awhile. Looks like there are a few beta implemenations of faster communication libraries or native compilers that purport to have much lower comm latencies than RMI. Anyone else have numbers to bandy about? Tim From mehmet.suzen at bristol.ac.uk Sun Dec 1 09:21:51 2002 From: mehmet.suzen at bristol.ac.uk (Mehmet Suzen) Date: Wed Nov 25 01:02:53 2009 Subject: Java clustering.... In-Reply-To: <200212012017.34811.mof@labf.org> References: <200212012017.34811.mof@labf.org> Message-ID: <1333453.1038763310@pc168.maths.bris.ac.uk> Hi, There are some interfaces to MPI http://perun.hscs.wmin.ac.uk/JavaMPI/ https://mailer.csit.fsu.edu/mailman/listinfo/java-mpi/ but I can't see any good reason to use java for HPC. Mehmet --On 01 December 2002 8:17pm +1030 Mof wrote: > G'day all, > I've been wandering, whether there are any of you that use Java in your > clusters. > Ifso, how do you find the speed compared to other langauges/environments ? > I do almost all of my development in Java, simply because I find it so > much faster to prototype in. > > Mof. From lindahl at keyresearch.com Sun Dec 1 14:27:28 2002 From: lindahl at keyresearch.com (Greg Lindahl) Date: Wed Nov 25 01:02:53 2009 Subject: Java clustering.... In-Reply-To: <1333453.1038763310@pc168.maths.bris.ac.uk>; from mehmet.suzen@bristol.ac.uk on Sun, Dec 01, 2002 at 05:21:51PM -0000 References: <200212012017.34811.mof@labf.org> <1333453.1038763310@pc168.maths.bris.ac.uk> Message-ID: <20021201142728.A18536@wumpus.attbi.com> On Sun, Dec 01, 2002 at 05:21:51PM -0000, Mehmet Suzen wrote: > but I can't see any good reason to use java for HPC. It used to be the case that HPC meetings had lots of papers on Java, but while it's fairly easy to use, performance and scalability are higher on most people's lists. At one recent meeting I got a big laugh during the end-of-meeting summary by noting that the words "Java" and "supercomputing" had not appeared in the same sentence during the entire meeting. -- greg From schweng at master1.astro.unibas.ch Mon Dec 2 07:09:50 2002 From: schweng at master1.astro.unibas.ch (Hans Schwengeler) Date: Wed Nov 25 01:02:53 2009 Subject: (no subject) Message-ID: <200212021509.QAA29420@master1.astro.unibas.ch> Hello, I'd like to know how to setup two networks on the same cluster. The first network would consist of the currently installed fast ethernet network (100 Mbit/s) and serve for booting and maybe NFS service. The second network would consist of the planned gigbit ethernet network (1000 Mbit/s) and serve the internode communication (MPI, PVM). How should I setup the interfaces and IP numbers and MPI to accomodate this? Any experience about such a situation? I have a scyld bz27-8 system with redhat 6.2. Yours, Hans Schwengeler From shewa at inel.gov Mon Dec 2 08:21:04 2002 From: shewa at inel.gov (Andrew Shewmaker`) Date: Wed Nov 25 01:02:53 2009 Subject: Java clustering.... In-Reply-To: <200212012017.34811.mof@labf.org> References: <200212012017.34811.mof@labf.org> Message-ID: <20021202092104.5bb6754f.shewa@inel.gov> On Sun, 1 Dec 2002 20:17:34 +1030 Mof wrote: > G'day all, > > I've been wandering, whether there are any of you that use Java in your > clusters. > Ifso, how do you find the speed compared to other langauges/environments ? > I do almost all of my development in Java, simply because I find it so much > faster to prototype in. I haven't used either of these, but I have a group of programmers who were also looking at java-based cluster toolkits and saw these recently. http://www.cs.vu.nl/manta/ "Manta compiles Java source codes to x86 executables. Manta supports the complete Java 1.1 language, including exceptions, garbage collection and dynamic class loading. Manta also supports some Java extentions, such as the JavaParty programming model (the 'remote' keyword), replicated objects (described at JavaGrande 2000), and efficient divide and conquer parallelism (the 'spawn' and 'sync' keywords from cilk). The divide and conquer system is called 'Satin' and was described at Euro-Par 2000 and PPoPP'01. Furthermore, we have built a distributed shared memory (DSM) system on top of Manta, called Jackal" http://vip.6to23.com/jcluster/ "We present Jcluster toolkit, a high-performance Java parallel environment, implemented entirely in Java. Jcluster automatically balances resource load across the large-scale heterogeneous cluster with a transitive random stealing algorithm and provides simple high-performance PVM-like and MPI-like message passing interfaces with multithreaded communications using UDP protocol." Andrew -- Andrew Shewmaker Associate Engineer Phone: 208.526.1415 Fax: 208.526.4017 Idaho National Engineering and Environmental Laboratory 2525 Fremont Ave. Idaho Falls, ID 83415-3605 From patrick at myri.com Tue Dec 3 04:50:08 2002 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:02:53 2009 Subject: [BProc] Master node in Clustermatic (BProc) In-Reply-To: <20021128183126.N67139-100000@hoolan.org> References: <20021128183126.N67139-100000@hoolan.org> Message-ID: <1038919810.576.19.camel@asterix> Hi, On Thu, 2002-11-28 at 05:46, Yung-Sheng Tang wrote: > node doesn't appear in the node list so GM ID of master node can not be > found, mpich-gm that Clustermatic supplied needs GM ID to establish > communication, not IP address. I have fast ethernet and myrinet > interface each in my cluster nodes, master and slaves. Anyway, thank you > for your reply. The newest MPICH-GM (based on MPICH 1.2.4) has a new mpirun.ch_gm script. However, it does not support Bproc yet. It's in my queue, not at the top but quite high, so it will happen. Patrick -- Patrick Geoffray, Phd Myricom, Inc. http://www.myri.com From fant at pobox.com Tue Dec 3 06:57:37 2002 From: fant at pobox.com (Andrew Fant) Date: Wed Nov 25 01:02:53 2009 Subject: Locality and caching in parallel/distributed file systems Message-ID: <20021203094729.O95569-100000@net.bluemoon.net> Morning all, Lately, I have been thinking a lot about parallel filesystems in the most un-rigourous way possible. Knowing that PVFS simply stripes the data across the participating filesystems, I was wondering if anyone had tried to apply caching technology and file migration capacities to a parallel/distributed filesytem in a manner analagous to SGI's ccNuma memory architecture. That is, distributing files in the FS to various nodes, keeping track of where the accesses are coming from, and moving the file to another node if that is where some suitable percentage of the reads and/or writes are coming from. Also, potentially allowing blocks from local files to be cached in disk on a local node until a write to those blocks elsewhere invalidates the cache (I know the semantics for this theoretically are in NFS, but NFS doesn't scale, and is dead 8-). I admit that I am not a computer science graduate, nor a semi-professional developer, so I have no idea if this has been or could be done, but it keeps rattling around in my head as an idea, and I would appreciate any feedback that people can give. Please forgive my ignorance if I turn out to have reinvented the edsel. Andy Andrew Fant | This | "If I could walk THAT way... Molecular Geek | Space | I wouldn't need the talcum powder!" fant@pobox.com | For | G. Marx (apropos of Aerosmith) Boston, MA USA | Hire | http://www.pharmawulf.com From erwan at mandrakesoft.com Tue Dec 3 08:11:00 2002 From: erwan at mandrakesoft.com (Erwan Velu) Date: Wed Nov 25 01:02:53 2009 Subject: Locality and caching in parallel/distributed file systems In-Reply-To: <20021203094729.O95569-100000@net.bluemoon.net> References: <20021203094729.O95569-100000@net.bluemoon.net> Message-ID: <1038931860.21717.144.camel@revolution.mandrakesoft.com> Le mar 03/12/2002 ? 15:57, Andrew Fant a ?crit : >I know the semantics for this theoretically are in NFS, but NFS doesn't scale, and is dead 8-). i've heard about a nfs in parallel, you may have a look at this project. http://www-id.imag.fr/Laboratoire/Membres/Lombard_Pierre/nfsp/ -- Erwan Velu Linux Cluster Distribution Project Manager MandrakeSoft 43 rue d'aboukir 75002 Paris Phone Number : +33 (0) 1 40 41 17 94 Fax Number : +33 (0) 1 40 41 92 00 Web site : http://www.mandrakesoft.com OpenPGP key : http://www.mandrakesecure.net/cks/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20021203/2cc82ae3/attachment.bin From hahn at physics.mcmaster.ca Tue Dec 3 08:37:52 2002 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:02:53 2009 Subject: Locality and caching in parallel/distributed file systems In-Reply-To: <20021203094729.O95569-100000@net.bluemoon.net> Message-ID: > appreciate any feedback that people can give. Please forgive my ignorance > if I turn out to have reinvented the edsel. well, much of what you describe was in AFS/Coda/Intermezzo. some of it is indeed Edsel-ish, though it's interesting to look at machine balance then versus now: diskSz diskSp netSp memSz memSP CPU then 30MB 1MBps 1MBps 1MB 20MBps 4mips now 60GB 30MBps 100MBps 1GB 2GBps 4gips ratio 2000 30 100 1000 100 1000 anyway, the vast increase in local disk/mem/cpu capacity has, unsurprisingly, made people want to make systems more peer-to-peer. afaikt, there hasn't really been an effective/thorough/general-purpose p2p filesystem, though. From landman at scalableinformatics.com Tue Dec 3 03:46:29 2002 From: landman at scalableinformatics.com (Joseph Landman) Date: Wed Nov 25 01:02:53 2009 Subject: Locality and caching in parallel/distributed file systems In-Reply-To: <20021203094729.O95569-100000@net.bluemoon.net> References: <20021203094729.O95569-100000@net.bluemoon.net> Message-ID: <1038915989.1526.21.camel@ome.ctaalliance.org> On Tue, 2002-12-03 at 19:57, Andrew Fant wrote: > Morning all, > Lately, I have been thinking a lot about parallel filesystems in > the most un-rigourous way possible. Knowing that PVFS simply stripes the > data across the participating filesystems, I was wondering if anyone had > tried to apply caching technology and file migration capacities to a > parallel/distributed filesytem in a manner analagous to SGI's ccNuma > memory architecture. That is, distributing files in the FS to various > nodes, keeping track of where the accesses are coming from, and moving > the file to another node if that is where some suitable percentage of the cough cough cough cough... Distributed parallel file systems require distributed data and local speed access to make any sense. I am sure others may disagree, but any file system that you need to shuttle metadata about will generally not scale well (unless you have a NUMAlink like speed/latency, which pushes the scaling wall way out, but it is still there). Cluster file systems have been the rage in the past as one of the next great things. I guess I advocate waiting and seeing for this, as I have not yet seen a scalable distributed file system (and if someone knows of one, which is not too painful, please let me know). My definition of a scalable distributed file system is, BTW, one that connects to every compute node, and gives local I/O speed to simultaneous reads and writes (to the same/different files) across the single namespace. This def may not be in line with others, but it is what I use to understand the issues. The idea in building any scalable resource (net, computing, disk, etc) is to avoid single points of information flow. Maintaining metadata for file systems represents exactly that. You get hot-spot formation, and start having to do interesting gymnastics to overcome it (if it is at all possible to overcome). Data motion is rapidly becoming one of the hardest issues to deal with. Good thread start there Andy! -- Joseph Landman, Ph.D. Scalable Informatics LLC email: landman@scalableinformatics.com web: http://scalableinformatics.com voice: +1 734 612 4615 fax: +1 734 398 5774 From hahn at physics.mcmaster.ca Tue Dec 3 10:33:52 2002 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:02:53 2009 Subject: Locality and caching in parallel/distributed file systems In-Reply-To: <1038915989.1526.21.camel@ome.ctaalliance.org> Message-ID: > node, and gives local I/O speed to simultaneous reads and writes (to the > same/different files) across the single namespace. This def may not be concurrent writes are always nontrivial; would you be happy assuming that the app always knows what it's doing, so the OS doesn't have to? for instance, if you write at offset 14k in a file, do you need to keep in mind that it falls within the third page-sized block (the fundamental pagecache unit on ia32 systems), which might correspond to, say, the second 8K filesystem block? > The idea in building any scalable resource (net, computing, disk, etc) > is to avoid single points of information flow. Maintaining metadata for > file systems represents exactly that. You get hot-spot formation, and > start having to do interesting gymnastics to overcome it (if it is at > all possible to overcome). even strict consistency doesn't imply there is necessarily a bottleneck, since, for instance, your filesystem will probably not be one big, flat directory. Coda (and probably others) have examined weaker consistency. From mas at ucla.edu Tue Dec 3 22:24:37 2002 From: mas at ucla.edu (Michael Stein) Date: Wed Nov 25 01:02:53 2009 Subject: Locality and caching in parallel/distributed file systems In-Reply-To: ; from hahn@physics.mcmaster.ca on Tue, Dec 03, 2002 at 01:33:52PM -0500 References: <1038915989.1526.21.camel@ome.ctaalliance.org> Message-ID: <20021203222437.A22113@mas1.ats.ucla.edu> > concurrent writes are always nontrivial; Yes, nontrivial. I'm not even sure what "concurrent" means in a distributed environment. What's the definition of "first" if things aren't happening at the same place? From mathog at mendel.bio.caltech.edu Wed Dec 4 10:34:53 2002 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:02:53 2009 Subject: Noise abatement for a rack Message-ID: Anybody here ever try noise insulating a rack??? I have to share a room with our new 20 node cluster and it would be nice to be able to do so without having to wear earplugs all the time. The 20 2U single Athlon systems are mounted in an open frame 4 post rack. Using a Radio Shack sound level meter (catalog #33-2055) set to dBA, fast weighting the following values were obtained for (front,left side, back, right side): room ambient (no power) <50dBA (below detection limit) 1 node at 6" from side, front panel open: 66,59,67,63 1 node at 6" from side, front panel closed: 63,58,67,62 1 node at 48" from side, f.p. closed: 53,- ,- ,55 20 nodes at 6" from side, f.p. closed: 72,70,76,72 20 nodes at 48" from side, f.p. closed: 66,- ,- ,66 As you'd expect - it's all fan noise. The left side is quieter than the right because there are fewer ventilation holes on that side and the dual internal fans are mounted closer to the right side of the case. My goal is to drop the dBA at 48" down to no more than 53 dBA with all nodes operating. Sound measurements next to an Antec SX-630 case (no sound insulation, just sheet metal) were 65 dBA with the side off, 57 with the side on. So adding sides to the open rack may help a little and shouldn't be much of a problem for ventilation. But how to treat the front and back of the case??? If I close them off with a solid sheet of sound absorbing material (lead would work but something a bit less expensive and toxic would be better) the system will become really quiet - because it's going to overheat and die. Some sort of sound absorbant coated louvers maybe? Or for the back, since it's close to a wall, coat the wall with sound absorbing tile? Hopefully somebody has already dealt with this. But if the solution is out there on the web I've not found it. All the racks I saw were just sheet metal and the front/back doors, if any, were either plastic or metal grillwork - good for maybe a 3-4 dBA sound reduction. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From lindahl at keyresearch.com Wed Dec 4 11:08:41 2002 From: lindahl at keyresearch.com (Greg Lindahl) Date: Wed Nov 25 01:02:53 2009 Subject: Noise abatement for a rack In-Reply-To: ; from mathog@mendel.bio.caltech.edu on Wed, Dec 04, 2002 at 10:34:53AM -0800 References: Message-ID: <20021204110841.A1686@wumpus.internal.keyresearch.com> On Wed, Dec 04, 2002 at 10:34:53AM -0800, David Mathog wrote: > Anybody here ever try noise insulating a rack??? You should thank your lucky stars that they're 2U cases, as 1U fans are significantly louder than 2U fans. -- greg From joelja at darkwing.uoregon.edu Wed Dec 4 12:01:28 2002 From: joelja at darkwing.uoregon.edu (Joel Jaeggli) Date: Wed Nov 25 01:02:53 2009 Subject: Noise abatement for a rack In-Reply-To: Message-ID: our chatsworth megaframe cabinets are fairly quiet relative to open racks but require directed ac, blown in under the floor and exhausted out the ceiling... This also has the effect of isolating the enviroment in the cabinets from the rest of the room somewhat, which is desireable when you have a pair of half rack sized routers drawing 1.5kw each in one cabinet. I would except that most reasonable acoustic materials are likely to cause your airflow requirements to go up given that you'll be blocking most of surface area that is currently radiating heat with something that's also thermal insulation and that you aren't currently blowing air into something that's an open rack. your temperature monitoring requierments will also go up since likelyhood of hardware failure goes way up if the airflow stops. joelja On Wed, 4 Dec 2002, David Mathog wrote: > Anybody here ever try noise insulating a rack??? > > I have to share a room with our new 20 node cluster > and it would be nice to be able to do so without having > to wear earplugs all the time. The 20 2U single Athlon > systems are mounted in an open frame 4 post rack. Using a > Radio Shack sound level meter (catalog #33-2055) set to > dBA, fast weighting the following values were obtained > for (front,left side, back, right side): > > room ambient (no power) <50dBA (below detection limit) > 1 node at 6" from side, front panel open: 66,59,67,63 > 1 node at 6" from side, front panel closed: 63,58,67,62 > 1 node at 48" from side, f.p. closed: 53,- ,- ,55 > 20 nodes at 6" from side, f.p. closed: 72,70,76,72 > 20 nodes at 48" from side, f.p. closed: 66,- ,- ,66 > > As you'd expect - it's all fan noise. The left side > is quieter than the right because there are fewer ventilation > holes on that side and the dual internal fans are mounted > closer to the right side of the case. My goal is to drop > the dBA at 48" down to no more than 53 dBA with all nodes > operating. > > Sound measurements next to an Antec SX-630 case (no > sound insulation, just sheet metal) were 65 dBA with > the side off, 57 with the side on. So adding sides to > the open rack may help a little and shouldn't be much > of a problem for ventilation. But how to treat the front > and back of the case??? If I close them off with a > solid sheet of sound absorbing material (lead would work > but something a bit less expensive and toxic > would be better) the system will become really quiet > - because it's going to overheat and die. Some sort of > sound absorbant coated louvers maybe? Or for the back, > since it's close to a wall, coat the wall with sound > absorbing tile? > > Hopefully somebody has already dealt with this. But > if the solution is out there on the web I've not found it. > All the racks I saw were just sheet metal and the front/back > doors, if any, were either plastic or metal grillwork - good > for maybe a 3-4 dBA sound reduction. > > Regards, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- -------------------------------------------------------------------------- Joel Jaeggli Academic User Services joelja@darkwing.uoregon.edu -- PGP Key Fingerprint: 1DE9 8FCA 51FB 4195 B42A 9C32 A30D 121E -- In Dr. Johnson's famous dictionary patriotism is defined as the last resort of the scoundrel. With all due respect to an enlightened but inferior lexicographer I beg to submit that it is the first. -- Ambrose Bierce, "The Devil's Dictionary" From dtj at uberh4x0r.org Wed Dec 4 11:57:28 2002 From: dtj at uberh4x0r.org (Dean Johnson) Date: Wed Nov 25 01:02:53 2009 Subject: Noise abatement for a rack In-Reply-To: <20021204110841.A1686@wumpus.internal.keyresearch.com> References: <20021204110841.A1686@wumpus.internal.keyresearch.com> Message-ID: <1039031849.26626.60.camel@samer> On Wed, 2002-12-04 at 13:08, Greg Lindahl wrote: > On Wed, Dec 04, 2002 at 10:34:53AM -0800, David Mathog wrote: > > > Anybody here ever try noise insulating a rack??? > > You should thank your lucky stars that they're 2U cases, as 1U fans > are significantly louder than 2U fans. > And just not louder, the 1U fans are much more of a whine than the bigger 2U fans. I had an SGI Origin 200 (big 4U rackmount) and an SGI 1100 (1U) in my home office and it was the 1100 that drove me nuts. But it was a "type 1" problem, meaning that I would just turn up the stereo sufficiently and the problem went away. ;-) -Dean From rgb at phy.duke.edu Wed Dec 4 12:28:16 2002 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:02:53 2009 Subject: Noise abatement for a rack In-Reply-To: <20021204110841.A1686@wumpus.internal.keyresearch.com> Message-ID: On Wed, 4 Dec 2002, Greg Lindahl wrote: > On Wed, Dec 04, 2002 at 10:34:53AM -0800, David Mathog wrote: > > > Anybody here ever try noise insulating a rack??? > > You should thank your lucky stars that they're 2U cases, as 1U fans > are significantly louder than 2U fans. And note that noise is remarkably similar to heat (kinda very long wavelength heat) so that noise insulation <=> heat insulation. One doensn't USUALLY want to heat insulate the systems, and getting cold air in and hot air out while keeping noise in makes things very complicated, where complexity can equal expense and loss of stability. However, there IS a simple solution. Our server room, between the racks and the SUV-sized AC heat exchanger/blower at one end, is probably somewhere in the 60-80 dB range. Not quite painful, but probably loud enough to damage your hearing in the long run:-). So we don't go in there. Thick walls and its location in a sub-basement provide perfect sound insulation while letting the AC do its thing. If we MUST go in there for more than a while, we sound-insulate our heads, not the racks as it is much cheaper and keeps our ears warm. Just wearing headphones works wonders (and lets you listen to music instead of systems). Heck, for less than $100 you can get sound-cancellation headphones that reduce ambient noise AND let you listen to music without much bleed-through jet-engine background. Not too great if you want to talk to somebody, but hey, they can't hear you in there anyway unless you shout. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From James.P.Lux at jpl.nasa.gov Wed Dec 4 13:01:43 2002 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:02:53 2009 Subject: Noise abatement for a rack In-Reply-To: Message-ID: <5.1.0.14.2.20021204125802.01931b58@mailhost4.jpl.nasa.gov> Hie thee to a library or bookstore and get a book on audio recording studio design (there are a bunch aimed at amateur/garage type operations). They have a lot of very useful and practical information on noise reduction, which is truly an art. Two things to worry about: conducted through a solid object - panels, racks, etc - and then reradiated via a panel, etc. - mass and soft help (lead, sand, etc. -- rigid is bad) conducted through the air - through ducts, etc. - torturous path (length attenuates), acoustically dead (soft, massy) and nonreflective. You need to know a bit about the spectral characteristics of your noise.. LF is a different problem with different solutions than HF. Maybe a microphone on a laptop and one of the freeware spectrogram progams? You're not looking for quantitative analysis to the nearest 0.001 dB here... just a general guide to where the problem is... At 10:34 AM 12/4/2002 -0800, you wrote: >Anybody here ever try noise insulating a rack??? > >I have to share a room with our new 20 node cluster >and it would be nice to be able to do so without having >to wear earplugs all the time. The 20 2U single Athlon >systems are mounted in an open frame 4 post rack. Using a >Radio Shack sound level meter (catalog #33-2055) set to >dBA, fast weighting the following values were obtained >for (front,left side, back, right side): From lothar at triumf.ca Wed Dec 4 15:20:05 2002 From: lothar at triumf.ca (lothar@triumf.ca) Date: Wed Nov 25 01:02:53 2009 Subject: Beowulf digest, Vol 1 #1136 - 3 msgs References: <5.1.1.6.2.20021130114743.03a72090@mail.harddata.com> Message-ID: <3DEE8DA5.9020507@triumf.ca> Thank you very much for your help... So I installed a bproc system on top of RedHat 7.3 and the master seems to run just fine. I created a bootfloppy with beoboot -1 and the smallest kernel arround and an upload ramdisk with beoboot -2. It is in /var/beowulf/boot.img. The floppy in the diskless node just runs fine..gets the Rarp request, obviously connects and the messages then go: request /var/beowulf/boot.img connect: Invalid argument bootimage download error A fatal error has occurred. So, what's wrong? I tried several kernels for the boot.img. Lothar Maurice Hilarius wrote: > With regards to your message at 10:01 AM 11/30/02, > beowulf-request@beowulf.org. Where you stated: > >> Date: Fri, 29 Nov 2002 12:08:05 -0800 >> From: lothar@triumf.ca >> Organization: TRIUMF >> To: beowulf@beowulf.org >> Subject: diskless kernel compilation for RedHat 7.3 >> >> Hi, >> I am upgrading a cluster which contains 20 diskless nodes from >> Redhat 6.1 to 7.3. For the diskless nodes I need new kernels >> compiled for use in nfs-root. However, the 7.3 distribution of >> redhat does not allow to configure the compilation instructions >> accordingly (no nfs-root option). Any idea why and how this >> can be remedied? >> >> Thanx >> >> Lothar > > > Hi Lothar. > > You might look at running a bproc based cluster environment for this. > It would make your job a lot simpler. > > Places to look: > http://www.clustermatic.org > > We have a complete cluster installation toolset and installation, > RH7.3 based on our ftp site. > This uses a modern kernel as well. > > ftp://ftp.harddata.com/pub/hddcs/3.2/2002-10-19/ > > Help yourself! > > > With our best regards, > > Maurice W. Hilarius Telephone: 01-780-456-9771 > Hard Data Ltd. FAX: 01-780-456-9772 > 11060 - 166 Avenue mailto:maurice@harddata.com > Edmonton, AB, Canada http://www.harddata.com/ > T5X 1Y3 > > Ask me about NAS and near-line storage > From russell.b.kegley at lmco.com Wed Dec 4 12:33:30 2002 From: russell.b.kegley at lmco.com (Kegley, Russell B) Date: Wed Nov 25 01:02:53 2009 Subject: Noise abatement for a rack Message-ID: <812B30319782D4119D8600508BE32864A1DF50@emss07m06.lmtas.lmco.com> I haven't ever tried this, but what about active noise cancellation? I'm thinking the relatively cheap ($100 or so) headphones, but the idea of damping the noise in the whole room is intriguing. One URL I found that I'm going to look into further: http://users.erols.com/ruckman/ancfaq.htm On Wed, 4 Dec 2002, David Mathog wrote: > Anybody here ever try noise insulating a rack??? -- snip -- > As you'd expect - it's all fan noise. Russell Kegley Lockheed Martin Aeronautics Company Fort Worth, TX Russell.B.Kegley@lmco.com From rgb at phy.duke.edu Wed Dec 4 16:41:35 2002 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:02:53 2009 Subject: Noise abatement for a rack In-Reply-To: <812B30319782D4119D8600508BE32864A1DF50@emss07m06.lmtas.lmco.com> Message-ID: On Wed, 4 Dec 2002, Kegley, Russell B wrote: > I haven't ever tried this, but what about active noise cancellation? I'm > thinking the relatively cheap ($100 or so) headphones, but the idea of > damping the noise in the whole room is intriguing. One URL I found that > I'm going to look into further: > > http://users.erols.com/ruckman/ancfaq.htm In noise cancellation headphones, the wave path for the sound being cancelled is very short -- a cm or two -- and the sound being cancelled can be measured and phase inverted on the way in through the headphones. Trying to cancel the noise "in the whole room" with something like an array of wall mounted cancellation units is a mind-bogglingly difficult problem in wave mechanics to begin with, and (in my opinion) just can't be made to work without something like defense department resources (the problem isn't TOO dissimilar to cancellation of e.g. reflected radar images, just mind-bogglingly harder). First of all, it is noise, with little to work with in the way of coherence length or signal persistence. Second it is broad spectrum noise, with frequencies all over the place and with wavelength ranging from tens of meters (order of the size of the room) to a few centimeters (order of the size of small features in the room) bouncing and interfering all over the place. Finally, it is broad spectrum noise with wavelengths ranging etc with widely distributed independent sources. I imagine that the e.g. pressure wave profile in any plane running through the room looks something like the surface of an incredibly choppy sea. Cancelling the waves propagating into a single local channel (your ear) is possible because you DON'T need to sample or know about the waves anywhere but right at the entrance to the channel, and there you can easily generate a full spectrum of sound at higher amplitude than the incoming waves. Cancelling the waves on the ocean surface itself? In three dimensions? With cancellation units "far" from the sources? Not impossible, maybe (maybe it is -- have to ask a mathematician and wait a year or two:-) but at the very least you'd need to sample the waves all over the room, solve a pretty nasty mathematical problem in real time, and generate a counterwave from the array of emitters. Anywhere you fail to achieve cancellation you're likely to get phase coherent addition and a manifold INCREASE in sound intensity -- sound "hot spots". Lo, the website/FAQ above points this out (in less detail:-): Controlling a spatially complicated sound field is beyond today's technology. The sound field surrounding your house when the neighbor's kid plays his electric guitar is hopelessly complex because of the high frequencies involved and the complicated geometry of the house and its surroundings. On the other hand, it is somewhat easier to control noise in an enclosed space such as a vehicle cabin at low frequencies where the wavelength is similar to (or longer than) one or more of the cabin dimensions. Easier still is controlling low-frequency noise in a duct, where two dimensions of the enclosed space are small with respect to wavelength. The extreme case would be low-frequency noise in a small box, where the enclosed space appears small in all directions compared to the acoustic wavelength. The latter two cases, with control space small in all directions compared to nearly all the wavelengths being controlled, corresponds to noise-cancelling headphones, which I'm sure work just fine. As for the larger space, well, using the beowulf itself we could probably sample the sound each nodes is producing very close to the source. We could probably install sound cards and drive little minispeakers inside the chassis that generate cancellation antiwaves in the primary sound emitting apertures, and maybe even handle resonant acoustical waves given off by the case surfaces. By cancelling the waves within centimeters of the sources we should be able to eliminate most of the outward going wave, at each node, one node at a time, in embarrassingly parallel. That would do it for the nodes I think. To do the AC system would be harder -- we'd probably need a few nodes to do that as well, with distributed sensors and counterwave emitters close to the primary emission points (and would probably have to wrap the ductwork and so forth to actively damp it). To put it another way, cancelling the kids electric guitar wave is easy -- if you can do it with several speakers "right next to" (surrounding) the emitting speaker, or better yet, with another speaker right on top of the emitting speaker. So, an autonoise-cancellation beowulf. Might not have a lot of leftover horsepower (all the computations required would need to be pretty much in realtime). However, one could try to prototype this way, create an ASIC coprocessor for a sound card, and equip the nodes with active noise cancellation as a "feature", putting a small speaker "on top of" each fan. This, in all seriousness, is probably possible and mass produceable for $100 or so (so cheap because it IS a hack of existing sound cards, although you'd probably need several independent channels). Shall we write a grant proposal? Anybody want to fund a patent? rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From raysonlogin at yahoo.com Wed Dec 4 20:52:50 2002 From: raysonlogin at yahoo.com (Rayson Ho) Date: Wed Nov 25 01:02:53 2009 Subject: Fwd: new biocluster pics online Message-ID: <20021205045250.11560.qmail@web11405.mail.yahoo.com> > For the people like me who are total infrastructure geeks... > > I've put some more pictures up in the > http://bioteam.net/gallery/bioclusters gallery: > > (1) http://bioteam.net/gallery/SunLX50 > > (Not a cluster!) just images of the internal/external layout of Sun's > > new Linux-on-Intel box that bioteam is eval'ing. Also shots of 'Sun > Linux 5.0' installing. All in all not super exciting unless one has a > > particular interest in the LX50 hardware although I'd love to know if > > people can figure out which contract manufacturer is stamping these > out > > for Sun as the chassis looks super familiar :) > > (2) http://bioteam.net/gallery/HarvardStats > > 20-cpu Linux compute farm using Dell hardware. All nodes attached via > > GigE thanks to onboard 1000-TX nics on the PowerEdge 1650 nodes. > Built > > over the last 2 days for the Department of Statistics at Harvard > University. Running Sun GridEngine 5.3p2 right now and will mostly be > > used for running internally-developed software plus some common > informatics applications. The compute nodes have dual 80gig IDE > drives > and we will be experimenting with Linux software raid mirroring on a > partition or two so as to get maximum IO on local disk. This is > because > > the head node + Dell's general reputation having a really slow > performing line of PERC RAID controllers means that the head node is > going to be a performance bottleneck if it gets thrashed with NFSv3 > fileservices duties. > > > Regards, > Chris > > __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com From eugen at leitl.org Thu Dec 5 02:47:59 2002 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:02:53 2009 Subject: IEEE 1394 Message-ID: After I've posted a link to the Oracle library for clustering over 1394 a while back Apple mumbled something about RFC 2734 (IP over 1394): http://developer.apple.com/firewire/IP_over_FireWire.html All Apple computers sold today include one or more FireWire ports. Because FireWire can transfer data at up to 400 megabits/second, it is suitable for networking and clustering solutions, as well as temporary connections to the internet using Internet Sharing. Now the IP over FireWire Preview Release adds support for using the Internet Protocol - commonly known as TCP/IP - over FireWire. With this software installed, Macintosh computers and other devices can use existing IP protocols and services over FireWire, including AFP, HTTP, FTP, SSH, etc. In all cases, Rendezvous can be used if desired for configuration, name resolution, and discovery. The preview release adds a new Kernel Extension that hooks into the existing network services architecture. Using the existing Network Preferences Pane, users can add FireWire as their IP network node to connect and communicate between two machines. Now developers interested in using the Internet Protocol (IP) over FireWire may download the IP over FireWire Preview Release. [the document is in some proprietary .dmg format] Since I've never seen real numbers for IEEE 1394 latency I did some websearches, and finally found some meat: "The IEEE 1394 bus has a minimum latency of a few hundred microseconds and a worst-case delay of a few milliseconds. For large data blocks, this bus uses direct memory access (DMA) similar to PCI bus mastering that reduces the influence of software protocol overhead on the transfer rate. The 400-Mb/s top data rate supports consumer digital video equipment and data acquisition devices requiring relatively fast data transfer. Bus latencies are compared in Figure 1 and bus throughput in Figure 2." Latency: http://www.evaluationengineering.com/archive/articles/0801pcbased2.gif Throughput: http://www.evaluationengineering.com/archive/articles/0801pcbased1.gif The article is at http://www.evaluationengineering.com/archive/articles/0801pcbased.htm Relax and Take the Bus by Tom Lecklider, Technical Editor Depending on your experience, the phrase PC-based instrumentation conjures up a variety of images. You may be contemplating using PC-based test for the first time, you may have used it in the past, or you may think it.s not suitable for your present application. If you.re in the last group, make sure you know how both PCs and instruments have changed before you make a final decision. On the PC side, two new buses have addressed the time when PCs will have no internal expansion slots. And, instruments have become smarter and leaner. Recently developed low-power semiconductor technology has provided high-speed digital signal processors (DSPs) and high-precision analog-to-digital converters (ADCs) that can operate from only a few watts. Instrument design has made better trade-offs between essential hardware functions and those things that can be done well by a Pentium-class host. The end result is smaller, lower cost instrumentation in a convenient and easy-to-use form. When they were introduced, the universal serial bus (USB) and the IEEE 1394 bus (aka FireWire from its Apple Computer roots) were easily distinguished by their relative speeds. USB version 1.1 (USB1) peripherals include mice, keyboards, floppy disk drives, and other devices with bit rates below 12 Mb/s. In fact, the USB1 specification established both a low-speed 1.5-Mb/s mode and a high-speed 12-Mb/s mode. However, neither was very impressive compared to FireWire.s 400-Mb/s rate. On the other hand, whether due to better PR, FireWire license fees, or the very low cost of USB peripherals, USB has clearly eclipsed FireWire at this time. Virtually all PCs ship with built-in USB ports, but only a few computer manufacturers include FireWire. So, one major USB advantage is ubiquity. Another is the recent version 2.0 (USB2) specification, calling for a top rate of 480 Mb/s. Of course, nothing comes for free, and comparable speed doesn.t necessarily equate to equivalent performance. The Buses Compared Both USB and FireWire are serial buses with hot-plug-and-play capabilities. This feature allows you to safely add or remove devices from the bus while the PC and any connected bus hubs are powered. There is one major difference between the buses: USB always requires a PC master while the 1394 bus provides peer-to-peer communications without PC intervention. USB1 transfers a 1,500-B frame every millisecond, and the frame is shared by all connected USB devices.up to a maximum of 127. This means that actual data transfer for any one device could be as slow as one data point per two or three frames, although the useful composite rate is about 1.16 MB/s. Passing all communications through the central PC makes possible very low-cost USB peripherals because they require minimal intelligence. The downside is increased transfer latency. It.s quite variable but as high as 8 or 9 ms for USB1. For that reason, USB1 is not a good bus choice for single-point transfers because its high latency limits you to low-speed monitoring and slowly changing temperature or pressure measurements. USB2.s 480-Mb/s top rate improves burst transfer speed greatly. Its latency also improves because of the new 125-?s microframes rather than USB1.s 1-ms frames. But, as described by Andy Purcell, a software design engineer at Agilent Technologies, .USB2 still is a master-slave architecture and will have an inherent fixed latency. The latency occurs because a USB slave cannot just send data when it is available. It must wait to send the data until asked for it. The latency is independent of CPU speed.. The IEEE 1394 bus has a minimum latency of a few hundred microseconds and a worst-case delay of a few milliseconds. For large data blocks, this bus uses direct memory access (DMA) similar to PCI bus mastering that reduces the influence of software protocol overhead on the transfer rate. The 400-Mb/s top data rate supports consumer digital video equipment and data acquisition devices requiring relatively fast data transfer. Bus latencies are compared in Figure 1 and bus throughput in Figure 2. Just as the USB specification has been upgraded, so too is there a 1394b version that will supersede the present 1394a. The proposed changes extend the top 400-Mb/s rate to 800, 1,600, and ultimately 3,200 Mb/s. However, although USB2 retains common protocol and operation with USB1, 1394b may not be entirely backward compatible with 1394a. Until the dust settles, manufacturers haven.t committed to 1394b silicon, preferring to back an unambiguous USB2. To extend the USB realm of addressable applications further, the USB Implementers Forum (USB-IF) has proposed a USB On-The-Go subset of USB2. This specification enhancement would allow USB peripherals to exchange information directly, without the need for an intervening PC. So, you could download images from your digital camera directly to your printer without having to go through a PC between the two devices. On the other hand, a PC would be required if, for example, you wanted to crop, enhance, or otherwise edit the image prior to printing or if you needed to archive it on disk. According to a recent article by Jeanne Graham, .USB is not necessarily a better technology than Bluetooth or 1394, but it has deployed better marketing campaigns.. Ms. Graham also quoted Bert McComas, an analyst at InQuest Market Research: .A consumer product manufacturer will say, .Give me one good reason to go with USB.. Well, one good reason is that every PC in the world has a USB port..1 The Industrial Case The USB.s advantages for consumer applications seem to be equally valid for industrial users. Ease of use, low cost, and worldwide independence from AC supply considerations influenced Herb Figel.s decision to purchase a Dactron Photon Spectrum Analyzer. Mr. Figel, the director of quality assurance at Hunter Fan, already had some experience with USB peripherals, having previously bought a digitizing pad and a device to synchronize his Palm organizer with his PC. He commented that the USB spectrum analyzer provided similar measurements to an older, large bench instrument, but that its user interface was much superior. The Photon instrument has an upper frequency limit of 21 kHz and is entirely powered by the USB connection. It was a good fit to Mr. Figel.s ceiling-fan noise-measurement application with frequencies in the 100-Hz to 1-kHz range. Had he needed multimegahertz speeds, he wouldn.t have found an instrument that operated within the USB.s meager 2.5-W power limit, although fast PCI-bus cards are readily available. So, he could have retained a PC-based test system, but it wouldn.t have been as simple and convenient as that made possible by USB. As an example, the Gage Applied CompuScope 14100 is a dual-channel, 100-MS/s, 14-bit resolution PCI card. It achieves sustained 100-MB/s data transfer rates via PCI bus mastering under single-tasking operating systems. On-board memory ranges from 1 MS to 1 GS, and the card draws from 25 to 35 W. For Gage.s customers, a high sustained data-transfer rate is important. .High bus transfer speed, while almost irrelevant in one-shot applications like explosion testing, is essential in the acquisition of repetitive signals,. explained Andrew Dawson, the company.s product manager of board-level products and advanced measurement systems. .Examples of these applications include radar, lidar, ultrasonic imaging, and manufacturing test systems. A typical requirement is to capture 1,000 point acquisitions at a repetition rate of over 10 kHz without missing a single event.. Also shunning USB and FireWire for the moment is Mark Cejer, the test and measurement marketing manager at Keithley Instruments. .Until an instrument comes along that offers unique features only available with USB or FireWire, there probably will be little incentive for users to buy them. Large production ATE racks consist of multiple types of instruments. What good will it do to have a USB DMM, for example, if all the other instruments are GPIB?. Balancing this view is one that considers the need to connect new PCs to existing GPIB and RS-232 instruments. National Instruments. solution consists of the GPIB-USB-A and the GPIB-1394 controllers that transform any computer with a USB or FireWire port into a plug-and-play GPIB controller that can handle up to 14 instruments. For Dewetron, system simplification is a theme that runs parallel with the development of the DEWE-BOOK. Grant Smith, the company president, said, .A DEWE-BOOK is an eight- or 16-channel signal-conditioning front end with a built-in ADC that precedes our DAQ and PAD series of modules. Previously, we offered an internal ADC board that had to be connected to the PC.s printer port, but as well as tying up the printer port, it limited throughput to 20 kHz. Today, we get a very consistent 100-kHz throughput with each USB-connected DEWE-BOOK. .In addition, it is plug-and-play, making it easier for our customers to install and get running, and USB is well supported by available software. Also,. he continued, .we had to use a separate COM port to control the settings on our DAQ and PAD modules. This meant that both a parallel port and a COM port on the customer.s PC were tied up. Now, we handle all control as well via USB. As a final benefit, by adding a hub, up to four DEWE-BOOKs can be used simultaneously with the same PC.. The Broader Picture Of course, USB and FireWire are just two of many instrumentation and computer buses available today. Agilent.s Mr. Purcell said, .USB1 block transfer performance is similar to GPIB, so the primary benefit for USB is the ease with which customers can connect instruments to PCs. FireWire performance is quite good, and we are seeing block transfer rates of 15 MB/s. This is 20 times better than GPIB and more than 1,000 times better than RS-232. .USB2, with its Intel backing, may become as pervasive as USB1 is today. Its higher bit rate should enable 10 times the performance of GPIB,. he continued. .In anticipation of this, there is a growing international group of companies that has started work on a standard USB protocol for test and measurement devices.. As a step toward industry-wide software compatibility, the VXIplug&play Systems Alliance has developed a specification for I/O software called Virtual Instrument Software Architecture (VISA). VISA provides a common foundation for the development, delivery, and interoperability of high-level multivendor system software components, such as instrument drivers, soft front panels, and application software. VISA not only allows test engineers to combine different I/O buses into one system, but also provides the necessary abstraction layer to make the transition to new buses transparent to the user. .Although VISA solves the mixed I/O problem on the host side,. commented Vanessa Trujillo, an instrument connectivity product manager at National Instruments, .a similar architecture is needed on the device side to make the integration of new bus types seamless for the instrument manufacturer. For instrument manufacturers to embrace and adopt new buses while at the same time to support their many customers who still use one of today.s buses, they need an architecture that allows them to easily adapt the firmware they have written for one bus type to another.. Sharing Mr. Purcell.s USB emphasis, Nick Turner, sales and marketing manager at Cytec, said .The advantage we saw with USB was the ability to daisy-chain devices to run from one PC port. RS-232 is a one-to-one bus, and GPIB is limited to 16 devices, so we thought there probably would be interest in USB. However, we.ve not done anything with FireWire because it requires licensing.. Mr. Turner cited the USB.s 5-m length restriction as a disadvantage in his company.s automated test business. Although greater expense and complexity accrue, hubs can be stacked to a maximum of 30 m. The 5-m length limit also applies to each FireWire hop, but the 1394b version promises to span 100-m or greater distances by matching media and speed. For example, 100 m could be achieved at high speed via fiber-optic cable, where copper would be appropriate at lower speeds. The continuing adoption of Ethernet for instrument communications goes on in the background as USB and FireWire vie for position. Mr. Turner commented, .The biggest trend we currently are seeing is more and more people wanting devices with a network interface. There is a plethora of software and hardware support for 100Base-Tx and even Gigabit Ethernet, and people are becoming increasingly accustomed to working with them.. Agilent.s Mr. Purcell agreed, but added a cautionary note. .Ethernet connections allow instruments to communicate using http, RMI, DCOM, and RPC. Instruments can act as web servers, and users can use familiar browsers to control and view collected data. There seems to be a lot of enthusiasm for connecting instruments to Ethernet, but enhancements are necessary so that it.s easy to configure and then discover attached devices.. Going beyond bus-tethered instruments, Gage Applied.s Dr. Dawson foresees high-speed wireless links that will transform PC-based test and measurement. .Within a few years, the primary human interface to the PC will not be the traditional mouse, keyboard, and monitor. Instead, users will interface to a portable personal digital assistant (PDA) that, in turn, will communicate through a wireless interface to a faceless connected instrument (FCI). Communications via the PDA will allow greater data sharing and free the user to control equipment remotely in a manner unavailable today,. he explained. Reference 1. Graham, J., .Approval Expected for USB On-The-Go,. Electronic Buyers. News, www.ebnews.com, March 8, 2001. From scheinin at crs4.it Thu Dec 5 02:59:57 2002 From: scheinin at crs4.it (Alan Scheinine) Date: Wed Nov 25 01:02:53 2009 Subject: Noise abatement for a rack Message-ID: <200212051059.LAA05393@dylandog.crs4.it> "David Mathog" wrote: > Anybody here ever try noise insulating a rack??? A non-technical solution would be to invite a member of OSHA to your office after the machines are installed. From becker at scyld.com Thu Dec 5 06:02:51 2002 From: becker at scyld.com (Donald Becker) Date: Wed Nov 25 01:02:53 2009 Subject: IEEE 1394 In-Reply-To: Message-ID: On Thu, 5 Dec 2002, Eugen Leitl wrote: > After I've posted a link to the Oracle library for clustering over 1394 > a while back Apple mumbled something about RFC 2734 (IP over 1394): > > http://developer.apple.com/firewire/IP_over_FireWire.html > > All Apple computers sold today include one or more FireWire ports. > Because FireWire can transfer data at up to 400 megabits/second, it is > suitable for networking and clustering solutions, as well as temporary > connections to the internet using Internet Sharing. By "clustering" they mean fail-over. "N-way clustering, where N<=2". If you want scalable performance and matrix availability, you will need an IEEE1394 switch. Pretty much every vendor with an 10Gb Ethernet switch also makes IEEE1394 switches ;-> (N=0) > Since I've never seen real numbers for IEEE 1394 latency I did some > websearches, and finally found some meat: We never see them because (!!!) > "The IEEE 1394 bus has a minimum latency of a few hundred microseconds and > a worst-case delay of a few milliseconds. I think that they are very optimistic with that worst-case delay. > For large data blocks, this bus > uses direct memory access (DMA) similar to PCI bus mastering that reduces > the influence of software protocol overhead on the transfer rate. This is great for fixed-sized repeated frames such as video or contiguous disk block reads, but adds additional overhead for other communication. And most cluster communication is "other". -- Donald Becker becker@scyld.com Scyld Computing Corporation http://www.scyld.com 410 Severn Ave. Suite 210 Scyld Beowulf cluster system Annapolis MD 21403 410-990-9993 From eozk at bicom-inc.com Wed Dec 4 23:20:34 2002 From: eozk at bicom-inc.com (Eray Ozkural) Date: Wed Nov 25 01:02:53 2009 Subject: IEEE 1394 In-Reply-To: References: Message-ID: <200212050920.34768.eozk@bicom-inc.com> A while ago I had asked whether there were any existing clusters using firewire IIRC. I had also found a similar query on this list, asked some time before me, but I don't have the link right now. I had even developed a design, unfortunately no professors had shown interest in it at Bilkent. There are interface cards containing 3 firewire ports with aforementioned bandwidth/latency characteristics which makes them excellent point-to-point connection devices. With a suitably high performance kernel router, this would make the construction of high performance static-network distributed memory machines an ordinary feat. Each node would have 2 of those interface cards, totaling to 6 firewire ports. 64 nodes can be connected in hybercube topology resulting in a high performance supercomputer. If anybody wants me to come and help build it, just send me a job offer :) Thanks, -- Eray Ozkural Software Engineer, BICOM Inc. GPG public key fingerprint: 360C 852F 88B0 A745 F31B EA0F 7C07 AE16 874D 539C From jhalpern at howard.edu Thu Dec 5 06:40:11 2002 From: jhalpern at howard.edu (Joshua Halpern) Date: Wed Nov 25 01:02:53 2009 Subject: IEEE 1394 References: Message-ID: <3DEF654B.4050401@howard.edu> Eugen Leitl wrote: SNIP.... > "The IEEE 1394 bus has a minimum latency of a few hundred microseconds and > a worst-case delay of a few milliseconds. For large data blocks, this bus > uses direct memory access (DMA) similar to PCI bus mastering that reduces > the influence of software protocol overhead on the transfer rate. The > 400-Mb/s top data rate supports consumer digital video equipment and data > acquisition devices requiring relatively fast data transfer. Bus latencies > are compared in Figure 1 and bus throughput in Figure 2." Although it is not relevant to Beowulfs, firewire is being touted here as a replacement of IEEE-488. They may have missed the boat, the movement in the measurement community is to USB-2.0 People who do real time data acquisition have less trouble with IEEE-488 than with the insane way it is implemented in various devices, the belt and suspenders handshaking required to even start talking to devices (remember the design was for an electrically noisy environment) and the fact that the device hardware implementation often has more than a lot to be desired. The bus itself may only have a ms of latency much of which is due to the handshaking, but the time to rip something out of a device may be a tenth of a second or more in the worst (Tektronix) case. You can sometimes be indirect and clever, but not always and the cost in your time is humongous. None of these issues are really addressed by the new buses as you find out when you talk to the manufacturers and ask how many measurements can I do per second and they tell you what the transfer rate is when it finally squirts out of the narrow end of the device. Josh Halpern From eugen at leitl.org Thu Dec 5 07:43:30 2002 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:02:53 2009 Subject: IEEE 1394 In-Reply-To: <200212050920.34768.eozk@bicom-inc.com> Message-ID: On Thu, 5 Dec 2002, Eray Ozkural wrote: > A while ago I had asked whether there were any existing clusters using > firewire IIRC. I had also found a similar query on this list, asked some time > before me, but I don't have the link right now. Have you seen http://www.ultraviolet.org/mail-archives/beowulf.2002/2977.html ? > I had even developed a design, unfortunately no professors had shown interest > in it at Bilkent. There are interface cards containing 3 firewire ports with It is most interesting to use with motherboards with onboard IEEE 1394. > aforementioned bandwidth/latency characteristics which makes them excellent > point-to-point connection devices. With a suitably high performance kernel I've found another bit of info after posting to the list, which however looks proprietary. They claim "Asynchronous packet round trip, real-time thread to real-time thread and back is 110 microseconds worst case." http://www.fsmlabs.com/about/news_item.htm?record_id=48 Real-Time IEEE 1394 Driver from FSMLabs Applications to industrial and machine control and clusters November 12 2002, Socorro, NM. FSMLabs announces the immediate availability of a full function OHCI IEEE 1394 driver for the RTLinux/Pro Operating System. The driver supports asynchronous and isochronous modes and bus configuration and is available with FSMLabs Lnet networking package that also support Ethernet. The zero copy variant of the UNIX standard socket interface allows application code to have full access to the packets and build application stacks without forcing packet copy. Asynchronous packet round trip, real-time thread to real-time thread and back is 110 microseconds worst case. The driver is currently being used by FSMLabs customers who employ 1394 as an instrument control bus, but real-time 1394 has applications in fields such as multimedia, robotics, and enterprise (where it can be used for fault tolerance). As an example, United Technologies uses the RTLinux 1394 support to bridge control systems and VME/shared memory systems, taking advantage of the high data movement rates of the 1394 bus to synchronize with shared memory on PCI control systems. FSMLabs Network Architect, Justin Weaver said: "The driver exposes the flexibility of 1394, which can provide both very low latency packet transmission and high data rates at the same time." Driver functions include: * * Asynchronous requests and responses * * Isochronous stream packets with ability to tune contexts to specific or multiple channels. * * Asynchronous stream packets * * Up to 32 isochronous receive contexts and same number of transmit contexts. * * Cycle master capability. * * IRM capability and Bus Manager topology map control. * * Up to 63 nodes per bus and up to 16 ports per node. About RTLinux/Pro and RTCore RTLinux/Pro provides FSMLabs RTCore POSIX PS51 robust "hard" real-time kernel with a full embedded Linux development system. RTCore employs a patented dual kernel technique to run Linux or BSD Unix as applications. Hard real-time software runs at hardware speeds while the full power of an open-source UNIX is available to non-real-time components. RTLinux/Pro is used for everything from satellite controllers, telescopes, and jet engine test stands to routers and computer graphics. RTLinux/Pro runs on a wide range of platforms from high end clusters of multiprocessor P4s/Athlons to low power devices like the MPC860, Elan 520, and ARM7. > router, this would make the construction of high performance static-network > distributed memory machines an ordinary feat. > > Each node would have 2 of those interface cards, totaling to 6 firewire ports. > 64 nodes can be connected in hybercube topology resulting in a high > performance supercomputer. > > If anybody wants me to come and help build it, just send me a job offer :) There are solutions like http://www.disi.unige.it/project/gamma/mpigamma/ Hardware requirements A pool of uniprocessor PCs with Intel Pentium, AMD K6, or superior CPU models. Each PC should have a Fast Ethernet or Gigabit Ethernet NIC supported by GAMMA. Currently supported Fast Ethernet NICs are: 3COM 3c905[rev.B, B, C], any adapter equipped with the DEC DS21143 / Intel DS21145 ``tulip'' chipsets and clones, Intel EtherExpress Pro/100. Currently supported Gigabit Ethernet NICs are: Alteon AceNIC and its clones (3COM 3c985, Netgear GA620), Netgear GA621 (and possibly GA622). You should also connect all PCs by a Fast Ethernet or Gigabit Ethernet switch, or by a Fast Ethernet repeater hub, (or by a simple cross-over cable, for a minimal cluster of two PCs). They claim 35 to 10.5 us userland latency: http://www.disi.unige.it/project/gamma/mpigamma/#GE Given how cheap GBit Ethernet switches are getting, there's really no point in going IEEE 1394 on a large scale, unless your motherboard happens to have it for free along with the Ethernet ports, and your cluster is small. From eozk at bicom-inc.com Thu Dec 5 05:07:31 2002 From: eozk at bicom-inc.com (Eray Ozkural) Date: Wed Nov 25 01:02:53 2009 Subject: Locality and caching in parallel/distributed file systems In-Reply-To: <20021203094729.O95569-100000@net.bluemoon.net> References: <20021203094729.O95569-100000@net.bluemoon.net> Message-ID: <200212051507.31679.eozk@bicom-inc.com> On Tuesday 03 December 2002 04:57 pm, Andrew Fant wrote: > I admit that I am not a computer science graduate, nor a > semi-professional developer, so I have no idea if this has been or could > be done, but it keeps rattling around in my head as an idea, and I would > appreciate any feedback that people can give. Please forgive my ignorance Your idea makes sense since there is always a better distribution than random. I have to meditate about the required model. Regards, -- Eray Ozkural Software Engineer, BICOM Inc. GPG public key fingerprint: 360C 852F 88B0 A745 F31B EA0F 7C07 AE16 874D 539C From eozk at bicom-inc.com Thu Dec 5 05:14:51 2002 From: eozk at bicom-inc.com (Eray Ozkural) Date: Wed Nov 25 01:02:53 2009 Subject: Locality and caching in parallel/distributed file systems In-Reply-To: <20021203094729.O95569-100000@net.bluemoon.net> References: <20021203094729.O95569-100000@net.bluemoon.net> Message-ID: <200212051514.51947.eozk@bicom-inc.com> On Tuesday 03 December 2002 04:57 pm, Andrew Fant wrote: > Morning all, > Lately, I have been thinking a lot about parallel filesystems in > the most un-rigourous way possible. Knowing that PVFS simply stripes the > data across the participating filesystems, I was wondering if anyone had > tried to apply caching technology and file migration capacities to a > parallel/distributed filesytem in a manner analagous to SGI's ccNuma > memory architecture. That is, distributing files in the FS to various > nodes, keeping track of where the accesses are coming from, and moving > the file to another node if that is where some suitable percentage of the > reads and/or writes are coming from. Also, potentially allowing blocks > from local files to be cached in disk on a local node until a write to > those blocks elsewhere invalidates the cache (I know the semantics for > this theoretically are in NFS, but NFS doesn't scale, and is dead 8-). In particular, this problem seems to be similar to a problem in Information Retrieval which we had given some thought to. It is likely to have a combinatorial solution since we have many nodes and many files :> I must meditate further. Thanks, -- Eray Ozkural Software Engineer, BICOM Inc. GPG public key fingerprint: 360C 852F 88B0 A745 F31B EA0F 7C07 AE16 874D 539C From Robin.Laing at drdc-rddc.gc.ca Thu Dec 5 13:37:00 2002 From: Robin.Laing at drdc-rddc.gc.ca (Robin Laing) Date: Wed Nov 25 01:02:53 2009 Subject: Newbie - help. Cluster specifications for ls-dyna. Message-ID: <3DEFC6FC.1060903@drdc-rddc.gc.ca> Hello, I have been working on specifications for some additional servers for our small (4 node) cluster. We are using gigabit ethernet (copper) switches to join the nodes. Our cluster will be running ls-dyna and is presently based on Scyld Beowulf software. I am looking at dual Xeon or Athlon processors. The Athlons are a lower cost option but I have concerns about the Floating point differences between each processor. I have searched the web and found that each has it's own benefits and can depend on the type of application. I have used linux since 1994 so I am not totally lost. As I have no experience with clusters or ls-dyna, I need some help and all help will be appreciated. -- Robin Laing Instrumentation Technologist Military Engineering Section Defence R&D Canada - Suffield P. O. Box 4000 Medicine Hat, AB, T1A 8K6 Canada Voice: 1-403-544-4762 FAX: 1-403-544-4704 Email: Robin.Laing@DRDC-RDDC.GC.CA WWW: http://www.suffield.drdc-rddc.gc.ca From eozk at bicom-inc.com Thu Dec 5 06:25:38 2002 From: eozk at bicom-inc.com (Eray Ozkural) Date: Wed Nov 25 01:02:53 2009 Subject: IEEE 1394 In-Reply-To: References: Message-ID: <200212051625.38593.eozk@bicom-inc.com> On Thursday 05 December 2002 05:43 pm, Eugen Leitl wrote: > On Thu, 5 Dec 2002, Eray Ozkural wrote: > > A while ago I had asked whether there were any existing clusters using > > firewire IIRC. I had also found a similar query on this list, asked some > > time before me, but I don't have the link right now. > > Have you seen > http://www.ultraviolet.org/mail-archives/beowulf.2002/2977.html > ? > This has a very different purpose from my design, I don't think it's much related. > > I had even developed a design, unfortunately no professors had shown > > interest in it at Bilkent. There are interface cards containing 3 > > firewire ports with > > It is most interesting to use with motherboards with onboard IEEE 1394. > For different purposes, maybe. What my design needs is n firewire ii ports for a hypercube of nth degree. I don't think there is any motherboard that gives you that. > > aforementioned bandwidth/latency characteristics which makes them > > excellent point-to-point connection devices. With a suitably high > > performance kernel > > I've found another bit of info after posting to the list, which however > looks proprietary. They claim "Asynchronous packet round trip, real-time > thread to real-time thread and back is 110 microseconds worst case." > > http://www.fsmlabs.com/about/news_item.htm?record_id=48 [snip] This is interesting since OS support is ultimately needed :) > > > router, this would make the construction of high performance > > static-network distributed memory machines an ordinary feat. > > > > Each node would have 2 of those interface cards, totaling to 6 firewire > > ports. 64 nodes can be connected in hybercube topology resulting in a > > high performance supercomputer. > > > > If anybody wants me to come and help build it, just send me a job offer > > :) > > There are solutions like http://www.disi.unige.it/project/gamma/mpigamma/ > I'm aware of these messaging protocol projects. I investigated in the performance and the gains did not seem to be wildly different from stock 2.4.x so I opted for that kernel. Again, this isn't very related to the static architecture I am mainly interested in. In my design there is no need for routers. We probably can't get decent CT-routing but even then it should have a blazing performance at an incredibly cheap price. Better than any gigabit ethernet network can offer. We had talked about gigabit ethernet performance many months ago and no implementations seemed to approach theoretical limits even in point-to-point connections. The cards I'm talking about are products such as: http://www.expansys.com/product.asp?code=BF-520 This one's got cards that go for $10-$20, pretty cheap... http://store.yahoo.com/akidacomputer/ieee1adcar.html Thanks, -- Eray Ozkural Software Engineer, BICOM Inc. GPG public key fingerprint: 360C 852F 88B0 A745 F31B EA0F 7C07 AE16 874D 539C From jlb17 at duke.edu Fri Dec 6 06:55:54 2002 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Wed Nov 25 01:02:53 2009 Subject: Newbie - help. Cluster specifications for ls-dyna. In-Reply-To: <3DEFC6FC.1060903@drdc-rddc.gc.ca> Message-ID: On Thu, 5 Dec 2002 at 2:37pm, Robin Laing wrote > Our cluster will be running ls-dyna and is presently based on Scyld > Beowulf software. > > I am looking at dual Xeon or Athlon processors. The Athlons are a > lower cost option but I have concerns about the Floating point > differences between each processor. I have searched the web and found > that each has it's own benefits and can depend on the type of application. ISTR an article comparing Xeons and Athlons with DYNA... ah: http://www.aceshardware.com/read_news.jsp?id=55000490 By that report, the Athlon 1800 beat out a 2GHz Xeon by a decent margin. Of course, we don't know what type of sim that was -- if memory bandwidth comes into play, the Xeon may take the lead again. Your best bet may be to contact LSTC and/or find some systems to benchmark. Note that I only have experience with DYNA on single nodes -- I haven't played with mpp-dyna yet. -- Joshua Baker-LePain Department of Biomedical Engineering Duke University From m at pavis.lamel.bo.cnr.it Fri Dec 6 08:47:16 2002 From: m at pavis.lamel.bo.cnr.it (m@pavis.lamel.bo.cnr.it) Date: Wed Nov 25 01:02:53 2009 Subject: Neural Network applications using Beowulf In-Reply-To: <20021121104209.A10170@synapse.neuralscape.com> References: <200211201149.59715.eozk@bicom-inc.com> <20021121104209.A10170@synapse.neuralscape.com> Message-ID: <20021206164716.GA28090@spartacus.biodec.com> * Karen Shaeffer (shaeffer@neuralscape.com) [021122 06:54]: > > May I suggest you read these books: > > > Ralf Herbrich; Learning Kernel Classifiers, Theory and Algorithms; MIT > Press, 2002 > I'd like to point out this paper too # X. Yao, ``Evolving artificial neural networks,'' Proceedings of the IEEE, 87(9):1423-1447, September 1999. (yao_ie3proc_online.ps.gz), which you may download from: http://www.cs.bham.ac.uk/~xin/research/eann.html it is the second of the list I've found it very valuable, either for the exposition and for the bibliography -- .*. finelli /V\ (/ \) -------------------------------------------------------------- ( ) Linux: Friends dont let friends use Piccolosoffice ^^-^^ -------------------------------------------------------------- I was appalled by this story of the destruction of a member of a valued endangered species. It's all very well to celebrate the practicality of pigs by ennobling the porcine sibling who constructed his home out of bricks and mortar. But to wantonly destroy a wolf, even one with an excessive taste for porkers, is unconscionable in these ecologically critical times when both man and his domestic beasts continue to maraud the earth. Sylvia Kamerman, "Book Reviewing" From mof at labf.org Fri Dec 6 17:13:22 2002 From: mof at labf.org (Mof) Date: Wed Nov 25 01:02:53 2009 Subject: Neural Network applications using Beowulf Message-ID: <200212071143.22222.mof@labf.org> On Sat, 7 Dec 2002 03:17 am, m@pavis.lamel.bo.cnr.it wrote: > * Karen Shaeffer (shaeffer@neuralscape.com) [021122 06:54]: > > May I suggest you read these books: > > > > > Ralf Herbrich; Learning Kernel Classifiers, Theory and Algorithms; MIT > > Press, 2002 > > I'd like to point out this paper too # X. Yao, ``Evolving artificial > neural networks,'' Proceedings of the IEEE, 87(9):1423-1447, September > 1999. (yao_ie3proc_online.ps.gz), which you may download from: > > http://www.cs.bham.ac.uk/~xin/research/eann.html > > it is the second of the list > > I've found it very valuable, either for the exposition and for the > bibliography I was just reading it, and found a little paragraph that would be of interest for those that didn't think that a EANN is not a real EANN if it doesn't implement crossover : "It is shown that evolutionary algorithms relying on crossover operators do not perform very well in searching for a near-optimal ANN architecture." So I guess unless you have your architecture already worked out, you shouldn't use cross-over.... Or alternatively the paper by Xin Yao is wrong..... I personally think that Xin is right. Mof. ------------------------------------------------------- From lindahl at keyresearch.com Mon Dec 9 10:23:25 2002 From: lindahl at keyresearch.com (Greg Lindahl) Date: Wed Nov 25 01:02:53 2009 Subject: infiniband pricing Message-ID: <20021209102325.A1643@wumpus.internal.keyresearch.com> For a while, people have been thinking that Infiniband prices are going to be better than Myrinet. Now they've been announced: JNI's retail price for dual 4X PCI cards is $1700, and small 8-port 4X switches are $8000. These prices are substantially above Myricom's list prices. Behold the power of the free market: If too much venture capital is chasing too small of a market, everyone prices high. -- greg From lothar at triumf.ca Mon Dec 9 15:11:06 2002 From: lothar at triumf.ca (lothar@triumf.ca) Date: Wed Nov 25 01:02:53 2009 Subject: bproc help request References: <5.1.1.6.2.20021204230819.03a26910@mail.harddata.com> <20021205095306.A25223@mail.harddata.com> <3DF11D0B.6090705@triumf.ca> <20021206171310.A10580@mail.harddata.com> Message-ID: <3DF5230A.8090301@triumf.ca> Michal Jaegermann wrote: >>aftpd start up does not change anything either. >> >> > >This usually runs from xinetd so starting that up should fail with >"port busy" or similar. > >If you will type 'chkconfig --list' you see close to the bottom >'tftp: on' (that name can be slightly different). > > tftp is 'on' > > >>I am sure how you do this. That's the file. >> >> >># default: off >># description: The tftp server serves files using the trivial file >>transfer \ >># protocol. The tftp protocol is often used to boot diskless \ >># workstations, download configuration files to network-aware printers, \ >># and to start the installation process for some operating systems. >>service tftp >>{ >>disable = no >>socket_type = dgram >>protocol = udp >>wait = yes >>user = nobody >>group = nobody >>server = /usr/sbin/in.tftpd >>server_args = /tftpboot >>} >> >> > >Looks sane. '/tftpboot' should be a default. Files you want to >serve from /tftpboot/ directory are readable by nobody.nobody, I >presume? > > > >>>Or maybe you are blocking it with your >>>ipchains or iptables setup or through tcp_wrappers (files >>>/etc/hosts.allow and /etc/hosts.deny)? >>> >>> >>> >>Don't think so, but these are the files: >> >> > >Would be nice if you could use a mailer able not mangle this by >wrapping lines. > > > >>in.tftpd: 192.168.0.0/24 >> >> > >Without knowing on which network are really clients trying to reach >your tftp server I cannot tell if this is correct or not. > > > >>As of ipchains: >> >>ipchains -L >>ipchains: Incompatible with this kernel >> >> > >Well, this really means that a corresponding module is not loaded >(likely because it was not configured and or started). Until this >modules is absent things are "incompatible". :-) >You cannot load support for and run ipchains and iptables at the >same time. One or another. > > > >>The kernel is from a bproc distribution. >> >> > >??? 'uname -a'? > > kernel 2.4.18-hddcs.17.2.smp > > >>iptables -L >> >> >.... > >Really no rules; so nothing blocks here. > >There are chances that after a failed attempt your >/var/log/messages, or maybe /var/log/secure, recorded some useful >information. > >A trival way to check a sanity of a setup and a contact it to >write a shell script like that > >#!/bin/sh >echo "Hello, I am a net service" >( echo -n "Got a contact: " ; date ) >> /tmp/junk.log >exit > >make it executable, write an xinetd config file like the one you >have above but for 'protocol = tcp' (different name, of course) and >restart xinetd service. If things are ok then after >'telnet tftp' from a prospective client it should talk back >to you and record that event in /tmp/junk.log. > > Michal > > > Well, I made a tcpdump (-i eth1), when starting up teh slave node , and the only dialog is: rarp who-is .1 tell .1 rarp reply .1 at .-1 Lothar From hahn at physics.mcmaster.ca Mon Dec 9 16:18:06 2002 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:02:53 2009 Subject: infiniband pricing In-Reply-To: <20021209102325.A1643@wumpus.internal.keyresearch.com> Message-ID: > list prices. Behold the power of the free market: If too much venture > capital is chasing too small of a market, everyone prices high. yuck, bring on the 10G ethernet ;) seriously, though, does anyone have experience with IB? whenever I look at IB docs, I get stopped cold by the nasty overlayering. does 4x have enough thrust to make such things fly in the <10 us latency market? From araman at cs.ucsb.edu Mon Dec 9 23:07:57 2002 From: araman at cs.ucsb.edu (Amit Raman) Date: Wed Nov 25 01:02:53 2009 Subject: Students interested in building a system In-Reply-To: Message-ID: Hey all, Is there anyone in the Bay Area willing to work with us this December to build our own cluster? Email me back at: araman@cs.ucsb.edu. Later all! Amit Raman araman@cs.ucsb.edu From jcownie at etnus.com Tue Dec 10 01:52:04 2002 From: jcownie at etnus.com (James Cownie) Date: Wed Nov 25 01:02:53 2009 Subject: infiniband pricing (and performance) In-Reply-To: Message from Mark Hahn of "Mon, 09 Dec 2002 19:18:06 EST." Message-ID: <18Lh3c-2Z0-00@etnus.com> Mark Hahn wrote :- > seriously, though, does anyone have experience with IB? whenever I look > at IB docs, I get stopped cold by the nasty overlayering. does 4x have > enough thrust to make such things fly in the <10 us latency market? At SC last month there were some folks from Ohio State who had an IB cluster on the show floor. They're claiming some relatively good numbers (the high long-message bandwidth helps, of course, but their printed handout also claims ~8uS latency, though the web site below says 11.6 :-( ). http://nowlab.cis.ohio-state.edu/projects/mpi-iba/ -- Jim James Cownie Etnus, LLC. +44 117 9071438 http://www.etnus.com From lelouarn at eso.org Tue Dec 10 05:56:19 2002 From: lelouarn at eso.org (Miska Le Louarn) Date: Wed Nov 25 01:02:53 2009 Subject: Running two MPI jobs simultaneously Message-ID: <3DF5F283.1050305@eso.org> Dear all, I am facing a strange problem (or "feature") related to either Linux or MPI or maybe their interaction. I have written two programs, in C, which both use MPI. I run these programs on a 6 node cluster of PCs, each PC running Linux. More precise hardware / software description at the end of this mail. When I run one program (any of the two - with the command mpirun), everything goes fine, the program doesn't crash and provides the right result. All PCs work happily and everything seems to be ok. I should say the two programs are completely independant (different executables and so on, I don't make any communication between the two...). BUT when I try to run these two programs at the same time, one of them hangs. It just stops doing anything and sits there without crashing until the other program is completed. Then it starts to work again. I am surprised by this behavior. I would have expected that both programs run independantly, slower (because they share resources like network and CPU) but still run. Now this one program hogs all resources and the other one just sits there doing nothing. I have also tried to run two copies of the first ("hog") process. Now one of the copies also freezes completely (but doesn't seem to restart once the hog process is finished). What I am doing now to avoid the problem is to run the programs sequencially. It just would be conveniant sometimes to have the progs run at the same time - although slower. I haven't tried running the two programms as two different users. I should maybe try that. So does anybody have any idea why this is ? Is it a Linux scheduler "feature" related to the network communication between the nodes (if I launch 2 non-MPI jobs, I get the standard slow-down) ? Or maybe interference inside MPI between the two processes ? Any tests I could do to see what is going on ? Thanks in advance, Miska Cluster: 5 Nodes are Pentium IV, 1.8 GHz, with 1 GB of RAM, running Linux RH 7.3 (stock kernel) Master: Pentium Xeon 2 CPU, 1GB of RAM, RH 7.3 (stock SMP kernel) All machines have a Gigabit network card, and we have a gigabit switch. Software: MPICH 1.2.3 Programs written in C, compiled with gcc 2.96-112 (have also tried gcc 3.2 without any change). The programs perform quite of lot of different operations, various computations, MPI communications, disk access on the local disk etc... From eugen at leitl.org Tue Dec 10 06:01:49 2002 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:02:53 2009 Subject: /. Computers' Shelf Life Gets Livelier Message-ID: http://www.washingtonpost.com/wp-dyn/articles/A32701-2002Dec9.html Computers' Shelf Life Gets Livelier Gateway to Sell Harnessed Power of the PCs in Its Stores Advertisement UMUC Chief Technology Officer Bob Burnett hit on the networking plan while seeking ways to leverage Gateway's resources. (Val Hoeppner For The Washington Post) By Mike Musgrove Washington Post Staff Writer Tuesday, December 10, 2002; Page E01 The next time you tap on a keyboard at a Gateway Country store, you might just be touching a piece of one of the world's most powerful supercomputers. Gateway Inc. plans to announce today that it has linked up the computers on display in its retail stores across the country to sell the combined processing power to corporate customers in need of some extra computing punch. Shoppers shouldn't notice any difference. The floor models will continue to run demos of spreadsheets, games and digital music programs. But in the background, if all goes according to plan, the machines will be grinding away at tasks such as drug design or geoscience research. Gateway is the latest company to experiment with a concept called grid computing, in which processing power is bought and sold just like electricity and natural gas. Technology companies big and small are betting that more and more computing services will be delivered this way in the future, as businesses and organizations seek to make more efficient use of their computer resources. Major tech players, such as Hewlett-Packard Co. and Sun Microsystems Inc., believe the technology is on the the verge of becoming a viable business. International Business Machines Corp. announced plans in October to spend $10 billion on what it calls "computing on demand." The basic attraction to many companies is the idea that collections of $1,500 workstations can be converted into virtual supercomputers. Jaguar Cars Ltd. has experimented with the concept to create new car and driving simulations. Chipmaker Advanced Micro Devices Inc. uses it to design its next generations of processors. For struggling computer maker Gateway Inc., it could be a new way to generate revenue. For the concept to reach its potential, there must be greater standardization among computing systems, experts said. Proponents must also overcome concerns about the security of data. "There's a huge confidence level you have to reach with customers before they're willing to turn their technological life over to you," said Barry Jaruzelski, head of Booz Allen Hamilton Inc.'s global computers and electronics practice. At the moment, companies typically buy enough desktop computers and servers to take care of the heaviest computing jobs during peak periods. In off-peak times, much of that power is wasted -- the conventional wisdom is that companies typically use only about 25 percent of their total computing resources. "If a computer is idle and you don't use it, the computing power you generated is lost -- just like if you generate electrical power and you don't use it, it's gone," said Dan Reed, director of the National Center for Supercomputing Applications at the University of Illinois. Organizations that managed to link their computers report noticeable improvement. In Alexandria, for example, the American Diabetes Association has used grid software to increase the speed of a computer program called Archimedes, developed by Kaiser Permanente, designed to test the effect of different levels of health care and services in a community. According to the association's chief scientist and chief medical officer, Richard Kahn, it once took four to five days for the program to run. This year, the organization linked the 250 computers in its Alexandria office together with grid computing technology and reduced that time to a couple of hours For the time being, analysts are undecided about whether Gateway, or its partner United Devices Inc., which makes grid computing software, will attract many customers. "I am pretty much withholding judgment," said Christopher Willard, analyst at IDC. "There is currently not enough data one way or another to tell if this will be a revenue generator." In its favor, Willard said, is the fact that Gateway already has thousands of computers sitting largely idle on display shelves (the program would not affect PCs sold to consumers). "They have nothing to lose and may have something to gain," he said. Gateway has 272 Gateway Country stores. With 7,800 floor model PCs, each with an average processing power of 2 gigahertz, Gateway says it has about 14 teraflops of computing power (a teraflop is 1 trillion operations per second). By comparison, the 10 most powerful computers in the world range from three to about 36 teraflops. The advantage, for customers, is the price. For an introductory price of 15 cents per computer hour, plus set-up fees, Gateway is making the power of supercomputing available to companies that might not be able to afford it otherwise. The computer maker has not said how much it will charge when that introductory period is over, other than that it will cost about what a company would spend to maintain such a network -- without having to buy all the hardware. "Gateway Processing on Demand," the computer maker's name for the new program, is the brainchild of Gateway's chief technology officer, Bob Burnett. About a year ago, Burnett downloaded a couple of software programs onto his personal computers that seek to tap surplus computing power: SETI@Home and the United Devices Cancer Research Project. The two programs use concepts of grid computing to allow people with Web-connected computers to donate their computer's spare processing power to causes they are interested in. These two projects involve scanning telescope readouts for possible extraterrestrial contact and seeking cures for cancer. Burnett was trying to figure out ways to leverage Gateway's existing resources when the idea of networking that power in a similar manner -- and charge for it -- came to him. "The stores are basically closed from nine at night to nine in the morning," he said. Even during the day, computer utilization is "essentially zero as well. The things you're doing at retail don't tax a processor very hard." Gateway's service does not have any customers yet, although a London-based drug research firm, Inpharmatica Ltd., participated in a trial version of the program earlier this year. The company requires high-performance hardware to search chemical combinations for potential new drugs. "We are a drug discovery company, not an IT shop. We would much rather employ people to do innovative analysis of the data than spend time building computers," said Pat Leach, Inpharmatica's chief information officer. Leach said his company was impressed with the results of the trial and indicated that Inpharmatica might become a customer if the company's computing needs grow beyond the equipment it has already purchased. "When it comes to it, we will do a simple commercial cost-benefit analysis," he said. "If the Gateway service is cheaper than owning the kit and competitive with other offerings we will go with it." -- -- Eugen* Leitl leitl ______________________________________________________________ ICBMTO: N48 04'14.8'' E11 36'41.2'' http://eugen.leitl.org 83E5CA02: EDE4 7193 0833 A96B 07A7 1A88 AA58 0E89 83E5 CA02 http://moleculardevices.org http://nanomachines.net From eozk at bicom-inc.com Tue Dec 10 01:10:49 2002 From: eozk at bicom-inc.com (Eray Ozkural) Date: Wed Nov 25 01:02:53 2009 Subject: Running two MPI jobs simultaneously In-Reply-To: <3DF5F283.1050305@eso.org> References: <3DF5F283.1050305@eso.org> Message-ID: <200212101110.49698.eozk@bicom-inc.com> On Tuesday 10 December 2002 03:56 pm, Miska Le Louarn wrote: > > BUT when I try to run these two programs at the same time, one of them > hangs. It just stops doing anything and sits there without crashing > until the other program is completed. Then it starts to work again. Is this LAM? I'm not sure but I think I saw this behavior a couple of times. [snip] > So does anybody have any idea why this is ? Is it a Linux scheduler > "feature" related to the network communication between the nodes (if I > launch 2 non-MPI jobs, I get the standard slow-down) ? Or maybe > interference inside MPI between the two processes ? > > Any tests I could do to see what is going on ? > I guess you could try to turn on the debugging features of the MPI implementation you're using and try to see what really happens. I would also write a very simple program to replicate the bug in a simpler environment such as sending a dummy message back and forth many times. Then you should check if multiple incarnations of this application has similar behavior. Thanks, -- Eray Ozkural Software Engineer, BICOM Inc. GPG public key fingerprint: 360C 852F 88B0 A745 F31B EA0F 7C07 AE16 874D 539C From mehmet.suzen at bristol.ac.uk Tue Dec 10 08:24:02 2002 From: mehmet.suzen at bristol.ac.uk (Mehmet Suzen) Date: Wed Nov 25 01:02:53 2009 Subject: Running two MPI jobs simultaneously In-Reply-To: <3DF5F283.1050305@eso.org> References: <3DF5F283.1050305@eso.org> Message-ID: <370992703.1039537442@pc168.maths.bris.ac.uk> Maybe you are using queue system? It sounds like your second program waiting in the queue?. If you don't have a queue system, you should have it. --On 10 December 2002 2:56pm +0100 Miska Le Louarn wrote: > Dear all, > > I am facing a strange problem (or "feature") related to either Linux or > MPI or maybe their interaction. > > I have written two programs, in C, which both use MPI. > > I run these programs on a 6 node cluster of PCs, each PC running Linux. > More precise hardware / software description at the end of this mail. > > When I run one program (any of the two - with the command mpirun), > everything goes fine, the program doesn't crash and provides the right > result. All PCs work happily and everything seems to be ok. > > I should say the two programs are completely independant (different > executables and so on, I don't make any communication between the two...). > > BUT when I try to run these two programs at the same time, one of them > hangs. It just stops doing anything and sits there without crashing until > the other program is completed. Then it starts to work again. > > I am surprised by this behavior. I would have expected that both programs > run independantly, slower (because they share resources like network and > CPU) but still run. Now this one program hogs all resources and the other > one just sits there doing nothing. > > I have also tried to run two copies of the first ("hog") process. Now one > of the copies also freezes completely (but doesn't seem to restart once > the hog process is finished). > > What I am doing now to avoid the problem is to run the programs > sequencially. It just would be conveniant sometimes to have the progs run > at the same time - although slower. > > I haven't tried running the two programms as two different users. I > should maybe try that. > > So does anybody have any idea why this is ? Is it a Linux scheduler > "feature" related to the network communication between the nodes (if I > launch 2 non-MPI jobs, I get the standard slow-down) ? Or maybe > interference inside MPI between the two processes ? > > Any tests I could do to see what is going on ? > > Thanks in advance, > > Miska > > Cluster: > > 5 Nodes are Pentium IV, 1.8 GHz, with 1 GB of RAM, running Linux RH 7.3 > (stock kernel) Master: Pentium Xeon 2 CPU, 1GB of RAM, RH 7.3 (stock SMP > kernel) All machines have a Gigabit network card, and we have a gigabit > switch. > > Software: > MPICH 1.2.3 > Programs written in C, compiled with gcc 2.96-112 (have also tried gcc > 3.2 without any change). The programs perform quite of lot of different > operations, various computations, MPI communications, disk access on the > local disk etc... > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From milkman50 at mail.com Tue Dec 10 08:59:36 2002 From: milkman50 at mail.com (chay hext) Date: Wed Nov 25 01:02:54 2009 Subject: Asking for help Message-ID: <20021210165937.20620.qmail@mail.com> Dear sir/madam, I am a student currently studying an AVCE In ICT in the United Kingdom and have been asked to research network drivers or any other drivers I was wondering if you could send any network drivers as an attachment as this is necessary for my course and have not had much luck so far and this would be very much appriaciated yours sincerly Chay Hext -- __________________________________________________________ Sign-up for your own FREE Personalized E-mail at Mail.com http://www.mail.com/?sr=signup One click access to the Top Search Engines http://www.exactsearchbar.com/mailcom From wathey at euler.snl.salk.edu Tue Dec 10 11:29:13 2002 From: wathey at euler.snl.salk.edu (Jack Wathey) Date: Wed Nov 25 01:02:54 2009 Subject: Need boot ROM for Asus A7M266-D motherboard Message-ID: <20021210112410.S34587-100000@euler.salk.edu> A friend and I are hoping to build a diskless cluster of dual Athlons using Asus A7M266-D motherboards. This board is sometimes sold as model A7M266-DL, which is just the A7M266-D bundled with a 3Com 100 Mbps NIC, model PCI-L3C920. According to 3com, the chipset on this little NIC is intended for use as an integrated LAN interface on the motherboard. Asus chose to put it on a tiny PCI card instead. We need to boot these machines over the net using PXE boot. Unfortunately, the A7M266-DL bundle does not include a boot ROM, although the NIC has a socket for one, and the manual says that one exists as an option for network booting. The manual does not list a part number for this ROM chip, however, and, after many calls to 3com and Asus, I have not yet found anyone who can tell me the part number and how to get the boot ROM chip for this NIC. We would consider an alternative NIC, but it would have to be physically small, like the PCI-L3C920, because we will not be putting these boards into conventional enclosures. The grip of the PCI slot will be the only mechanical connection supporting the card. Also, it would need to be able to do wake-on-LAN through the PCI bus, as the PCI-L3C920 does. The A7M266-D has no connector for an external WOL signal. Our preference would be to use the PCI-L3C920, IF we can find the boot ROM. Has anyone out there succeeded in getting the boot ROM for the PCI-L3C920? If so, please tell me the part number and where you bought it. Any suggestions for alternative NICs that might work for us? We would also welcome suggestions for alternative dual Athlon boards that support WOL and have on-board 100 Mbps LAN (other than Tyan and Gigabyte, which we have already tried and found inadequate for our purposes). Thanks in advance, Jack Wathey From rgb at phy.duke.edu Tue Dec 10 12:20:51 2002 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:02:54 2009 Subject: Asking for help In-Reply-To: <20021210165937.20620.qmail@mail.com> Message-ID: On Tue, 10 Dec 2002, chay hext wrote: > Dear sir/madam, I am a student currently studying an AVCE In ICT in > the United Kingdom and have been asked to research network drivers or > any other drivers I was wondering if you could send any network drivers > as an attachment as this is necessary for my course and have not had > much luck so far and this would be very much appriaciated The entire linux kernel, as well as the entire e.g. freebsd kernel, is chock full of network (and other) device drivers in open source form and are openly and freely available at (e.g.) www.kernel.org, www.freebsd.org, and in all the linux distributions. There is also at least one book that I've familiar with on writing (linux) kernel drivers (Linux Device Drivers, by A. Rubini, O'Reilly and Assoc.) which may be updated at this point and have accumulated more authors, don't remember, and finally www.scyld.com (and Don Becker, who lives there) has whole network driver pages as he wrote a large chunk of the network drivers in use in linux today. Finally, there are numerous mailing lists devoted to network drivers for specific devices. It is a bit difficult to completely detach a study of "network drivers" from the kernel and hardware architecture they are intended to function in, since they have to do things like deal with asynchronous interrupts, possible MP and multitasking, buffering, DMA, parallel threads of operation and more, so you're probably going to want to study the entire kernel and not just "network drivers" anyway. Even if your assignment is really up a couple of ISO/OSI levels and intended to study e.g. the TCP stack this is still true -- you can learn a lot about TCP and IP from RFC's and books and online documents on the subject, but to understand "drivers" it helps to go into the kernel where the drivers function, AFTER reading the aforementioned documentation and learning how TCP/IP packets are formulated, routed, and so forth. Hope this helps. rgb (Finally out from under yet another five day power outage, this time from an unreal ice storm. I've been "Franned". This time we like to froze our butts off as indoor temperatures dropped into the 40's... I like calling North Carolina home...:-) > yours sincerly > > Chay Hext > -- > __________________________________________________________ > Sign-up for your own FREE Personalized E-mail at Mail.com > http://www.mail.com/?sr=signup > > One click access to the Top Search Engines > http://www.exactsearchbar.com/mailcom > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From bob at drzyzgula.org Tue Dec 10 13:02:01 2002 From: bob at drzyzgula.org (Bob Drzyzgula) Date: Wed Nov 25 01:02:54 2009 Subject: Need boot ROM for Asus A7M266-D motherboard In-Reply-To: <20021210112410.S34587-100000@euler.salk.edu> References: <20021210112410.S34587-100000@euler.salk.edu> Message-ID: <20021210160201.H16037@www2> You might be able to get etherboot to work for you. Read up at http://www.etherboot.org, and searching http://www.google.com/search?q=etherboot+3c920&filter=0 It seems that the 3C920 is essentially an embedded version of the 3C905, and that only minor changes to the code might be required. --Bob On Tue, Dec 10, 2002 at 11:29:13AM -0800, Jack Wathey wrote: > > ... > > Has anyone out there succeeded in getting the boot ROM for the > PCI-L3C920? If so, please tell me the part number and where you > bought it. > > ... From daniel at labplan.ufsc.br Tue Dec 10 14:43:19 2002 From: daniel at labplan.ufsc.br (Daniel Dotta) Date: Wed Nov 25 01:02:54 2009 Subject: Cluster with 845GERG2L motherboard Message-ID: <008f01c2a09d$89a162c0$8c13a296@micro40> Hi, I started to work with GNU/Linux clusters in 1998. Since then I built four different clusters (here we work with Energy and Weather Forecast parallel problems). In the last cluster that I installed, in final of 2001, I used 16 PC's, Asus A7A266 motherboards ( I had very good price) and Athlons 1.2 GHz in a Diskless System using Linux Mandrake 8.0. In this case I had many problems with heat and stability problems with the motherboard A7A266, the result was that I spent much time to get a good stability to this system. Now I will build another cluster and I have a very good price (again!) to Intel 845GERG2L motherboard. Do you have any comments about this motherboard? Thanks, Daniel Dotta LabPlan - Power Systems Planning Lab Federal University at Santa Catarina Florian?polis/SC, Brazil From lothar at triumf.ca Tue Dec 10 16:23:08 2002 From: lothar at triumf.ca (lothar@triumf.ca) Date: Wed Nov 25 01:02:54 2009 Subject: bproc help request References: <5.1.1.6.2.20021204230819.03a26910@mail.harddata.com> <20021205095306.A25223@mail.harddata.com> <3DF11D0B.6090705@triumf.ca> <20021206171310.A10580@mail.harddata.com> <3DF5230A.8090301@triumf.ca> <20021210103206.A25166@mail.harddata.com> Message-ID: <3DF6856C.2050505@triumf.ca> Mmmm, so far everything seems to be independent of how tftpd is configured, even being switched off. However, I made some progress by changing the /etc/beowulf/config file (the node number and iprange seemed to be in conflict). Now the slave obviously gets to /var/beowulf/boot.img. However, it does not seem capable of loading it. A constant stream of missed block nnn with nnn being a number between 400 and 1400. Occasionally it seems to try to reload...then again missed blocks (goes over miunutes probably forever). tcpdump shows a constant stream of udp packages moving with occasional request from the slave for "n-1" which is answered ny the ethernet card no. of the master. boot.img was done with beoboot -2 -n (also with explicitly naming the kernel with no diff.). Lothar Michal Jaegermann wrote: >Are you running your tftp server from xinetd? If yes then please >reconfigure it to make it a standalone daemon (in a xinetd config >file for that service make it 'disable=yes', restart xinetd, and >start your tftp server in a daemon mode - somewhere in your startup >is a good idea for the future). > >I hear that tftp server has troubles with multicast when running >through xinetd. This is apparently not the case for a standalone >mode. > > Michal > > > From becker at scyld.com Tue Dec 10 19:32:31 2002 From: becker at scyld.com (Donald Becker) Date: Wed Nov 25 01:02:54 2009 Subject: Need boot ROM for Asus A7M266-D motherboard In-Reply-To: <20021210112410.S34587-100000@euler.salk.edu> Message-ID: On Tue, 10 Dec 2002, Jack Wathey wrote: > A friend and I are hoping to build a diskless cluster of dual Athlons > using Asus A7M266-D motherboards. This board is sometimes sold as > model A7M266-DL, which is just the A7M266-D bundled with a 3Com 100 > Mbps NIC, model PCI-L3C920. According to 3com, the chipset on this > little NIC is intended for use as an integrated LAN interface on > the motherboard. Asus chose to put it on a tiny PCI card instead. Pay attention to the warning never to use this card on another motherboard... > We need to boot these machines over the net using PXE boot. > Unfortunately, the A7M266-DL bundle does not include a boot ROM, > although the NIC has a socket for one, and the manual says that one Get a 29 series Flash chip of the right pin count. A likely part is the Atmel AT29c010 128Kx8, which was the recommended part for earlier boards. > 3com and Asus, I have not yet found anyone who can tell me the part > number and how to get the boot ROM chip for this NIC. Very few people do know... > Any suggestions for alternative NICs that might work for us? We > would also welcome suggestions for alternative dual Athlon boards > that support WOL and have on-board 100 Mbps LAN (other than Tyan and > Gigabyte, which we have already tried and found inadequate for our > purposes). I'm curious: those are the two main players, what was lacking? -- Donald Becker becker@scyld.com Scyld Computing Corporation http://www.scyld.com 410 Severn Ave. Suite 210 Scyld Beowulf cluster system Annapolis MD 21403 410-990-9993 From russell.builta-eds at eds.com Wed Dec 11 10:21:34 2002 From: russell.builta-eds at eds.com (Builta, Russell D (syzygy)) Date: Wed Nov 25 01:02:54 2009 Subject: Beowulf Questions Message-ID: I am new to clustering but have been working with redhat for some years now. I just need to either be pointed to where I can get the information that I need or what ever. I am looking for the instructions and/or the software where I can install and play with the cluster. Any help would be greatly appreciated. Thanks. Russell Builta Solaris System Administrator Syzygy-Tech "Unix is like having a teepee, NO WINDOWS, NO GATES, and Apache in house!" From bartol at salk.edu Wed Dec 11 13:08:13 2002 From: bartol at salk.edu (Tom Bartol) Date: Wed Nov 25 01:02:54 2009 Subject: Need boot ROM for Asus A7M266-D motherboard In-Reply-To: Message-ID: <20021211111500.Y40397-100000@euler.salk.edu> Dear Donald, Thanks for your quick and useful reply to our query. I've inserted specific responses and follow-up questions in-line below: On Tue, 10 Dec 2002, Donald Becker wrote: > On Tue, 10 Dec 2002, Jack Wathey wrote: > > > A friend and I are hoping to build a diskless cluster of dual Athlons > > using Asus A7M266-D motherboards. This board is sometimes sold as > > model A7M266-DL, which is just the A7M266-D bundled with a 3Com 100 > > Mbps NIC, model PCI-L3C920. According to 3com, the chipset on this > > little NIC is intended for use as an integrated LAN interface on > > the motherboard. Asus chose to put it on a tiny PCI card instead. > > Pay attention to the warning never to use this card on another > motherboard... Thanks for the heads-up -- I think I missed that warning in the manual. > > > We need to boot these machines over the net using PXE boot. > > Unfortunately, the A7M266-DL bundle does not include a boot ROM, > > although the NIC has a socket for one, and the manual says that one > > Get a 29 series Flash chip of the right pin count. > A likely part is the Atmel AT29c010 128Kx8, which was the recommended > part for earlier boards. I am assuming that this part will come as a blank ROM and that we would have to flash it with the appropriate boot code. We're pretty computer savvy but have never programmed or flashed our own chips. Several questions now arise: 1) Is there a vendor that supplies these ROMs already flashed with PXE boot or other suitable boot code (e.g Etherboot). 2) if not then I assume we'd have to obtain necessary hardware and software to flash our own chips. Can you recommend to us the hardware and software for this process? I've found some information on this at the Etherboot web site but your specific recommendations for the 3C920 and Atmel AT29c010 128Kx8 would be most welcome. 3) Etherboot doesn't specifically mention support for the 3C920. This would most likely mean making some (hopefully small) modifications to one of the 3C90x drivers in Etherboot. But making these mods would require detailed manuals of the 3C920. Do you think 3Com would supply us with the necessary info and technical support? > > > 3com and Asus, I have not yet found anyone who can tell me the part > > number and how to get the boot ROM chip for this NIC. > > Very few people do know... > > > Any suggestions for alternative NICs that might work for us? We > > would also welcome suggestions for alternative dual Athlon boards > > that support WOL and have on-board 100 Mbps LAN (other than Tyan and > > Gigabyte, which we have already tried and found inadequate for our > > purposes). > > I'm curious: those are the two main players, what was lacking? Both the Tyan and Gigabyte boards came very close to satisfying our needs but fell short in the following ways: 1) The Tyan board (model S2466) will not do WOL, and does not properly ignore keyboard absence. It does, however, boot reliably via PXE boot and will search repeatedly (even forever) for a boot server until one is found (useful for times when the boot server is slow to respond to the boot request). 2) The Gigabyte board (model 7DPXDW-P) does do WOL and PXE boot and does properly ignore an absent keyboard. However WOL works only in soft-poweroff mode and the board will not go into soft-poweroff mode after resumption of power after a power failure (e.g. disconnection of power-supply from line voltage via a power strip). Also the PXE boot sequence gives up after 5 tries and requires hitting the reset switch to initiate another boot attempt (not useful for when the boot server is too slow to respond during the 5 tries). Of course, if you know of any work-arounds for the above short-comings please let us know. In the meantime, however, we are now pursuing the ASUS board mentioned above as solution. WOL behavior and handling of keyboard absence meet our requirements so the remaining issue with the ASUS board is network booting (as mention above). We greatly appreciate any advice you can give. Tom From lindahl at keyresearch.com Wed Dec 11 16:30:54 2002 From: lindahl at keyresearch.com (Greg Lindahl) Date: Wed Nov 25 01:02:54 2009 Subject: infiniband pricing (and performance) In-Reply-To: <18Lh3c-2Z0-00@etnus.com>; from jcownie@etnus.com on Tue, Dec 10, 2002 at 09:52:04AM +0000 References: <18Lh3c-2Z0-00@etnus.com> Message-ID: <20021211163054.B2324@wumpus.internal.keyresearch.com> While we're at it, IBM just killed their Infiniband silicon: http://www.byteandswitch.com/document.asp?doc_id=25633 -- greg From bob at drzyzgula.org Wed Dec 11 16:49:58 2002 From: bob at drzyzgula.org (Bob Drzyzgula) Date: Wed Nov 25 01:02:54 2009 Subject: Need boot ROM for Asus A7M266-D motherboard In-Reply-To: <20021211111500.Y40397-100000@euler.salk.edu> References: <20021211111500.Y40397-100000@euler.salk.edu> Message-ID: <20021211194958.B1291@www2> On Wed, Dec 11, 2002 at 01:08:13PM -0800, Tom Bartol wrote: > > I am assuming that this part will come as a blank ROM and that we would > have to flash it with the appropriate boot code. We're pretty computer > savvy but have never programmed or flashed our own chips. Several > questions now arise: > 1) Is there a vendor that supplies these ROMs already flashed with PXE > boot or other suitable boot code (e.g Etherboot). You may not want this even if they are available. A lot of the configurability of etherboot happens at build time. > 2) if not then I assume we'd have to obtain necessary hardware and > software to flash our own chips. Can you recommend to us the > hardware and software for this process? I've found some information > on this at the Etherboot web site but your specific recommendations > for the 3C920 and Atmel AT29c010 128Kx8 would be most welcome. I know that (at least some of) the 3c905 cards can flash the EEPROM themselves, see the 3c90x.txt file in the etherboot source code. In the etherboot contrib directory, there's a linux-based utility to do this. The onboard 3c920 may or may not be able to do this, but if not, and the sockets for the two boards are the same (does the 3c920 card from ASUS have a normal DIP socket? I can't tell from the blurry picture on ASUS's website) you might be able to use a real 3c905 card to burn the EEPROM, and then transfer it over. Alternatively, you can get an EEPROM burner device yourself. Probably the gold standard is DataIO [1, the "chipwriter" would probably be sufficient], but those are expensive and require maintenance contracts to obtain algorithm updates. Many people have good luck with the Nedhams [2] devices, e.g. the EMP-10. I use a Hi-Lo systems All-11P2 from Tribal Micro [3], which works well, although it's overkill for this task and the software is a bit lame. Most commercial device programmers require a Windows system to operate. :-( [1] http://www.dataio.com/ [2] http://www.needhams.com/ [3] http://www.tribalmicro.com/ > 3) Etherboot doesn't specifically mention support for the 3C920. This > would most likely mean making some (hopefully small) modifications to > one of the 3C90x drivers in Etherboot. But making these mods would > require detailed manuals of the 3C920. Do you think 3Com would supply > us with the necessary info and technical support? If no one here knows, I'd highly recommend asking on the etherboot list -- someone may have already done the work. --Bob From becker at scyld.com Wed Dec 11 17:53:53 2002 From: becker at scyld.com (Donald Becker) Date: Wed Nov 25 01:02:54 2009 Subject: Need boot ROM for Asus A7M266-D motherboard In-Reply-To: <20021211111500.Y40397-100000@euler.salk.edu> Message-ID: On Wed, 11 Dec 2002, Tom Bartol wrote: > On Tue, 10 Dec 2002, Donald Becker wrote: > > On Tue, 10 Dec 2002, Jack Wathey wrote: > > > model A7M266-DL, which is just the A7M266-D bundled with a 3Com 100 > > > Mbps NIC, model PCI-L3C920. According to 3com, the chipset on this .. > > > Unfortunately, the A7M266-DL bundle does not include a boot ROM, > > > although the NIC has a socket for one, and the manual says that one > > > > Get a 29 series Flash chip of the right pin count. > > A likely part is the Atmel AT29c010 128Kx8, which was the recommended > > part for earlier boards. > > I am assuming that this part will come as a blank ROM and that we would > have to flash it with the appropriate boot code. Yes. > We're pretty computer > savvy but have never programmed or flashed our own chips. Several > questions now arise: > 1) Is there a vendor that supplies these ROMs already flashed with PXE > boot or other suitable boot code (e.g Etherboot). There likely is a one-person operation somewhere, but I don't know of one. > 2) if not then I assume we'd have to obtain necessary hardware and > software to flash our own chips. I you are running Scyld Beowulf, you already have the software that we have written to do this -- 'vortex-diag'. If you are running some other system, you can get the source code at http://www.scyld.com/diag/index.html > 3) Etherboot doesn't specifically mention support for the 3C920. It very likely works -- the 920 is just one of the 905c chip with slightly different EEPROM programming. > > I'm curious: those are the two main players, what was lacking? > > Both the Tyan and Gigabyte boards came very close to satisfying our > needs but fell short in the following ways: > > 1) The Tyan board (model S2466) will not do WOL, and does not > properly ignore keyboard absence. Those are both likely fixed with BIOS updates. > It does, however, boot reliably > via PXE boot and will search repeatedly (even forever) for a boot > server until one is found That is very useful. It's much better than what other, earlier PXE implementations do: terminate and leave the machine in a useless state. [[ The PXE spec forces all of the server parts to be modified. And to correctly implement the spec, the PXE server parts really have to built together as one program to communication with each other. But the stuff the PXE authors controlled, the BIOS, isn't changed when they add PXE on top. ]] > (useful for times when the boot server is slow to respond to the boot > request). Another reason that we had to write our own PXE server... > 2) The Gigabyte board (model 7DPXDW-P) does do WOL and PXE boot and does > properly ignore an absent keyboard. However WOL works only in > soft-poweroff mode Do you mean that the OS has to initialize the NIC first? What NIC does it use? > of power-supply from line voltage via a power strip). Also the PXE > boot sequence gives up after 5 tries and requires hitting the reset > switch to initiate another boot attempt (not useful for when the boot > server is too slow to respond during the 5 tries). More precisely, 4 timeout-retransmits, before that single-threaded, single-interface PXE client exits and different single-threaded PXE tries the next interface. After trying the interfaces, one-by-one, it the BIOS tries the next boot media. If there isn't one, the BIOS just goes into a busy-loop. A proper implementation would send the PXE requests on all interfaces at once, and then select from among the replies. But the standard doesn't say that explicitly, so... -- Donald Becker becker@scyld.com Scyld Computing Corporation http://www.scyld.com 410 Severn Ave. Suite 210 Scyld Beowulf cluster system Annapolis MD 21403 410-990-9993 From c10c at rediffmail.com Thu Dec 12 05:07:35 2002 From: c10c at rediffmail.com (Chandani, Chirag) Date: Wed Nov 25 01:02:54 2009 Subject: Students interested in building a system In India Message-ID: <20021212130735.5295.qmail@mailweb34.rediffmail.com> Hey i m willing to build a cluster down here in india (delhi ) this goes to everyone who is interested in india email back at : c10c@rediffmail.com From timm at fnal.gov Thu Dec 12 09:30:11 2002 From: timm at fnal.gov (Steven Timm) Date: Wed Nov 25 01:02:54 2009 Subject: 3c59x driver, "eth0: Too much work in interrupt, status e401" Message-ID: We have a cluster of machines running Redhat 7.1 with errata kernel 2.4.9-31smp. These are Tyan 2466 boards which have one on-board 3com network controller. dmesg output is: 3c59x: Donald Becker and others. www.scyld.com/network/vortex.html 02:08.0: 3Com PCI 3c905C Tornado at 0x3000. Vers LK1.1.16 WE have also seen the same error on Tyan 2468 nodes running Redhat 7.3 with 2.4.18-18smp kernel, the latest errata kernel available from Redhat. dmesg from these are: 3c59x: Donald Becker and others. www.scyld.com/network/vortex.html 02:08.0: 3Com PCI 3c982 Dual Port Server Cyclone at 0x2400. Vers LK1.1.18-ac divert: allocating divert_blk for eth0 02:09.0: 3Com PCI 3c982 Dual Port Server Cyclone at 0x2480. Vers LK1.1.18-ac divert: allocating divert_blk for eth1 eth0: Setting half-duplex based on MII #24 link partner capability of 0000. eth0: Setting full-duplex based on MII #24 link partner capability of 41e1. ------------------------------------------------- I have two questions: 1) Is the message "too much work in interrupt, status E401" dangerous.. I seem to remember getting it in 6.1 all the time and things still seem to work. But every once in a while we are seeing a system hang within an hour of the time one of these errors happen on the console 2) Is the version string as given in the dmesg the correct way to determine the version number of the 3com driver we are running? Are any newer versions available, and will they help this problem? Thanks for any help you can provide. Steve Timm ------------------------------------------------------------------ Steven C. Timm (630) 840-8525 timm@fnal.gov http://home.fnal.gov/~timm/ Fermilab Computing Division/Core Support Services Dept. Assistant Group Leader, Scientific Computing Support Group Lead of Computing Farms Team From lothar at triumf.ca Thu Dec 12 09:22:43 2002 From: lothar at triumf.ca (lothar@triumf.ca) Date: Wed Nov 25 01:02:54 2009 Subject: clustermatic installation/loading of boot kernel Message-ID: <3DF8C5E3.3040404@triumf.ca> Hi, I am trying to set up a clustermatic/bproc based system. I installed your latest version of the CDrom. I made bootimages for a floppy and latter loading. /etc/rc.d/init.d/beowulf starts without problems. I have put one of the diskless-floppy only slaves on a videocard and monitor. ethernet card on slave and master or both 3com905. When I boot up, the floppy installation seems to run flawless till it makes contact to the master. When loading /var/beowulf/boot.img it starts to spill out messages of the the nature missed block nnnn (eg.. 1340) for a while, later it reverses into some rcv /var/beowulf/boot.img followed occassionally by missed block messagess. For being busy with other things I left it for two days. Amazingly enough this morning it had successfully booted. What is going on? Lothar From lothar at triumf.ca Thu Dec 12 10:10:49 2002 From: lothar at triumf.ca (lothar@triumf.ca) Date: Wed Nov 25 01:02:54 2009 Subject: clustermatic installation/loading of boot kernel References: <3DF8C5E3.3040404@triumf.ca> <20021212111119.A18428@lanl.gov> Message-ID: <3DF8D129.7020002@triumf.ca> Erik A. Hendriks wrote: >On Thu, Dec 12, 2002 at 09:22:43AM -0800, lothar@triumf.ca wrote: > > >>Hi, >>I am trying to set up a clustermatic/bproc based system. I installed >>your latest version >>of the CDrom. I made bootimages for a floppy and latter loading. >>/etc/rc.d/init.d/beowulf >>starts without problems. I have put one of the diskless-floppy only >>slaves on a videocard >>and monitor. ethernet card on slave and master or both 3com905. When I >>boot up, the >>floppy installation seems to run flawless till it makes contact to the >>master. When loading >>/var/beowulf/boot.img it starts to spill out messages of the the nature >>missed block nnnn (eg.. 1340) >>for a while, later it reverses into some >>rcv /var/beowulf/boot.img >>followed occassionally by missed block messagess. >> >>For being busy with other things I left it for two days. >>Amazingly enough this morning it had successfully booted. >> >>What is going on? >> >> > >Most likely your network switch is dropping a lot of the multicast >traffic. In my experience pretty much all managed switches can't >handle even a few megabits per second of multicast traffic. > >The solution to this is to switch it to using broadcast instead of >multicast. Put the following in /etc/beowulf/config: > >mcastbcast ethX # switch image service on ethX to broadcast > >You can also throttle the boot image transmits like this: > >mcastthrottle ethX NN # throttle multicast/broadcast on ethX to NN megabits/sec. > >- Erik > > > I put these commands into /etc/beowuld/config with throtteling to 1 megabit/s. I restarted /etc/rc.d/init.d/beowulf. Unfortunately the same messages appear. Lothar From becker at scyld.com Thu Dec 12 12:11:39 2002 From: becker at scyld.com (Donald Becker) Date: Wed Nov 25 01:02:54 2009 Subject: 3c59x driver, "eth0: Too much work in interrupt, status e401" In-Reply-To: Message-ID: On Thu, 12 Dec 2002, Steven Timm wrote: > Subject: 3c59x driver, "eth0: Too much work in interrupt, status e401" ... > I have two questions: > > 1) Is the message "too much work in interrupt, status E401" > dangerous.. I seem to remember getting it in 6.1 all the time > and things still seem to work. But every once in a while > we are seeing a system hang within an hour of the time one of > these errors happen on the console It's reporting a problem. All of the IRQ-driven, latency sensitive devices on the system should be reporting the problem, but my network drivers are the pretty much the only kernel modules that check, respond and report this type of problem. > 2) Is the version string as given in the dmesg the correct > way to determine the version number of the 3com driver we are running? > Are any newer versions available, and will they help this problem? Yes, newer driver versions are available from Scyld. There have been recent changes to dynamically adapt to high latency bus master grants. But the problem you are seeing is the driver reporting a problem with some _other_ part of the kernel, not a problem with the driver itself. We can help you track down the problem on professional-services basis, but I'm guessing that you are asking here because you were hoping for an easy, one-size-fits-all answer. The only easy answer is one that hides the problem (e.g. changing the max_interrupt_work module parameter, or using a driver that doesn't put out messages) rather than fixing it. -- Donald Becker becker@scyld.com Scyld Computing Corporation http://www.scyld.com 410 Severn Ave. Suite 210 Scyld Beowulf cluster system Annapolis MD 21403 410-990-9993 From hcxckwk at hkucc.hku.hk Thu Dec 12 22:55:09 2002 From: hcxckwk at hkucc.hku.hk (Kwan Wing Keung) Date: Wed Nov 25 01:02:54 2009 Subject: Availablity of Graph plotting routine in PGI Fortran Message-ID: Dear All, Perhaps this may be slightly off the topic. We are currently porting a big finite element program from NT Digital Visual Fortran (DVF) to Linux PGI Fortran. The portion on number crunching is fairly straight forward, and in fact we are trying to parallelize it at the moment. However on the computer graphics portion, we get stuck. The program allows plotting of the 3D objects in different sections, and it is not sensible to generate the massive plotting data into files and plot there outside the fortran program. My question is therefore whether there is any PGI Fortran callable graphics libraries directly compatible with Microsoft DVF Quickwin routines? Thanks in advance. W.K. Kwan Computer Centre University of Hongkong From beowulf at hoolan.org Sat Dec 14 16:34:31 2002 From: beowulf at hoolan.org (Yung-Sheng Tang) Date: Wed Nov 25 01:02:54 2009 Subject: What's the different between "MPP" and "constellation" ? Message-ID: <20021215081606.H59904-100000@hoolan.org> First, I am sorry this is somewhat off-topic. I was reading TOP500 stuffs then, and found two classifications of "MPP" and "constellation". What's the difference between these two categories? Actually, there is a SGI Origin 2000 in my office. According to the TOP500 database, it belongs to Constellation; On the other hand, IBM pSeries 690 is a typical MPP. But to me, their architectures seem to be quite similar. Thanks for any response to my stupid question. Tang From pu at ku.ac.th Sun Dec 15 10:05:38 2002 From: pu at ku.ac.th (pu@ku.ac.th) Date: Wed Nov 25 01:02:54 2009 Subject: What's the different between "MPP" and "constellation" ? In-Reply-To: <20021215081606.H59904-100000@hoolan.org> References: <20021215081606.H59904-100000@hoolan.org> Message-ID: <1039975538.3dfcc47206066@webmail.ku.ac.th> Hi, From Todd_Henderson at isl-3com.com Mon Dec 16 05:38:22 2002 From: Todd_Henderson at isl-3com.com (Henderson, TL Todd) Date: Wed Nov 25 01:02:54 2009 Subject: Good, inexpensive fiber GigE card? Message-ID: <78EF214FF6753F4C9078A43E8C86EB4E174754@gvlmail01.gvl.l-3com.com> Anyone have any recommendations for a good, inexpensive, well supported RedHat fiber GigE card? thanks, Todd From dtj at uberh4x0r.org Mon Dec 16 06:13:29 2002 From: dtj at uberh4x0r.org (Dean Johnson) Date: Wed Nov 25 01:02:54 2009 Subject: Good, inexpensive fiber GigE card? In-Reply-To: <78EF214FF6753F4C9078A43E8C86EB4E174754@gvlmail01.gvl.l-3com.com> References: <78EF214FF6753F4C9078A43E8C86EB4E174754@gvlmail01.gvl.l-3com.com> Message-ID: <1040048009.28606.29.camel@terra> On Mon, 2002-12-16 at 07:38, Henderson, TL Todd wrote: > Anyone have any recommendations for a good, inexpensive, well supported > RedHat fiber GigE card? A word of warning, you may get cheap gig cards, but the cables will still kill you. Every time I order them I have the profound sense that I am in the wrong line of work. Just my $.02 worth. -- -Dean From rauch at inf.ethz.ch Mon Dec 16 07:28:41 2002 From: rauch at inf.ethz.ch (Felix Rauch) Date: Wed Nov 25 01:02:54 2009 Subject: Good, inexpensive fiber GigE card? In-Reply-To: <78EF214FF6753F4C9078A43E8C86EB4E174754@gvlmail01.gvl.l-3com.com> Message-ID: On Mon, 16 Dec 2002, Henderson, TL Todd wrote: > Anyone have any recommendations for a good, inexpensive, well supported > RedHat fiber GigE card? I can only give you a warning: Stay away from DP83820 based cards (like e.g. "Asante Friendlynet"). It's been a while since a student of mine worked with the card, but back then it had bad performance and hardware-bugs. Regards, Felix -- Felix Rauch | Email: rauch@inf.ethz.ch Institute for Computer Systems | Homepage: http://www.cs.inf.ethz.ch/~rauch/ ETH Zentrum / RZ H18 | Phone: +41 1 632 7489 CH - 8092 Zuerich / Switzerland | Fax: +41 1 632 1307 From mgalicki at us.ibm.com Sun Dec 15 17:05:57 2002 From: mgalicki at us.ibm.com (Mike S Galicki) Date: Wed Nov 25 01:02:54 2009 Subject: RSH scaling problems... Message-ID: Can't seem to get rsh to scale past like 63 nodes with mpi jobs. SSH scales much higher, but the performance is a lot worse in customer benchmark tests. I'm guessing that I'm running out of pty's or tty's or something on the headnode. Anyone have some ideas? I believe the default pty's in 2.4.20 is 1024, but when I list /dev/pty I only see 256 entries. MAKEDEV -m 1024 didn't seem to do anything past 256. Mike Galicki Technical Consultant Linux Services Team San Francisco, CA Internet ID: mgalicki@us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20021215/940ac810/attachment.html From jcmcknny at uiuc.edu Mon Dec 16 02:46:12 2002 From: jcmcknny at uiuc.edu (jon) Date: Wed Nov 25 01:02:54 2009 Subject: low-latency high-bandwidth OS bypass user-level messaging for commodity(linux) clusters with commodity NICs(<$200), HELP! (GAMMA/EMP/M-VIA/etc.) Message-ID: <001601c2a4f0$5cfc9a80$3027fea9@jonxp> Hi Donald Becker, master of all that is networking! And anyone else that can help :) Perhaps this isn't the best way to get ahold of you, but I've also sent this to the Beowulf list. I've noted your comments on OS Bypass drivers in the past. But isn't there some room for non-TCP/IP related traffic, such as with computing clusters? We don't need no stinking TCP! No associated revenue? You could replace Myrinet in the thousands of nodes we have here ALONE at NCSA. We (UIUC theoretical astrophysics group) are in the midst of purchasing a $50K cluster (I know, small, but big for us! :)) and I'm done all the research as to what we should be getting. We ended up going with a Intel Desktop gigabit board and P4, but have found the tests to be very poor. We only have 4 nodes right now because we worried about this very thing. Anyways, our problem is we are willing to pay for a commercial product, but not at any cost, perhaps upto $200 per board. Basically, we see all these solutions such as: Giganet using VIA ServernetII using VIA InfiniBand U-Net AM II LPC PM FM GigaE-PM BIP EMP GAMMA M-VIA Half of these are seemingly dead, those that seem relatively alive are: M-VIA: http://www.nersc.gov/research/ftg/via/ Only support a few devices, and only 1 expensive ($500) gigabit board that's still available (the SysKonnect). GAMMA: http://www.disi.unige.it/project/gamma/index.html Depending on what part of their website you are at, they support different devices. The Alteon TIGON-II results seem to suck for latency, which is our biggest problem. The Netgear GA621 looks great! But we already bought a $5000 copper gigabit switch! We are stuck with it! (HP Procurve 5308xl). Whether they support the GA622 is kinda open or at least untested according to the website. No luck getting in touch with driver writer about that. Besides, EMP guy says the GA622 sucks! EMP: http://www.osc.edu/~pw/emp/ Seems to be interesting, although the available 3Com 3C996, of which we have 3 to test, is only said to be "maybe" supported since it's Tigon 3 and not Tigon 2. And will it such in latency just like the Tigon 2? EMP guys says the GA-622T sucks with it's ns chipset and that was one option with GAMMA, assuming he really did write the driver for both 621 and 622 (their website isn't clear about this, and no emails from the guys there), since the 622 "was" an option. Basically after all my testing (about 2 months of light testing and last 2 weeks of hard-core 24-hour a day testing) I realized TCP sucks and I need an OS Bypass or user-level communication driver. My questions are: 1) Is there a commercial product for a not so expensive board that provides what GAMMA/EMP/M-VIA provide? Any other OS-bypass driver/MPI layer I don't know about? 2) Is there a solution I'm missing? Has to be copper gigabit for linux, OS-bypass like GAMMA, MPI on top of that GAMMA-like. No dead boards, etc. Why are there no commercial products? MPI/Pro is just a funny MPI still on top of TCP, no? I basically want 20us latency for 0 message size and peak bandwidth, for $100-$200 per board on gigabit. Not too much to ask? :) I know it's certainly possible. Currently with Myrinet on P4 I get 17us latency and 80MB/sec bandwidth, gigabit on P4 gets 70us latency and 80MB/sec. On Xeon's I get 50us latency and 95MB/sec bandwidth with latest 3com bcm5700 or latest intel e1000 driver. I'm going to try EMP with my 3c996, but honestly his instructions are damn vague and confusing (i.e. WHAT snapshot of gcc/binutils should I use, what the heck do I do?, etc.) Might try GAMMA too since EMP says it may work as a Tigon processor. GAMMA seems a bit less crazy. Honestly, I can't really figure out what Scyld does. Is it just a linux distribution? Does it actually have OS-bypass networking? Does anything? Why is the OS-bypass so hard? If wanting no TCP support, isn't it easier than writing standard linux driver? (like you've done a lot!) Thanks! Jonathan McKinney University of Illinois at Urbana-Champaign Center for Theoretical Astrophysics NCSA From svel at ipr.res.in Mon Dec 16 17:00:54 2002 From: svel at ipr.res.in (N Sakthivel) Date: Wed Nov 25 01:02:54 2009 Subject: MPICH-1.2.0 installation problem in RH7.3 Message-ID: Dear Lists, While installing MPICH-1.2.0 in RH7.3 machine, I am getting the following error in config.log and make.log below as given below. Does any one encountered this problem MPICH-1.2.0 gcc-2.96-110 config.log ------------------ /tmp/ccAGVgM5.o: In function `t': /tmp/ccAGVgM5.o(.text+0x11): undefined reference to `__Argc' ------------------ make.log ----------------- make[4]: *** [slog_irec_write.o] Error 1 make[3]: *** [sloglib] Error 2 make[2]: *** [/packages/mpich-1.2.0/lib/libslog.a] Error 2 make[1]: [mpelib] Error 2 (ignored) ---------------- I would appreciate it if any one help me to solve the above problem. With Advance thanks N. Sakthivel From landman at scalableinformatics.com Mon Dec 16 08:37:41 2002 From: landman at scalableinformatics.com (Joseph Landman) Date: Wed Nov 25 01:02:54 2009 Subject: RSH scaling problems... In-Reply-To: References: Message-ID: <1040056661.1489.14.camel@ome.ctaalliance.org> Hi Mike: Look in your log files Luke... You might find the relevant error message at the tail end of /var/log/message. Look for rshd or in.rshd errors. Some thoughts that might help if RSH is really the issue: In later linuxes (linicies?) rsh spawning is done by xinetd. You want to make sure xinetd can spawn enough processes. Look at the xinetd man page, and the -limit option. Adjust the /etc/xinetd.conf file to reflect the limit. One one system I had to bump this pretty high to allow all the connections to daemons. If you are still using /etc/inetd.conf, you can tell it how many servers it may spawn by including a .nservers at the appropriate part of the line (though my memory is unclear as to which part) You might also be running out of network bandwidth. Try running vmstat 1 netstat -cav and see what your machine is doing network-wise. Try grabbing the atop program from freshmeat, and using that to summarize the net utilization (or use ntop, or any of the others). Joe On Sun, 2002-12-15 at 20:05, Mike S Galicki wrote: > Can't seem to get rsh to scale past like 63 nodes with mpi jobs. SSH > scales much higher, but the performance is a lot worse in customer > benchmark tests. I'm guessing that I'm running out of pty's or tty's > or something on the headnode. Anyone have some ideas? I believe the > default pty's in 2.4.20 is 1024, but when I list /dev/pty I only see > 256 entries. MAKEDEV -m 1024 didn't seem to do anything past 256. > > Mike Galicki > Technical Consultant > Linux Services Team > San Francisco, CA > Internet ID: mgalicki@us.ibm.com -- Joseph Landman, Ph.D. Scalable Informatics LLC email: landman@scalableinformatics.com web: http://scalableinformatics.com voice: +1 734 612 4615 fax: +1 734 398 5774 From math at velocet.ca Mon Dec 16 09:32:39 2002 From: math at velocet.ca (Ken Chase) Date: Wed Nov 25 01:02:54 2009 Subject: Good, inexpensive fiber GigE card? In-Reply-To: ; from rauch@inf.ethz.ch on Mon, Dec 16, 2002 at 04:28:41PM +0100 References: <78EF214FF6753F4C9078A43E8C86EB4E174754@gvlmail01.gvl.l-3com.com> Message-ID: <20021216123239.I88140@velocet.ca> On Mon, Dec 16, 2002 at 04:28:41PM +0100, Felix Rauch's all... > On Mon, 16 Dec 2002, Henderson, TL Todd wrote: > > Anyone have any recommendations for a good, inexpensive, well supported > > RedHat fiber GigE card? > > I can only give you a warning: Stay away from DP83820 based cards > (like e.g. "Asante Friendlynet"). It's been a while since a student of > mine worked with the card, but back then it had bad performance and > hardware-bugs. We've had relatively few problems with our ARK 83820 cards. (They're an exact copy (down to the diodes) of the DLINK 83820 cards (apparently which follow the reference layout). bcrl@redhat has gotten ~ 85MB/s across a pair of these cards directly connected. They're not great for latency but they're not bad for throughput with jumbo frames. We bought them because they were $35 at the time and we needed a boatload for what we were doing. The intel cards were $250 odd or something. However, the intels e1000's are now like $50-55 and quite nice. Heard good things about tigon 3 based cards, however. Anyone used some yet? /kc > > Regards, > Felix > -- > Felix Rauch | Email: rauch@inf.ethz.ch > Institute for Computer Systems | Homepage: http://www.cs.inf.ethz.ch/~rauch/ > ETH Zentrum / RZ H18 | Phone: +41 1 632 7489 > CH - 8092 Zuerich / Switzerland | Fax: +41 1 632 1307 > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Ken Chase, math@velocet.ca * Velocet Communications Inc. * Toronto, CANADA From joachim at ccrl-nece.de Mon Dec 16 09:29:43 2002 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Wed Nov 25 01:02:54 2009 Subject: MPICH-1.2.0 installation problem in RH7.3 In-Reply-To: References: Message-ID: <200212161829.43325.joachim@ccrl-nece.de> N Sakthivel: > make.log > ----------------- > make[4]: *** [slog_irec_write.o] Error 1 > make[3]: *** [sloglib] Error 2 > make[2]: *** [/packages/mpich-1.2.0/lib/libslog.a] Error 2 > make[1]: [mpelib] Error 2 (ignored) > ---------------- Strange... what exactly is the error? - workaround: configure with -nompe (or --disable-mpe) - possible solution 1: fix the source file - possible solution 2: use original MPICH archive (latest version is 1.2.4) Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From siegert at sfu.ca Mon Dec 16 10:56:29 2002 From: siegert at sfu.ca (Martin Siegert) Date: Wed Nov 25 01:02:54 2009 Subject: MPICH-1.2.0 installation problem in RH7.3 In-Reply-To: <200212161829.43325.joachim@ccrl-nece.de> References: <200212161829.43325.joachim@ccrl-nece.de> Message-ID: <20021216185629.GA13227@stikine.ucs.sfu.ca> On Mon, Dec 16, 2002 at 06:29:43PM +0100, Joachim Worringen wrote: > N Sakthivel: > > make.log > > ----------------- > > make[4]: *** [slog_irec_write.o] Error 1 > > make[3]: *** [sloglib] Error 2 > > make[2]: *** [/packages/mpich-1.2.0/lib/libslog.a] Error 2 > > make[1]: [mpelib] Error 2 (ignored) > > ---------------- > > Strange... what exactly is the error? > - workaround: configure with -nompe (or --disable-mpe) > - possible solution 1: fix the source file > - possible solution 2: use original MPICH archive (latest version is 1.2.4) Actually, I would start with solution 2 (install 1.2.4 including all patches). There are significant performance improvements in mpich since 1.2.2. If that still does not compile, send an email to mpi-bugs@mcs.anl.gov. The folks at Argonne have been very helpful with all kinds of problems when I asked them. Cheers, Martin ======================================================================== Martin Siegert Academic Computing Services phone: (604) 291-4691 Simon Fraser University fax: (604) 291-4242 Burnaby, British Columbia email: siegert@sfu.ca Canada V5A 1S6 ======================================================================== From jbecker at fryed.net Mon Dec 16 11:29:17 2002 From: jbecker at fryed.net (Jesse Becker) Date: Wed Nov 25 01:02:54 2009 Subject: RSH scaling problems... In-Reply-To: References: Message-ID: On Sun, 15 Dec 2002, Mike S Galicki wrote: > I believe the default pty's in 2.4.20 is 1024, but when I list /dev/pty > I only see 256 entries. MAKEDEV -m 1024 didn't seem to do anything past > 256. The default number of ptys is 254 in 2.4.x Linux kernels. This is hardcoded, and you need a kernel recompile if you need more. --Jesse From mathog at mendel.bio.caltech.edu Mon Dec 16 12:47:02 2002 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:02:54 2009 Subject: Minimal, fast single node queue software Message-ID: I have an application which is very simple and efficient so that the parallel compute node part can run in much less than 1 second. And therein lies the rub - nothing I've tried so far can handle job queues across 20 nodes nearly that fast. Essentially all it needs to do is: 1. start the same script on all nodes for job K in queue=NAME where each queue is a simple first in, first out scheduler. 2. Notify the master (run a script there) when all nodes complete job K 3. Allow job K+1 to start on a node when K completes (even though K might still be running on other nodes). There is no load balancing required, nor time limits. SGE had an overhead of approximately 10 seconds to handle all the setup/tear down, for the compute and and clean up jobs. I messed with the SGE code but couldn't drop it down below that, mostly because some of the time functions used in the code only kept time to the second. PVM had about 2.5 seconds of overhead plus it required a major kludge involving lock files and usleep in the scripts to restrict them to run sequentially on the remote nodes. Although it was a kludge it let job K+1 start within .02 second of job K terminating - at the cost of a large number of processes spinning in usleep and checking and rechecking the lock files. Surely there's an extant queue program around that can handle this many very short jobs well? If so, what it is it? The best I've been able to do so far to speed up job launch is to use a hacked version of the linux rsh command where rsh -z node1,node2...nodeN command runs the same command on all listed nodes and ignores all IO. This launches commands at a rate of about 1 node/.011 second. That's better than anything else I've tried so far - it's limited by rcmd() and the network, and not much else. For a cluster with a lot more nodes than ours it would probably be worth it to modify rsh further so that it could automatically split the target list and cause a binary tree distribution of the command. For our 20 nodes the .055 vs. .220 seconds isn't very much, But for 200 nodes it would be .088 vs 2.2, which is a pretty big difference. The fast rsh only starts the jobs quickly so I still need at least a local queue system on each node where: submit -q name command will serialize the jobs. The down side of the -z flag is that there's no way to tell when the remote job has completed. So a job termination integrator is also needed. Preferably something a bit more elegant than having all the jobs end with rsh -z master.node touch >/tmp/jobname/`hostname`_done and having a shell script sleep / count the jobs until the expected number report termination. Thanks, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From jbecker at fryed.net Mon Dec 16 12:54:31 2002 From: jbecker at fryed.net (Jesse Becker) Date: Wed Nov 25 01:02:54 2009 Subject: RSH scaling problems... In-Reply-To: References: Message-ID: On Mon, 16 Dec 2002, Jesse Becker wrote: > The default number of ptys is 254 in 2.4.x Linux kernels. This is > hardcoded, and you need a kernel recompile if you need more. Bah, that should be 256. Mea culpa. --Jesse From lusk at mcs.anl.gov Mon Dec 16 13:26:37 2002 From: lusk at mcs.anl.gov (Rusty Lusk) Date: Wed Nov 25 01:02:54 2009 Subject: MPICH-1.2.0 installation problem in RH7.3 In-Reply-To: References: Message-ID: <20021216.152637.25133062.lusk@localhost> From: N Sakthivel Subject: MPICH-1.2.0 installation problem in RH7.3 Date: Mon, 16 Dec 2002 20:00:54 -0500 (GMT) > > Dear Lists, > While installing MPICH-1.2.0 in RH7.3 machine, I am getting the > following error in config.log and make.log below as given below. > Does any one encountered this problem You should start by getting the current version of MPICH, which is 1.2.4. 1.2.0 is pretty old. You can get it from http://www.mcs.anl.gov/mpi/mpich. From patrick at myri.com Mon Dec 16 13:17:25 2002 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:02:54 2009 Subject: Minimal, fast single node queue software In-Reply-To: References: Message-ID: <1040073447.422.11.camel@asterix> On Mon, 2002-12-16 at 15:47, David Mathog wrote: > The best I've been able to do so far to speed up job launch > is to use a hacked version of the linux rsh command where Have you tried MPICH's MPD ? It's fast and it scales. Patrick -- Patrick Geoffray, Phd Myricom, Inc. http://www.myri.com From lindahl at keyresearch.com Mon Dec 16 13:21:35 2002 From: lindahl at keyresearch.com (Greg Lindahl) Date: Wed Nov 25 01:02:54 2009 Subject: Minimal, fast single node queue software In-Reply-To: ; from mathog@mendel.bio.caltech.edu on Mon, Dec 16, 2002 at 12:47:02PM -0800 References: Message-ID: <20021216132135.C1509@wumpus.internal.keyresearch.com> On Mon, Dec 16, 2002 at 12:47:02PM -0800, David Mathog wrote: > I have an application which is very simple and efficient > so that the parallel compute node part can run in much less > than 1 second. And therein lies the rub - nothing I've tried > so far can handle job queues across 20 nodes nearly that fast. "Doctor, it hurts when I do this." What you want to do instead is write a "worker" which asks some server for work units. You aren't going to get there with anything that opens lots of sockets, generates lots of new processes, etc. -- greg From becker at scyld.com Mon Dec 16 16:31:40 2002 From: becker at scyld.com (Donald Becker) Date: Wed Nov 25 01:02:54 2009 Subject: low-latency high-bandwidth OS bypass user-level messaging for commodity(linux) clusters with commodity NICs(<$200), HELP! (GAMMA/EMP/M-VIA/etc.) In-Reply-To: <001601c2a4f0$5cfc9a80$3027fea9@jonxp> Message-ID: On Mon, 16 Dec 2002, jon wrote: > Perhaps this isn't the best way to get ahold of you, but I've also sent > this to the Beowulf list. I've noted your comments on OS Bypass drivers > in the past. But isn't there some room for non-TCP/IP related traffic, > such as with computing clusters? We don't need no stinking TCP! No > associated revenue? You could replace Myrinet in the thousands of nodes > we have here ALONE at NCSA. Very likely not. Myrinet has both a cost and increasing performance advantage over gigabit Ethernet when the switch is larger than about 96 ports. > We (UIUC theoretical astrophysics group) are in the midst of purchasing > a $50K cluster (I know, small, but big for us! :)) and I'm done all the > research as to what we should be getting. We ended up going with a > Intel Desktop gigabit board and P4, but have found the tests to be very > poor. We only have 4 nodes right now because we worried about this very > thing. Latency or bandwidth? And what are you using to test? > InfiniBand Hardware is just now appearing, after a rapid committee-driven complexity increase. The initial price is well above Myrinet to capture the value to the "must-have" crowd. The question is if there is the motivation to travel down the price-volume curve > Giganet using VIA > ServernetII using VIA Both effectively dead hardware products, although Giganet is still shipping is current hardware. > U-Net Dead software effort, pre-dated VIA protocol. > BIP Magic protocol using custom Myrinet firmware. Grumble: early performance numbers were not reproducible (I got exactly 50% of tech report numbers on same hardware and software). > PM, FM The other Myrinet custom protocols? Dead. With a communication processor to do the work at the other end, you can do magic application-specific things. And when you write the paper, the programming effort was mimimal and the performance astonishing. > EMP The only thing I knew of by this name was a predecessor to IPMI. > LPC A google search shows only a low-speed serial communication project. > Half of these are seemingly dead, those that seem relatively alive are: Only half? Do you have the same number of fingers on both hands? A general guideline is that building a safe, reliable, general purpose communication protocol is always much more difficult than getting something that only works in the perfect conditions. You must implement checksums, sequence numbers, and recovery for failed endpoints. If you are directly writing to a remote process memory space, you have to take into account VM page table tracking and cache coherency. These can quickly erase any performance advantage of "zero copy". > M-VIA: http://www.nersc.gov/research/ftg/via/ > Only support a few devices, and only 1 expensive ($500) gigabit board > that's still available (the SysKonnect). > GAMMA: http://www.disi.unige.it/project/gamma/index.html The top project for current support. > Depending on what part of their website you are at, they support > different devices. The Alteon TIGON-II results seem to suck for > latency, which is our biggest problem. The Netgear GA621 looks great! > But we already bought a $5000 copper gigabit switch! We are stuck with > it! (HP Procurve 5308xl). Whether they support the GA622 is kinda open > or at least untested according to the website. No luck getting in touch > with driver writer about that. Besides, EMP guy says the GA622 sucks! While there are better Gigabit chips than the DP83820, most of its bad reputation comes from the poor performance of the other drivers out there. We get quite reasonable performance from it with the Scyld ns820.c driver. Others have reported a 2.5-3X performance improvement over the driver written by Red Hat. > EMP: http://www.osc.edu/~pw/emp/ > Seems to be interesting, although the available 3Com 3C996, of which we > have 3 to test, is only said to be "maybe" supported since it's Tigon 3 > and not Tigon 2. And will it such in latency just like the Tigon 2? > EMP guys says the GA-622T sucks with it's ns chipset and that was one > option with GAMMA, assuming he really did write the driver for both 621 > and 622 (their website isn't clear about this, and no emails from the > guys there), since the 622 "was" an option. ... > My questions are: > > 1) Is there a commercial product for a not so expensive board that > provides what GAMMA/EMP/M-VIA provide? Any other OS-bypass driver/MPI > layer I don't know about? No commercial company is likely to support a communication protocol unless they can pay for it (and have a hope of it working!) by bundling it with expensive hardware. We would support something on a best-effort or time-and-materials basis. > 2) Is there a solution I'm missing? Has to be copper gigabit for linux, > OS-bypass like GAMMA, MPI on top of that GAMMA-like. No dead boards, > etc. Why are there no commercial products? MPI/Pro is just a funny MPI > still on top of TCP, no? With custom versions available -- whatever you are willing to pay for. > Honestly, I can't really figure out what Scyld does. Is it just a linux > distribution? Does it actually have OS-bypass networking? Does > anything? We are a Linux distribution specifically designed for clustering. We have various modification for higher network performance, but that's actually used against us! Our competitors say "Look, Scyld modifies the kernel while we ship you a completely standard system." Then, when things don't work (as so often happen with complex systems), they get to say "that's the standard Linux behavior, it's not our problem". > Why is the OS-bypass so hard? If wanting no TCP support, isn't it > easier than writing standard linux driver? (like you've done a lot!) It took 20 years to get TCP right... -- Donald Becker becker@scyld.com Scyld Computing Corporation http://www.scyld.com 410 Severn Ave. Suite 210 Scyld Beowulf cluster system Annapolis MD 21403 410-990-9993 From gropp at mcs.anl.gov Mon Dec 16 16:26:28 2002 From: gropp at mcs.anl.gov (William Gropp) Date: Wed Nov 25 01:02:54 2009 Subject: MPICH-1.2.0 installation problem in RH7.3 In-Reply-To: Message-ID: <5.1.1.6.2.20021216182539.035e79f0@localhost> At 08:00 PM 12/16/2002 -0500, N Sakthivel wrote: >Dear Lists, > While installing MPICH-1.2.0 in RH7.3 machine, I am getting the >following error in config.log and make.log below as given below. >Does any one encountered this problem > >MPICH-1.2.0 The current version of MPICH is 1.2.4, with 1.2.5 soon to be released. I recommend getting the current version and seeing if that works first. Bill From mprinkey at aeolusresearch.com Mon Dec 16 16:37:22 2002 From: mprinkey at aeolusresearch.com (Michael T. Prinkey) Date: Wed Nov 25 01:02:54 2009 Subject: low-latency high-bandwidth OS bypass user-level messaging for commodity(linux) clusters with commodity NICs(<$200), HELP! (GAMMA/EMP/M-VIA/etc.) In-Reply-To: <001601c2a4f0$5cfc9a80$3027fea9@jonxp> References: <001601c2a4f0$5cfc9a80$3027fea9@jonxp> Message-ID: <1746.192.168.1.4.1040085442.squirrel@ra.aeolustec.com> Hi Jon, I began investigating the same issue a few months ago as Cu gigabit cards and switches became widely available. The situation is much the same now as it was then. No one is actively developing bypass support for the new networking hardware. The state of MVIA in particular has been in limbo for at least 18 months. GAMMA is interesting and seems to be more actively developed, but when I last checked there was no support for SMP. The MVIA web site was promising a second-generation of the core code which intended to make driver development more simple. To my knowledge, that version 2 has not seen the light of day. I think that many people would love to take advantage of newer cheap gigabit hardware with OS bypass, but as of yet, no one is really taking the lead in pulling the drivers together. In my mind, this is the next big hurdle for the Beowulf community to overcome if we intend to really use commodity hardware for network interconnects. Mike Prinkey Aeolus Research, Inc. > Hi Donald Becker, master of all that is networking! And anyone else > that can help :) > > Perhaps this isn't the best way to get ahold of you, but I've also sent > this to the Beowulf list. I've noted your comments on OS Bypass > drivers in the past. But isn't there some room for non-TCP/IP related > traffic, such as with computing clusters? We don't need no stinking > TCP! No associated revenue? You could replace Myrinet in the > thousands of nodes we have here ALONE at NCSA. > > We (UIUC theoretical astrophysics group) are in the midst of purchasing > a $50K cluster (I know, small, but big for us! :)) and I'm done all the > research as to what we should be getting. We ended up going with a > Intel Desktop gigabit board and P4, but have found the tests to be very > poor. We only have 4 nodes right now because we worried about this > very thing. > > Anyways, our problem is we are willing to pay for a commercial product, > but not at any cost, perhaps upto $200 per board. Basically, we see > all these solutions such as: > > Giganet using VIA > ServernetII using VIA > InfiniBand > U-Net > AM II > LPC > PM > FM > GigaE-PM > BIP > EMP > GAMMA > M-VIA > > Half of these are seemingly dead, those that seem relatively alive are: > > M-VIA: http://www.nersc.gov/research/ftg/via/ > Only support a few devices, and only 1 expensive ($500) gigabit board > that's still available (the SysKonnect). > > GAMMA: http://www.disi.unige.it/project/gamma/index.html > Depending on what part of their website you are at, they support > different devices. The Alteon TIGON-II results seem to suck for > latency, which is our biggest problem. The Netgear GA621 looks great! > But we already bought a $5000 copper gigabit switch! We are stuck with > it! (HP Procurve 5308xl). Whether they support the GA622 is kinda open > or at least untested according to the website. No luck getting in > touch with driver writer about that. Besides, EMP guy says the GA622 > sucks! > > > EMP: http://www.osc.edu/~pw/emp/ > Seems to be interesting, although the available 3Com 3C996, of which we > have 3 to test, is only said to be "maybe" supported since it's Tigon 3 > and not Tigon 2. And will it such in latency just like the Tigon 2? > EMP guys says the GA-622T sucks with it's ns chipset and that was one > option with GAMMA, assuming he really did write the driver for both 621 > and 622 (their website isn't clear about this, and no emails from the > guys there), since the 622 "was" an option. > > Basically after all my testing (about 2 months of light testing and > last 2 weeks of hard-core 24-hour a day testing) I realized TCP sucks > and I need an OS Bypass or user-level communication driver. > > My questions are: > > 1) Is there a commercial product for a not so expensive board that > provides what GAMMA/EMP/M-VIA provide? Any other OS-bypass driver/MPI > layer I don't know about? > > 2) Is there a solution I'm missing? Has to be copper gigabit for > linux, OS-bypass like GAMMA, MPI on top of that GAMMA-like. No dead > boards, etc. Why are there no commercial products? MPI/Pro is just a > funny MPI still on top of TCP, no? > > I basically want 20us latency for 0 message size and peak bandwidth, > for $100-$200 per board on gigabit. Not too much to ask? :) I know > it's certainly possible. > > Currently with Myrinet on P4 I get 17us latency and 80MB/sec bandwidth, > gigabit on P4 gets 70us latency and 80MB/sec. On Xeon's I get 50us > latency and 95MB/sec bandwidth with latest 3com bcm5700 or latest intel > e1000 driver. > > I'm going to try EMP with my 3c996, but honestly his instructions are > damn vague and confusing (i.e. WHAT snapshot of gcc/binutils should I > use, what the heck do I do?, etc.) Might try GAMMA too since EMP says > it may work as a Tigon processor. GAMMA seems a bit less crazy. > > Honestly, I can't really figure out what Scyld does. Is it just a > linux distribution? Does it actually have OS-bypass networking? Does > anything? > > Why is the OS-bypass so hard? If wanting no TCP support, isn't it > easier than writing standard linux driver? (like you've done a lot!) > > Thanks! > Jonathan McKinney > University of Illinois at Urbana-Champaign > Center for Theoretical Astrophysics > NCSA > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From jcmcknny at uiuc.edu Mon Dec 16 22:08:32 2002 From: jcmcknny at uiuc.edu (jon) Date: Wed Nov 25 01:02:54 2009 Subject: low-latency high-bandwidth OS bypass user-level messaging for commodity(linux) clusters with commodity NICs(<$200), HELP! (GAMMA/EMP/M-VIA/etc.) In-Reply-To: <1746.192.168.1.4.1040085442.squirrel@ra.aeolustec.com> Message-ID: <004a01c2a592$bd05aff0$3027fea9@jonxp> What's the deal with SCore? http://pdswww.rwcp.or.jp/ The guys at GAMMA mentioned them, as well as some friends on mine recently. It seems to do quite well with Intel gigabit chips (30us latency and 120MB/sec peak bandwidth). Why have I never heard of it? How do they achieve this? Not a total OS-bypass linux distribution? They seem to be an entire linux dist. Why can't Scyld do this? -Jon > -----Original Message----- > From: Michael T. Prinkey [mailto:mprinkey@aeolusresearch.com] > Sent: Monday, December 16, 2002 6:37 PM > To: jcmcknny@uiuc.edu > Cc: beowulf@beowulf.org; becker@scyld.com > Subject: Re: low-latency high-bandwidth OS bypass user-level messaging for > commodity(linux) clusters with commodity NICs(<$200), HELP! (GAMMA/EMP/M-VIA/etc.) > > Hi Jon, > > I began investigating the same issue a few months ago as Cu gigabit cards > and switches became widely available. The situation is much the same now as > it was then. No one is actively developing bypass support for the new > networking hardware. The state of MVIA in particular has been in limbo for > at least 18 months. GAMMA is interesting and seems to be more actively > developed, but when I last checked there was no support for SMP. The MVIA > web site was promising a second-generation of the core code which intended > to make driver development more simple. To my knowledge, that version 2 has > not seen the light of day. > > I think that many people would love to take advantage of newer cheap gigabit > hardware with OS bypass, but as of yet, no one is really taking the lead in > pulling the drivers together. In my mind, this is the next big hurdle for > the Beowulf community to overcome if we intend to really use commodity > hardware for network interconnects. > > Mike Prinkey > Aeolus Research, Inc. > > > Hi Donald Becker, master of all that is networking! And anyone else > > that can help :) > > > > Perhaps this isn't the best way to get ahold of you, but I've also sent > > this to the Beowulf list. I've noted your comments on OS Bypass > > drivers in the past. But isn't there some room for non-TCP/IP related > > traffic, such as with computing clusters? We don't need no stinking > > TCP! No associated revenue? You could replace Myrinet in the > > thousands of nodes we have here ALONE at NCSA. > > > > We (UIUC theoretical astrophysics group) are in the midst of purchasing > > a $50K cluster (I know, small, but big for us! :)) and I'm done all the > > research as to what we should be getting. We ended up going with a > > Intel Desktop gigabit board and P4, but have found the tests to be very > > poor. We only have 4 nodes right now because we worried about this > > very thing. > > > > Anyways, our problem is we are willing to pay for a commercial product, > > but not at any cost, perhaps upto $200 per board. Basically, we see > > all these solutions such as: > > > > Giganet using VIA > > ServernetII using VIA > > InfiniBand > > U-Net > > AM II > > LPC > > PM > > FM > > GigaE-PM > > BIP > > EMP > > GAMMA > > M-VIA > > > > Half of these are seemingly dead, those that seem relatively alive are: > > > > M-VIA: http://www.nersc.gov/research/ftg/via/ > > Only support a few devices, and only 1 expensive ($500) gigabit board > > that's still available (the SysKonnect). > > > > GAMMA: http://www.disi.unige.it/project/gamma/index.html > > Depending on what part of their website you are at, they support > > different devices. The Alteon TIGON-II results seem to suck for > > latency, which is our biggest problem. The Netgear GA621 looks great! > > But we already bought a $5000 copper gigabit switch! We are stuck with > > it! (HP Procurve 5308xl). Whether they support the GA622 is kinda open > > or at least untested according to the website. No luck getting in > > touch with driver writer about that. Besides, EMP guy says the GA622 > > sucks! > > > > > > EMP: http://www.osc.edu/~pw/emp/ > > Seems to be interesting, although the available 3Com 3C996, of which we > > have 3 to test, is only said to be "maybe" supported since it's Tigon 3 > > and not Tigon 2. And will it such in latency just like the Tigon 2? > > EMP guys says the GA-622T sucks with it's ns chipset and that was one > > option with GAMMA, assuming he really did write the driver for both 621 > > and 622 (their website isn't clear about this, and no emails from the > > guys there), since the 622 "was" an option. > > > > Basically after all my testing (about 2 months of light testing and > > last 2 weeks of hard-core 24-hour a day testing) I realized TCP sucks > > and I need an OS Bypass or user-level communication driver. > > > > My questions are: > > > > 1) Is there a commercial product for a not so expensive board that > > provides what GAMMA/EMP/M-VIA provide? Any other OS-bypass driver/MPI > > layer I don't know about? > > > > 2) Is there a solution I'm missing? Has to be copper gigabit for > > linux, OS-bypass like GAMMA, MPI on top of that GAMMA-like. No dead > > boards, etc. Why are there no commercial products? MPI/Pro is just a > > funny MPI still on top of TCP, no? > > > > I basically want 20us latency for 0 message size and peak bandwidth, > > for $100-$200 per board on gigabit. Not too much to ask? :) I know > > it's certainly possible. > > > > Currently with Myrinet on P4 I get 17us latency and 80MB/sec bandwidth, > > gigabit on P4 gets 70us latency and 80MB/sec. On Xeon's I get 50us > > latency and 95MB/sec bandwidth with latest 3com bcm5700 or latest intel > > e1000 driver. > > > > I'm going to try EMP with my 3c996, but honestly his instructions are > > damn vague and confusing (i.e. WHAT snapshot of gcc/binutils should I > > use, what the heck do I do?, etc.) Might try GAMMA too since EMP says > > it may work as a Tigon processor. GAMMA seems a bit less crazy. > > > > Honestly, I can't really figure out what Scyld does. Is it just a > > linux distribution? Does it actually have OS-bypass networking? Does > > anything? > > > > Why is the OS-bypass so hard? If wanting no TCP support, isn't it > > easier than writing standard linux driver? (like you've done a lot!) > > > > Thanks! > > Jonathan McKinney > > University of Illinois at Urbana-Champaign > > Center for Theoretical Astrophysics > > NCSA > > > > > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > From jcmcknny at uiuc.edu Mon Dec 16 22:12:30 2002 From: jcmcknny at uiuc.edu (jon) Date: Wed Nov 25 01:02:54 2009 Subject: low-latency high-bandwidth OS bypass user-level messaging for commodity(linux) clusters with commodity NICs(<$200), HELP! (GAMMA/EMP/M-VIA/etc.) In-Reply-To: Message-ID: <004b01c2a593$48432250$3027fea9@jonxp> > -----Original Message----- > From: Donald Becker [mailto:becker@scyld.com] > Sent: Monday, December 16, 2002 6:32 PM > To: jon > Cc: beowulf@beowulf.org > Subject: Re: low-latency high-bandwidth OS bypass user-level messaging for > commodity(linux) clusters with commodity NICs(<$200), HELP! (GAMMA/EMP/M-VIA/etc.) > > On Mon, 16 Dec 2002, jon wrote: > > > Perhaps this isn't the best way to get ahold of you, but I've also sent > > this to the Beowulf list. I've noted your comments on OS Bypass drivers > > in the past. But isn't there some room for non-TCP/IP related traffic, > > such as with computing clusters? We don't need no stinking TCP! No > > associated revenue? You could replace Myrinet in the thousands of nodes > > we have here ALONE at NCSA. > > Very likely not. Myrinet has both a cost and increasing performance > advantage over gigabit Ethernet when the switch is larger than about 96 > ports. [jon] I see. > > > We (UIUC theoretical astrophysics group) are in the midst of purchasing > > a $50K cluster (I know, small, but big for us! :)) and I'm done all the > > research as to what we should be getting. We ended up going with a > > Intel Desktop gigabit board and P4, but have found the tests to be very > > poor. We only have 4 nodes right now because we worried about this very > > thing. > > Latency or bandwidth? And what are you using to test? [jon] We need (projected 0 message size) latencies to be about 30-40us for our applications. We end up with message sizes ranging from 256bytes to 8KB for different applications. A 2.4Ghz P4 cluster has an idling CPU due to the latency. > While there are better Gigabit chips than the DP83820, most of its bad > reputation comes from the poor performance of the other drivers out > there. We get quite reasonable performance from it with the Scyld > ns820.c driver. Others have reported a 2.5-3X performance improvement > over the driver written by Red Hat. [jon] I see, what kinda of 0message latency and peak bandwidth do you get on 64-bit 66Mhz bus? [jon] Thanks! -Jon From patrick at myri.com Mon Dec 16 23:38:26 2002 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:02:54 2009 Subject: low-latency high-bandwidth OS bypass user-level messaging for commodity(linux) clusters with commodity NICs(<$200), HELP! (GAMMA/EMP/M-VIA/etc.) In-Reply-To: <001601c2a4f0$5cfc9a80$3027fea9@jonxp> References: <001601c2a4f0$5cfc9a80$3027fea9@jonxp> Message-ID: <1040110709.480.255.camel@asterix> Hi Jon, On Mon, 2002-12-16 at 05:46, jon wrote: > Currently with Myrinet on P4 I get 17us latency and 80MB/sec bandwidth, Throw your machine to the garbage and buy a P4 with a decent chipset, IMHO. > Why is the OS-bypass so hard? If wanting no TCP support, isn't it > easier than writing standard linux driver? (like you've done a lot!) The last time I talked to Pete Wyckoff (SC02), he was not working on EMP anymore. It would be interesting to get his view on the problem (Pete ?). My 2 cents. Patrick -- Patrick Geoffray, Phd Myricom, Inc. http://www.myri.com From patrick at myri.com Mon Dec 16 23:58:21 2002 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:02:54 2009 Subject: low-latency high-bandwidth OS bypass user-level messaging for commodity(linux) clusters with commodity NICs(<$200), HELP! (GAMMA/EMP/M-VIA/etc.) In-Reply-To: References: Message-ID: <1040111903.480.276.camel@asterix> Hi Don, On Mon, 2002-12-16 at 19:31, Donald Becker wrote: > > BIP > Magic protocol using custom Myrinet firmware. Grumble: early > performance numbers were not reproducible (I got exactly 50% of tech report > numbers on same hardware and software). This is weird. I was involved in BIP back when I was a student in France, and Loic (BIP's author and Myrinet guru at large) was very careful about publishing real and reproducible numbers. I have myself confirmed these numbers many times. Try to get a hand on 2 recent NICs and test the latest BIP (0.99u) from http://www.ens-lyon.fr/LIP/BIP/. I am sure Loic would help you if you cannot reproduce good numbers. Last time I heard about it, BIP was getting <4 us on L9 (not reliable though. Reliability was planned but never implemented, the curse of all academics projects). > No commercial company is likely to support a communication protocol > unless they can pay for it (and have a hope of it working!) by > bundling it with expensive hardware. I totally agree with you. It would be very hard to generate enough income selling only software to support the manpower needed for this type of development. That's why, when selling a product (hardware + software), the software end up almost all of the time being free. In the case of Scyld, the value is not in the software, it's in the service (not that the software has no value, but people accept to pay for service, not anymore for software). My 2 cents. Patrick -- Patrick Geoffray, Phd Myricom, Inc. http://www.myri.com From patrick at myri.com Tue Dec 17 00:06:08 2002 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:02:54 2009 Subject: low-latency high-bandwidth OS bypass user-level messaging forcommodity(linux) clusters with commodity NICs(<$200), HELP!(GAMMA/EMP/M-VIA/etc.) In-Reply-To: <005101c2a5a0$ea603840$3027fea9@jonxp> References: <005101c2a5a0$ea603840$3027fea9@jonxp> Message-ID: <1040112369.480.284.camel@asterix> Jon, On Tue, 2002-12-17 at 02:50, jon wrote: > Perhaps the P4 isn't that great, but it's an Intel 850E chipset, not > sure that it sucks :) > > http://usa.asus.com/mb/socket478/p4t533-c/overview.htm is the > motherboard. > It's a PCI 32/33. Nowadays, you can find plenty of P4-based motherboards with correct PCI 64/66, for almost the same price. > Myrinet even says that you can't expect anymore than 100MB/sec-130MB/sec > out of myrinet on 32-bit 33Mhz with the best chipset. Well, you cannot expect more than 130 MB/s out of a PCI bus 32 bits/33 MHz. As the PCI is on the path... > They also say system dependencies can provide latencies from 6.5us to > 16us. PCI is one of them. > Anyways, ya, I don't think any P4 system is a great myrinet host. The hot machines today are P4-based, using the Serverworks GrandChampion (GC) or Intel E7500 (Plumas) chipsets. They provide PCI 64/66 (even PCI-X) and are available from many vendors. All of the Intel i8x0 chipset are just **** regarding PCI performance. All the machines we bought recently were Supermicro motherboards with Serverworks GC-LE chipsets. Works great. Patrick -- Patrick Geoffray, Phd Myricom, Inc. http://www.myri.com From c00dcw00 at nchc.gov.tw Tue Dec 17 01:29:06 2002 From: c00dcw00 at nchc.gov.tw (David Wan) Date: Wed Nov 25 01:02:54 2009 Subject: 12/17/2002, 17:28 - Need info ! References: <004b01c2a593$48432250$3027fea9@jonxp> Message-ID: <005c01c2a5ae$c0ff8920$673c6e8c@dt.nchc.gov.tw> To whom it may concern/know : As regarding "Myrinet has both a cost and increasing performance advantage over gigabit Ethernet when the switch is larger than about 96 ports" . Can you supply more detailed info ! TKS ! Regards, David C. Wan 12/17/2002, 17:28 ----- Original Message ----- From: "jon" To: "'Donald Becker'" Cc: Sent: Tuesday, December 17, 2002 2:12 PM Subject: RE: low-latency high-bandwidth OS bypass user-level messaging for commodity(linux) clusters with commodity NICs(<$200), HELP! (GAMMA/EMP/M-VIA/etc.) > > -----Original Message----- > > From: Donald Becker [mailto:becker@scyld.com] > > Sent: Monday, December 16, 2002 6:32 PM > > To: jon > > Cc: beowulf@beowulf.org > > Subject: Re: low-latency high-bandwidth OS bypass user-level messaging > for > > commodity(linux) clusters with commodity NICs(<$200), HELP! > (GAMMA/EMP/M-VIA/etc.) > > > > On Mon, 16 Dec 2002, jon wrote: > > > > > Perhaps this isn't the best way to get ahold of you, but I've also > sent > > > this to the Beowulf list. I've noted your comments on OS Bypass > drivers > > > in the past. But isn't there some room for non-TCP/IP related > traffic, > > > such as with computing clusters? We don't need no stinking TCP! No > > > associated revenue? You could replace Myrinet in the thousands of > nodes > > > we have here ALONE at NCSA. > > > > Very likely not. Myrinet has both a cost and increasing performance > > advantage over gigabit Ethernet when the switch is larger than about > 96 > > ports. > > [jon] I see. > > > > > > We (UIUC theoretical astrophysics group) are in the midst of > purchasing > > > a $50K cluster (I know, small, but big for us! :)) and I'm done all > the > > > research as to what we should be getting. We ended up going with a > > > Intel Desktop gigabit board and P4, but have found the tests to be > very > > > poor. We only have 4 nodes right now because we worried about this > very > > > thing. > > > > Latency or bandwidth? And what are you using to test? > > [jon] We need (projected 0 message size) latencies to be about 30-40us > for our applications. We end up with message sizes ranging from > 256bytes to 8KB for different applications. A 2.4Ghz P4 cluster has an > idling CPU due to the latency. > > > While there are better Gigabit chips than the DP83820, most of its bad > > reputation comes from the poor performance of the other drivers out > > there. We get quite reasonable performance from it with the Scyld > > ns820.c driver. Others have reported a 2.5-3X performance improvement > > over the driver written by Red Hat. > > [jon] I see, what kinda of 0message latency and peak bandwidth do you > get on 64-bit 66Mhz bus? > > [jon] Thanks! -Jon > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From wrp at alpha0.bioch.virginia.edu Tue Dec 17 05:54:19 2002 From: wrp at alpha0.bioch.virginia.edu (William R. Pearson) Date: Wed Nov 25 01:02:54 2009 Subject: Minimal, fast single node queue software In-Reply-To: <200212170514.gBH5E3G25320@blueraja.scyld.com> (beowulf-request@beowulf.org) References: <200212170514.gBH5E3G25320@blueraja.scyld.com> Message-ID: <200212171354.IAA00524@alpha0.bioch.virginia.edu> You might try disperse: Clifford R, Mackey AJ. Disperse: a simple and efficient approach to parallel database searching. Bioinformatics. 2000 Jun;16(6):564-5. PMID: 10980156 [PubMed - indexed for MEDLINE] It is available from: www.people.virginia.edu/~ajm6q/disperse Bill Pearson From becker at scyld.com Tue Dec 17 07:13:37 2002 From: becker at scyld.com (Donald Becker) Date: Wed Nov 25 01:02:55 2009 Subject: low-latency high-bandwidth OS bypass user-level messaging for commodity(linux) clusters with commodity NICs(<$200), HELP! (GAMMA/EMP/M-VIA/etc.) In-Reply-To: <1040111903.480.276.camel@asterix> Message-ID: On 17 Dec 2002, Patrick Geoffray wrote: > On Mon, 2002-12-16 at 19:31, Donald Becker wrote: > > > BIP > > Magic protocol using custom Myrinet firmware. Grumble: early > > performance numbers were not reproducible (I got exactly 50% of tech report > > numbers on same hardware and software). > > This is weird. I was involved in BIP back when I was a student in > France, and Loic (BIP's author and Myrinet guru at large) was very > careful about publishing real and reproducible numbers. I have myself > confirmed these numbers many times. > Try to get a hand on 2 recent NICs and test the latest BIP (0.99u) from > http://www.ens-lyon.fr/LIP/BIP/. As I said, "early numbers", probably 4-5 years ago. This was a comment that reflected a few days work back then. We had exactly the same Pentium Pro PR440FX motherboards, Myrinet cards and used the same reported kernel version. I'm fairly certain that we created a near-duplicate test environment, and we were trying hard to reproduce, not refute, the numbers. Today it would be much more difficult to reproduce the exact environment. Just take the 2.4 kernel. It - modifies the PCI bridge and bus master parameters based on many input variable, - configures and uses the IOAPIC based on BIOS tables, - is usually patched by the distribution, with many modification being to the PCI quirks table, - may be compiled with widely varying GCC verisons 2.96, 2.96, 3.0, 3.2 And unlike 2.2 and earlier kernels, the performance under heavy I/O load can depend heavily on the initial pattern of interrupts to the APIC. I didn't mean for this posting to be about BIP. I do feel it was fair to put in a short note reflecting our experience. > I am sure Loic would help you if you > cannot reproduce good numbers. Last time I heard about it, BIP was > getting <4 us on L9 (not reliable though. Reliability was planned but > never implemented, the curse of all academics projects). Implementing reliability is essential to understanding the effectiveness of the approach. There is a big gap between "put the next packet into this memory location, call it done, and assume everything goes according to plan" and a TCP/IP socket. -- Donald Becker becker@scyld.com Scyld Computing Corporation http://www.scyld.com 410 Severn Ave. Suite 210 Scyld Beowulf cluster system Annapolis MD 21403 410-990-9993 From jcmcknny at uiuc.edu Mon Dec 16 23:50:06 2002 From: jcmcknny at uiuc.edu (jon) Date: Wed Nov 25 01:02:55 2009 Subject: low-latency high-bandwidth OS bypass user-level messaging forcommodity(linux) clusters with commodity NICs(<$200), HELP!(GAMMA/EMP/M-VIA/etc.) In-Reply-To: <1040110709.480.255.camel@asterix> Message-ID: <005101c2a5a0$ea603840$3027fea9@jonxp> Hi, Perhaps the P4 isn't that great, but it's an Intel 850E chipset, not sure that it sucks :) http://usa.asus.com/mb/socket478/p4t533-c/overview.htm is the motherboard. Myrinet even says that you can't expect anymore than 100MB/sec-130MB/sec out of myrinet on 32-bit 33Mhz with the best chipset. http://www.myrinet.com/myrinet/performance/index.html They also say system dependencies can provide latencies from 6.5us to 16us. Anyways, ya, I don't think any P4 system is a great myrinet host. Thanks, Jon > -----Original Message----- > From: Patrick Geoffray [mailto:patrick@myri.com] > Sent: Tuesday, December 17, 2002 1:38 AM > To: jon > Cc: Beowulf cluster mailing list; becker@scyld.com > Subject: Re: low-latency high-bandwidth OS bypass user-level messaging > forcommodity(linux) clusters with commodity NICs(<$200), HELP!(GAMMA/EMP/M-VIA/etc.) > > Hi Jon, > > On Mon, 2002-12-16 at 05:46, jon wrote: > > > Currently with Myrinet on P4 I get 17us latency and 80MB/sec bandwidth, > > Throw your machine to the garbage and buy a P4 with a decent chipset, > IMHO. > > > Why is the OS-bypass so hard? If wanting no TCP support, isn't it > > easier than writing standard linux driver? (like you've done a lot!) > > The last time I talked to Pete Wyckoff (SC02), he was not working on EMP > anymore. It would be interesting to get his view on the problem (Pete > ?). > > My 2 cents. > > Patrick > -- > > Patrick Geoffray, Phd > Myricom, Inc. > http://www.myri.com From jcmcknny at uiuc.edu Tue Dec 17 00:39:29 2002 From: jcmcknny at uiuc.edu (jon) Date: Wed Nov 25 01:02:55 2009 Subject: low-latency high-bandwidth OS bypass user-level messagingforcommodity(linux) clusters with commodity NICs(<$200),HELP!(GAMMA/EMP/M-VIA/etc.) In-Reply-To: <1040112369.480.284.camel@asterix> Message-ID: <005201c2a5a7$d3ac6f90$3027fea9@jonxp> Hi, > -----Original Message----- > From: Patrick Geoffray [mailto:patrick@myri.com] > Sent: Tuesday, December 17, 2002 2:06 AM > To: jon > Cc: 'Beowulf cluster mailing list' > Subject: RE: low-latency high-bandwidth OS bypass user-level messagingforcommodity(linux) > clusters with commodity NICs(<$200),HELP!(GAMMA/EMP/M-VIA/etc.) > > Jon, > > On Tue, 2002-12-17 at 02:50, jon wrote: > > Perhaps the P4 isn't that great, but it's an Intel 850E chipset, not > > sure that it sucks :) > > > > http://usa.asus.com/mb/socket478/p4t533-c/overview.htm is the > > motherboard. > > > > It's a PCI 32/33. Nowadays, you can find plenty of P4-based motherboards > with correct PCI 64/66, for almost the same price. [jon] Honestly I couldn't find ANY P4 MB's with 64-bit/66Mhz, let alone PCI-X. If you could point one out, that would be fine :) The Grand Champion's from SuperMicro with 64-bit/66Mhz are all dual-Xeons. There's a single-Xeon with 64-bit/33Mhz, and the rest are P4's with 1 CPU and are all 32-bit 33Mhz. The point is the Intel 845 and Intel 850 are for use with P4, least SuperMicro's, even the E7205 based P4 is 32-bit 33Mhz with the P4. > > Anyways, ya, I don't think any P4 system is a great myrinet host. > > The hot machines today are P4-based, using the Serverworks GrandChampion > (GC) or Intel E7500 (Plumas) chipsets. They provide PCI 64/66 (even > PCI-X) and are available from many vendors. All of the Intel i8x0 > chipset are just **** regarding PCI performance. > > All the machines we bought recently were Supermicro motherboards with > Serverworks GC-LE chipsets. Works great. [jon] And they have P4 Xeons, right? :) Not Pure P4's. [jon] From jeffrey.b.layton at lmco.com Tue Dec 17 03:09:07 2002 From: jeffrey.b.layton at lmco.com (Jeff Layton) Date: Wed Nov 25 01:02:55 2009 Subject: RSH scaling problems... References: Message-ID: <3DFF05D3.654BAA8B@lmco.com> Jesse Becker wrote: > On Sun, 15 Dec 2002, Mike S Galicki wrote: > > > I believe the default pty's in 2.4.20 is 1024, but when I list /dev/pty > > I only see 256 entries. MAKEDEV -m 1024 didn't seem to do anything past > > 256. > > The default number of ptys is 254 in 2.4.x Linux kernels. This is > hardcoded, and you need a kernel recompile if you need more. The way it was explained to me is that the function rcmd(), which is invoked by rsh, attempts to gobble up two ports between 512 and 1024. Simple math: you can only EVER have 256 rshs running on a machine at the same time. It usually is a lot less than this since other programs are gobbling up some of these ports. (Courtesy of Dan Nurmi of Argonne). So, even if you patch the kernel to give you more than 256 ptys, you also need to patch rcmd() to use a wider range of ports (at least in theory). Any comments? Jeff > > > --Jesse > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Jeff Layton Senior Engineer Lockheed-Martin Aeronautical Company - Marietta Aerodynamics & CFD "Is it possible to overclock a cattle prod?" - Irv Mullins This email may contain confidential information. If you have received this email in error, please delete it immediately, and inform me of the mistake by return email. Any form of reproduction, or further dissemination of this email is strictly prohibited. Also, please note that opinions expressed in this email are those of the author, and are not necessarily those of the Lockheed-Martin Corporation. From lindahl at keyresearch.com Tue Dec 17 08:42:35 2002 From: lindahl at keyresearch.com (Greg Lindahl) Date: Wed Nov 25 01:02:55 2009 Subject: RSH scaling problems... In-Reply-To: <3DFF05D3.654BAA8B@lmco.com>; from jeffrey.b.layton@lmco.com on Tue, Dec 17, 2002 at 06:09:07AM -0500 References: <3DFF05D3.654BAA8B@lmco.com> Message-ID: <20021217084235.A2795@wumpus.attbi.com> On Tue, Dec 17, 2002 at 06:09:07AM -0500, Jeff Layton wrote: > So, even if you patch the kernel to give you more than 256 ptys, > you also need to patch rcmd() to use a wider range of ports (at least > in theory). > Any comments? rcmd must use low ports, for security reasons. ssh has a mode in which it can use a high port. That's how the 370-node FSL system did it in 1999. This technique also worked OK on Cplant at 1200 nodes. -- greg From patrick at myri.com Tue Dec 17 09:19:55 2002 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:02:55 2009 Subject: low-latency high-bandwidth OS bypass user-level messagingforcommodity(linux) clusters with commodity NICs(<$200),HELP!(GAMMA/EMP/M-VIA/etc.) In-Reply-To: <005201c2a5a7$d3ac6f90$3027fea9@jonxp> References: <005201c2a5a7$d3ac6f90$3027fea9@jonxp> Message-ID: <1040145598.480.15.camel@asterix> On Tue, 2002-12-17 at 03:39, jon wrote: > [jon] And they have P4 Xeons, right? :) Not Pure P4's. Yes, you are absolutely right. I didn't make the distinction between P4 and P4 Xeon. All of the good chipsets are for P4 Xeon, not for P4. It's hard to justify a good PCI for the market targeted by the P4 (not Xeon). Patrick -- Patrick Geoffray, Phd Myricom, Inc. http://www.myri.com From lindahl at keyresearch.com Tue Dec 17 09:36:54 2002 From: lindahl at keyresearch.com (Greg Lindahl) Date: Wed Nov 25 01:02:55 2009 Subject: RSH scaling problems... In-Reply-To: <3DFF5BC4.E62A7801@lmco.com>; from jeffrey.b.layton@lmco.com on Tue, Dec 17, 2002 at 12:15:48PM -0500 References: <3DFF05D3.654BAA8B@lmco.com> <20021217084235.A2795@wumpus.attbi.com> <3DFF5BC4.E62A7801@lmco.com> Message-ID: <20021217093654.B2795@wumpus.attbi.com> > > ssh has a mode in which it can use a high port. That's how the > > 370-node FSL system did it in 1999. This technique also worked OK on > > Cplant at 1200 nodes. > > Greg, > > Our users who have tested ssh have told us that the wall clock > time is much larger than if they use rsh (I'm sorry I don't have any > numbers :) This are MPI jobs running over about 12-24 hours. > Have you seen this? I haven't seen it. If your jobs run that long, just about any method of starting will be efficient enough to not increase the runtime. -- greg From patrick at myri.com Tue Dec 17 09:41:33 2002 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:02:55 2009 Subject: low-latency high-bandwidth OS bypass user-level messaging for commodity(linux) clusters with commodity NICs(<$200), HELP! (GAMMA/EMP/M-VIA/etc.) In-Reply-To: References: Message-ID: <1040146895.497.36.camel@asterix> On Tue, 2002-12-17 at 10:13, Donald Becker wrote: > Today it would be much more difficult to reproduce the exact > environment. Tell me about it :-) It really takes some time to realize the number of parameters that involved in the big equation. Neglect one component and part of the performance goes to the toilettes. This is a fragile environment. Still, a 50% difference is unusually large, specially with a code that is OS-bypass, does not use interrupts and where the large part of the critical path is running on a hardware independent of the host. To come back on the original post, the fact that Os-bypass/zero-copy/whatever communication layer is often tight to a specific hardware is not only a way to secure an effective source of revenue. It's also aimed at make the life of the software developers much easier: having a common closed hardware environment reduces the dependencies on the host. One big question when designing a non-TCP layer for Ethernet is how deep to go trying to exploit the hardware ? If you try to be generic, you will quickly realize that the existing driver architecture makes a very decent job. If you starts to use hardware specific functionalities, you will either lock yourself in a few hardware solutions (scary when you do not control the future of the hardware line) or the amount of work needed to support a large set of GigE NICs at a ssuch low level is exploding. Add to that the fact that GigE chipsets have a quite short life cycle and the vendors are reluctant to provide details about their chips, I understand why there is no such product today. Patrick -- Patrick Geoffray, Phd Myricom, Inc. http://www.myri.com From mathog at mendel.bio.caltech.edu Tue Dec 17 10:08:07 2002 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:02:55 2009 Subject: RSH scaling problems... Message-ID: Here's a variant of rsh I've been experimenting with. ftp://saf.bio.caltech.edu/pub/software/linux_or_unix_tools/rsh.c It allows multiple target nodes on the command line, or multiple target nodes from a name file, and runs the same command on each node (one after the other, not in parallel). It also has a -z flag which disables all IO. That mode can be used to fire off jobs on remote nodes quickly in those cases where no IO for that job must go through stdin/stdout/stderr. I expect that the -z mode would not suffer (much) from the port limitation Greg Lindahl points out since rsh -z doesn't leave any ports open once it starts the remote command, and it typically completes in about .011 seconds (100baseT, RH 7.3 on Athlon 2200). I expect though that mpi probably does use stdout/stderr and maybe stdin, so -z likely won't resolve the problem at hand. Still, it's a handy tool for tasks like: rsh -zf allnodes.txt killall pvmd3 \; rm -f /tmp/pvm\* and the like. And it starts jobs at least 2x faster than any other tool I've tried so far. Regards David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From jeffrey.b.layton at lmco.com Tue Dec 17 09:15:48 2002 From: jeffrey.b.layton at lmco.com (Jeff Layton) Date: Wed Nov 25 01:02:55 2009 Subject: RSH scaling problems... References: <3DFF05D3.654BAA8B@lmco.com> <20021217084235.A2795@wumpus.attbi.com> Message-ID: <3DFF5BC4.E62A7801@lmco.com> Greg Lindahl wrote: > On Tue, Dec 17, 2002 at 06:09:07AM -0500, Jeff Layton wrote: > > > So, even if you patch the kernel to give you more than 256 ptys, > > you also need to patch rcmd() to use a wider range of ports (at least > > in theory). > > Any comments? > > rcmd must use low ports, for security reasons. > > ssh has a mode in which it can use a high port. That's how the > 370-node FSL system did it in 1999. This technique also worked OK on > Cplant at 1200 nodes. Greg, Our users who have tested ssh have told us that the wall clock time is much larger than if they use rsh (I'm sorry I don't have any numbers :) This are MPI jobs running over about 12-24 hours. Have you seen this? Thanks! Jeff > > > -- greg > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Jeff Layton Senior Engineer Lockheed-Martin Aeronautical Company - Marietta Aerodynamics & CFD "Is it possible to overclock a cattle prod?" - Irv Mullins This email may contain confidential information. If you have received this email in error, please delete it immediately, and inform me of the mistake by return email. Any form of reproduction, or further dissemination of this email is strictly prohibited. Also, please note that opinions expressed in this email are those of the author, and are not necessarily those of the Lockheed-Martin Corporation. From Daniel.Kidger at quadrics.com Tue Dec 17 09:58:11 2002 From: Daniel.Kidger at quadrics.com (Daniel Kidger) Date: Wed Nov 25 01:02:55 2009 Subject: RSH scaling problems... Message-ID: <010C86D15E4D1247B9A5DD312B7F5AA78DDD95@stegosaurus.bristol.quadrics.com> >> > ssh has a mode in which it can use a high port. That's how the >> > 370-node FSL system did it in 1999. This technique also worked OK on >> > Cplant at 1200 nodes. >> >> Greg, >> >> Our users who have tested ssh have told us that the wall clock >> time is much larger than if they use rsh (I'm sorry I don't have any >> numbers :) This are MPI jobs running over about 12-24 hours. >> Have you seen this? > >I haven't seen it. If your jobs run that long, just about any method >of starting will be efficient enough to not increase the runtime. Dont forget that by default ssh encripts all traffic. If you have a lot of stdout then this will slow things down I guess. There are options to ssh to change or turn off this encription. Yours, Daniel. -------------------------------------------------------------- Dr. Dan Kidger, Quadrics Ltd. daniel.kidger@quadrics.com One Bridewell St., Bristol, BS1 2AA, UK 0117 915 5505 ----------------------- www.quadrics.com -------------------- From lindahl at keyresearch.com Tue Dec 17 10:33:37 2002 From: lindahl at keyresearch.com (Greg Lindahl) Date: Wed Nov 25 01:02:55 2009 Subject: RSH scaling problems... In-Reply-To: ; from mathog@mendel.bio.caltech.edu on Tue, Dec 17, 2002 at 10:08:07AM -0800 References: Message-ID: <20021217103337.A3112@wumpus.attbi.com> On Tue, Dec 17, 2002 at 10:08:07AM -0800, David Mathog wrote: > I expect that the -z mode would not suffer (much) from the port > limitation Greg Lindahl points out since rsh -z doesn't leave > any ports open once it starts the remote command, and it typically > completes in about .011 seconds (100baseT, RH 7.3 on Athlon 2200). Low ports can't be reused until TIME_WAIT time has passed. greg From becker at scyld.com Tue Dec 17 11:10:29 2002 From: becker at scyld.com (Donald Becker) Date: Wed Nov 25 01:02:55 2009 Subject: low-latency high-bandwidth OS bypass user-level messaging for commodity(linux) clusters with commodity NICs(<$200), HELP! (GAMMA/EMP/M-VIA/etc.) In-Reply-To: <1040146895.497.36.camel@asterix> Message-ID: On 17 Dec 2002, Patrick Geoffray wrote: > On Tue, 2002-12-17 at 10:13, Donald Becker wrote: > To come back on the original post, the fact that > Os-bypass/zero-copy/whatever communication layer is often tight to a > specific hardware is not only a way to secure an effective source of > revenue. It's also aimed at make the life of the software developers > much easier: having a common closed hardware environment reduces the > dependencies on the host. I would go further -- the only way OS-bypass is efficient enough to use, especially on SMP machines where the page tables and cache must be consistent, is to exploit special features of the hardware. The SMP issue is a very big one. To convert a standard driver to SMP is figuring out what serialization assumptions have been and can be made. To convert an OS-bypass driver to SMP requires redesigning the structure to something much more complex. > One big question when designing a non-TCP layer for Ethernet is how deep > to go trying to exploit the hardware ? If you try to be generic, you > will quickly realize that the existing driver architecture makes a very > decent job. If you starts to use hardware specific functionalities, you > will either lock yourself in a few hardware solutions (scary when you do > not control the future of the hardware line) or the amount of work > needed to support a large set of GigE NICs at a ssuch low level is > exploding. Add to that the fact that GigE chipsets have a quite short > life cycle and the vendors are reluctant to provide details about their > chips, I understand why there is no such product today. A good example is easy to find: the Intel GigE NIC series has added about a half dozen new PCI IDs in the past year. Not just revision numbers, which we don't track, but the major ID number. Some of those new versions appear to have significant changes in the feature set and in the way they handle small packets to reduce latency. That's a life cycle of about three months, which is shorter than the time to decide a device driver is stable. -- Donald Becker becker@scyld.com Scyld Computing Corporation http://www.scyld.com 410 Severn Ave. Suite 210 Scyld Beowulf cluster system Annapolis MD 21403 410-990-9993 From xyzzy at speakeasy.org Tue Dec 17 11:37:09 2002 From: xyzzy at speakeasy.org (Trent Piepho) Date: Wed Nov 25 01:02:55 2009 Subject: RSH scaling problems... In-Reply-To: <20021217093654.B2795@wumpus.attbi.com> Message-ID: On Tue, 17 Dec 2002, Greg Lindahl wrote: > > Our users who have tested ssh have told us that the wall clock > > time is much larger than if they use rsh (I'm sorry I don't have any > > numbers :) This are MPI jobs running over about 12-24 hours. > > Have you seen this? > > I haven't seen it. If your jobs run that long, just about any method > of starting will be efficient enough to not increase the runtime. I know with LAM-MPI, any console I/O from MPI processes is sent back over the rsh or ssh connection. If you use ssh instead of rsh, that means all your output to stdout and stderr, and any input to stdin, will get encrypted and decrypted. If you have a lot of i/o this way, that could add up to a few cycles. I wish there was some kind of ssh option to not encrypt content for ssh and especially scp. It's one thing to encrypt interactive sessions where you might type in passwords, but it's slow and unnecessary to when copying terabytes of atmospheric data from one machine to another. From lindahl at keyresearch.com Tue Dec 17 13:49:18 2002 From: lindahl at keyresearch.com (Greg Lindahl) Date: Wed Nov 25 01:02:55 2009 Subject: RSH scaling problems... In-Reply-To: ; from xyzzy@speakeasy.org on Tue, Dec 17, 2002 at 11:37:09AM -0800 References: <20021217093654.B2795@wumpus.attbi.com> Message-ID: <20021217134918.A1807@wumpus.internal.keyresearch.com> On Tue, Dec 17, 2002 at 11:37:09AM -0800, Trent Piepho wrote: > I wish there was some kind of ssh option to not encrypt content for ssh and > especially scp. There is a switch, but you have to recompile the demon, and last I looked, you can't exert fine-grain control such as "only allow unencrypted data from systems inside the cluster". By the way, stderr/stdout from process 0 generally doesn't travel through ssh, and most programs that print out a lot of output do it from process 0. -- greg From xyzzy at speakeasy.org Tue Dec 17 16:05:54 2002 From: xyzzy at speakeasy.org (Trent Piepho) Date: Wed Nov 25 01:02:55 2009 Subject: RSH scaling problems... In-Reply-To: <20021217134918.A1807@wumpus.internal.keyresearch.com> Message-ID: On Tue, 17 Dec 2002, Greg Lindahl wrote: > On Tue, Dec 17, 2002 at 11:37:09AM -0800, Trent Piepho wrote: > > > I wish there was some kind of ssh option to not encrypt content for ssh and > > especially scp. > > There is a switch, but you have to recompile the demon, and last I > looked, you can't exert fine-grain control such as "only allow > unencrypted data from systems inside the cluster". I thought openssh removed the "none" cipher totally, so you can't even turn it on with a compile time switch. And when you could use it with ssh1, it wasn't really ideal, as passwords are then sent plaintext. I looked into this to do tape backups to a remote tape drive. In order to keep the tape drive streaming, it needs data at a certain minimum rate. ssh with encryption wasn't fast enough, but rsh was. It was less trouble to temporarily enable rsh than to get ssh working without encryption. From jcmcknny at uiuc.edu Tue Dec 17 17:41:46 2002 From: jcmcknny at uiuc.edu (jon) Date: Wed Nov 25 01:02:55 2009 Subject: low-latency high-bandwidth OS bypass user-level messagingforcommodity(linux) clusters with commodity NICs(<$200),HELP!(GAMMA/EMP/M-VIA/etc.) In-Reply-To: Message-ID: <008701c2a636$a06cb9c0$3027fea9@jonxp> Hey! Thanks for the info! I really tried hard looking for such a thing and never found it. I'll take it into consideration. Thanks, Jon > -----Original Message----- > From: Iwao Makino [mailto:iwao@rickey-net.com] > Sent: Tuesday, December 17, 2002 10:16 AM > To: jon > Subject: RE: low-latency high-bandwidth OS bypass user-level messagingforcommodity(linux) > clusters with commodity NICs(<$200),HELP!(GAMMA/EMP/M-VIA/etc.) > > Jon, > > I don't know about SuperMicro, but Tyan has what you might be interested at... > > S2707 > > > There's also single Xeon board out there(CPU more expensive not much point > going this way) > > At 2:39 -0600 on 17.12.2002, jon wrote regarding RE: low-latency > high-bandwidth OS bypass user-level mes: > >Hi, > > > >> -----Original Message----- > >> From: Patrick Geoffray [mailto:patrick@myri.com] > >> Sent: Tuesday, December 17, 2002 2:06 AM > >> To: jon > >> Cc: 'Beowulf cluster mailing list' > >> Subject: RE: low-latency high-bandwidth OS bypass user-level > >messagingforcommodity(linux) > >> clusters with commodity NICs(<$200),HELP!(GAMMA/EMP/M-VIA/etc.) > >> > >> Jon, > >> > >> On Tue, 2002-12-17 at 02:50, jon wrote: > >> > Perhaps the P4 isn't that great, but it's an Intel 850E chipset, not > >> > sure that it sucks :) > >> > > >> > http://usa.asus.com/mb/socket478/p4t533-c/overview.htm is the > >> > motherboard. > >> > > >> > >> It's a PCI 32/33. Nowadays, you can find plenty of P4-based > >motherboards > >> with correct PCI 64/66, for almost the same price. > > > >[jon] Honestly I couldn't find ANY P4 MB's with 64-bit/66Mhz, let alone > >PCI-X. > > > >If you could point one out, that would be fine :) > > > >The Grand Champion's from SuperMicro with 64-bit/66Mhz are all > >dual-Xeons. There's a single-Xeon with 64-bit/33Mhz, and the rest are > >P4's with 1 CPU and are all 32-bit 33Mhz. > > > >The point is the Intel 845 and Intel 850 are for use with P4, least > >SuperMicro's, even the E7205 based P4 is 32-bit 33Mhz with the P4. > > > >> > Anyways, ya, I don't think any P4 system is a great myrinet host. > >> > >> The hot machines today are P4-based, using the Serverworks > >GrandChampion > >> (GC) or Intel E7500 (Plumas) chipsets. They provide PCI 64/66 (even > >> PCI-X) and are available from many vendors. All of the Intel i8x0 > >> chipset are just **** regarding PCI performance. > >> > >> All the machines we bought recently were Supermicro motherboards with > >> Serverworks GC-LE chipsets. Works great. > > > >[jon] And they have P4 Xeons, right? :) Not Pure P4's. > >[jon] > > > > > > > > > >_______________________________________________ > >Beowulf mailing list, Beowulf@beowulf.org > >To change your subscription (digest mode or unsubscribe) visit > >http://www.beowulf.org/mailman/listinfo/beowulf From blamere at diversa.com Tue Dec 17 18:43:18 2002 From: blamere at diversa.com (Brian LaMere) Date: Wed Nov 25 01:02:55 2009 Subject: memory leak Message-ID: <81D14648D6BD694CBDB4F45536E81CBC016839@aquarius.diversa.com> Having not gotten very far, I thought I'd ask you all for some advice... I have a memory leak ... somewhere. It *appears* to be an nfs file caching issue. The nodes pick up 750Mb files from an NFS server, and crunch on them. They're only doing this to one at a time. After about a week, they've used up all the memory on the systems. I say all.../almost/ all. Never all. Never is the swap used, and of the 2Gb ram on each node, there's always 16-40Mb free minimum. Problem is that the nodes lose the ability to cache those 750Mb files, and have to then start going out and grabbing it after each and every run. Since a job takes only a couple seconds, having to grab it each time is terribly inefficient. The master (which has 3.25Gb ram) exhibits the same behavior - for whatever reason, almost all of its memory is used up too. Nothing short of (blech) rebooting the systems will clear out the memory. Then, after about a week, they're all full again. The particular file they're caching right now has been the same since Sep24, and absolutely nothing non-data related has changed on the cluster (ie, OS files, modules, settings, whatnot) since the beginning of September. There have been some external changes, but... This problem has only been occurring for about a month. Top doesn't report anything using a lot of memory (a few 0.2's, a few 0.1's, then 0.0's percentage-wise for master), and when sorted by memory usage nothing over 1% is listed. "free" doesn't even show an exorbitant amount being used for cacheing, I'm just led to believe its that due to the fact that the memory never gets 100% used (I can increase the load, even when there is only 16M free of memory, and swap won't be touched). Instead of going through and telling everyone all the things I've tried for the last month (would take a long while), I'd rather just see what sort of suggestions people might have. Where does one find a memory hole? Malicious code is, I suppose, theoretically possible. Unfortunately (damn it) I don't have tripwire up, but I'm not terribly sure that would matter. It just doesn't feel like that's the right direction. Suggestions? Thoughts? Advice? I'm open for anything. Rebooting the cluster once a week hurts...have to though, cause it takes down the nfs server otherwise (what with constant requests for 750Mb files). I can't think of anywhere else to look other than where I have already. I've stared at the proc tree so long I'm goin crazy. basic info: Nodes are dual 1ghz p3's, with 2Gb ram and a 18gb local disk. Master is dual 1ghz p3 with 3.25Gb ram and mirrored 36 Gb disk. Running Scyld 28cz4rc3 Running nfs version 3 /proc/sys/vm/bdflush (not changed, at default): 40 0 0 0 500 3000 60 0 0 (hopeful) pre-emptive thanks, Brian LaMere Diversa Corp From sp at scali.com Wed Dec 18 01:13:43 2002 From: sp at scali.com (Steffen Persvold) Date: Wed Nov 25 01:02:55 2009 Subject: low-latency high-bandwidth OS bypass user-level messagingforcommodity(linux) clusters with commodity NICs(<$200),HELP!(GAMMA/EMP/M-VIA/etc.) In-Reply-To: <008701c2a636$a06cb9c0$3027fea9@jonxp> Message-ID: On Tue, 17 Dec 2002, jon wrote: > Hey! Thanks for the info! I really tried hard looking for such a thing > and never found it. > > I'll take it into consideration. > The Tyan S2707 also has a decent onboadrd GbE NIC, the Broadcom BCM5701 chip which uses the tg3 Linux driver (the 2.4.20 kernel has a stable working version). Regards, -- Steffen Persvold | Scali AS mailto:sp@scali.com | http://www.scali.com Tel: (+47) 2262 8950 | Olaf Helsets vei 6 Fax: (+47) 2262 8951 | N0621 Oslo, NORWAY From John.Hearns at cern.ch Wed Dec 18 02:44:40 2002 From: John.Hearns at cern.ch (John Hearns) Date: Wed Nov 25 01:02:55 2009 Subject: memory leak In-Reply-To: <81D14648D6BD694CBDB4F45536E81CBC016839@aquarius.diversa.com> Message-ID: Brian, please forgive me if I am insulting your intelligence. But are you sure that you are not just noticing the disk buffering behaviour of Linux? The Linux kernel will use up spare memory as disk buffers - leading an (apparently) lack of free memory. This is not really the case - as the memory will be released again when needed. (Ahem. Was caught out by this too the first time I saw it...) Anyway, if this isn't the problem, maybe you could send us some of the stats from your system? Maybe use nfsstat? From walkev at presearch.com Wed Dec 18 05:08:49 2002 From: walkev at presearch.com (Vann H. Walke) Date: Wed Nov 25 01:02:55 2009 Subject: memory leak In-Reply-To: References: Message-ID: <1040216929.2334.5.camel@mobwalke.domain> I agree that this seems like the most likely source of your problem. Unfortunately with the current Linux kernels obtaining clear memory usage information is difficult. The only way I know to really test memory usage is to write a simple program that attempts to malloc large areas of memory - If the malloc succeeds, the memory wasn't really being used. If you read back through the archives you should find some other messages relating to the problem and perhaps even some test code. Good luck, Vann On Wed, 2002-12-18 at 05:44, John Hearns wrote: > Brian, please forgive me if I am insulting your intelligence. > > But are you sure that you are not just noticing the disk > buffering behaviour of Linux? > The Linux kernel will use up spare memory as disk buffers - > leading an (apparently) lack of free memory. > This is not really the case - as the memory will be released > again when needed. > (Ahem. Was caught out by this too the first time I saw it...) > > > Anyway, if this isn't the problem, maybe you could send > us some of the stats from your system? > Maybe use nfsstat? > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From jeffrey.b.layton at lmco.com Wed Dec 18 03:14:44 2002 From: jeffrey.b.layton at lmco.com (Jeff Layton) Date: Wed Nov 25 01:02:55 2009 Subject: low-latency high-bandwidth OS bypass user-level messaging for commodity(linux) clusters with commodityNICs(<$200), HELP! (GAMMA/EMP/M-VIA/etc.) References: Message-ID: <3E0058A4.A95ABA40@lmco.com> > > > A good example is easy to find: the Intel GigE NIC series has added > about a half dozen new PCI IDs in the past year. Not just revision > numbers, which we don't track, but the major ID number. Some of those > new versions appear to have significant changes in the feature set and > in the way they handle small packets to reduce latency. I'm not sure what NICS these are, but I just found some good numbers for GigE NICs: http://www.plogic.com/ll-gige.html They look pretty darn good! Jeff > > > That's a life cycle of about three months, which is shorter than the > time to decide a device driver is stable. > > -- > Donald Becker becker@scyld.com > Scyld Computing Corporation http://www.scyld.com > 410 Severn Ave. Suite 210 Scyld Beowulf cluster system > Annapolis MD 21403 410-990-9993 > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Jeff Layton Senior Engineer Lockheed-Martin Aeronautical Company - Marietta Aerodynamics & CFD "Is it possible to overclock a cattle prod?" - Irv Mullins This email may contain confidential information. If you have received this email in error, please delete it immediately, and inform me of the mistake by return email. Any form of reproduction, or further dissemination of this email is strictly prohibited. Also, please note that opinions expressed in this email are those of the author, and are not necessarily those of the Lockheed-Martin Corporation. From blamere at diversa.com Wed Dec 18 07:18:21 2002 From: blamere at diversa.com (Brian LaMere) Date: Wed Nov 25 01:02:55 2009 Subject: memory leak Message-ID: <81D14648D6BD694CBDB4F45536E81CBC01683C@aquarius.diversa.com> Yes, I am aware. It was not until a month ago that performance started becoming an issue, however. And it was not until yesterday that the cluster almost crippled the NFS server. The file in particular they were hitting when this occurred has been the same since Sep24. I am fully aware that the memory is still available. The problem is that the buffers are not - and as such, it grabs the file *each and every time*, as I said. If I reboot them, they do not grab the file each and every time. I would love it if the buffers would get released, but they're not. I thought I said this before, however? Jobs get completed about 2 a minute when the memory is listed as "full." They get completed about 200 a minute when the memory isn't. I'm really not sure how to paint a clearer picture than that. What /should/ occur in theory is not, in fact, occurring. A node can sit there unused for as much as 24 hours, and still exhibit the problem. When memory is not listed as full, the nodes slam the NFS server for a moment, just long enough to grab whatever flatfile database the current tool is running against. Then there is almost no network traffic at all for hours, esp to the NFS server. This behavior changes after about a week. This behavior never changed before. I'm repeating myself simply because I obviously wasn't clear before. Yes, I know buffers are held until something else is needed to be buffered, based on a retention policy. The only insult to my intelligence is in my lack of clarity in the description of what is going on. -----Original Message----- From: John Hearns [mailto:John.Hearns@cern.ch] Sent: Wednesday, December 18, 2002 2:45 AM To: Brian LaMere Cc: beowulf@beowulf.org Subject: Re: memory leak Brian, please forgive me if I am insulting your intelligence. But are you sure that you are not just noticing the disk buffering behaviour of Linux? The Linux kernel will use up spare memory as disk buffers - leading an (apparently) lack of free memory. This is not really the case - as the memory will be released again when needed. (Ahem. Was caught out by this too the first time I saw it...) Anyway, if this isn't the problem, maybe you could send us some of the stats from your system? Maybe use nfsstat? From blamere at diversa.com Wed Dec 18 07:26:04 2002 From: blamere at diversa.com (Brian LaMere) Date: Wed Nov 25 01:02:55 2009 Subject: memory leak Message-ID: <81D14648D6BD694CBDB4F45536E81CBC01683D@aquarius.diversa.com> Problem is that its not a matter of whether the memory is really being used. It's a matter of what is occurring to file caching. I don't care at all what free says, other than the fact that I wish I knew bdflush better, or such. Ths important factor is that this behavior did not occur until a month ago, nothing seemingly important changed around that time, and the only way to fix it is to reboot the cluster. Once rebooted, they perform literally 100 times better. If getting 100 times better performance when you reboot after a week's time sounds normal to this list, then I guess I'll join another :/ For the life of me, I can't tell where else to look. I've got stock (default) settings for the kernel's use of memory. I've dug around for a while now, unable to make heads or tails of anything. I'm merely asking for suggestions as to what to look at, etc. I'm going to set on some ethereal monitoring of the traffic, and verify a few things on that end that seem relatively obvious. Beyond that... ? Brian -----Original Message----- From: Vann H. Walke [mailto:walkev@presearch.com] Sent: Wednesday, December 18, 2002 5:09 AM To: John Hearns Cc: Brian LaMere; beowulf@beowulf.org Subject: Re: memory leak I agree that this seems like the most likely source of your problem. Unfortunately with the current Linux kernels obtaining clear memory usage information is difficult. The only way I know to really test memory usage is to write a simple program that attempts to malloc large areas of memory - If the malloc succeeds, the memory wasn't really being used. If you read back through the archives you should find some other messages relating to the problem and perhaps even some test code. Good luck, Vann On Wed, 2002-12-18 at 05:44, John Hearns wrote: > Brian, please forgive me if I am insulting your intelligence. > > But are you sure that you are not just noticing the disk > buffering behaviour of Linux? > The Linux kernel will use up spare memory as disk buffers - > leading an (apparently) lack of free memory. > This is not really the case - as the memory will be released > again when needed. > (Ahem. Was caught out by this too the first time I saw it...) > > > Anyway, if this isn't the problem, maybe you could send > us some of the stats from your system? > Maybe use nfsstat? > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From eforensics at hotmail.com Wed Dec 18 07:50:11 2002 From: eforensics at hotmail.com (Jason Fuller) Date: Wed Nov 25 01:02:55 2009 Subject: Implementing a Beowulf into the Computer Forensics Process? Message-ID: I am interested in a question posted early in one of our forensics emailing lists. I want to implement a beowulf designed network into the computer forensics examination process. Does anyone currently use a beowulf for examinations and if so, how are you incorporating it into the process. For example, if you are using a windows based app such as Access Data or Guidance Software how are you "merging" the examinations into the beowulf network. Or on the other hand, if you are using @Stake's software (some linux-based app), how could this be implemented. I believe the earlier post was presented from a college student researching this possibility for a research paper. Thanks, Jason Fuller ABI _________________________________________________________________ Protect your PC - get McAfee.com VirusScan Online http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963 From rgb at phy.duke.edu Wed Dec 18 08:58:15 2002 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:02:55 2009 Subject: memory leak In-Reply-To: <1040216929.2334.5.camel@mobwalke.domain> Message-ID: On 18 Dec 2002, Vann H. Walke wrote: > I agree that this seems like the most likely source of your problem. > Unfortunately with the current Linux kernels obtaining clear memory > usage information is difficult. The only way I know to really test > memory usage is to write a simple program that attempts to malloc large > areas of memory - If the malloc succeeds, the memory wasn't really being > used. If you read back through the archives you should find some other > messages relating to the problem and perhaps even some test code. It is perfectly simple to obtain memory usage in linux, unless I'm either going mad or somehow missed something. The "free" command up above is one way: rgb@ganesh|T:103>free total used free shared buffers cached Mem: 772908 726700 46208 0 141740 243964 -/+ buffers/cache: 340996 431912 Swap: 2104432 59148 2045284 To translate: This system (ganesh) has 768 MB total memory. It is "using" nearly all of it. It has a relatively small chunk it is holding "free" to allocate >>quickly<< to running processes that ask for it -- this number floats around a lot as processes start and stop and whack on memory, but is generally at least a few tens of MB on a system with a decent amount of memory relative to the actual memory being used. It has a big chunk (140 MB) allocated to various buffers -- it can get this back if it ever needs it for a malloc, at the cost of course of flushing the buffers and potentially slowing the kernel somewhat. It has 240 MB or so allocated to various cached entities -- files, dynamic libraries, memory pages -- anything and everything it ever uses that it doesn't need right now but MIGHT need in the future is cached if possible. All of this tends to be pretty damn well tuned, so that linux is very efficient and usable interactively even when fairly heavily loaded and doesn't waste memory by leaving it idle when it could be put to work as what amounts to a dynamic and fully automatic ramdisk that you don't ever have to worry about or set up. If all these numbers annoy you, the "-/+ buffers/cache" line tells you the real "summary" story. The system is REALLY using 340 MB. It REALLY has as much as 430 MB available to allocate (without swapping). All of the raw numbers, BTW, are available in /proc/meminfo as well (along with even more "uncooked" information that might be of use if you read enough of the kernel source to learn what they mean). HTH, rgb > > Good luck, > Vann > > On Wed, 2002-12-18 at 05:44, John Hearns wrote: > > Brian, please forgive me if I am insulting your intelligence. > > > > But are you sure that you are not just noticing the disk > > buffering behaviour of Linux? > > The Linux kernel will use up spare memory as disk buffers - > > leading an (apparently) lack of free memory. > > This is not really the case - as the memory will be released > > again when needed. > > (Ahem. Was caught out by this too the first time I saw it...) > > > > > > Anyway, if this isn't the problem, maybe you could send > > us some of the stats from your system? > > Maybe use nfsstat? > > > > > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Wed Dec 18 08:08:32 2002 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:02:55 2009 Subject: RSH scaling problems... In-Reply-To: <3DFF05D3.654BAA8B@lmco.com> Message-ID: On Tue, 17 Dec 2002, Jeff Layton wrote: > Jesse Becker wrote: > > > On Sun, 15 Dec 2002, Mike S Galicki wrote: > > > > > I believe the default pty's in 2.4.20 is 1024, but when I list /dev/pty > > > I only see 256 entries. MAKEDEV -m 1024 didn't seem to do anything past > > > 256. > > > > The default number of ptys is 254 in 2.4.x Linux kernels. This is > > hardcoded, and you need a kernel recompile if you need more. > > The way it was explained to me is that the function rcmd(), which > is invoked by rsh, attempts to gobble up two ports between 512 and > 1024. Simple math: you can only EVER have 256 rshs running on a > machine at the same time. It usually is a lot less than this since other > programs are gobbling up some of these ports. (Courtesy of Dan Nurmi > of Argonne). > So, even if you patch the kernel to give you more than 256 ptys, > you also need to patch rcmd() to use a wider range of ports (at least > in theory). > Any comments? My standard comment is that everyone in the computing universe should simply stop using rsh, period, ever, for anything, and start using its nextgen replacement ssh instead. It is difficult to convey in a short note all of the advantages of ssh relative to rsh -- authentication, encryption, port management, resource managment, X forwarding, environment support and more. So read the man page(s) instead. It is marginally more "expensive" than rsh in system resources and latencies associated with making a connection, but we're talking tenths of seconds here, from my direct measurements, and that was some years ago on slower machines AND included the use of bidirectional traffic encryption. On a sandbox cluster LAN, one can of course NOT use encryption in ssh and still realize all its benefits. Many Universities and similar organizations, not being complete fools or insensitive to the security risks associated with easily spoofed, easily snooped protocols like telnet and rcp, have come to more or less require ssh now that RSA patent issues seem to have disappeared, and have turned off all telnet access throughout the organization. The only significant bitch that I have with ssh these days is that the openssh designers viewed a feature of rsh -- the ability to remotely initiate a process and then disconnect, leaving the backgrounded process still running with no tty -- as a "bug", and have made it much more difficult to do this with ssh without the liberal use of .~ to forcably disconnect sessions. A relatively small price to pay, though, for its many features. rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Wed Dec 18 09:39:44 2002 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:02:55 2009 Subject: memory leak In-Reply-To: <81D14648D6BD694CBDB4F45536E81CBC01683C@aquarius.diversa.com> Message-ID: On Wed, 18 Dec 2002, Brian LaMere wrote: > When memory is not listed as full, the nodes slam the NFS server for a > moment, just long enough to grab whatever flatfile database the current tool > is running against. Then there is almost no network traffic at all for > hours, esp to the NFS server. This behavior changes after about a week. > This behavior never changed before. > > I'm repeating myself simply because I obviously wasn't clear before. Yes, I > know buffers are held until something else is needed to be buffered, based > on a retention policy. The only insult to my intelligence is in my lack of > clarity in the description of what is going on. Can you rearrange your program(s) to not use NFS (which sucks in oh, so many ways)? E.g. rsync the DB to the nodes? Have you tried tuning the nfsd, e.g. increasing the number of threads? Have you tried tuning the NFS mounts themselves (rsize, wsize)? Have you considered that the problem could be with file locking -- if you have lots of nodes trying to open and read, but especially to write, to the same file, there could be a all sorts of queues and problems being created with file locking (rpc.lockd). Have you tried to resolve this by (perhaps) maintaining several copies of the files in contention and spreading the open/close load around? Have you considered the problems associated with plain old latency -- e.g., suppose that application a on node A opens file X on the server, reads a bunch of stuff from it, and then writes a bit onto the end, and closes it. In the meantime, application b on node B is trying to open it. I >>think<< that NFS is required to flush the modified image through to disk before it can reissue the image to another request (part of its being a "reliable" protocol, so that application b doesn't see the "wrong image" of the file). This can take anything from hundredths of seconds to seconds, depending on file size and server load, so you might not see any problem at all as long as demand is lower than some threshhold and then "suddenly" start seeing it as you start to encounter "collision storms". This used to happen a lot on shared 10 Mbps ethernet, especially thinwire when the lengths were borderline too long and to heavily laden with hosts (so the probability of collisions was relatively high) -- an entire network could be nonlinearly brought to its knees by a single host inching the total network traffic up over a critical level, causing error recoveries and retransmissions to start to pile up with positive feedback (re: "packet storm"). Of course nobody can tell you which of these problems is the critical one in your particular situation, but maybe the list of the above will help you debug it. The key thing to do is to try to learn about the particular subsystem(s) associated with the delays. Sure, maybe it's just "a kernel bug" (and the kernel list may be the right place to seek help:-). OTOH, it could very easily be something that is your "fault" in that you have pushed your network out of the regime where stable operation can ever be realistically expected for your particular task architecture. In that case, you'll both have to debug it yourself (figure out what is failing) and figure out how to re-architect it so that it no longer is a problem. Not easy, actually -- takes a lot of trial and effort and can even end up being something REALLY trivial like a bad network cable or bad switch port so that errors you thought were "broken kernel" or even "broken software" were really "bad hardware" and impossible to EVER fix without replacing the bad hardware. nfsstat, vmstat, cat /proc/stat, plain old stat, netstat, and perhaps tools like wulfstat/xmlsysd (available at www.phy.duke.edu/brahma/xmlsysd.html) are your friends. Try clever experiments. Try to isolate the proximate cause of the problem or the precise conditions where it occurs. HTH, rgb > > > -----Original Message----- > From: John Hearns [mailto:John.Hearns@cern.ch] > Sent: Wednesday, December 18, 2002 2:45 AM > To: Brian LaMere > Cc: beowulf@beowulf.org > Subject: Re: memory leak > > Brian, please forgive me if I am insulting your intelligence. > > But are you sure that you are not just noticing the disk > buffering behaviour of Linux? > The Linux kernel will use up spare memory as disk buffers - > leading an (apparently) lack of free memory. > This is not really the case - as the memory will be released > again when needed. > (Ahem. Was caught out by this too the first time I saw it...) > > > Anyway, if this isn't the problem, maybe you could send > us some of the stats from your system? > Maybe use nfsstat? > > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From HJSTEIN at BLOOMBERG.COM Wed Dec 18 09:13:25 2002 From: HJSTEIN at BLOOMBERG.COM (Harvey J. Stein) Date: Wed Nov 25 01:02:55 2009 Subject: memory leak In-Reply-To: "Brian LaMere"'s message of "Wed, 18 Dec 2002 07:18:21 -0800" References: <81D14648D6BD694CBDB4F45536E81CBC01683C@aquarius.diversa.com> Message-ID: "Brian LaMere" writes: > It was not until a month ago that performance started becoming an issue, > however. And it was not until yesterday that the cluster almost crippled > the NFS server. Given that this started a month ago, I'm going to ask an obvious question, which presumably you've already checked, but you didn't mention it in your messages. Has any software or hardware on the machines changed in this period? This includes kernels, libs, apps, configs, network cards, etc. on the NFS server, the cluster machines, the routers/hubs, DNS, etc. Has the mix of jobs or the jobs themselves running on the cluster changed? I'd do "find / -mtime -90" & especially check NFS configs. -- Harvey Stein Bloomberg LP hjstein@bloomberg.com From mathog at mendel.bio.caltech.edu Wed Dec 18 09:26:19 2002 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:02:55 2009 Subject: RSH scaling problems Message-ID: Greg Lindahl wrote: > Low ports can't be reused until TIME_WAIT time has passed. True. Let's see what kind of a limit that imposes on rsh command rates on a typical system - RedHat 7.3 with a few servers and SGE running, over 100baseT. I put 100 copies of a target node's name in a file and then did: time rsh -zf manycopies.txt hostname That blew up. But initially it was because of the default cps setting in xinetd.d for rsh, which picked up the default cps = 25 30 So I added cps = 250 10 to the /etc/xinetd.d/rsh, restarted xinetd, and tried it again, whereupon it completed in 2.196 seconds real time. Running this 3 times quickly failed in the third one, and netstat on the target showed all the ports used up. On the node running rsh netstat showed no TIME_WAIT connections. I think that means the target was closing the connection before the source. After a while (TIME_WAIT, presumably) these all dropped out of netstat and rsh to the target started working again. Then I changed the target file so that it listed 50 copies of target1 and 50 copies of target2. That variation failed in the 6th iteration, further supporting the conjecture that the limit is on the target end. So the rate for outgoing rsh from a given node seems not to be limited (at least by this effect) but the incoming rate to a node is limited. It jams up when about 290 ports are stuck in TIME_WAIT. TIME_WAIT on linux is 60 seconds (I think). So the average sustainable rate of incoming rsh (or rlogin, or rcp) commands is about 290/60, or just less than 5 per second. cps set to 250 is overly optimistic as well, if all rsh come from one source, since the fastest that rsh can send them (my modified version, which basically runs rcmd() in a loop), is only about 50/second. This was over 100baseT, maybe you can go higher with Myrinet. Which means, I suppose that if you want to fire a lot of commands from one machine to another putting rsh inside a loop is a bad idea. Better to start up one rsh, leave it running, and pipe the commands through it to some target process which runs them on the other end without dropping the connection between commands. ANYWAY, going back to the original post by Mike Galicki, he should check that the xinetd cps value (or equivalent, if it isn't linux) isn't setting the upper limit. Possibly he can get more throughput by raising it. Failing that, perhaps one of the other mpi devices keeps a line open all the time and so bypasses this limit entirely? Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From markus at markus-fischer.de Wed Dec 18 10:37:56 2002 From: markus at markus-fischer.de (Markus Fischer) Date: Wed Nov 25 01:02:55 2009 Subject: low-latency high-bandwidth OS bypass user-level messaging for commodity(linux) clusters with commodityNICs(<$200), HELP! (GAMMA/EMP/M-VIA/etc.) References: <3E0058A4.A95ABA40@lmco.com> Message-ID: <3E00C084.1020607@markus-fischer.de> unfortunately the web site does not mention whether it was streaming or round trip measurements. One can assume the streaming (-s) option, though. The latency is pretty good. Typically only achieved when running in direct mode over GigEth without the stack. Jeff Layton wrote: >>A good example is easy to find: the Intel GigE NIC series has added >>about a half dozen new PCI IDs in the past year. Not just revision >>numbers, which we don't track, but the major ID number. Some of those >>new versions appear to have significant changes in the feature set and >>in the way they handle small packets to reduce latency. >> >> > >I'm not sure what NICS these are, but I just found some good >numbers for GigE NICs: > >http://www.plogic.com/ll-gige.html > >They look pretty darn good! > > >Jeff > > > > >>That's a life cycle of about three months, which is shorter than the >>time to decide a device driver is stable. >> >>-- >>Donald Becker becker@scyld.com >>Scyld Computing Corporation http://www.scyld.com >>410 Severn Ave. Suite 210 Scyld Beowulf cluster system >>Annapolis MD 21403 410-990-9993 >> >>_______________________________________________ >>Beowulf mailing list, Beowulf@beowulf.org >>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >> >> > >-- > >Jeff Layton >Senior Engineer >Lockheed-Martin Aeronautical Company - Marietta >Aerodynamics & CFD > >"Is it possible to overclock a cattle prod?" - Irv Mullins > >This email may contain confidential information. If you have received this >email in error, please delete it immediately, and inform me of the mistake by >return email. Any form of reproduction, or further dissemination of this >email is strictly prohibited. Also, please note that opinions expressed in >this email are those of the author, and are not necessarily those of the >Lockheed-Martin Corporation. > > > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > From siegert at sfu.ca Wed Dec 18 10:51:20 2002 From: siegert at sfu.ca (Martin Siegert) Date: Wed Nov 25 01:02:55 2009 Subject: memory leak In-Reply-To: <81D14648D6BD694CBDB4F45536E81CBC01683C@aquarius.diversa.com> References: <81D14648D6BD694CBDB4F45536E81CBC01683C@aquarius.diversa.com> Message-ID: <20021218185120.GB9838@stikine.ucs.sfu.ca> On Wed, Dec 18, 2002 at 07:18:21AM -0800, Brian LaMere wrote: > Yes, I am aware. > > It was not until a month ago that performance started becoming an issue, > however. And it was not until yesterday that the cluster almost crippled > the NFS server. > > The file in particular they were hitting when this occurred has been the > same since Sep24. > > I am fully aware that the memory is still available. The problem is that > the buffers are not - and as such, it grabs the file *each and every time*, > as I said. If I reboot them, they do not grab the file each and every time. > > I would love it if the buffers would get released, but they're not. I > thought I said this before, however? Jobs get completed about 2 a minute > when the memory is listed as "full." They get completed about 200 a minute > when the memory isn't. I'm really not sure how to paint a clearer picture > than that. What /should/ occur in theory is not, in fact, occurring. A > node can sit there unused for as much as 24 hours, and still exhibit the > problem. > > When memory is not listed as full, the nodes slam the NFS server for a > moment, just long enough to grab whatever flatfile database the current tool > is running against. Then there is almost no network traffic at all for > hours, esp to the NFS server. This behavior changes after about a week. > This behavior never changed before. > > I'm repeating myself simply because I obviously wasn't clear before. Yes, I > know buffers are held until something else is needed to be buffered, based > on a retention policy. The only insult to my intelligence is in my lack of > clarity in the description of what is going on. We were experiencing a similar symptom (Linux clients crippling a NFS server) with our Netapp filers. In that case the cause was a bug Linux's NFS implementation that leads to a flooding the reassembly queue of the NFS server with the same effect that you were describing: basically no traffic to the NFS server, but the server 100% busy. I better description of that bug can be found on the Linux NFS mailing list: http://marc.theaimsgroup.com/?l=linux-nfs&m=102515480929805&w=2 To my knowledge the bug is fixed in 2.4.20 but present in all previous 2.4.x kernels. For kernels 2.4.x, x < 20, the solution is to switch to NFS over tcp and/or limiting rsize, wsize to 8k (note the RedHat by default uses NFS over udp with 32k rsize, wsize therefore triggering the bug by default). We switched to tcp and the problem disappeared. I do not know whether this has anything to do with your problem, but switching NFS to tcp may be worth a try. Martin ======================================================================== Martin Siegert Academic Computing Services phone: (604) 291-4691 Simon Fraser University fax: (604) 291-4242 Burnaby, British Columbia email: siegert@sfu.ca Canada V5A 1S6 ======================================================================== From lindahl at keyresearch.com Wed Dec 18 11:47:29 2002 From: lindahl at keyresearch.com (Greg Lindahl) Date: Wed Nov 25 01:02:55 2009 Subject: RSH scaling problems In-Reply-To: ; from mathog@mendel.bio.caltech.edu on Wed, Dec 18, 2002 at 09:26:19AM -0800 References: Message-ID: <20021218114729.B2217@wumpus.internal.keyresearch.com> On Wed, Dec 18, 2002 at 09:26:19AM -0800, David Mathog wrote: > That variation failed in the 6th iteration, > further supporting the conjecture that the limit is on the target > end. But it isn't: the low port limit only applies on the sending side. Sure, the recipient has an inetd or xinetd limit, but in this case we're talking about one host sending rsh sessions to many hosts in a cluster. If you run PBS you do have to worry about rcp incoming to your front-end. -- greg From deadline at plogic.com Wed Dec 18 14:30:34 2002 From: deadline at plogic.com (Douglas Eadline) Date: Wed Nov 25 01:02:55 2009 Subject: low-latency high-bandwidth OS bypass user-level messaging for commodity(linux) clusters with commodityNICs(<$200), HELP! (GAMMA/EMP/M-VIA/etc.) In-Reply-To: <3E00C084.1020607@markus-fischer.de> Message-ID: On Wed, 18 Dec 2002, Markus Fischer wrote: > unfortunately the web site does not mention whether > it was streaming or round trip measurements. good point, fixed that on the page, for your reference: We used NPtcp -t -h receive_node -o netpipe.data -O on the sender and NPtcp -r on the receiver." Doug > > One can assume the streaming (-s) option, though. > The latency is pretty good. Typically only achieved > when running in direct mode over GigEth without > the stack. > > Jeff Layton wrote: > > >>A good example is easy to find: the Intel GigE NIC series has added > >>about a half dozen new PCI IDs in the past year. Not just revision > >>numbers, which we don't track, but the major ID number. Some of those > >>new versions appear to have significant changes in the feature set and > >>in the way they handle small packets to reduce latency. > >> > >> > > > >I'm not sure what NICS these are, but I just found some good > >numbers for GigE NICs: > > > >http://www.plogic.com/ll-gige.html > > > >They look pretty darn good! > > > > > >Jeff > > > > > > > > > >>That's a life cycle of about three months, which is shorter than the > >>time to decide a device driver is stable. > >> > >>-- > >>Donald Becker becker@scyld.com > >>Scyld Computing Corporation http://www.scyld.com > >>410 Severn Ave. Suite 210 Scyld Beowulf cluster system > >>Annapolis MD 21403 410-990-9993 > >> > >>_______________________________________________ > >>Beowulf mailing list, Beowulf@beowulf.org > >>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > >> > >> > > > >-- > > > >Jeff Layton > >Senior Engineer > >Lockheed-Martin Aeronautical Company - Marietta > >Aerodynamics & CFD > > > >"Is it possible to overclock a cattle prod?" - Irv Mullins > > > >This email may contain confidential information. If you have received this > >email in error, please delete it immediately, and inform me of the mistake by > >return email. Any form of reproduction, or further dissemination of this > >email is strictly prohibited. Also, please note that opinions expressed in > >this email are those of the author, and are not necessarily those of the > >Lockheed-Martin Corporation. > > > > > > > >_______________________________________________ > >Beowulf mailing list, Beowulf@beowulf.org > >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > > > > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- ------------------------------------------------------------------- Paralogic, Inc. | PEAK | Voice:+610.814.2800 130 Webster Street | PARALLEL | Fax:+610.814.5844 Bethlehem, PA 18015 USA | PERFORMANCE | http://www.plogic.com ------------------------------------------------------------------- From becker at scyld.com Wed Dec 18 12:34:53 2002 From: becker at scyld.com (Donald Becker) Date: Wed Nov 25 01:02:55 2009 Subject: RSH scaling problems... In-Reply-To: <3DFF05D3.654BAA8B@lmco.com> Message-ID: The rate-limiting aspect of using 'rsh' has been covered by other postings, but not the underlying reason... On Tue, 17 Dec 2002, Jeff Layton wrote: > Jesse Becker wrote: > > On Sun, 15 Dec 2002, Mike S Galicki wrote: > > > > > I believe the default pty's in 2.4.20 is 1024, but when I list /dev/pty > > > I only see 256 entries. MAKEDEV -m 1024 didn't seem to do anything past > > > 256. > > > > The default number of ptys is 254 in 2.4.x Linux kernels. This is > > hardcoded, and you need a kernel recompile if you need more. > > The way it was explained to me is that the function rcmd(), which > is invoked by rsh, attempts to gobble up two ports between 512 and > 1024. The key value is IPPORT_RESERVED /usr/include/netinet/in.h: IPPORT_RESERVED = 1024, This a well-known constant. Changing the value is almost impossible. > So, even if you patch the kernel to give you more than 256 ptys, > you also need to patch rcmd() to use a wider range of ports (at least > in theory). It's not nearly as simple as changing the value and recompiling your local kernel. You must also recompile the applications that depend on 1024, and edit those that don't use the symbolic names. But who cares about the local machine -- it's the remote machine that you need to impress with your credentials! So you have to change the whole world, and that's not going to happen. Overall, using 'port < 1024' as a security mechanism is pretty weak. Single-domain clusters are one of the few cases where it _is_ useful. -- Donald Becker becker@scyld.com Scyld Computing Corporation http://www.scyld.com 410 Severn Ave. Suite 210 Scyld Beowulf cluster system Annapolis MD 21403 410-990-9993 From HJSTEIN at bloomberg.com Wed Dec 18 12:48:59 2002 From: HJSTEIN at bloomberg.com (Harvey J. Stein) Date: Wed Nov 25 01:02:55 2009 Subject: memory leak In-Reply-To: "Brian LaMere"'s message of "Wed, 18 Dec 2002 12:29:59 -0800" References: <81D14648D6BD694CBDB4F45536E81CBC016843@aquarius.diversa.com> Message-ID: "Brian LaMere" writes: > I have updated absolutely nothing on the cluster since Sept24, nor > have I made a single parameter change. Nothing is different other > than the files on the nfs server, and possibly some settings on the > nfs server itself (though I can't figure out how any server > settings would cause deterioration over time, instead of relatively > initial issues). I don't know either, but given the whole thing's a mystery, I'd put the settings back & see if it has any effect. -- Harvey Stein Bloomberg LP hjstein@bloomberg.com From cjereme at ucla.edu Wed Dec 18 13:18:15 2002 From: cjereme at ucla.edu (cjereme@ucla.edu) Date: Wed Nov 25 01:02:55 2009 Subject: bzImage/kernel comp prob Message-ID: <200212182118.gBILIF905581@webmail.my.ucla.edu> Hello I am currently setting up a beowulf that is running on SuSe 8.1 kernel 2.4.19. Since the kernel seem to have most of the stuff on it, I decided to keep the kernel that came with it and make a bzimage for the compute nodes. I plan to eventually download mknbi later on. But when I execute the ff commands make clean make dep make -j 16 bzImage I get get errors getting an .o file for trap.c ...has anyone come across this problem? I know this is really a kernel question... So it basically would not ocmpile for me. if it will help: I am working on a diskless beowulf and planning to do network boot using netgear fa310tx cards. Thanks, Christine From jbecker at fryed.net Wed Dec 18 13:32:18 2002 From: jbecker at fryed.net (Jesse Becker) Date: Wed Nov 25 01:02:55 2009 Subject: RSH scaling problems... In-Reply-To: References: Message-ID: On Wed, 18 Dec 2002, Robert G. Brown wrote: > My standard comment is that everyone in the computing universe should > simply stop using rsh, period, ever, for anything, and start using its > nextgen replacement ssh instead. No argument there. > It is marginally more "expensive" than rsh in system resources and > latencies associated with making a connection, but we're talking tenths > of seconds here, from my direct measurements, and that was some years > ago on slower machines AND included the use of bidirectional traffic > encryption. On a sandbox cluster LAN, one can of course NOT use > encryption in ssh and still realize all its benefits. It's worth noting two things: 1) SSH supports different ciphers 2) Different ciphers are faster to compute than others. I did some testing on a local system, and come up with some times for transfering a file via SSH using various ciphers. The file in question was generated via this command (which generated 525 megs of data): dd if=/dev/urandom of=./stuff bs=1M count=500 I used all of the ciphers below, repeated each test 10 times, and recorded the results. The systems were otherwise unused. Both boxes are dual Xeon 2.4GHz boxes, 2GB of RAM, and a single UATA 40GB IBM Deskstar 120GXP drive. Both systems run Redhat 7.3, with 2.4.18-5 smp kernels. There is an Intel 1000/PRO 82544GC Gigabit NIC in each box (using the eepro100.c: $Revision: 1.36 $ modules), and the switch is (IIRC) a Netgear GS516T 16 port gigabit switch. Both boxes use the latest offical release of OpenSSH 3.1p1 from Redhat. Cipher Transfer time (sec) Avg.(sec) Rate ------------- ----------------------------- ---- ---- aes128-cbc: 20 18 21 20 21 20 18 20 20 19 ==> 19.7 26.6 aes192-cbc: 22 21 22 22 22 21 22 21 19 20 ==> 21.2 24.7 aes256-cbc: 23 23 22 24 23 23 23 22 23 23 ==> 22.9 22.9 blowfish-cbc: 19 18 19 17 16 19 16 17 18 19 ==> 17.8 29.4 cast128-cbc: 18 20 20 18 19 20 20 20 20 20 ==> 19.5 26.9 arcfour: 19 19 19 17 19 18 16 19 18 18 ==> 18.2 28.8 3des-cbc: 44 46 45 45 46 44 45 43 45 45 ==> 44.8 11.7 I can rerun the test with a 2GB file (to ensure that nothing gets cached) if people are interested. Times for the same test using a 2.1GB file: Cipher Trans time (sec) Avg.(sec) Rate ------------- ------------------- ---- ---- aes128-cbc: 81 80 77 74 80 ==> 78.4 27.4 aes192-cbc: 77 80 82 84 79 ==> 80.4 26.7 aes256-cbc: 85 83 90 83 87 ==> 85.6 25.1 blowfish-cbc: 69 70 75 69 70 ==> 70.6 30.4 cast128-cbc: 70 72 76 80 73 ==> 74.2 28.9 arcfour: 64 70 66 71 67 ==> 67.6 31.8 3des-cbc: 170 175 177 172 170 ==> 172.8 12.4 The default order is aes128-cbc, 3des-cbc, blowfish-cbc, cast128-cbc, arcfour, aes192-cbc, aes256-cbc, so you should use one of the faster ciphers if you don't specify anything. I don't have rsh enabled anywhere, but copying the same files over an NFS (v3) mount took about 11.7 seconds for 525MB, and 54 seconds for 2.1GB. These are averages of 5 transfers, and it works out to about 44.8 and 39.8 Mb/sec respectively. > The only significant bitch that I have with ssh these days is that the > openssh designers viewed a feature of rsh -- the ability to remotely > initiate a process and then disconnect, leaving the backgrounded process Have you tried using screen(1)? It works very well with SSH, although it's not as useful for the "fire-n-forget" method of running things. --Jesse From jeffrey.b.layton at lmco.com Wed Dec 18 10:00:11 2002 From: jeffrey.b.layton at lmco.com (Jeff Layton) Date: Wed Nov 25 01:02:55 2009 Subject: Broadcom NIC supports jumbo frames? Message-ID: <3E00B7AB.7891D1EB@lmco.com> Hello, We've got a cluster with GigE NICS with Broadcom chips. ifconfig says, Broadcom BCM5703 and the driver that is loaded is 'bcm5700'. I assume this is the Tigon3 stuff? I looked at the tg3 driver in the 2.4.20 kernel and I see that it supports jumbo frames. I want to make sure this is true. Is anybody using NICs with the same chipset? Are you running jumbo frames? Just to be sure, if we are running jobs on this cluster, ALL of the nodes have to have a MTU of 9000 (I found that wasn't set correctly). Anything thing else I should check? (we're having MPI jobs fail after we switched all of the nodes to jumbo frames). TIA! Jeff -- Jeff Layton Senior Engineer Lockheed-Martin Aeronautical Company - Marietta Aerodynamics & CFD "Is it possible to overclock a cattle prod?" - Irv Mullins This email may contain confidential information. If you have received this email in error, please delete it immediately, and inform me of the mistake by return email. Any form of reproduction, or further dissemination of this email is strictly prohibited. Also, please note that opinions expressed in this email are those of the author, and are not necessarily those of the Lockheed-Martin Corporation. From blamere at diversa.com Wed Dec 18 12:29:59 2002 From: blamere at diversa.com (Brian LaMere) Date: Wed Nov 25 01:02:55 2009 Subject: memory leak Message-ID: <81D14648D6BD694CBDB4F45536E81CBC016843@aquarius.diversa.com> I've been running the same version of Scyld's distribution since September. I know there has to have been a change somewhere, but that's what I'm having such a hard time tracking down. The scripts handling the queuing system are altered on a regular basis, but I can't for the life of me figure out why a perl script would alter file caching - I, like everyone else am aware that the memory isn't really being used. Its purely a buffer thing - has to be. That, or either malicious code that is confusing the memory manager, or a leak (only reason I think leak is because the gradual growth of it). I have updated absolutely nothing on the cluster since Sept24, nor have I made a single parameter change. Nothing is different other than the files on the nfs server, and possibly some settings on the nfs server itself (though I can't figure out how any server settings would cause deterioration over time, instead of relatively initial issues). If I had changed anything, it would be really easy for me to point at and figure out, and I wouldn't be frustrated with it. I have changelogs that I've reviewed, On the cluster, there have been no updates, no installs, no changes. Whenever I install something, it is by one of two ways. Either I install from source, or I install from rpm. The last rpm I installed, according to the rpm database, was on September 10. The newest makefile in either of the src directories is from September 9. According to "ls -alrtR|grep Nov" and "ls -altrR|grep Dec" (and again for Oct) run from inside the /etc directory, nothing has changed (sans the things that get changed at reboot, like mtab) since Sept 24. It really just doesn't make sense to me that a perl script, or even updating a perl binary on an nfs server (and then using it on the cluster), could be causing this problem. The queue system has run nearly flawlessly for a year and a half with this equiptment. If that's possible though - how would I determine why the simple perl script is doing it? The queue system pulls jobs from an outside mysql server...it was updated on the 6th (after this problem started). I'm just looking for suggestions as to what to do other than reboot the systems once a week. If perl can. in fact, cause these types of problems (how?) then hey...I can work with that. But it truely seems to be a buffer problem. I can throw 500M into ram without it complaining, and without swap ever being hit. It just won't cache these files anymore, until I reboot the systems. Does that make sense? Vmstat output below. Wiglaf is the master, and it was last rebooted on the 6th. Node 40 and 42 were both rebooted yesterday. Node 42 was taken out of the queue, and is just sitting there 100% idle (control-ish box). Node 40 is currently idle, but is part of the queue. There are no jobs (right now) in the queue. Despite the cache already showing high on 40, its performance is still good....for now. In a week, I'll need to reboot it, along with everyone else. If *I* knew the answer, I wouldn't be asking the question :) I apologize if I seem frustrated - I am. [root@wiglaf etc]# vmstat procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 0 0 0 544580 127004 2533160 0 0 0 2 10 2 1 1 19 [root@wiglaf etc]# bpsh 42 vmstat procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 0 0 0 2044748 884 5180 0 0 0 0 52 2 0 0 100 [root@wiglaf etc]# bpsh 40 vmstat procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 0 0 0 19020 884 2024524 0 0 0 0 111 73 29 3 68 [root@wiglaf etc]# -----Original Message----- From: Harvey J. Stein [mailto:HJSTEIN@bloomberg.com] Sent: Wed 12/18/2002 9:13 AM To: Brian LaMere Cc: beowulf@beowulf.org Subject: Re: memory leak "Brian LaMere" writes: > It was not until a month ago that performance started becoming an issue, > however. And it was not until yesterday that the cluster almost crippled > the NFS server. Given that this started a month ago, I'm going to ask an obvious question, which presumably you've already checked, but you didn't mention it in your messages. Has any software or hardware on the machines changed in this period? This includes kernels, libs, apps, configs, network cards, etc. on the NFS server, the cluster machines, the routers/hubs, DNS, etc. Has the mix of jobs or the jobs themselves running on the cluster changed? I'd do "find / -mtime -90" & especially check NFS configs. -- Harvey Stein Bloomberg LP hjstein@bloomberg.com From lindahl at keyresearch.com Wed Dec 18 14:28:30 2002 From: lindahl at keyresearch.com (Greg Lindahl) Date: Wed Nov 25 01:02:55 2009 Subject: Broadcom NIC supports jumbo frames? In-Reply-To: <3E00B7AB.7891D1EB@lmco.com>; from jeffrey.b.layton@lmco.com on Wed, Dec 18, 2002 at 01:00:11PM -0500 References: <3E00B7AB.7891D1EB@lmco.com> Message-ID: <20021218142830.A2764@wumpus.internal.keyresearch.com> On Wed, Dec 18, 2002 at 01:00:11PM -0500, Jeff Layton wrote: > Just to be sure, if we are running jobs on this cluster, ALL of > the nodes have to have a MTU of 9000 (I found that wasn't set > correctly). Anything thing else I should check? (we're having > MPI jobs fail after we switched all of the nodes to jumbo > frames). Try ping with an 8000 byte packet. If both sides don't agree on the MTU, you'll see 100% lossage. greg From blamere at diversa.com Wed Dec 18 13:15:27 2002 From: blamere at diversa.com (Brian LaMere) Date: Wed Nov 25 01:02:55 2009 Subject: memory leak Message-ID: <81D14648D6BD694CBDB4F45536E81CBC016845@aquarius.diversa.com> The NFS server is running the proprietary OS from EMC named "dart" they use on their Celeras (and possibly other things). It had a firmware-ish update during November to the NAS code for fix a user mapping bug, but that's about it. The Celera is a cabinet that does nothing other than nfs and cifs. While I didn't cripple the whole cabinet, I did cripple a datamover inside it (the primary datamover for the filesystems I was accessing). I just checked, and there have been no configuration changes on there in the last couple months Brian -----Original Message----- From: Harvey J. Stein [mailto:HJSTEIN@bloomberg.com] Sent: Wed 12/18/2002 12:48 PM To: Brian LaMere Cc: beowulf@beowulf.org Subject: Re: memory leak "Brian LaMere" writes: > I have updated absolutely nothing on the cluster since Sept24, nor > have I made a single parameter change. Nothing is different other > than the files on the nfs server, and possibly some settings on the > nfs server itself (though I can't figure out how any server > settings would cause deterioration over time, instead of relatively > initial issues). I don't know either, but given the whole thing's a mystery, I'd put the settings back & see if it has any effect. -- Harvey Stein Bloomberg LP hjstein@bloomberg.com From blamere at diversa.com Wed Dec 18 12:37:52 2002 From: blamere at diversa.com (Brian LaMere) Date: Wed Nov 25 01:02:55 2009 Subject: memory leak Message-ID: <81D14648D6BD694CBDB4F45536E81CBC016844@aquarius.diversa.com> for testing purposes, I can put most of the datafiles on the nodes, yes. However, these files are smaller than they were before (I ran into a sorta similar problem in the past when the files became simply too large to be cached). However, they get updated nightly, and for some of the datafiles, they get updated as soon as the workers are finished running the first phase of a job, and they then need access to the same updated datafile to run the second phase. And unfortuantely, there is no real way to pause between very well. Over the course of a day, 60K-120K jobs will be run, using a total of generally 3-5 large (500-900M) datafiles (though there are more that are potential). Latency doesn't /seem/ to be an issue, since the queue has worked near flawlessly in its current setup (with only major change being scyld upgrade from previous version). Also, it runs perfectly for a few days after a reboot. Its not until there is a buildup - somewhere - that performance starts taking a hit. Performance also doesn't take a graduate hit, generally. Its normally a fairly sharp change in the performance curve (though it can change due to the varying size of the files being requested). I'm going to be walking around and inspecting cables today, and seeing about testing the routers. -----Original Message----- From: Robert G. Brown [mailto:rgb@phy.duke.edu] Sent: Wed 12/18/2002 9:39 AM To: Brian LaMere Cc: beowulf@beowulf.org Subject: RE: memory leak On Wed, 18 Dec 2002, Brian LaMere wrote: > When memory is not listed as full, the nodes slam the NFS server for a > moment, just long enough to grab whatever flatfile database the current tool > is running against. Then there is almost no network traffic at all for > hours, esp to the NFS server. This behavior changes after about a week. > This behavior never changed before. > > I'm repeating myself simply because I obviously wasn't clear before. Yes, I > know buffers are held until something else is needed to be buffered, based > on a retention policy. The only insult to my intelligence is in my lack of > clarity in the description of what is going on. Can you rearrange your program(s) to not use NFS (which sucks in oh, so many ways)? E.g. rsync the DB to the nodes? Have you tried tuning the nfsd, e.g. increasing the number of threads? Have you tried tuning the NFS mounts themselves (rsize, wsize)? Have you considered that the problem could be with file locking -- if you have lots of nodes trying to open and read, but especially to write, to the same file, there could be a all sorts of queues and problems being created with file locking (rpc.lockd). Have you tried to resolve this by (perhaps) maintaining several copies of the files in contention and spreading the open/close load around? Have you considered the problems associated with plain old latency -- e.g., suppose that application a on node A opens file X on the server, reads a bunch of stuff from it, and then writes a bit onto the end, and closes it. In the meantime, application b on node B is trying to open it. I >>think<< that NFS is required to flush the modified image through to disk before it can reissue the image to another request (part of its being a "reliable" protocol, so that application b doesn't see the "wrong image" of the file). This can take anything from hundredths of seconds to seconds, depending on file size and server load, so you might not see any problem at all as long as demand is lower than some threshhold and then "suddenly" start seeing it as you start to encounter "collision storms". This used to happen a lot on shared 10 Mbps ethernet, especially thinwire when the lengths were borderline too long and to heavily laden with hosts (so the probability of collisions was relatively high) -- an entire network could be nonlinearly brought to its knees by a single host inching the total network traffic up over a critical level, causing error recoveries and retransmissions to start to pile up with positive feedback (re: "packet storm"). Of course nobody can tell you which of these problems is the critical one in your particular situation, but maybe the list of the above will help you debug it. The key thing to do is to try to learn about the particular subsystem(s) associated with the delays. Sure, maybe it's just "a kernel bug" (and the kernel list may be the right place to seek help:-). OTOH, it could very easily be something that is your "fault" in that you have pushed your network out of the regime where stable operation can ever be realistically expected for your particular task architecture. In that case, you'll both have to debug it yourself (figure out what is failing) and figure out how to re-architect it so that it no longer is a problem. Not easy, actually -- takes a lot of trial and effort and can even end up being something REALLY trivial like a bad network cable or bad switch port so that errors you thought were "broken kernel" or even "broken software" were really "bad hardware" and impossible to EVER fix without replacing the bad hardware. nfsstat, vmstat, cat /proc/stat, plain old stat, netstat, and perhaps tools like wulfstat/xmlsysd (available at www.phy.duke.edu/brahma/xmlsysd.html) are your friends. Try clever experiments. Try to isolate the proximate cause of the problem or the precise conditions where it occurs. HTH, rgb > > > -----Original Message----- > From: John Hearns [mailto:John.Hearns@cern.ch] > Sent: Wednesday, December 18, 2002 2:45 AM > To: Brian LaMere > Cc: beowulf@beowulf.org > Subject: Re: memory leak > > Brian, please forgive me if I am insulting your intelligence. > > But are you sure that you are not just noticing the disk > buffering behaviour of Linux? > The Linux kernel will use up spare memory as disk buffers - > leading an (apparently) lack of free memory. > This is not really the case - as the memory will be released > again when needed. > (Ahem. Was caught out by this too the first time I saw it...) > > > Anyway, if this isn't the problem, maybe you could send > us some of the stats from your system? > Maybe use nfsstat? > > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From ctierney at hpti.com Wed Dec 18 15:19:14 2002 From: ctierney at hpti.com (Craig Tierney) Date: Wed Nov 25 01:02:55 2009 Subject: Broadcom NIC supports jumbo frames? In-Reply-To: <3E00B7AB.7891D1EB@lmco.com> References: <3E00B7AB.7891D1EB@lmco.com> Message-ID: <20021218231914.GD16024@hpti.com> On Wed, Dec 18, 2002 at 01:00:11PM -0500, Jeff Layton wrote: > Hello, > > We've got a cluster with GigE NICS with Broadcom chips. > ifconfig says, > > Broadcom BCM5703 > > and the driver that is loaded is 'bcm5700'. I assume this is the > Tigon3 stuff? > I looked at the tg3 driver in the 2.4.20 kernel and I see that > it supports jumbo frames. I want to make sure this is true. Is > anybody using NICs with the same chipset? Are you running > jumbo frames? > Just to be sure, if we are running jobs on this cluster, ALL of > the nodes have to have a MTU of 9000 (I found that wasn't set > correctly). Anything thing else I should check? (we're having > MPI jobs fail after we switched all of the nodes to jumbo > frames). So are you currently using the bcm5700 driver or the tg3 driver? The bcm5700 is a bad driver. You should be using the tg3 driver. I don't know if it supports Jumbo Frames, but it has got to be better than the bcm driver even without it. Craig > > TIA! > > Jeff > > > -- > > Jeff Layton > Senior Engineer > Lockheed-Martin Aeronautical Company - Marietta > Aerodynamics & CFD > > "Is it possible to overclock a cattle prod?" - Irv Mullins > > This email may contain confidential information. If you have received this > email in error, please delete it immediately, and inform me of the mistake by > return email. Any form of reproduction, or further dissemination of this > email is strictly prohibited. Also, please note that opinions expressed in > this email are those of the author, and are not necessarily those of the > Lockheed-Martin Corporation. > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > -- Craig Tierney (ctierney@hpti.com) From walkev at presearch.com Wed Dec 18 15:40:57 2002 From: walkev at presearch.com (Vann H. Walke) Date: Wed Nov 25 01:02:55 2009 Subject: Broadcom NIC supports jumbo frames? In-Reply-To: <20021218142830.A2764@wumpus.internal.keyresearch.com> References: <3E00B7AB.7891D1EB@lmco.com> <20021218142830.A2764@wumpus.internal.keyresearch.com> Message-ID: <1040254857.2334.61.camel@mobwalke.domain> The broadcom cards should support it. Make sure your switches support jumbo frames as well though. Many don't. Vann On Wed, 2002-12-18 at 17:28, Greg Lindahl wrote: > On Wed, Dec 18, 2002 at 01:00:11PM -0500, Jeff Layton wrote: > > > Just to be sure, if we are running jobs on this cluster, ALL of > > the nodes have to have a MTU of 9000 (I found that wasn't set > > correctly). Anything thing else I should check? (we're having > > MPI jobs fail after we switched all of the nodes to jumbo > > frames). > > Try ping with an 8000 byte packet. If both sides don't agree on the > MTU, you'll see 100% lossage. > > greg > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From laytonjb at bellsouth.net Wed Dec 18 15:07:13 2002 From: laytonjb at bellsouth.net (Jeffrey B. Layton) Date: Wed Nov 25 01:02:55 2009 Subject: Broadcom NIC supports jumbo frames? References: <3E00B7AB.7891D1EB@lmco.com> <20021218142830.A2764@wumpus.internal.keyresearch.com> Message-ID: <3E00FFA1.70301@bellsouth.net> Greg Lindahl wrote: >On Wed, Dec 18, 2002 at 01:00:11PM -0500, Jeff Layton wrote: > > > >> Just to be sure, if we are running jobs on this cluster, ALL of >>the nodes have to have a MTU of 9000 (I found that wasn't set >>correctly). Anything thing else I should check? (we're having >>MPI jobs fail after we switched all of the nodes to jumbo >>frames). >> >> > >Try ping with an 8000 byte packet. If both sides don't agree on the >MTU, you'll see 100% lossage. > Slick. I never thought about one before. I'll try it tomorrow. Thanks! Jeff > >greg >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > From fmuldoo at alpha2.eng.lsu.edu Mon Dec 23 10:08:53 2002 From: fmuldoo at alpha2.eng.lsu.edu (Frank Muldoon) Date: Wed Nov 25 01:02:55 2009 Subject: multiple processes on laptop Message-ID: <3E075135.753A31C@me.lsu.edu> I have a Fortran 95/MPI parallel program that I run on my laptop for testing. If I have the ethernet card plugged into a live cable I can run 32 jobs on the laptop easily. If there is no cable or a dead cable plugged in then my program runs very slow. It takes for ever for the 32 processes to spawn. Rebuilding mpich (version 1.2.4) with the shared memory device does not help. Anyone know how to fix this problem? Thanks, Frank -- Frank Muldoon Computational Fluid Dynamics Research Group Louisiana State University Baton Rouge, LA 70803 225-205-6601 (cell phone) 225-578-5217 (w) From fmuldoo at alpha2.eng.lsu.edu Thu Dec 19 15:46:52 2002 From: fmuldoo at alpha2.eng.lsu.edu (Frank Muldoon) Date: Wed Nov 25 01:02:55 2009 Subject: multiple processes on laptop Message-ID: <3E025A6C.F85E57A@me.lsu.edu> I have a Fortran 95/MPI parallel program that I run on my laptop for testing. If I have the ethernet card plugged into a live cable I can run 32 jobs on the laptop easily. If there is no cable or a dead cable plugged in then my program runs very slow. It takes for ever for the 32 processes to spawn. Rebuilding mpich (version 1.2.4) with the shared memory device does not help. Anyone know how to fix this problem? Thanks, Frank -- Frank Muldoon Computational Fluid Dynamics Research Group Louisiana State University Baton Rouge, LA 70803 225-205-6601 (cell phone) 225-578-5217 (w) From eugen at leitl.org Fri Dec 20 05:22:25 2002 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:02:55 2009 Subject: [kraut] Sun goes InfiniBand Message-ID: http://heise.de/newsticker/data/ciw-20.12.02-001/ Sun setzt auf InfiniBand [20.12.2002 13:03 ] In Zukunft will Sun[1] Server liefern, die sich mit schnellen InfiniBand[2]-Links zu gr??eren Maschinen zusammenschalten lassen. Sun geh?rt zu den Gr?ndungsmitgliedern der InfiniBand Trade Association[3], die diese zun?chst vor allem von Intel[4] vorangetriebene Verbindungstechnik gemeinsam entwickeln. Sun will kommende Versionen ihres Betriebssystems Solaris InfiniBand-tauglich machen und so die Kommunikation und Verwaltung von Servern und Storage-Systemen ?ber InfiniBand-Hostadapter und -Switches unterst?tzen. Als konkretes Produkt k?ndigte Sun die im Jahre 2004 erwarteten kompakten Blade-Server an. Sp?ter sollen auch Enterprise-Server mit InfiniBand ausgestattet werden. Als Vorteile des Interconnect-Verfahrens f?hrt Sun dessen Skalierbarkeit von 1x-Links mit Datentransferraten von 2,5 GBit/s bis zu 12x-Mehrfach-Links mit 30 GBit/s an, kurze Latenzzeiten sowie die direkte Anbindung an den Speicher entfernter Server. InfiniBand sei darauf ausgerichtet, mit anderen Verfahren wie Ethernet und FibreChannel effizient zusammen zu arbeiten. Urspr?nglich[5] als schnelles universelles I/O-Verfahren gedacht, das auch den etablierten PCI-Bus ersetzen sollte, hat sich mittlerweile als vorrangige InfiniBand-Anwendung die externe Kopplung von Rechnern herauskristallisiert. K?rzlich stellte etwa das US-Unternehmen Paceline einen Switch zur Kopplung von Cluster-Knotenrechnern[6] vor. Standardisierte InfiniBand-Backplanes f?r Blade-Server k?nnten k?nftig die gemeinsame Nutzung von Blades unterschiedlicher Hersteller in einem Server erm?glichen. IBM und Intel[7] entwickeln leistungsstarke Blade-Server, die sich sp?ter auch per InfiniBand zu Clustern verkn?pfen lassen sollen. (ciw[8]/c't) URL dieses Artikels: http://www.heise.de/newsticker/data/ciw-20.12.02-001/ Links in diesem Artikel: [1] http://www.sun.com/ [2] http://www.heise.de/newsticker/data/jk-25.10.00-005/ [3] http://www.infinibandta.org/ [4] http://www.fabric-io.com/ [5] http://www.heise.de/newsticker/data/jk-22.08.00-001/ [6] http://www.heise.de/newsticker/data/ciw-22.11.02-000/ [7] http://www.heise.de/newsticker/data/ciw-17.09.02-001/ [8] mailto:ciw@ct.heise.de From rgb at phy.duke.edu Thu Dec 19 07:11:58 2002 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:02:55 2009 Subject: memory leak In-Reply-To: <81D14648D6BD694CBDB4F45536E81CBC016845@aquarius.diversa.com> Message-ID: On Wed, 18 Dec 2002, Brian LaMere wrote: > The NFS server is running the proprietary OS from EMC named "dart" they use > on their Celeras (and possibly other things). It had a firmware-ish update > during November to the NAS code for fix a user mapping bug, but that's about > it. The Celera is a cabinet that does nothing other than nfs and cifs. > While I didn't cripple the whole cabinet, I did cripple a datamover inside > it (the primary datamover for the filesystems I was accessing). > > I just checked, and there have been no configuration changes on there in the > last couple months Just as a matter of extreme humorosity, you might try picking a box with enough disk to hold the file(s) you are serving, copy them over, export them, and redirect all your nodes to mount from it instead (which probably wouldn't take as long as it sounds -- it is pretty trivial to set up an NFS server and push a hacked /etc/fstab to the nodes). Run it that way for a day or eight and see if it matters (problem resurfaces). BTW, you've just revealed (if I understand you correctly) that you're using a proprietary OS, closed source, black box NFS server. Ordinarily, I'd say that this is a bad idea, for precisely the reasons you are now in trouble: it may be IMPOSSIBLE for you to positively determine where your problem lies, unless you are fortunate enough to find a trivial problem and fix it. If it is a very slow memory leak or other "deep bug", or even a disagreement/incompatibility between your server and the mounting clients, how could >>you<< ever tell? How could you fix it? How can you even convince the vendor/mfr that the bug exists and is their fault so THEY can fix it? The answer is pretty universally not without a lot of work and finger pointing on everybody's part. One thing I would NO LONGER suggest that you do is take the problem to the kernel list. They tend to be a tiny bit intolerant of bug reports involving proprietary interfaces (hardware, software, peripheral) because of the obvious difficulties in determining who owns the problem and where it has to be fixed. Sometimes they'll listen, but sometimes they just don't want to waste their time. Sigh. It is going to be very difficult to debug this if it isn't (your) hardware. With a black box, it will be very difficult to debug if it IS hardware -- inside the black box. With a black box, you'll never debug it if it is a bug in the black box software -- at best you'll be able to convince yourself that it isn't a problem with your nodes per se and find a workaround (e.g. build your own NFS server, which is cheap'n'easy enough, and give your BB server to the poor) or MAYBE convince the company that they own the problem and stimulate a fix. Open source vs closed source, hmmm....;-) rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Thu Dec 19 06:06:58 2002 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:02:55 2009 Subject: RSH scaling problems... In-Reply-To: Message-ID: On Wed, 18 Dec 2002, Jesse Becker wrote: > I did some testing on a local system, and come up with some times for > transfering a file via SSH using various ciphers. The file in question > was generated via this command (which generated 525 megs of data): > dd if=/dev/urandom of=./stuff bs=1M count=500 > > I used all of the ciphers below, repeated each test 10 times, and recorded > the results. The systems were otherwise unused. Both boxes are dual Xeon > 2.4GHz boxes, 2GB of RAM, and a single UATA 40GB IBM Deskstar 120GXP > drive. Both systems run Redhat 7.3, with 2.4.18-5 smp kernels. There is > an Intel 1000/PRO 82544GC Gigabit NIC in each box (using the eepro100.c: > $Revision: 1.36 $ modules), and the switch is (IIRC) a Netgear GS516T 16 > port gigabit switch. > > Both boxes use the latest offical release of OpenSSH 3.1p1 from Redhat. > > Cipher Transfer time (sec) Avg.(sec) Rate > ------------- ----------------------------- ---- ---- > aes128-cbc: 20 18 21 20 21 20 18 20 20 19 ==> 19.7 26.6 > aes192-cbc: 22 21 22 22 22 21 22 21 19 20 ==> 21.2 24.7 > aes256-cbc: 23 23 22 24 23 23 23 22 23 23 ==> 22.9 22.9 > blowfish-cbc: 19 18 19 17 16 19 16 17 18 19 ==> 17.8 29.4 > cast128-cbc: 18 20 20 18 19 20 20 20 20 20 ==> 19.5 26.9 > arcfour: 19 19 19 17 19 18 16 19 18 18 ==> 18.2 28.8 > 3des-cbc: 44 46 45 45 46 44 45 43 45 45 ==> 44.8 11.7 Very useful. In a way, it is a shame that the openssh folks completely eliminated the -c none option. If they made it something that had to be explicitly enabled in ssh[,d]_config and that was officially discouraged, it would have permitted one to authenticate passwords with a proper secure handshake but still use ssh for rsh-comparable file transfers in a cluster sandbox (presumed firewalled and internally secure). Still... > I don't have rsh enabled anywhere, but copying the same files over an NFS > (v3) mount took about 11.7 seconds for 525MB, and 54 seconds for 2.1GB. > These are averages of 5 transfers, and it works out to about 44.8 and 39.8 > Mb/sec respectively. ...so you've got wirespeed limitations of perhaps 100-100 MB/sec, and disk speed limitations that are likely less than that. So the speed penalty is 2-3 in total speed for using ssh to transfer huge files. This isn't particularly terrible for most file transfer applications, unless all you are doing all of the time is file transfer and it is significantly rate-limiting. The ssh latency (time to make and break connections) is more significantly greater -- as much as 10x the time required to make an rsh connection, if I recall correctly from my earlier tests -- but you're talking 10x a very small absolute number, leading to bigger but still small number, negligible in MOST cases (and where it isn't negligible, you probably shouldn't be using rsh OR ssh to manage the connection). > > The only significant bitch that I have with ssh these days is that the > > openssh designers viewed a feature of rsh -- the ability to remotely > > initiate a process and then disconnect, leaving the backgrounded process > > Have you tried using screen(1)? It works very well with SSH, although > it's not as useful for the "fire-n-forget" method of running things. I first used screen back in the late 80' or early 90's, IIRC, to give me primitive VT100 "windowing" through a kermit/modem connection. I haven't used it much since I got linux first running at home, since a small stack of xterms and rsh (now ssh) was so much more efficient and easy to rotate and manipulate. I also got addicted to the rsh feature that a named symlink to the binary acted as a de facto command, e.g. ln -s /usr/bin/rsh ganesh ganesh would execute "/usr/bin/rsh ganesh", making it REALLY easy to execute commands on hosts -- create a directory with a list of hostname links on your path, and every hostname becomes a remote command. So I could just type ganesh worklikehell & and make ganesh work like hell in the background;-) Alas, the openssh folks in their wisdom doubly trashed this -- even when I wrote a notty wrapper that explicitly disconnects from the control terminal, forks, and execs to background a process, ssh won't allow you to disconnect until all children of the initiating instance of the shell are dead OR you bop it on the head with interactively typed escaped disconnects, AND they eliminated the perfectly lovely and harmless symlink feature. The first is just plain fascist of the designers and counterproductive to a huge fraction of the work ssh might be put to doing -- in order to work around it I'd have to write an expect script or really hack the source, the latter most risky in a key security program. I haven't gone the expect route, yet, but I may yet. I have written a near perfect script-based wrapper that reproduces the symlink trick -- even now I have a directory on my path with symlinks (to the wrapper script) so that in order to ssh login to ganesh I just type "ganesh". Silly, but those keystrokes really add up when one does a lot of bopping around... -- I suppose I should probably further hack the script to add the expect feature, but alas it is NOT easy to do -- one has to escape once for EACH LEVEL of ssh one works through. If I'm at home, ssh'd into ganesh (my department desktop) and from there ssh'd into a node (not infrequently true, as the nodes aren't router-accessible from outside the physics domain) I have to escape my disconnect TWICE or I'll disconnect from ganesh and not just the node. Working from ganesh directly I have to escape ONCE. Tres suck. I don't know of any variable that maintains the number of ssh layers from a primary login, so I'd likely have to make one, if possible. So although I love ssh relative to rsh and strongly advocate its use, it isn't perfect and these three things, in particular: a "none" cipher (use at your own deliberate risk), the ability to cleanly background jobs and still automatically disconnect/notty at logout and the hell with std[in,out,err] (again at your own risk, of course), and put back the harmless and lovely symlink feature. Sure, nobody but the terminally elderly probably even know that it ever existed or ran MAKEHOSTS in their lives, but we creaky oldsters actually found it useful and who knows, it might one day be useful again... Alas, this is the difference between BSD and linux... linux tends to be a lot freer with features and less slavishly attuned to an abstract vision of an operational stack at the expense of functionality, but openssh is a (primarily) bsd project...:-( I suppose one could possibly hack these features into the linux "portability goop" (their term for the OS-specific layer on top of the universally shared operational core), but I suspect not, or at least not easily. The cipher almost certainly not, and quite possibly not the disconnect either. rgb > > --Jesse > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From JAI_RANGI at SDSTATE.EDU Mon Dec 23 08:19:28 2002 From: JAI_RANGI at SDSTATE.EDU (RANGI, JAI) Date: Wed Nov 25 01:02:55 2009 Subject: Beowulf digest, Vol 1 #1147 - 6 msgs In-Reply-To: <200212121701.gBCH1LG25301@blueraja.scyld.com> Message-ID: <000001c2aa9f$115859c0$b282d889@sdstate.edu> I have a Beowulf cluster of 17 nodes with Suse linux 8.0. What is the best and easiest way to upgrade all the nodes to Suse 8.1. Jai Rangi ------------------------------------------------------- In the world with no fences, why would you need Gates ? - Linux ------------------------------------------------------- From John.Hearns at cern.ch Thu Dec 19 00:40:14 2002 From: John.Hearns at cern.ch (John Hearns) Date: Wed Nov 25 01:02:55 2009 Subject: bzImage/kernel comp prob In-Reply-To: <200212182118.gBILIF905581@webmail.my.ucla.edu> Message-ID: On Wed, 18 Dec 2002 cjereme@ucla.edu wrote: > Hello > > I am currently setting up a beowulf that is running on SuSe 8.1 kernel > 2.4.19. Since > make clean > make dep > make -j 16 bzImage This is probably totally unnecessary, but if you want to use the SuSE supplied configuration, I would also make sure you have the SuSE config file in the source directory, and do a 'make oldconfig' also. I don't have a SuSE system to hand, so can't > > I get get errors getting an .o file for trap.c ...has anyone come across > this problem? Christine, can you copy us what the error message is please? From brian.dobbins at yale.edu Wed Dec 18 18:38:03 2002 From: brian.dobbins at yale.edu (Brian Dobbins) Date: Wed Nov 25 01:02:55 2009 Subject: Q: Memory performance - 7501/7505/GC-LE? Message-ID: Hi guys, Does anyone have any information on the memory system performance of the Intel 7501/7505 dual-Xeon (400/533) chipsets versus the Serverworks GC-LE? I've checked a bit through the archives and on the web, but haven't found much yet. Thanks, - Brian From bob at drzyzgula.org Fri Dec 20 05:51:18 2002 From: bob at drzyzgula.org (Bob Drzyzgula) Date: Wed Nov 25 01:02:55 2009 Subject: SMC 8624T 24-port 10/100/1000 switch In-Reply-To: <20021112203744.M8874@www2> References: <20021112150209.L8874@www2> <20021112203744.M8874@www2> Message-ID: <20021220085118.A14760@www2> Follow-up: We recently received some of the SMC 24-port gigabit switches. I cracked one open and it does in fact have Broadcom chips in it. The only visible ones are the transceivers. There are five BCM5404s for twenty of the copper ports, and four BCM5421s shared between the last four copper ports and the four mini-GBIC slots, i.e., one can only have 24 ports active despite the 28 physical ports on the front. The switch fabric chips have heat sinks glued onto them, so I can't give the number for those. Interestingly, as a promotional deal we got ten SMC 32-bit PCI gigabit NICs free with each switch. Those are using Altima AC1002 chips; Altima is a Broadcom subsidiary. We've not yet had an opportunity to test putting traffic through the SMC switches, but we've powered one up and our networking guys were happy to find the software to be an IOS clone. --Bob On Tue, Nov 12, 2002 at 08:37:44PM -0500, Bob Drzyzgula wrote: > > ... > > We do have a definate yes jumbo for the SMC, at 9KB in > fact, so this adds further credence to the speculation. > that they are using the Broadcom. > > I think it would be great if people who have these > boxes to pop them open and look at what chips are > inside, and then let us know... > > --Bob From ivan at sixfold.com Mon Dec 23 18:44:08 2002 From: ivan at sixfold.com (Ivan Pulleyn) Date: Wed Nov 25 01:02:55 2009 Subject: multiple processes on laptop In-Reply-To: <3E075135.753A31C@me.lsu.edu> Message-ID: Do the jobs actually run slowly, or just spawn slowly? It might be DNS problems. Assuming a somewhat standard Linux machine, you could change the hosts line in /etc/nsswitch.conf from hosts: files nisplus dns to: hosts: files Add localhost.localdomain and your laptops hostname to /etc/hosts. Try again while offline to see if it's faster. Ivan... On Mon, 23 Dec 2002, Frank Muldoon wrote: > I have a Fortran 95/MPI parallel program that I run on my laptop for > testing. If I have the ethernet card plugged into a live cable I can > run 32 jobs on the laptop easily. If there is no cable or a dead cable > plugged in then my program runs very slow. It takes for ever for the 32 > processes to spawn. Rebuilding mpich (version 1.2.4) with the shared > memory device does not help. Anyone know how to fix this problem? > > Thanks, > Frank > > -- > Frank Muldoon > Computational Fluid Dynamics Research Group > Louisiana State University > Baton Rouge, LA 70803 > 225-205-6601 (cell phone) > 225-578-5217 (w) > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- ---------------------------------------------------------------------- Ivan Pulleyn Sixfold Technologies, LLC Chicago Technology Park 2201 West Campbell Drive Chicago, IL 60612 email: ivan@sixfold.com voice: (866) 324-5460 x601 fax: (312) 421-0388 ---------------------------------------------------------------------- From wrp at alpha0.bioch.virginia.edu Tue Dec 24 09:33:33 2002 From: wrp at alpha0.bioch.virginia.edu (William R.Pearson) Date: Wed Nov 25 01:02:55 2009 Subject: slow MPI on laptop Message-ID: Your problem almost certainly has to do with the lack of a DNS server when your ethernet is not plugged in. I have seen the same problem under OSX. You need to figure out a way to convince your operating system to determine quickly (without timing out) that localhost is on your machine when it is not on a network. I have found that this can be difficult. Bill Pearson From xyzzy at speakeasy.org Tue Dec 24 18:47:05 2002 From: xyzzy at speakeasy.org (Trent Piepho) Date: Wed Nov 25 01:02:55 2009 Subject: RSH scaling problems... In-Reply-To: Message-ID: On Thu, 19 Dec 2002, Robert G. Brown wrote: > ...so you've got wirespeed limitations of perhaps 100-100 MB/sec, and disk > speed limitations that are likely less than that. So the speed penalty > is 2-3 in total speed for using ssh to transfer huge files. This isn't > particularly terrible for most file transfer applications, unless all > you are doing all of the time is file transfer and it is significantly > rate-limiting. The test system here was a 2.4 GHz P4 with a 40MB IDE drive. That's pretty low on the ratio of disk speed to CPU speed. We have 1GHz P3 with a 3ware 7850 card, probably has 1/3 the CPU speed and 5 times the disk bandwidth. In this case ssh vs rsh speed is more like a factor of 10. > So although I love ssh relative to rsh and strongly advocate its use, it > isn't perfect and these three things, in particular: a "none" cipher It's even more annoying that these features were would be easy to add or even used to exist and were removed. It's just the fascism of the openssh team that keeps them out. > hack these features into the linux "portability goop" (their term for > the OS-specific layer on top of the universally shared operational > core), but I suspect not, or at least not easily. The cipher almost > certainly not, and quite possibly not the disconnect either. What's really needed is for someone, like say Redhat, to maintain a set of patches to ssh to add features that the ssh people don't like. From rgupta at cse.iitkgp.ernet.in Sat Dec 28 01:42:09 2002 From: rgupta at cse.iitkgp.ernet.in (Rakesh Gupta) Date: Wed Nov 25 01:02:55 2009 Subject: Rlogin without password Message-ID: I am having a linux cluster running redhat 7.3 . I want to rlogin into the clients without password. I changed /etc/hosts.equiv , .rhosts and /etc/pam.d/rlogin but still it asks for the password. Can anyone tell me how to go about it ? Regards Rakesh From aleahy at knox.edu Sat Dec 28 03:04:03 2002 From: aleahy at knox.edu (Andrew Leahy) Date: Wed Nov 25 01:02:55 2009 Subject: Rlogin without password References: Message-ID: <3E0D8523.E2D7D068@knox.edu> Rakesh Gupta wrote: > > I am having a linux cluster running redhat 7.3 . I want to rlogin into the > clients without password. I changed /etc/hosts.equiv , .rhosts and > /etc/pam.d/rlogin but still it asks for the password. Can anyone tell me > how to go about it ? Did you try removing the pam_securetty line from /etc/pam.d/{rsh,rlogin,rexec}? Andrew Leahy From misc at lost.co.nz Sat Dec 28 05:06:58 2002 From: misc at lost.co.nz (Leon) Date: Wed Nov 25 01:02:55 2009 Subject: Rlogin without password In-Reply-To: Message-ID: <000901c2ae72$017c3fe0$9300a8c0@flash> I'd really suggest you use the SSH suite of programs instead - I've put together a mini-howto on how to setup SSH to stop having to type in passwords each time, as well as a shell script to implement the 'Symlink-trick'... http://www.lost.co.nz/main/linux/ssh.html Enjoy! -- Leon -- > -----Original Message----- > From: beowulf-admin@beowulf.org > [mailto:beowulf-admin@beowulf.org] On Behalf Of Rakesh Gupta > Sent: Saturday, 28 December 2002 22:42 > To: beowulf@beowulf.org > Subject: Rlogin without password > > > > I am having a linux cluster running redhat 7.3 . I want to > rlogin into the > clients without password. I changed /etc/hosts.equiv , .rhosts and > /etc/pam.d/rlogin but still it asks for the password. Can > anyone tell me > how to go about it ? > > Regards > Rakesh > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) > visit http://www.beowulf.org/mailman/listinfo/beowulf > From summers at stsci.edu Mon Dec 30 08:54:05 2002 From: summers at stsci.edu (Frank Summers) Date: Wed Nov 25 01:02:55 2009 Subject: Rlogin without password In-Reply-To: References: Message-ID: <200212301154.05985.summers@stsci.edu> On Saturday 28 December 2002 04:42 am, Rakesh Gupta wrote: > I am having a linux cluster running redhat 7.3 . I want to rlogin into the > clients without password. I changed /etc/hosts.equiv , .rhosts and > /etc/pam.d/rlogin but still it asks for the password. Can anyone tell me > how to go about it ? I have the same setup. To echo the ssh suggestion in a different way, always remember that rsh tools were developed for use on a trusted network. They should only be used on trusted networks, and you must firewall off any other network connections. Here's what my notes say I did: 1) Add rsh and rlogin to the file /etc/securetty. Just add two lines to the end of the file with "rsh" on one and "rlogin" on the other. 2) Edit the xinetd settings for rlogin and rsh. These are the files /etc/xinetd.d/rlogin and /etc/xinetd.d/rsh. Change the "disable" line from "yes" to "no". 3) Add the cluster machines to /etc/hosts.equiv 3A) Make sure that TCP Wrappers doesn't block the cluster machines. In /etc/hosts.allow, they should be listed with a line like "ALL: 192.168.1. localhost", where 192.168.1.XXX is the private network for the cluster. One should also make sure that /etc/hosts.deny has only one line reading "ALL:ALL". 3B) Make sure your firewall won't block these connections from your cluster network. Check /etc/sysconfig/ipchains or /etc/sysconfig/iptables. 4) If you want root rlogin capability (insert usual danger warnings, etc), then you need an rhosts file for root ( /root/.rhosts ) that lists all the cluster machines. 5) You probably need to restart xinetd with "/etc/init.d/xinetd restart" 6) My notes don't list any changes to /etc/pam.d/rsh or/etc/pam.d/rlogin, but I might have missed writing something down. However, RPM reports that these files are the same as installed. 7) Make changes to all cluster machines (obvious, but easy to forget). If I missed soemthing, let me know. Frank From xyzzy at speakeasy.org Mon Dec 30 10:40:17 2002 From: xyzzy at speakeasy.org (Trent Piepho) Date: Wed Nov 25 01:02:55 2009 Subject: Rlogin without password In-Reply-To: <200212301154.05985.summers@stsci.edu> Message-ID: On Mon, 30 Dec 2002, Frank Summers wrote: > 1) Add rsh and rlogin to the file /etc/securetty. Just add two lines to > the end of the file with "rsh" on one and "rlogin" on the other. Accoring to the securetty(5) and login(1) man pages, you're just supposed to list tty devices from /dev, there's nothing about "rsh" or "rlogin" being valid. I added ttyp[0-5], which is somewhat sub-optimal since root won't be allowed to login if the first six pseudo-ttys are already in use, though in practice that hasn't been a problem. Do you know where you found out about adding "rsh" as a tty? That's sounds like a much better way to do it if it really works. > > 3A) Make sure that TCP Wrappers doesn't block the cluster machines. > In /etc/hosts.allow, they should be listed with a line like > "ALL: 192.168.1. localhost", where 192.168.1.XXX is the private > network for the cluster. One should also make sure that /etc/hosts.deny > has only one line reading "ALL:ALL". Insead of adding ALL to hosts.allow, add two lines like: in.rshd : 192.168.0. in.rlogind : 192.168.0. That way you're only opening up rsh and rlogin ports, not ftp, telnet, daytime or what have you. > 4) If you want root rlogin capability (insert usual danger warnings, > etc), then you need an rhosts file for root ( /root/.rhosts ) that > lists all the cluster machines. Also make sure that the .rhosts file is owned by root and not writable by group or other, or it won't work. You can also omit the hosts.equiv step if you only want users with .rhosts to have rsh without password ability. From summers at stsci.edu Mon Dec 30 11:16:47 2002 From: summers at stsci.edu (Frank Summers) Date: Wed Nov 25 01:02:55 2009 Subject: Rlogin without password In-Reply-To: References: Message-ID: <200212301416.47363.summers@stsci.edu> On Monday 30 December 2002 01:40 pm, Trent Piepho wrote: > Do you know where you found > out about adding "rsh" as a tty? That's sounds like a much better way to > do it if it really works. Hmmm ... don't remember where I found this. It probably was in the docs for a parallel program I installed. Or perhaps it came from the great wild Google-verse. 'Twas many months ago. My notes only say that I did that step while installing and testing MPICH. It does work for me. Frank From siegert at sfu.ca Mon Dec 30 11:56:14 2002 From: siegert at sfu.ca (Martin Siegert) Date: Wed Nov 25 01:02:55 2009 Subject: Rlogin without password In-Reply-To: References: <200212301154.05985.summers@stsci.edu> Message-ID: <20021230195614.GA22380@stikine.ucs.sfu.ca> On Mon, Dec 30, 2002 at 10:40:17AM -0800, Trent Piepho wrote: > On Mon, 30 Dec 2002, Frank Summers wrote: > > 1) Add rsh and rlogin to the file /etc/securetty. Just add two lines to > > the end of the file with "rsh" on one and "rlogin" on the other. > > Accoring to the securetty(5) and login(1) man pages, you're just supposed to > list tty devices from /dev, there's nothing about "rsh" or "rlogin" being > valid. I added ttyp[0-5], which is somewhat sub-optimal since root won't be > allowed to login if the first six pseudo-ttys are already in use, though in > practice that hasn't been a problem. Do you know where you found out about > adding "rsh" as a tty? That's sounds like a much better way to do it if it > really works.