From Bogdan.Costescu at iwr.uni-heidelberg.de Wed Jul 1 02:35:10 2009 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Parallel Programming Question In-Reply-To: <4A4AA881.7030806@ldeo.columbia.edu> References: <428810f20906232123x4ba721aye1f4c64edec741b0@mail.gmail.com> <4A427CAA.9050304@ldeo.columbia.edu> <4A4AA881.7030806@ldeo.columbia.edu> Message-ID: On Tue, 30 Jun 2009, Gus Correa wrote: > My answers were given in the context of Amjad's original questions Sorry, I missed somehow the context for the questions. Still, the thoughts about I/O programming are general in nature, so they would apply in any case. > Hence, he may want to follow the path of least resistance, rather > than aim at the fanciest programming paradigm. Heh, I have the impression that most scientific software is started like that and only if it's interesting enough (f.e. survives the first generation of PhD student(s) working on/with it) and gets developed further it has some "fancyness" added to it. ;-) > Nevertheless, what if these 1000 jobs are running on the same > cluster, but doing "brute force" I/O through each of their, say, 100 > processes? Wouldn't file and network contention be larger than if > the jobs were funneling I/O through a single processor? The network connection to the NFS file server or some uplinks in an over-subscribed network would impose the limitation - this would be a hard limitation: it doesn't matter if you divide a 10G link between 100 or 100000 down-links, it will not exceed 10G in any case; in extreme cases, the switch might not take the load and start dropping packets. Similar for a NFS file server: it certainly makes a difference if it needs to serve 1 client or 100 simultaneously, but beyond that point it won't matter too much how many there are (well, that was my experience anyway, I'd be interested to hear about a different experience). > Absolutely, but the emphasis I've seen, at least for small clusters > designed for scientific computations in a small department or > research group is to pay less attention to I/O that I had the chance > to know about. When one gets to the design of the filesystems and > I/O the budget is already completely used up to buy a fast > interconnect for MPI. That's a mistake that I have also done. But one can learn from own mistakes or can learn from the mistakes of others. I'm now trying to help others understand that the cluster is not only about CPU or MPI performance, but about the whole, including storage. So, spread the word :-) >> > [ parallel I/O programs ] always cause a problem when the number of >> > processors is big. > Sorry, but I didn't say parallel I/O programs. No, that was me trying to condense your description in a few words to allow for more clipping - I have obviously failed... > The opposite, however, i.e., writing the program expecting the > cluster to provide a parallel file system, is unlikely to scale well > on a cluster without one, or not? I interpret your words (maybe again mistakenly) as a general remark and I can certainly find cases where the statement is false. If you have a well thought-out network design and a NFS file server that can take the load, a good scaling could still be achieved - please note however that I'm not necessarily referring to a Linux-based NFS file server, an "appliance" (f.e. from NetApp or Isilon) could take that role as well although at a price. > If you are on a shoestring budget, and your goal is to do parallel > computing, and your applications are not particularly I/O intensive, > what would you prioritize: a fast interconnect for MPI, or hardware > and software for a parallel file system? A balanced approach :-) It's important to have a clear idea of what "not particularly I/O intensive" actually means and how much value the users give for the various tasks that would run on the cluster. > Hopefully courses like yours will improve this. If I could, I would > love to go to Heidelberg and take your class myself! Just to make this clearer: I wasn't teaching myself; based on Uni Heidelberg regulations, I'd need to hold a degree (like Dr.) to be allowed to do teaching. But no such restrictions are present for the practical work, which is actually the part that I find most interesting because it's more "meaty" and a dialogue with students can take place ;-) -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 E-mail: bogdan.costescu@iwr.uni-heidelberg.de From ashley at pittman.co.uk Wed Jul 1 09:10:20 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Parallel Programming Question In-Reply-To: <4A45BC7C.6020908@cse.ucdavis.edu> References: <428810f20906232123x4ba721aye1f4c64edec741b0@mail.gmail.com> <4A45BC7C.6020908@cse.ucdavis.edu> Message-ID: <1246464620.12457.52.camel@alpha> On Fri, 2009-06-26 at 23:30 -0700, Bill Broadley wrote: > Keep in mind that when you say broadcast that many (not all) MPI > implementations do not do a true network layer broadcast... and that > in most > situations network uplinks are distinct from the downlinks (except for > the > ACKs). > > If all clients need all input files you can achieve good performance > by either > using a bit torrent approach (send 1/N of the file to each of N > clients then > have them re-share it), or even just a simple chain. Head -> node A > -> node B > -> node C. This works better than you might think since Node A can > start > uploading immediately and the upload bandwidth doesn't compete with > the > download bandwidth (well not much usually). What you are recommending here is for Amjad to re-implement MPI_Broadcast() in his code which is something I would consider a very bad idea. The collectives are a part of MPI for a reason, it's a lot easier for the library to know about your machine than it is for you to know about it, having users re-code parts of the MPI library inside their application is both a waste of programmers time and is also likely to make the application run slower. Yours, Ashley Pittman. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From dnlombar at ichips.intel.com Wed Jul 1 10:24:18 2009 From: dnlombar at ichips.intel.com (David N. Lombard) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Parallel Programming Question In-Reply-To: <1246464620.12457.52.camel@alpha> References: <428810f20906232123x4ba721aye1f4c64edec741b0@mail.gmail.com> <4A45BC7C.6020908@cse.ucdavis.edu> <1246464620.12457.52.camel@alpha> Message-ID: <20090701172418.GA15378@nlxdcldnl2.cl.intel.com> On Wed, Jul 01, 2009 at 09:10:20AM -0700, Ashley Pittman wrote: > On Fri, 2009-06-26 at 23:30 -0700, Bill Broadley wrote: > > Keep in mind that when you say broadcast that many (not all) MPI > > implementations do not do a true network layer broadcast... and that > > in most > > situations network uplinks are distinct from the downlinks (except for > > the > > ACKs). A network layer broadcast can be iffy; not all switches are created equal. > > If all clients need all input files you can achieve good performance > > by either > > using a bit torrent approach (send 1/N of the file to each of N > > clients then > > have them re-share it), or even just a simple chain. Head -> node A > > -> node B > > -> node C. This works better than you might think since Node A can > > start > > uploading immediately and the upload bandwidth doesn't compete with > > the > > download bandwidth (well not much usually). GIYF. It will find existing implementations of application-level broadcasts and file transfer pipelines. > What you are recommending here is for Amjad to re-implement > MPI_Broadcast() in his code which is something I would consider a very > bad idea. The collectives are a part of MPI for a reason, it's a lot > easier for the library to know about your machine than it is for you to > know about it, having users re-code parts of the MPI library inside > their application is both a waste of programmers time and is also likely > to make the application run slower. Isn't the use model important here? If the file is only needed for the one run, I completely agree, do it directly in your MPI program using collectives. If persistence of the file on the node has value, e.g., for multiple runs, I'd get the file out to all the nodes using some existing package that implements one of the methods Bill described. I wouldn't code one of those using MPI. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From gus at ldeo.columbia.edu Wed Jul 1 12:41:47 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Parallel Programming Question In-Reply-To: <428810f20906302155v1d2b3b55s65f13be8a8964e42@mail.gmail.com> References: <428810f20906232123x4ba721aye1f4c64edec741b0@mail.gmail.com> <4A427CAA.9050304@ldeo.columbia.edu> <4A4AA881.7030806@ldeo.columbia.edu> <428810f20906302155v1d2b3b55s65f13be8a8964e42@mail.gmail.com> Message-ID: <4A4BBBFB.1030607@ldeo.columbia.edu> Hi Amjad, list amjad ali wrote: > Hi, > Gus--thank you. > You are right. I mainly have to run programs on a small cluster (GiGE) > dedicated for my job only; and sometimes I might get some opportunity to > run my code on a shared cluster with hundreds of nodes. > Thanks for telling. My guess was not very far from true. :) BTW, if you are doing cluster control, I/O and MPI all across the same GigE network, and if your nodes have dual GigE ports, you may consider buying a second switch (or using VLAN on your existing switch, if it has it) to deploy a second GigE network only for MPI. In case you haven't done this yet, of course. The cost is very modest, and the performance should improve. OpenMPI and MPICH can select which network they use, leaving the other one for I/O and control. > My parallel CFD application involves (In its simplest form): > 1) reading of input and mesh data from few files by the master process (I/O) > 2) Sending the full/respective data to all other processes (MPI > Broadcast/Scatter) > 3) Share the values at subdomains (created by Metis) boundaries at the > end of each iteration (MPI Message Passing) > 4) On converge, send/Gather the results from all processes to master > process > 5) Writing the results to files by the master process (I/O). > > So I think my program is not I/O intensive; so the Funneling I/O through > the master process is sufficient for me. Right? > This scheme doesn't differ much from most atmosphere/ocean/climate models we run here. (They call the programs "models" in this community.) After all, part of the computations here are also CFD-type, although with reduced forms of the Navier-Stokes equation in a rotating planet. (Other computations are not CFD, e.g. radiative transfer.) We tend to output data every 4000 time steps (order of magnitude), and in extreme, rare, cases every 40 time steps or so. There is a lot of computation on each time step, and the usual exchange of boundary values across subdomains using MPI (i.e. we also use domain decomposition). You may have adaptive meshes though, whereas most of our models use fixed grids. For this pattern of work and this ratio of computation-to-communication-to-I/O, the models that work best are those that funnel I/O through the "master" processor. My guess is that this scheme would work OK for you also, since you seem to output data only "on convergence" (to a steady state perhaps?). I presume this takes many time steps, and involves a lot of computation, and a significant amount of MPI communication, right? > But now I have to parallelize a new serial code, which plots the results > while running (online/live display). Means that it shows the plots > of three/four variables (in small windows) while running and we see it > as video (all progress from initial stage to final stage). I assume that > this time much more I/O is involved. At the end of each iteration result > needs to be gathered from all processes to the master process. And > possibly needs to be written in files as well (I am not sure). Do we > need to write it on some file/s for online display, after each > iteration/time-step? > Do you somehow use the movie results to control the program, change its parameters, or its course of action? If you do, then the "real time" feature is really required. Otherwise, you could process the movie offline after the run ends, although this would spoil the fun of seeing it live, no doubt about it. I suppose a separate program shows the movie, right? Short from a more sophisticated solution using MPI-I/O and perhaps relying on a parallel file system, you could dump the subdomain snapshots to the nodes' local disks, then run a separate program to harvest this data, recompose the frames into the global domain, and exhibit the movie. (Using MPI types may help rebuild the frames on the global domain.) If the snapshots are not dumped very often, funneling them through the master processor using MPI_Gather[v], and letting the master processor output the result, would be OK also. Regardless of how you do it, I doubt you need one snapshot for each time step. It should be much less, as you say below. > I think (as serial code will be displaying result after each > iteration/time step), I should display result online after 100 > iterations/time-steps in my parallel version so less "I/O" and/or > "funneling I/O through master process" will be required. > Any opinion/suggestion? > Yes, you may want to decimate the number of snapshots that you dump to file, to avoid I/O at every time step. How many time steps between snapshots? It depends on how fast the algorithm moves the solution, I would guess. It should be an interval short enough to provide smooth transitions from frame to frame, but long enough to avoid too much I/O. You may want to leave this number as a (Fortran namelist) parameter that you can choose at runtime. Movies are 24 frames per second (at least before the digital era). Jean-Luc Goddard once said: "Photography is truth. Cinema is truth twenty-four times per second." Of course he also said: "Cinema is the most beautiful fraud in the world.". But you don't need to tell your science buddies or your adviser about that ... :) Good luck! Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- > regards, > Amjad Ali. > > > > On Wed, Jul 1, 2009 at 5:06 AM, Gus Correa > wrote: > > Hi Bogdan, list > > Oh, well, this is definitely a peer reviewed list. > My answers were given in the context of Amjad's original > questions, and the perception, based on Amjad's previous > and current postings, that he is not dealing with a large cluster, > or with many users, and plans to both parallelize and update his > code from F77 to F90, which can be quite an undertaking. > Hence, he may want to follow the path of least resistance, > rather than aim at the fanciest programming paradigm. > > In the edited answer that context was stripped off, > and so was the description of > "brute force" I/O in parallel jobs. > That was the form of concurrent I/O I was referring to. > An I/O mode which doesn't take any precautions > to avoid file and network contention, and unfortunately is more common > than clean, well designed, parallel I/O code (at least on the field > I work). > > That was the form of concurrent I/O I was referring to > (all processors try to do I/O at the same time using standard > open/read/write/close commands provided by Fortran or another language, > not MPI calls). > > Bogdan seems to be talking about programs with well designed > parallel I/O instead. > > > Bogdan Costescu wrote: > > On Wed, 24 Jun 2009, Gus Correa wrote: > > the "master" processor reads... broadcasts parameters that > are used by all "slave" processors, and scatters any data > that will be processed in a distributed fashion by each > "slave" processor. > ... > That always works, there is no file system contention. > > > I beg to disagree. There is no file system contention if this > job is the only one doing the I/O at that time, which could be > the case if a job takes the whole cluster. However, in a more > conventional setup with several jobs running at the same time, > there is I/O done from several nodes (running the MPI rank 0 of > each job) at the same time, which will still look like mostly > random I/O to the storage. > > > Indeed, if there are 1000 jobs running, > even if each one is funneling I/O through > the "master" processor, there will be a large number of competing > requests to the I/O system, hence contention. > However, contention would also happen if all jobs were serial. > Hence, this is not a problem caused by or specific from parallel jobs. > It is an intrinsic limitation of the I/O system. > > Nevertheless, what if these 1000 jobs are running on the same cluster, > but doing "brute force" I/O through > each of their, say, 100 processes? > Wouldn't file and network contention be larger than if the jobs were > funneling I/O through a single processor? > > That is the context in which I made my statement. > Funneling I/O through a "master" processor reduces the chances of file > contention because it minimizes the number of processes doing I/O, > or not? > > > Another drawback is that you need to write more code for the > I/O procedure. > > > I also disagree here. The code doing I/O would need to only > happen on MPI rank 0, so no need to think for the other ranks > about race conditions, computing a rank-based position in the > file, etc. > > > >From what you wrote, > you seem to agree with me on this point, not disagree. > > 1) Brute force I/O through all ranks takes little programming effort, > the code is basically the same serial, > and tends to trigger file contention (and often breaks NFS, etc). > 2) Funneling I/O through the master node takes a moderate programming > effort. One needs to gather/scatter data through the "master" > processor, which concentrates the I/O, and reduces contention. > 3) Correct and cautious parallel I/O across all ranks takes a larger > programming effort, > due to the considerations you pointed out above. > > > In addition, MPI is in control of everything, you are less > dependent on NFS quirks. > > > ... or cluster design. I have seen several clusters which were > designed with 2 networks, a HPC one (Myrinet or Infiniband) and > GigE, where the HPC network had full bisection bandwidth, but > the GigE was a heavily over-subscribed one as the design really > thought only about MPI performance and not about I/O > performance. In such an environment, it's rather useless to try > to do I/O simultaneously from several nodes which share the same > uplink, independent whether the storage is a single NFS server > or a parallel FS. Doing I/O from only one node would allow full > utilization of the bandwidth on the chain of uplinks to the > file-server and the data could then be scattered/gathered fast > through the HPC network. Sure, a more hardware-aware application > could have been more efficient (f.e. if it would be possible to > describe the network over-subscription so that as many uplinks > could be used simultaneously as possible), but a more balanced > cluster design would have been even better... > > > Absolutely, but the emphasis I've seen, at least for small clusters > designed for scientific computations in a small department or > research group is to pay less attention to I/O that I had the chance > to know about. > When one gets to the design of the filesystems and I/O the budget is > already completely used up to buy a fast interconnect for MPI. > I/O is then done over Gigabit Ethernet using a single NFS > file server (often times a RAID on the head node itself). > For the scale of a small cluster, with a few tens of nodes or so, > this may work OK, > as long as one writes code that is gentle with NFS > (e.g. by funneling I/O through the head node). > > Obviously the large clusters on our national labs and computer centers > do take into consideration I/O requirements, parallel file systems, > etc. However, that is not my reality here, and I would guess it is > not Amjad's situation either. > > > [ parallel I/O programs ] always cause a problem when the > number of processors is big. > > > > Sorry, but I didn't say parallel I/O programs. > I said brute force I/O by all processors (using standard NFS, > no parallel file system, all processors banging on the file system > with no coordination). > > > I'd also like to disagree here. Parallel file systems teach us > that a scalable system is one where the operations are split > between several units that do the work. Applying the same > knowledge to the generation of the data, a scalable application > is one for which the I/O operations are done as much as possible > split between the ranks. > > > Yes. > If you have a parallel file system. > > > > IMHO, the "problem" that you see is actually caused by reaching > the limits of your cluster, IOW this is a local problem of that > particular cluster and not a problem in the application. By > re-writing the application to make it more NFS-friendly (f.e. > like the above "rank 0 does all I/O"), you will most likely kill > scalability for another HPC setup with a distributed/parallel > storage setup. > > Yes, that is true, but may only be critical if the program is I/O > intensive (ours are not). > One may still fare well with funneling I/O through one or a few > processors, if the program is not I/O intensive, > and not compromise scalability. > > The opposite, however, i.e., > writing the program expecting the cluster to > provide a parallel file system, > is unlikely to scale well on a cluster > without one, or not? > > > Often times these codes were developed on big iron machines, > ignoring the hurdles one has to face on a Beowulf. > > > Well, the definition of Beowulf is quite fluid. Nowadays is > sufficiently easy to get a parallel FS running with commodity > hardware that I wouldn't associate it anymore with big iron. > > > That is true, but very budget dependent. > If you are on a shoestring budget, and your goal is to do parallel > computing, and your applications are not particularly I/O intensive, > what would you prioritize: a fast interconnect for MPI, > or hardware and software for a parallel file system? > > > In general they don't use MPI parallel I/O either > > > Being on the teaching side in a recent course+practical work > involving parallel I/O, I've seen computer science and physics > students making quite easily the transition from POSIX I/O done > on a shared file system to MPI-I/O. They get sometimes an index > wrong, but mostly the conversion is painless. After that, my > impression has become that it's mostly lazyness and the attitude > 'POSIX is everywhere anywhere, why should I bother with > something that might be missing' that keeps applications at this > stage. > > > I agree with your considerations about laziness and the POSIX-inertia. > However, there is still a long way to make programs and programmers > at least consider the restrictions imposed by network and file systems, > not to mention to use proper parallel I/O. > Hopefully courses like yours will improve this. > If I could, I would love to go to Heidelberg and take your class myself! > > Regards, > Gus Correa > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > From gus at ldeo.columbia.edu Wed Jul 1 14:43:04 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Parallel Programming Question In-Reply-To: References: <428810f20906232123x4ba721aye1f4c64edec741b0@mail.gmail.com> <4A427CAA.9050304@ldeo.columbia.edu> <4A4AA881.7030806@ldeo.columbia.edu> Message-ID: <4A4BD868.5060502@ldeo.columbia.edu> Hi Bogdan, list Bogdan Costescu wrote: > On Tue, 30 Jun 2009, Gus Correa wrote: > >> My answers were given in the context of Amjad's original questions > > Sorry, I missed somehow the context for the questions. Still, the > thoughts about I/O programming are general in nature, so they would > apply in any case. > >> Hence, he may want to follow the path of least resistance, rather than >> aim at the fanciest programming paradigm. > > Heh, I have the impression that most scientific software is started like > that and only if it's interesting enough (f.e. survives the first > generation of PhD student(s) working on/with it) and gets developed > further it has some "fancyness" added to it. ;-) > I can only say something about the codes I know. A bunch of atmosphere, ocean, and climate models have been quite resilient. Not static, they evolved, both in the science and in the programming side, but they kept some basic characteristics, mostly the central algorithms that are used. In some cases the addition of programming "fanciness" was a leap forward: e.g. encapsulating MPI communication in modules and libraries. In others cases not so much: e.g. transitioning from F77 to F90 only to add types for everything (and types of types of types ...), and 10-level overload operators to do trivial things, which did little but to slow down some codes and make them hard to adapt and maintain. >> Nevertheless, what if these 1000 jobs are running on the same cluster, >> but doing "brute force" I/O through each of their, say, 100 processes? >> Wouldn't file and network contention be larger than if the jobs were >> funneling I/O through a single processor? > > The network connection to the NFS file server or some uplinks in an > over-subscribed network would impose the limitation - this would be a > hard limitation: it doesn't matter if you divide a 10G link between 100 > or 100000 down-links, it will not exceed 10G in any case; in extreme > cases, the switch might not take the load and start dropping packets. > Similar for a NFS file server: it certainly makes a difference if it > needs to serve 1 client or 100 simultaneously, but beyond that point it > won't matter too much how many there are (well, that was my experience > anyway, I'd be interested to hear about a different experience). > We are on the low end of small clusters here. Our cluster networks are small single switch type so far. >> Absolutely, but the emphasis I've seen, at least for small clusters >> designed for scientific computations in a small department or research >> group is to pay less attention to I/O that I had the chance to know >> about. When one gets to the design of the filesystems and I/O the >> budget is already completely used up to buy a fast interconnect for MPI. > > That's a mistake that I have also done. But one can learn from own > mistakes or can learn from the mistakes of others. I'm now trying to > help others understand that the cluster is not only about CPU or MPI > performance, but about the whole, including storage. So, spread the word > :-) > I advised the purchase of two clusters here, one of which I administer. In both cases the recommendation to buy equipment to support a parallel file system was dropped based on budget constraints. This may explain my bias toward the "funnel all I/O through the master processor" paradigm. I do recognize the need for parallel file systems, and the appropriate use of MPI-I/O (and derivatives of it like parallel HDF5, parallel NetCDF, etc) to explore this capability and avoid I/O bottlenecks. >>> > [ parallel I/O programs ] always cause a problem when the number >>> of > processors is big. >> Sorry, but I didn't say parallel I/O programs. > > No, that was me trying to condense your description in a few words to > allow for more clipping - I have obviously failed... > >> The opposite, however, i.e., writing the program expecting the cluster >> to provide a parallel file system, is unlikely to scale well on a >> cluster without one, or not? > > I interpret your words (maybe again mistakenly) as a general remark and > I can certainly find cases where the statement is false. If you have a > well thought-out network design and a NFS file server that can take the > load, a good scaling could still be achieved - please note however that > I'm not necessarily referring to a Linux-based NFS file server, an > "appliance" (f.e. from NetApp or Isilon) could take that role as well > although at a price. I am sure you can find counter examples. However, for the mainstream barebones NFS/Ethernet file server that many small clusters use, the safest (or less risky) thing to do is to funnel I/O through the master node on a non-I/O-intensive parallel program. Or use local disks, if the I/O is heavy. It is a poor man's approach, admittedly, but an OK one, as it tries to dodge the system's bottlenecks as much as it can. However, it doesn't propose a solution to remove those bottlenecks, which is what parallel file systems and MPI-I/O intend to do, I presume. > >> If you are on a shoestring budget, and your goal is to do parallel >> computing, and your applications are not particularly I/O intensive, >> what would you prioritize: a fast interconnect for MPI, or hardware >> and software for a parallel file system? > > A balanced approach :-) It's important to have a clear idea of what "not > particularly I/O intensive" actually means and how much value the users > give for the various tasks that would run on the cluster. > In coarse terms, I would guess that a program is not particularly I/O intensive if the computation and (non-I/O) MPI communication effort is much larger than the I/O effort. Typically we do I/O every 4000 time steps or so, in rare cases every ~40 time steps. In between there is heavy computation per time step, and MPI exchange of boundary values (a series of 2D arrays) every time step. If the computation is halted once every 4000 steps to funnel I/O through the master processor, it is not a big deal in overall effort. Other parallel applications (scientific or not) may have a very different ratio, I suppose. >> Hopefully courses like yours will improve this. If I could, I would >> love to go to Heidelberg and take your class myself! > > Just to make this clearer: I wasn't teaching myself; based on Uni > Heidelberg regulations, I'd need to hold a degree (like Dr.) to be > allowed to do teaching. Too bad that Heidelberg is so scholastic w.r.t. academic titles. They should let you teach. I always read your postings, and learn from them. Your students should subscribe to this list, then! :) > But no such restrictions are present for the > practical work, which is actually the part that I find most interesting > because it's more "meaty" and a dialogue with students can take place ;-) > Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- From a28427 at ua.pt Wed Jul 1 18:47:08 2009 From: a28427 at ua.pt (Tiago Marques) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Torque user quotas Message-ID: Hi all, Beeing somewhat of a noob in Beowulf type clusters, I must ask, what do you use to manage user quotas for job queueing with Torque and Maiu? Gold Allocation Manager? Or does SGE do something like this? I've been browsing the web but couldn't find much. Our current cluster uses just Maui + Torque. The cluster currently accepts jobs on a FCFS basis but this behaviour is far from ideal. I would like to have the jobs to continue running as long as the user wishes but this usage would be "charged" in his account. The balance would be used to decide from whom would the next job waiting would be able to run on the nodes when one is made available. Ideally, quotas would be defined by group, which would have various users. Each group would be given a specific number of nodes where the "sum of groups*their_nodes=number_of_nodes". Say we have 8 nodes and three groups and then node quotas would be like this: - group1 would have 2 nodes - group2 would have 2 nodes - group3 would have 4 nodes So that if the cluster is always full and time usage is the same between groups then group1 would be using 2 nodes, group2 two other nodes, etc. Now, say that group1 has two times the time usage of group2 and group3 is using triple of what group2 used, or their exact quota: - group 1 would have used 8 days - group 2 would have used 4 days - group 3 would have used 12 days (I'm oversimplifying and probably will screw up the math somewhere) Group 3 has used 50% of the time, so the quota is fine, group2 is way behind group 1. So, the allocation system should disallow group1 from having jobs allocated to nodes while they're usage isn't the same as group2 again - assuming that group3's usage remains constant and that all the nodes are booked: - group1 would remain at 8 days - group2 would reach 8 days of usage - group3 would now be at 16 days of usage And normal 2-2-4 quotas would be in place again. Or ideally this would be smoothed out over time, like in a 1-3-4 usage, to avoid that anyone would be unable to perform calculations for a long time just becaused they used the cluster too much when another group didn't use it for months. The problem here is we have different groups with different amounts of researchers and some groups have allocated more research grants than others to the cluster, hence should be entitled to a fair usage scenario. This is likely to remain for a good amount of time and automation of the quotas would be ideal. Is there any kind of solution that provides this sort of behaviour, even if only for users and not groups? Best regards, Tiago Marques From d.love at liverpool.ac.uk Thu Jul 2 03:49:58 2009 From: d.love at liverpool.ac.uk (Dave Love) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Re: typical latencies for gigabit ethernet In-Reply-To: <4A4A0EE3.6060900@tamu.edu> (Gerry Creager's message of "Tue, 30 Jun 2009 14:10:59 +0100") References: <87vdmg1sxp.fsf@liv.ac.uk> <4A47971B.3000203@tamu.edu> <87r5x3e4xs.fsf@liv.ac.uk> <4A48C7EC.7090607@tamu.edu> <87my7rcakg.fsf@liv.ac.uk> <4A4A0EE3.6060900@tamu.edu> Message-ID: <87ocs38ill.fsf@liv.ac.uk> Gerry Creager writes: > Anaconda (CentOS2 originally installed a tg3 driver. I guess things have moved on a bit since v2. > Performance sucked (technical term) I've known more technical ones! > with that and we went to a bnx2 driver consistent with > our understanding of the hardware. It sucked less. I checked there's no overlap of PCI ids in recent versions of the two modules, so at least now you can't use either. That's consistent with my understanding from the Broadcom site too, so I guess it was specific to old CentOS. Thanks. From deadline at eadline.org Thu Jul 2 16:48:42 2009 From: deadline at eadline.org (Douglas Eadline) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Sell a U41 rack w/ computers and switches In-Reply-To: <2f39fbb60906291126o1054bf9dg5541a7f88b90b458@mail.gmail.com> References: <2f39fbb60906291126o1054bf9dg5541a7f88b90b458@mail.gmail.com> Message-ID: <48485.68.44.135.226.1246578522.squirrel@mail.eadline.org> > Hello everyone, I helped a friend get his cluster up and running and but > when he found out how much it would cost for electricity and maintenance > he > became fond of the idea of just selling the thing. > > Its a U41 amax rack with 14 blades with 4 1.6ghz processors, 4gb ram, > 250gb > harddisks each. Two heavy duty UPSs, Three switches, everything is > gigabit, and controlled by bladelike monitor/keyboard/mouse. Basically it > is good to go. > > I have written software which pxe boots an operating system on all nodes > except for the first, making maintenance way easy for anyone who knows > what > they are doing.(my website has more details) > > I would put this guy on ebay, but considering we needed a forklift to move > it in the first place it be great to find someone in the San Francisco bay > area to avoid a shipping nightmare. > Please do not take this the wrong way, but how can someone buy an expensive piece of IT hardware and not have an understanding of the power requirements/cost? This intrigues me. As someone who reads and writes quite a bit about HPC I assumed the power and cooling message was well understood. If it is not, maybe there is more to write about! -- Doug > Thanks, > > - Karlan Thomas Mitchell > - http://3dstoneage.com > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Doug From csamuel at vpac.org Thu Jul 2 22:17:17 2009 From: csamuel at vpac.org (Chris Samuel) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Odd AMD quad core SuperMicro power off issues In-Reply-To: <611059374.8547881246596069297.JavaMail.root@mail.vpac.org> Message-ID: <220098836.8548541246598237135.JavaMail.root@mail.vpac.org> ----- "Chris Samuel" wrote: In April I wrote: > Well we've been gradually replacing the Barcelona chips > with Shanghai (same clockspeed) and we are yet to see a > power off on a Shanghai node! Since I wrote that we have seen far fewer with 2.3GHz Shanghai (2376, a 75W part), *but* we have some nodes upgraded to the ULP 2.4 GHz Shanghai (2379 HE, a 55W part) which do exhibit this issue very regularly! :-( Gaussian is still a classic for doing this, but we've also been able to trigger it with VASP, Amber and (far less frequently) InterProScan. The compute nodes are using SuperMicro H8DM8-2 based with 32GB of ECC RAM. The boxes are running CentOS 5.3 with mainline kernels (currently 2.6.28.9, though we have demonstrated it with 2.6.30-rc6 and the EDAC patches which catch nothing before it dies). We've seen the same behaviour with the standard CentOS kernels too. This is driving us up the wall! Is nobody else seeing this ? cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From i.n.kozin at googlemail.com Fri Jul 3 08:32:42 2009 From: i.n.kozin at googlemail.com (Igor Kozin) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] seminars and webcasting on 7th July Message-ID: Dear All, As you probably already know there will be three extremely interesting seminars held on the 7th July at Daresbury - Jack Dongarra, U of Tennessee "Multicore and Hybrid Computing for Dense Linear Algebra Computations" - Benoit Raoult, CAPS "HMPP: Leverage Computing Power Simply by Directives" - Timothy Lanfear, NVIDIA "Visual Computing" followed by a reception sponsored by NVIDIA. More details here http://www.cse.scitech.ac.uk/events/GPU_2009/programme.html You are most welcome to attend the seminars but if you can't this is to let you know that we'll be webcasting the event. External people can tune in here http://extrplay.dl.ac.uk/ whereas local folk can watch here http://dlvidserve.dl.ac.uk/ The email address for questions to the speakers is disco_group@dl.ac.uk and it will be active only on the 7th July. Hopefully everything goes well and either see you on Monday or tune in! Regards, Igor From jclinton at advancedclustering.com Mon Jul 6 08:27:02 2009 From: jclinton at advancedclustering.com (Jason Clinton) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Odd AMD quad core SuperMicro power off issues In-Reply-To: <220098836.8548541246598237135.JavaMail.root@mail.vpac.org> References: <611059374.8547881246596069297.JavaMail.root@mail.vpac.org> <220098836.8548541246598237135.JavaMail.root@mail.vpac.org> Message-ID: <588c11220907060827u72370c1ajd7f2575ee288fc15@mail.gmail.com> On Fri, Jul 3, 2009 at 12:17 AM, Chris Samuel wrote: >> Well we've been gradually replacing the Barcelona chips >> with Shanghai (same clockspeed) and we are yet to see a >> power off on a Shanghai node! > > Since I wrote that we have seen far fewer with 2.3GHz > Shanghai (2376, a 75W part), *but* we have some nodes > upgraded to the ULP 2.4 GHz Shanghai (2379 HE, a 55W > part) which do exhibit this issue very regularly! :-( We saw a similar power-off issue on a customer of ours who upgraded from 2220's to Barcelona's on a similar board; it was reproducible at the same failure rate on approximately 160 nodes. After trying just about everything under the sun, we wholesale replaced all the memory in the entire cluster. The power-offs ceased immediately thereafter and have not returned. -- Jason D. Clinton, 913-643-0306, http://twitter.com/HPCClusterTech http://www.google.com/profiles/jasondclinton From mathog at caltech.edu Mon Jul 6 10:20:00 2009 From: mathog at caltech.edu (David Mathog) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Re: Odd AMD quad core SuperMicro power off issues Message-ID: Chris Samuel wrote: > > In April I wrote: > > > Well we've been gradually replacing the Barcelona chips > > with Shanghai (same clockspeed) and we are yet to see a > > power off on a Shanghai node! > > Since I wrote that we have seen far fewer with 2.3GHz > Shanghai (2376, a 75W part), *but* we have some some as in: some of the upgraded nodes do this, some do not? > nodes > upgraded to the ULP 2.4 GHz Shanghai (2379 HE, a 55W > part) which do exhibit this issue very regularly! :-( If some of your upgraded nodes do this, and some do not, then this will most likely map to one of: 1. CPU 2. motherboard (all are identical, including BIOS, right?) 3. RAM 4. power supply Start swapping parts between good and bad nodes and pray that it correlates perfectly with the location of one component type. Also keep bugging Supermicro, they should have some idea what is going on. Refresh our memory on this, are you seeing orderly power off (as in a shutdown) or are the nodes just powering down like "boom"? In the latter case I would tend to suspect that the power supply has issues and is triggering an emergency power off to prevent damage from overheating or overload. Swapping the CPUs could make a difference if the newer ones use a bit less power than the older ones. (We had a bunch of PCs which, due to a monster graphics card, were so close to the power supply limit that adding a single fan made the difference between being able to run SpecViewPerf to completion or not - using a lower power CPU would have made the same sort of difference.) Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From cousins at umit.maine.edu Mon Jul 6 14:46:06 2009 From: cousins at umit.maine.edu (Steve Cousins) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Odd AMD quad core SuperMicro power off issues In-Reply-To: <200907031900.n63J0AsW002771@bluewest.scyld.com> References: <200907031900.n63J0AsW002771@bluewest.scyld.com> Message-ID: > ----- "Chris Samuel" wrote: > > The compute nodes are using SuperMicro H8DM8-2 based > with 32GB of ECC RAM. Hi Chris, I had MCE crashes on a Supermicro system (quad Xeon quad-core 2.4 Ghz) that was driving me nuts for quite a while. It would take a couple of months to crash which doesn't sound bad but it was a real pain. I bought the machine from ASL and they worked with Supermicro to fix a microcode issue. The reason I mention this is that at least in this case, the BIOS version was same before I ran the update and after. Here is part of a message I got from ASL: > Note that you will be updating the BIOS from version 1.0b to 1.0b. In > Supermicro wisdom, they released several updates using the same revision > number. After updating my 1.0b BIOS to the new 1.0b the machine has been running solid since Christmas. So, if you have two machines, one that crashes and one that doesn't, check the dates of the BIOS's even if the BIOS versions match. I hope this helps. Steve From csamuel at vpac.org Mon Jul 6 17:53:39 2009 From: csamuel at vpac.org (Chris Samuel) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Re: Odd AMD quad core SuperMicro power off issues In-Reply-To: Message-ID: <55211600.147731246928019525.JavaMail.root@mail.vpac.org> ----- "David Mathog" wrote: Hi David, > Chris Samuel wrote: > > > Since I wrote that we have seen far fewer with 2.3GHz > > Shanghai (2376, a 75W part), *but* we have some > > some as in: some of the upgraded nodes do this, some do not? Some as in any of the ones we've had the chance to isolate and run tests on (the others are running user jobs). > Refresh our memory on this, are you seeing orderly power > off (as in a shutdown) or are the nodes just powering > down like "boom"? Fall down go boom. :-( One second running with power light on, next second dead with no power light. > In the latter case I would tend to suspect that the power > supply has issues and is triggering an emergency power off > to prevent damage from overheating or overload. We've duplicated this with a much higher capacity PSU on the same board. > Swapping the CPUs could make a difference if the newer > ones use a bit less power than the older ones. It's the lower power (55W) parts that power off, not the higher power (75W) ones. I would have thought it would have been the other way around ? cheers! Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From ellis at runnersroll.com Mon Jul 6 19:06:18 2009 From: ellis at runnersroll.com (Ellis Wilson III) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Re: Odd AMD quad core SuperMicro power off issues Message-ID: <1112284347-1246932371-cardhu_decombobulator_blackberry.rim.net-856679655-@bxe1087.bisx.prod.on.blackberry> I'm sure that this has already been examined, but just in case: Can you confirm that the sensors on these boards are playing nice with the bios and the sensors which are typically integrated into the cpu? I.e. If the cpu sensors weren't reporting values the bios expected, it might just cut out and assume that the cpu is either overheating/failing or the sensors are crapping out. Sorry if this is an obvious suggestion, I'm far from a sysadmin (maybe someday though)! Ellis From gerry.creager at tamu.edu Mon Jul 6 22:13:49 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Odd AMD quad core SuperMicro power off issues In-Reply-To: References: <200907031900.n63J0AsW002771@bluewest.scyld.com> Message-ID: <4A52D98D.5050505@tamu.edu> I learned recently that, regardless of the versioing, it's VERY important with SuperMicro to check the BIOS date. Much more important than I'd thought. As in, that's the real release info on the BIOS. gerry Steve Cousins wrote: > > >> ----- "Chris Samuel" wrote: >> >> The compute nodes are using SuperMicro H8DM8-2 based >> with 32GB of ECC RAM. > > Hi Chris, > > I had MCE crashes on a Supermicro system (quad Xeon quad-core 2.4 Ghz) > that was driving me nuts for quite a while. It would take a couple of > months to crash which doesn't sound bad but it was a real pain. I bought > the machine from ASL and they worked with Supermicro to fix a microcode > issue. > > The reason I mention this is that at least in this case, the BIOS > version was same before I ran the update and after. > > Here is part of a message I got from ASL: > >> Note that you will be updating the BIOS from version 1.0b to 1.0b. In >> Supermicro wisdom, they released several updates using the same >> revision number. > > After updating my 1.0b BIOS to the new 1.0b the machine has been running > solid since Christmas. > > So, if you have two machines, one that crashes and one that doesn't, > check the dates of the BIOS's even if the BIOS versions match. > > I hope this helps. > > Steve > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From csamuel at vpac.org Mon Jul 6 22:20:36 2009 From: csamuel at vpac.org (Chris Samuel) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Odd AMD quad core SuperMicro power off issues In-Reply-To: <588c11220907060827u72370c1ajd7f2575ee288fc15@mail.gmail.com> Message-ID: <722167562.156891246944036924.JavaMail.root@mail.vpac.org> ----- "Jason Clinton" wrote: Hi Jason, > We saw a similar power-off issue on a customer of ours who upgraded > from 2220's to Barcelona's on a similar board; it was reproducible at > the same failure rate on approximately 160 nodes. After trying just > about everything under the sun, we wholesale replaced all the memory > in the entire cluster. The power-offs ceased immediately thereafter > and have not returned. We saw that with Barcelona's, but instead going to the 2.3GHz (75W) Shanghai's solved the issue for us - we were rather surprised to see it reappear with the 2.4GHz (55W) Shanghai. :-( cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From gerry.creager at tamu.edu Mon Jul 6 22:25:47 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Re: Odd AMD quad core SuperMicro power off issues In-Reply-To: <1112284347-1246932371-cardhu_decombobulator_blackberry.rim.net-856679655-@bxe1087.bisx.prod.on.blackberry> References: <1112284347-1246932371-cardhu_decombobulator_blackberry.rim.net-856679655-@bxe1087.bisx.prod.on.blackberry> Message-ID: <4A52DC5B.2060209@tamu.edu> I'm having some interesting problems, too. I suspect I've now gotto get AMD involved but the system manufacturer and VAR have spent weeks looking at this and had no success I can tell so far. I have no good way save when I boot into the BIOS and see what appears to be normal-looking values on the sensors page. gerry Ellis Wilson III wrote: > I'm sure that this has already been examined, but just in case: > > Can you confirm that the sensors on these boards are playing nice with the bios and the sensors which are typically integrated into the cpu? I.e. If the cpu sensors weren't reporting values the bios expected, it might just cut out and assume that the cpu is either overheating/failing or the sensors are crapping out. > > Sorry if this is an obvious suggestion, I'm far from a sysadmin (maybe someday though)! > > Ellis > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From csamuel at vpac.org Mon Jul 6 22:37:46 2009 From: csamuel at vpac.org (Chris Samuel) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Re: Odd AMD quad core SuperMicro power off issues In-Reply-To: <4A52DC5B.2060209@tamu.edu> Message-ID: <763514123.157001246945066589.JavaMail.root@mail.vpac.org> ----- "Gerry Creager" wrote: > I'm having some interesting problems, too. I suspect I've now gotto > get AMD involved but the system manufacturer and VAR have spent weeks > looking at this and had no success I can tell so far. Yeah, our vendor has just got a phone conference set up with them and AMD tomorrow morning about this. I had some hints off list about memory issues, but we pulled half the RAM from each socket, dropped the G03 job down to just use 15872MB rather than 31GB and it still killed the node, just took a little bit longer than before. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From linus at ussarna.se Tue Jul 7 03:46:45 2009 From: linus at ussarna.se (Linus Harling) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Enabling I/OAT and DCA on a live system Message-ID: <4A532795.9000503@ussarna.se> Hi All, though this might be of interest to someone on the list. It is an article about how to enable I/OAT and DCA on a live system or a system that hides the option in the BIOS. According to the post: "So by using I/OAT the network stack in the Linux kernel can offload copy operations to increase throughput. I/OAT also includes a feature called Direct Cache Access (DCA) which can deliver data directly into processor caches. This is particularly cool because when a network interrupt arrives and data is copied to system memory, the CPU which will access this data will not cause a cache-miss on the CPU because DCA has already put the data it needs in the cache. Sick. Measurements from the Linux Foundation project indicate a 10% reduction in CPU usage, while the Myri-10G NIC website claims they?ve measured a 40% reduction in CPU usage. For more information describing the performance benefits of DCA see this incredibly detailed paper: Direct Cache Access for High Bandwidth Network I/O." Regards /Mostly Lurking From csamuel at vpac.org Wed Jul 8 00:40:45 2009 From: csamuel at vpac.org (Chris Samuel) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Enabling I/OAT and DCA on a live system In-Reply-To: <2069463823.221781247038673171.JavaMail.root@mail.vpac.org> Message-ID: <964955624.221801247038845278.JavaMail.root@mail.vpac.org> Hiya, ----- "Linus Harling" wrote: > Hi All, though this might be of interest to someone > on the list. It is an article about how to enable > I/OAT and DCA on a live system or a system that > hides the option in the BIOS. Be careful though, we hit some nasty kernel panics on an IBM system when trying to use 2.6.28.x with I/OAT and DCA due to suspected locking problems in the kernel. We're currently stuck on 2.6.28.x due to the fact that NFS export of XFS filesystems has is broken to the point of causing a kernel panic in 2.6.29.x and 2.6.30.x at present. :-( http://bugzilla.kernel.org/show_bug.cgi?id=13375 cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From eugen at leitl.org Wed Jul 8 04:02:14 2009 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Odd AMD quad core SuperMicro power off issues In-Reply-To: <4A52D98D.5050505@tamu.edu> References: <200907031900.n63J0AsW002771@bluewest.scyld.com> <4A52D98D.5050505@tamu.edu> Message-ID: <20090708110214.GH23524@leitl.org> On Tue, Jul 07, 2009 at 12:13:49AM -0500, Gerry Creager wrote: > I learned recently that, regardless of the versioing, it's VERY > important with SuperMicro to check the BIOS date. > > Much more important than I'd thought. As in, that's the real release > info on the BIOS. I've had issues with spontaneous lockups on Windows 2003 Server (no BSOD, gray screen with no login, the mouse was responsive, but nothing else) on an Opteron Supermicro (H8SMI-2/BULK). The thing also has Oracle on it with a custom cartridge, and many weird proprietary things on it. Haven't tried it with Linux yet. I've reflashed the BIOS, which hasn't changed anything. The hardware monitoring applications for Windows Supermicro ships is surprisingly awful. It also filled the logs with spurious hardware failures (every component failed, according to the log) and broke OLE embedding on that machine. Anecdotally, a reseller mentioned that board has issues. I'll follow up on this in the next week or two, and report, if I find something. -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From romero619 at hotmail.com Wed Jul 8 11:59:22 2009 From: romero619 at hotmail.com (P.R.) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Some beginner's questions on cluster setup Message-ID: Hi, Im new to the list & also to cluster technology in general. Im planning on building a small 20+node cluster, and I have some basic questions. We're planning on running 5-6 motherboards with quad-core amd 3.0GHz phenoms, and 4GB of RAM per node. Off the bat, does this sound like a reasonable setup My first question is about node file&operating systems: I'd like to go with a diskless setup, preferably using an NFS root for each node. However, based on some of the testing Ive done, running the nodes off of the NFS share(s) has turned out to be rather slow & quirky. Our master node will be running on a completely different hardware setup than the slaves, so I *believe* it will make it more complicated & tedious to setup&update the nfsroots for all of the nodes (since its not simply a matter of 'cloning' the master's setup&config). Is there any truth to this, am I way off? Can anyone provide any general advice or feedback on how to best setup a diskless node? The alternative that I was considering was using (4GB?) USB flash drives to drive a full-blown,local OS install on each node... Q: does anyone have experience running a node off of a usb flash drive? If so, what are some of the pros/cons/issues associated with this type of setup? My next question(s) is regarding network setup. Each motherboard has an integrated gigabit nic. Q: should we be running 2 gigabit NICs per motherboard instead of one? Is there a 'rule-of-thumb' when it comes to sizing the network requirements? (i.e.,'one NIC per 1-2 processor cores'...) Also, we were planning on plugging EVERYTHING into one big (unmanaged) gigabit switch. However, I read somewhere on the net where another cluster was physically separating NFS & MPI traffic on two separate gigabit switches. Any thoughts as to whether we should implement two switches, or should we be ok with only 1 switch? Notes: The application we'll be running is NOAA's wavewatch3, in case anyone has any experience with it. It will utilize a fair amount of NFS traffic (each node must read a common set of data at periodic intervals), and I *believe* that the MPI traffic is not extremely heavy or constant (i.e., nodes do large amounts of independent processing before sending results back to master). Id appreciate any help or feedback anyone would be willing&able to offer... Thanks, P.Romero From carsten.aulbert at aei.mpg.de Wed Jul 8 22:47:01 2009 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Some beginner's questions on cluster setup In-Reply-To: References: Message-ID: <4A558455.5040208@aei.mpg.de> Hi P.R. wrote: > Im planning on building a small 20+node cluster, and I have some basic > questions. > We're planning on running 5-6 motherboards with quad-core amd 3.0GHz > phenoms, and 4GB of RAM per node. > Off the bat, does this sound like a reasonable setup > I guess that fully depends on what you want to accomplish. If you want to use it as a proof of concept design or as a cluster for smaller tasks I think it looks reasonable. If you wanted to render IceAge4 with it, I think you need more power ;) > My first question is about node file&operating systems: > I'd like to go with a diskless setup, preferably using an NFS root for each > node. > However, based on some of the testing Ive done, running the nodes off of the > NFS share(s) has turned out to be rather slow & quirky. > Our master node will be running on a completely different hardware setup > than the slaves, so I *believe* it will make it more complicated & tedious > to setup&update the nfsroots for all of the nodes (since its not simply a > matter of 'cloning' the master's setup&config). > Is there any truth to this, am I way off? > 5-6 boxes off NFS root should not be a large burden to the server, as long as it has decent disk speeds (small RAID perhaps) and plenty of memory for caching (couple of GB should be sufficient to start with). Try tuning the NFS parameters to suit your needs. > Can anyone provide any general advice or feedback on how to best setup a > diskless node? Not really, we do that only during the installation phase. > > > The alternative that I was considering was using (4GB?) USB flash drives to > drive a full-blown,local OS install on each node... > Q: does anyone have experience running a node off of a usb flash drive? > If so, what are some of the pros/cons/issues associated with this type of > setup? > We do that only rarely for rescue setups as our nodes don't have CD drives and I think USB flash drives are pretty slow still. > > My next question(s) is regarding network setup. > Each motherboard has an integrated gigabit nic. > > Q: should we be running 2 gigabit NICs per motherboard instead of one? > Is there a 'rule-of-thumb' when it comes to sizing the network requirements? > (i.e.,'one NIC per 1-2 processor cores'...) > Again, that all depends on your workload and jobs. I think no-one can help you there unless you know what the workload will be. > > Also, we were planning on plugging EVERYTHING into one big (unmanaged) > gigabit switch. > However, I read somewhere on the net where another cluster was physically > separating NFS & MPI traffic on two separate gigabit switches. > Any thoughts as to whether we should implement two switches, or should we be > ok with only 1 switch? > Well again it depends what you need. I'd start first of with a single switch and see of NFS traffic is killing your MPI performance. On larger sites storage and intercommunication networks are often separated as the might interfere with each other too much, but it will boil down to the question how much money you have. 2 8-port GBit switches are cheap enough for testing, 2 1000+ Gbit switches (or Infiniband,...) are not ;) > > Notes: > The application we'll be running is NOAA's wavewatch3, in case anyone has > any experience with it. > It will utilize a fair amount of NFS traffic (each node must read a common > set of data at periodic intervals), > and I *believe* that the MPI traffic is not extremely heavy or constant > (i.e., nodes do large amounts of independent processing before sending > results back to master). > With the small number of machines I would go with 2 8 port switches and see what happens. No idea how wavewatch3 works and what it needs, sorry. > > Id appreciate any help or feedback anyone would be willing&able to offer... > I hope my reply already helps a little bit. Cheers Carsten From jac67 at georgetown.edu Thu Jul 9 08:16:02 2009 From: jac67 at georgetown.edu (Jess Cannata) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Some beginner's questions on cluster setup In-Reply-To: References: Message-ID: <4A5609B2.10103@georgetown.edu> Hi! The diskless provisioning system is definitely the way to go. We use the cluster toolkit called, Jesswulf, which is available at http://hpc.arc.georgetown.edu/mirror/jesswulf/ By default it runs on RedHat/Centos/Fedora systems, though it has been ported to Ubuntu and SuSE without too much trouble. Perseus/Warewulf also work well. We also teach cluster courses, which may be helpful. http://training.arc.georgetown.edu/ To answer some of your questions, I prefer the read-only NFSROOT approach with a small (less than 20 MB ramdisk). We use this on all of our clusters (about 7 clusters) and it works fine. We even use it on heterogeneous systems. One cluster has a mix of P4 Xeons, dual-core Opterons, and quad-core Xeons all using the same NFSROOT so you simply update one directory on the master node and *all* of the compute nodes have the new software. We love it! We simply either compile the kernel or make the initrd with hardware support for all of the nodes. We often use different hardware for the master and compute nodes, without issue. The only thing that we don't mix is 32 and 64-bit. We have a couple of 32-bit clusters and the rest are 64-bit. The main issue that you need to deal with is having a fast enough storage system for parallel jobs that generate a lot of data. We use the local hard drives in the computes nodes for "scratch" space and we have some type of shared file system. On the small clusters, we use NFS, but on the bigger clusters we use Glusterfs with Infiniband, which has proven to be very nice. If you are running MPI jobs with lots of data, you might want to consider adding Infiniband. Even the cheap ($125) Infiniband cards give much better performance than standard Gigabit. And you can always run IP over IB for applications or services that need standard IP. You mention that you don't think that you will have too much MPI traffic, but that you will be copying the results back to the master. This is when we see the highest load on our NFS file systems when all of the compute nodes are writing at the same time, even on small clusters (less than 20 nodes). We've found that a clustered file system like Glusterfs provides very low I/O wait load when copying lots of files compared to NFS. You may consider picking up some of the cheap IB cards ($125) and switches ($750 for 8-ports/$2400 for 24-ports) in order to do some relatively inexpensive testing. Here is one place where you can find them: http://www.colfaxdirect.com/store/pc/viewCategories.asp?pageStyle=m&idCategory=6 I'd be happy to talk to you. My phone number is below and you have my e-mail. Jess -- Jess Cannata Advanced Research Computing & High Performance Computing Training Georgetown University 202-687-3661 P.R. wrote: > Hi, > Im new to the list & also to cluster technology in general. > Im planning on building a small 20+node cluster, and I have some basic > questions. > We're planning on running 5-6 motherboards with quad-core amd 3.0GHz > phenoms, and 4GB of RAM per node. > Off the bat, does this sound like a reasonable setup > > My first question is about node file&operating systems: > I'd like to go with a diskless setup, preferably using an NFS root for each > node. > However, based on some of the testing Ive done, running the nodes off of the > NFS share(s) has turned out to be rather slow & quirky. > Our master node will be running on a completely different hardware setup > than the slaves, so I *believe* it will make it more complicated & tedious > to setup&update the nfsroots for all of the nodes (since its not simply a > matter of 'cloning' the master's setup&config). > Is there any truth to this, am I way off? > > Can anyone provide any general advice or feedback on how to best setup a > diskless node? > > > The alternative that I was considering was using (4GB?) USB flash drives to > drive a full-blown,local OS install on each node... > Q: does anyone have experience running a node off of a usb flash drive? > If so, what are some of the pros/cons/issues associated with this type of > setup? > > > My next question(s) is regarding network setup. > Each motherboard has an integrated gigabit nic. > > Q: should we be running 2 gigabit NICs per motherboard instead of one? > Is there a 'rule-of-thumb' when it comes to sizing the network requirements? > (i.e.,'one NIC per 1-2 processor cores'...) > > > Also, we were planning on plugging EVERYTHING into one big (unmanaged) > gigabit switch. > However, I read somewhere on the net where another cluster was physically > separating NFS & MPI traffic on two separate gigabit switches. > Any thoughts as to whether we should implement two switches, or should we be > ok with only 1 switch? > > > Notes: > The application we'll be running is NOAA's wavewatch3, in case anyone has > any experience with it. > It will utilize a fair amount of NFS traffic (each node must read a common > set of data at periodic intervals), > and I *believe* that the MPI traffic is not extremely heavy or constant > (i.e., nodes do large amounts of independent processing before sending > results back to master). > > > Id appreciate any help or feedback anyone would be willing&able to offer... > > Thanks, > P.Romero > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From diep at xs4all.nl Sun Jul 12 06:43:27 2009 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Some beginner's questions on cluster setup In-Reply-To: <4A558455.5040208@aei.mpg.de> References: <4A558455.5040208@aei.mpg.de> Message-ID: On Jul 9, 2009, at 7:47 AM, Carsten Aulbert wrote: > Hi > > P.R. wrote: >> Im planning on building a small 20+node cluster, and I have some >> basic >> questions. >> We're planning on running 5-6 motherboards with quad-core amd 3.0GHz >> phenoms, and 4GB of RAM per node. >> Off the bat, does this sound like a reasonable setup >> > > I guess that fully depends on what you want to accomplish. If you want > to use it as a proof of concept design or as a cluster for smaller > tasks > I think it looks reasonable. If you wanted to render IceAge4 with > it, I > think you need more power ;) Sorry to hop in. Realize how low quality movies are pixelwise seen compared to a highres photocamera or scanner. It is a huge task to create graphics and direct a movie. Really big. Much underestimated part is writing scenario for graphics work. It's not so hard to plug in existing objects, but creating the scenario what it is gonna do and how... ...not easy! Designing a single head of a character and how it is supposed to behave, it already is over a full month fulltime work for a design team. For how it behaves actually there is expensive machines getting used that measure in 3d. This could be just 1 character that you see a fragment of a few seconds from some distance walking by. Note they're very clever in reusing already produced scenario's. However the actual rendering of the movie depends much upon the software you use for it. There is a difference between the actual speed of the commercial software and what it *could* do. Some years ago i wasn't happy with the reflection. Wasn't photorealistic enough IMHO. Took months to fix both in the graphics as well as in the 3d engine. Looked more photorealistic then. As a comparision we rendered a scene in lightwave itself. It took hours to produce that. Just 600KB in MP4 format (and a lot more in other formats). That specific animation is just 2.5 seconds in time. Took an hour to get rendered. It wasn't even same photorealistic quality. Something we render realtime looks better than what commercial software took hours. It's a factor 1000 difference however in rendering speed. Games have to work realtime, rendering software doesn't. I wouldn't say they can't program. Programming for speed is however a science in itself. Anyone here knows about how to sort objects. Sorted lists work faster than unsorted lists where you run from 1 to n to get something done. Square that a number of times and you'll realize the problem. Total trivial for probably vaste majority of subscribers in this list, total not trivial for that type of software. It's 1 person who programmed it 1 day in history. It can be done realtime generating that movie at cheapo hardware nowadays. That hardware is so fast for especially this purpose, only that software... But if the software would be doing that, no one can charge a big price for whatever he was doing :) Vincent From dxryder50 at hotmail.com Sun Jul 5 23:50:49 2009 From: dxryder50 at hotmail.com (dave ryder) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Obtaining knowledge Message-ID: Hello, I'm trying learn how to built my first computer cluster using OEM PC and /or PS3. What do I need to do? I've try searching the subject online using Google and youtube. But, there's no clear answer to my question. Can you help me? _________________________________________________________________ Windows Live? SkyDrive?: Get 25 GB of free online storage. http://windowslive.com/online/skydrive?ocid=TXT_TAGLM_WL_SD_25GB_062009 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20090706/0b18fab2/attachment.html From john.hearns at mclaren.com Thu Jul 9 08:50:27 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed Nov 25 01:08:45 2009 Subject: [Beowulf] Some beginner's questions on cluster setup In-Reply-To: References: Message-ID: <68A57CCFD4005646957BD2D18E60667B0C7BDEA3@milexchmb1.mil.tagmclarengroup.com> I echo what the other replies have said on this one. > My next question(s) is regarding network setup. > Each motherboard has an integrated gigabit nic. > > Q: should we be running 2 gigabit NICs per motherboard instead of one? > Is there a 'rule-of-thumb' when it comes to sizing the network > requirements? > (i.e.,'one NIC per 1-2 processor cores'...) > > > Also, we were planning on plugging EVERYTHING into one big (unmanaged) > gigabit switch. > However, I read somewhere on the net where another cluster was > physically > separating NFS & MPI traffic on two separate gigabit switches. This is quite a common configuration - you run the cluster management and NFS traffic on one network, And the MPI traffic on another network. I would personally go for two separate switches. The only slight complication here is that when you run MPI through a batch system you have an MPI machines file assigned by the batch system - you might have to change that file such that the hostnames are the ones associated with the MPI interface. To make that clear, lets say the node name is 'node1' and you might have to change this to 'node1-mpi' Scripts to do this are very easy to write - don't worry. Also I must ask you - have you considered buying a preconfigured, tested and supported system from a cluster company? Most cluster vendors have a 'canned configuration' where they will sell you a system 'off the shelf' which will fit your requirements. They will test the hardware before it is brought to you, assemble it on site, test it again, and will help you get applications running. It really is worth doing this. It is not clear from your email which country you are in, we could recommend some companies. John Hearns The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From sammeuly at yahoo.ca Thu Jul 9 14:29:28 2009 From: sammeuly at yahoo.ca (Samuel Pichardo) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Re: Some beginner's questions on cluster setup Message-ID: <236823.2122.qm@web65509.mail.ac4.yahoo.com> As pointed by others, if MPI traffic is not a concern (only a "massive" reading at beginning + writing at end), single NIC + NFS should handle it, in our case we have a diskless cluster with 15 slaves+master and rarely we deal with bottlenecks caused by NFS loads and we are not even using RAID. We are using CentOS 5.2 with default values for NFS server, the master is a single Xeon 5410 with 4GB (Supermicro Twin 1U), each slave nodes has 2 x xeon 5410. For the provisioning and node management, I've been using Perceus 1.5. Under CentOS the installation is quite straightforward and simple, just be sure of activating the Perceus modules for name of hosts, group/users, ip address and so on. I'm using NFS for provisioning rather than XGET which for some versions of Perceus is the default provisioning mechanism (XGET was much more slower and sometimes hanged, but I know that Infiniscale has been working hard on this for the latest version, so you can give a try). There is also Caos Linux distribution with the "full package" for clusters ready to go. Perceus and some other cluster-related stuff are included by default in Caos. About the flash usb, same here, we tried it and it was quite slow. Good luck Sam. . D?couvrez les styles qui font sensation sur Yahoo! Qu?bec Avatars. http://cf.avatars.yahoo.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20090709/acb0b71a/attachment.html From dzaletnev at yandex.ru Thu Jul 9 18:21:12 2009 From: dzaletnev at yandex.ru (Dmitry Zaletnev) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Some beginner's questions on cluster setup In-Reply-To: <4A5609B2.10103@georgetown.edu> References: <4A5609B2.10103@georgetown.edu> Message-ID: <333291247188872@webmail115.yandex.ru> Hi, Jess! I've read your message about Jesswolf from mailing list. And i've a question: i've a small training-purposes installation of PC running Linux and two SonyPS3 running CentOS-base YDL 6.1 or Fedora 11 connected through Gigabit AT unmanaged swith. Is Jesswulf a crossplatform solution? I've encountered difficulties in building OpenFOAM on them from sources - the main trouble is gcc for ppc ver. 4.1 when OpenFOAM wants 4.2 min, and there's something that prevent OpenFOAM from running resulting binaries for ppc platform. I'd like to try on this installation diskless setup, but not sure that PS3 support such features, however it has USB and two network interfaces: GLAN and WLAN. That's my training "clussroom" in cluster technologies and I'm not sure that some CFD-software in which I'm interested for my postgraduate studies support ppc64/cbea architecture. My professor told me that my PS3's and nVidia G80 and Crossplatform .Net are "the day after tommorow", and recommended me work with mass-market solutions and MPI but i'm going on with an idea of heterogenious home cluster. There are two reasons preventing me of using Rapidmind API: they do not support OS's that work with PS3 firmware rev.2.70 and they do not answer on my e-mails, so I cannot knew their pricing. Thanks in advance, Dmitry PS The only distros that i encounted to work with my PS3 firmware rev.2.70 are Fedora 11 ppc, YellowDog 6.1 (new), Ubuntu 8.10 PS3 Server. The previous releases do not install, Debian Lennie ppc also. > Hi! > > The diskless provisioning system is definitely the way to go. We use the > cluster toolkit called, Jesswulf, which is available at > > > > http://hpc.arc.georgetown.edu/mirror/jesswulf/ > > By default it runs on RedHat/Centos/Fedora systems, though it has been > ported to Ubuntu and SuSE without too much trouble. Perseus/Warewulf > also work well. We also teach cluster courses, which may be helpful. > > http://training.arc.georgetown.edu/ > > > > To answer some of your questions, I prefer the read-only NFSROOT > approach with a small (less than 20 MB ramdisk). We use this on all of > our clusters (about 7 clusters) and it works fine. We even use it on > heterogeneous systems. One cluster has a mix of P4 Xeons, dual-core > Opterons, and quad-core Xeons all using the same NFSROOT so you simply > update one directory on the master node and *all* of the compute nodes > have the new software. We love it! We simply either compile the kernel > or make the initrd with hardware support for all of the nodes. We often > use different hardware for the master and compute nodes, without issue. > The only thing that we don't mix is 32 and 64-bit. We have a couple of > 32-bit clusters and the rest are 64-bit. > > The main issue that you need to deal with is having a fast enough > storage system for parallel jobs that generate a lot of data. We use the > local hard drives in the computes nodes for "scratch" space and we have > some type of shared file system. On the small clusters, we use NFS, but > on the bigger clusters we use Glusterfs with Infiniband, which has > proven to be very nice. If you are running MPI jobs with lots of data, > you might want to consider adding Infiniband. Even the cheap ($125) > Infiniband cards give much better performance than standard Gigabit. And > you can always run IP over IB for applications or services that need > standard IP. > > You mention that you don't think that you will have too much MPI > traffic, but that you will be copying the results back to the master. > This is when we see the highest load on our NFS file systems when all of > the compute nodes are writing at the same time, even on small clusters > (less than 20 nodes). We've found that a clustered file system like > Glusterfs provides very low I/O wait load when copying lots of files > compared to NFS. You may consider picking up some of the cheap IB cards > ($125) and switches ($750 for 8-ports/$2400 for 24-ports) in order to do > some relatively inexpensive testing. Here is one place where you can > find them: > > http://www.colfaxdirect.com/store/pc/viewCategories.asp?pageStyle=m&idCategory=6 > > I'd be happy to talk to you. My phone number is below and you have my > e-mail. > > Jess > -- > Jess Cannata > Advanced Research Computing & > High Performance Computing Training > Georgetown University > 202-687-3661 > > > > P.R. wrote: > > Hi, > > Im new to the list & also to cluster technology in general. > > Im planning on building a small 20+node cluster, and I have some basic > > questions. > > We're planning on running 5-6 motherboards with quad-core amd 3.0GHz > > phenoms, and 4GB of RAM per node. > > Off the bat, does this sound like a reasonable setup > > > > My first question is about node file&operating systems: > > I'd like to go with a diskless setup, preferably using an NFS root for each > > node. > > However, based on some of the testing Ive done, running the nodes off of the > > NFS share(s) has turned out to be rather slow & quirky. > > Our master node will be running on a completely different hardware setup > > than the slaves, so I *believe* it will make it more complicated & tedious > > to setup&update the nfsroots for all of the nodes (since its not simply a > > matter of 'cloning' the master's setup&config). > > Is there any truth to this, am I way off? > > > > Can anyone provide any general advice or feedback on how to best setup a > > diskless node? > > > > > > The alternative that I was considering was using (4GB?) USB flash drives to > > drive a full-blown,local OS install on each node... > > Q: does anyone have experience running a node off of a usb flash drive? > > If so, what are some of the pros/cons/issues associated with this type of > > setup? > > > > > > My next question(s) is regarding network setup. > > Each motherboard has an integrated gigabit nic. > > > > Q: should we be running 2 gigabit NICs per motherboard instead of one? > > Is there a 'rule-of-thumb' when it comes to sizing the network requirements? > > (i.e.,'one NIC per 1-2 processor cores'...) > > > > > > Also, we were planning on plugging EVERYTHING into one big (unmanaged) > > gigabit switch. > > However, I read somewhere on the net where another cluster was physically > > separating NFS & MPI traffic on two separate gigabit switches. > > Any thoughts as to whether we should implement two switches, or should we be > > ok with only 1 switch? > > > > > > Notes: > > The application we'll be running is NOAA's wavewatch3, in case anyone has > > any experience with it. > > It will utilize a fair amount of NFS traffic (each node must read a common > > set of data at periodic intervals), > > and I *believe* that the MPI traffic is not extremely heavy or constant > > (i.e., nodes do large amounts of independent processing before sending > > results back to master). > > > > > > Id appreciate any help or feedback anyone would be willing&able to offer... > > > > Thanks, > > P.Romero > > > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > ????? ????? ?????? http://mail.yandex.ru/promo/neo/welcome/sign From meyerm at in.tum.de Sun Jul 12 10:41:55 2009 From: meyerm at in.tum.de (Marcel Meyer) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] hints for benchmarking on-chip communication Message-ID: <200907121941.55716.meyerm@in.tum.de> Hello list, I want to benchmark on-chip performance of message passing from one process running on one core to another process running on another core (test setup would be OpenMPI 1.3.2 with a 4-socket Dunnington, processes will be pinned to a specific core). I do know about other, more suitable programming models on such a shared memory system, I really just want to have a look at MPI. But I'm a beginner when it comes to benchmarking at that level and wanted to ask you if you could point me to some "first steps"-docs. Like how to prevent hardware prefetching getting in the way of measuring the worst-case performance when sending big arrays (force fetching random locations?), how to recognize TLB hits/misses in the results, etc. Currently I'm looking over the source code of the SM-BTL in OpenMPI and will try to get some scheme of the Dunnington to better understand it's architecture (still searching ;-) ). Thank you very much, Marcel From mdidomenico4 at gmail.com Mon Jul 13 11:14:25 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] nvidia card id? Message-ID: Does anyone know off hand if there is a way to pull the exact card information from an nvidia GPU inside a linux server from linux itself? I received an S1070, but linux seems to think it's a S870. The bezel on the front says S1070, so short of opening the unit, I'm not sure which to believe. yes, i emailed nvidia, they have yet to respond. :( thanks From carsten.aulbert at aei.mpg.de Mon Jul 13 11:26:33 2009 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] nvidia card id? In-Reply-To: References: Message-ID: <200907132026.33532.carsten.aulbert@aei.mpg.de> Hi Michaelm On Monday 13 July 2009 20:14:25 Michael Di Domenico wrote: > Does anyone know off hand if there is a way to pull the exact card > information from an nvidia GPU inside a linux server from linux > itself? > Have you tried installing the nvidia driver and the sdk? > I received an S1070, but linux seems to think it's a S870. The bezel > on the front says S1070, so short of opening the unit, I'm not sure > which to believe. gpu06:/usr/local/nvidia/sdk-2.1# bin/linux/release/deviceQuery There are 3 devices supporting CUDA Device 0: "Tesla C1060" Major revision number: 1 Minor revision number: 3 Total amount of global memory: 4294705152 bytes Number of multiprocessors: 30 Number of cores: 240 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 16384 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1.30 GHz Concurrent copy and execution: Yes [...] Does this help? Cheers Carsten From mdidomenico4 at gmail.com Mon Jul 13 11:40:39 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] nvidia card id? In-Reply-To: <200907132026.33532.carsten.aulbert@aei.mpg.de> References: <200907132026.33532.carsten.aulbert@aei.mpg.de> Message-ID: On Mon, Jul 13, 2009 at 2:26 PM, Carsten Aulbert wrote: > Hi Michaelm > > On Monday 13 July 2009 20:14:25 Michael Di Domenico wrote: >> Does anyone know off hand if there is a way to pull the exact card >> information from an nvidia GPU inside a linux server from linux >> itself? >> > > Have you tried installing the nvidia driver and the sdk? That's where I'm stuck. I'm using the exact package I used on another S1070 install, but this one is telling me that I have an Unsupported Device. From mfatica at gmail.com Mon Jul 13 11:43:50 2009 From: mfatica at gmail.com (Massimiliano Fatica) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] nvidia card id? In-Reply-To: References: Message-ID: <8e6393ac0907131143n3c1ff0f6s15fac73a76b414e4@mail.gmail.com> This is the output from an S1070: /sbin/lspci |grep 3D 07:00.0 3D controller: nVidia Corporation: Unknown device 05e7 (rev a1) 09:00.0 3D controller: nVidia Corporation: Unknown device 05e7 (rev a1) Massimiliano On Mon, Jul 13, 2009 at 11:14 AM, Michael Di Domenico < mdidomenico4@gmail.com> wrote: > Does anyone know off hand if there is a way to pull the exact card > information from an nvidia GPU inside a linux server from linux > itself? > > I received an S1070, but linux seems to think it's a S870. The bezel > on the front says S1070, so short of opening the unit, I'm not sure > which to believe. > > yes, i emailed nvidia, they have yet to respond. :( > > thanks > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20090713/2f7f04e5/attachment.html From mdidomenico4 at gmail.com Mon Jul 13 11:48:46 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] nvidia card id? In-Reply-To: <8e6393ac0907131143n3c1ff0f6s15fac73a76b414e4@mail.gmail.com> References: <8e6393ac0907131143n3c1ff0f6s15fac73a76b414e4@mail.gmail.com> Message-ID: On Mon, Jul 13, 2009 at 2:43 PM, Massimiliano Fatica wrote: > This is the output from an S1070: > /sbin/lspci |grep 3D > 07:00.0 3D controller: nVidia Corporation: Unknown device 05e7 (rev a1) > 09:00.0 3D controller: nVidia Corporation: Unknown device 05e7 (rev a1) On the S1070 working that exactly what i see and deviceQuery reports 4 T10 devices. On the non-working unit all I see is ten nVidia Coroporation Tesla S870 (rev a3) devices I'm going to download the S870 package and try to install it and see what happens. From hahn at mcmaster.ca Mon Jul 13 11:49:40 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] nvidia card id? In-Reply-To: References: Message-ID: > Does anyone know off hand if there is a way to pull the exact card > information from an nvidia GPU inside a linux server from linux > itself? well, there's lspci. is that what you meant? it's usually a bit fuzzy how to match the pci-level id (vendor/device/revision) to marketing names. but afaik, the data from lspci is fundamental. From brockp at umich.edu Mon Jul 13 12:06:33 2009 From: brockp at umich.edu (Brock Palen) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] nvidia card id? In-Reply-To: References: Message-ID: <7EBB7B4D-EA91-4861-AC7E-E8AE10CD628C@umich.edu> If you do manage to get a driver installed, look at: /proc/driver/nvidia/cards/* Example our s1070's using [brockp@nyx388 cards]$ cat * Model: Tesla C1060 IRQ: 201 Video BIOS: ??.??.??.??.?? Card Type: PCI-E DMA Size: 40 bits DMA Mask: 0xffffffffff Bus Location: 09.00.0 Model: Tesla C1060 IRQ: 177 Video BIOS: ??.??.??.??.?? Card Type: PCI-E DMA Size: 40 bits DMA Mask: 0xffffffffff Bus Location: 0b.00.0 Model: Tesla C1060 IRQ: 185 Video BIOS: ??.??.??.??.?? Card Type: PCI-E DMA Size: 40 bits DMA Mask: 0xffffffffff Bus Location: 16.00.0 Model: Tesla C1060 IRQ: 201 Video BIOS: ??.??.??.??.?? Card Type: PCI-E DMA Size: 40 bits DMA Mask: 0xffffffffff Bus Location: 18.00.0 We just grabbed the normal installer from their website, though we are a few reves back now. Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp@umich.edu (734)936-1985 > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From deadline at eadline.org Mon Jul 13 13:07:23 2009 From: deadline at eadline.org (Douglas Eadline) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Obtaining knowledge In-Reply-To: References: Message-ID: <53059.192.168.1.213.1247515643.squirrel@mail.eadline.org> You may want to look at ClusterMonkey.net. Specifically: New to Clusters section: http://www.clustermonkey.net//content/view/91/44/ Cluster Links: http://www.clustermonkey.net//component/option,com_weblinks/Itemid,23/ -- Doug > > Hello, > > > I'm trying learn how to built my first computer cluster using OEM PC > and /or PS3. What do I need to do? I've try searching the subject > online using Google and youtube. But, there's no clear answer to my > question. Can you help me? > > _________________________________________________________________ > Windows Live? SkyDrive?: Get 25 GB of free online storage. > http://windowslive.com/online/skydrive?ocid=TXT_TAGLM_WL_SD_25GB_062009_______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Doug From tjrc at sanger.ac.uk Tue Jul 14 01:20:58 2009 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] nvidia card id? In-Reply-To: References: Message-ID: <2351C061-4496-4B67-AF94-08A3A57325C9@sanger.ac.uk> On 13 Jul 2009, at 7:49 pm, Mark Hahn wrote: >> Does anyone know off hand if there is a way to pull the exact card >> information from an nvidia GPU inside a linux server from linux >> itself? > > well, there's lspci. is that what you meant? it's usually a bit > fuzzy how to match the pci-level id (vendor/device/revision) to > marketing names. but afaik, the data from lspci is fundamental. On Debian-derived systems, such as Ubuntu, there's a utility to download the latest PCI ID database, to give you a fighting chance: /usr/bin/update-pciids It's probably present on other distros too, or something similar at any rate. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From hahn at mcmaster.ca Tue Jul 14 07:24:13 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] nvidia card id? In-Reply-To: <2351C061-4496-4B67-AF94-08A3A57325C9@sanger.ac.uk> References: <2351C061-4496-4B67-AF94-08A3A57325C9@sanger.ac.uk> Message-ID: > /usr/bin/update-pciids /sbin/update-pciids on RH-ish systems (package pciutils). From krzywicki.pawel at googlemail.com Mon Jul 13 14:45:01 2009 From: krzywicki.pawel at googlemail.com (Pawel Krzywicki) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] nvidia card id? In-Reply-To: References: Message-ID: <200907132245.02095.krzywicki.pawel@gmail.com> Monday 13 July 2009 19:14:25 Michael Di Domenico napisa?(a): yes lspci is fine but I thik more detailed info you will get when you have this card properly installed and use glxinfo as root > Does anyone know off hand if there is a way to pull the exact card > information from an nvidia GPU inside a linux server from linux > itself? > > I received an S1070, but linux seems to think it's a S870. The bezel > on the front says S1070, so short of opening the unit, I'm not sure > which to believe. > > yes, i emailed nvidia, they have yet to respond. :( > > thanks > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Pawel Krzywicki From john.hearns at mclaren.com Tue Jul 14 02:00:52 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] nvidia card id? In-Reply-To: <2351C061-4496-4B67-AF94-08A3A57325C9@sanger.ac.uk> References: <2351C061-4496-4B67-AF94-08A3A57325C9@sanger.ac.uk> Message-ID: <68A57CCFD4005646957BD2D18E60667B0C865B2E@milexchmb1.mil.tagmclarengroup.com> > On Debian-derived systems, such as Ubuntu, there's a utility to > download the latest PCI ID database, to give you a fighting chance: > > /usr/bin/update-pciids > > It's probably present on other distros too, or something similar at > any rate On a couple of occasions, when working with very recent Nvidia graphics cards, I have had to download the list direct from the site which gathers this information: http://pciids.sourceforge.net/ I now feel a bit of a fool - the SuSE utility for doing this is /sbin/update-pciids My tips for anyone working with very recent graphics cards - update the pciids, And if you are on SuSE update to the very latest SaX version. You should be able to Use the SuSE One-Click install for the Nvidia drivers. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From jpaugh at advancedclustering.com Mon Jul 6 09:29:15 2009 From: jpaugh at advancedclustering.com (Jim Paugh) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] *FREE* SC09 Education Program's Summer Tutorial Workshops (Ark, OK) In-Reply-To: <20090701224143.DE004E8224@scorpion.oscer.ou.edu> References: <20090701224143.DE004E8224@scorpion.oscer.ou.edu> Message-ID: Came across this announcement and thought I'd pass it along: SUMMARY: *FREE* *FREE* *FREE* *FREE* *FREE* *FREE* *FREE* *FREE* *FREE* *FREE* SC09 Education Program's Summer Tutorial Workshops on Computational Science & Engineering and Parallel Computing http://sc09.sc-education.org/ Please apply to register as soon as possible. And please feel free to share this information with anyone you think might be interested and eligible. DETAILS: The Supercomputing 2009 Education Program's summer tutorial workshop series is hosting a series of *FREE* weeklong summer tutorial workshops this summer, across the US. There will be *TWO* workshops held in the southern Great Plains region this August: Aug 2- 8: U Arkansas: Introduction to Computational Thinking Aug 9-15: U Oklahoma: Parallel Programming & Cluster Computing These tutorial workshops will cover not only how to do various kinds of computational science & engineering as well as parallel computing, but also how to teach these topics, and how to use computational methods in your teaching. The tutorial workshops are *FREE* (except you have to pay your own transportation costs to and from the workshop institution), and we'll feed you and house you at no charge. Each tutorial workshop will require a $150 FULLY REFUNDABLE DEPOSIT. (To get your refund, you'll need to attend the workshop EVERY SINGLE DAY, and submit the daily surveys EVERY SINGLE DAY, plus the pre-survey and the post-survey.) The website for these *FREE* tutorial workshops is: http://sc09.sc-education.org/ and then follow the links to the workshop registration page. *BOTH* workshops are *ALMOST FULL!* So if you want to apply to register, it's *VERY IMPORTANT* that you do so *AS SOON AS POSSIBLE.* We would prefer that you apply *RIGHT AWAY* if at all possible. Please bear in mind that you are *applying* for registration, and that applying doesn't guarantee acceptance, although of course we'll try to accept as many people as we have room and budget for. Preference will be given to teaching faculty (or soon-to-be-faculty) who expect to use the workshop content in their own teaching, although historically we have accepted a limited number of others (students, staff etc). Please feel free to forward this e-mail to any relevant faculty, staff etc, not just locally but nationwide. --- Henry Neeman (hneeman@ou.edu) Director, OU Supercomputing Center for Education & Research (OSCER) University of Oklahoma Information Technology Adjunct Assistant Professor, School of Computer Science --- * Ranked 10th in PC Magazine's 2008 Top 20 Wired Campuses * Computerworld 2006 100 Best Places to Work in IT --- -- Jim Paugh Advanced Clustering Technologies 866.802.8222 x310 (Office) 913.643.0299 (Fax) Twitter: @clusteringjim www.advancedclustering.com Ask about the great offers available with nVidia's Tesla products. See what our unique 1U GPU Server can do for you: http://www.advancedclustering.com/products/1x5gpu2.html Make purchasing easier by taking advantage of our newly awarded GSA contract #GS35F0443U. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20090706/5d8e123f/attachment.html From orion at cora.nwra.com Wed Jul 15 09:43:44 2009 From: orion at cora.nwra.com (Orion Poplawski) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] storage server hardware considerations Message-ID: <4A5E0740.7080209@cora.nwra.com> I'm thinking about building a moderately fast network storage server using a striped array of SSD disks. I'm not sure how much such a device would benefit from being dual processor and whether the latest Nehalem chips would be better than the older in the tooth (and less expensive) AMD Opteron offerings. I'm almost certainly going to be limited at the moment by using GigE, but would like to move to Infiniband in future. Thoughts? -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA/CoRA Division FAX: 303-415-9702 3380 Mitchell Lane orion@cora.nwra.com Boulder, CO 80301 http://www.cora.nwra.com From jan.heichler at gmx.net Wed Jul 15 09:48:32 2009 From: jan.heichler at gmx.net (Jan Heichler) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] storage server hardware considerations In-Reply-To: <4A5E0740.7080209@cora.nwra.com> References: <4A5E0740.7080209@cora.nwra.com> Message-ID: <14810385286.20090715184832@gmx.net> Hallo Orion, Mittwoch, 15. Juli 2009, meintest Du: OP> I'm thinking about building a moderately fast network storage server OP> using a striped array of SSD disks. I'm not sure how much such a device OP> would benefit from being dual processor and whether the latest Nehalem OP> chips would be better than the older in the tooth (and less expensive) OP> AMD Opteron offerings. I'm almost certainly going to be limited at the OP> moment by using GigE, but would like to move to Infiniband in future. OP> Thoughts? Can you give a bit more Info? Network Storage Server = NFS? Why SSDs? What speed do you want to reach? In general: RAM does help, CPU power is normally not necessary. Jan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20090715/773c31c5/attachment.html From landman at scalableinformatics.com Wed Jul 15 10:23:06 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] storage server hardware considerations In-Reply-To: <4A5E0740.7080209@cora.nwra.com> References: <4A5E0740.7080209@cora.nwra.com> Message-ID: <4A5E107A.4050804@scalableinformatics.com> Orion Poplawski wrote: > I'm thinking about building a moderately fast network storage server > using a striped array of SSD disks. I'm not sure how much such a device > would benefit from being dual processor and whether the latest Nehalem > chips would be better than the older in the tooth (and less expensive) > AMD Opteron offerings. I'm almost certainly going to be limited at the > moment by using GigE, but would like to move to Infiniband in future. > Thoughts? > What specifically are you trying to do with this? You can indeed build or buy SSD based servers, but understand they have specific use cases (mostly IOPS/seek bound loads). If this is what your access pattern is like (many short reads/writes) SSDs could be good. If on the other hand, all you wish to do is to push out fast NFS, you really don't need SSDs for that. Some of the folks here on this list can provide NFS servers that provide 1GB/s (achievable/measured) to network clients (c.f. http://scalability.org/?p=1708 ). Since you are rate limited by GbE, this might not be as much of an issue. Rather than looking at the solution, could you describe the problem you need to solve? Regards, Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, email: landman@scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From orion at cora.nwra.com Wed Jul 15 10:42:39 2009 From: orion at cora.nwra.com (Orion Poplawski) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] storage server hardware considerations In-Reply-To: <14810385286.20090715184832@gmx.net> References: <4A5E0740.7080209@cora.nwra.com> <14810385286.20090715184832@gmx.net> Message-ID: <4A5E150F.6070206@cora.nwra.com> On 07/15/2009 10:48 AM, Jan Heichler wrote: > Hallo Orion, > > > Mittwoch, 15. Juli 2009, meintest Du: > > > OP> I'm thinking about building a moderately fast network storage server > > OP> using a striped array of SSD disks. I'm not sure how much such a device > > OP> would benefit from being dual processor and whether the latest Nehalem > > OP> chips would be better than the older in the tooth (and less expensive) > > OP> AMD Opteron offerings. I'm almost certainly going to be limited at the > > OP> moment by using GigE, but would like to move to Infiniband in future. > > OP> Thoughts? > > > Can you give a bit more Info? > > > Network Storage Server = NFS? NFS, though I'm open for trying other options. > Why SSDs? > Performance per watt, $? I'm happy to be corrected. > What speed do you want to reach? I'd be happy with saturating 1 GigE ( ~ 100MB/s ?) to start, with some headroom. Don't want to break the bank though for the last bit of performance. > In general: RAM does help, CPU power is normally not necessary. That's my feeling, but wondering how low on the cpu to go. Is a dual (physical) processor system a help? -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA/CoRA Division FAX: 303-415-9702 3380 Mitchell Lane orion@cora.nwra.com Boulder, CO 80301 http://www.cora.nwra.com From orion at cora.nwra.com Wed Jul 15 10:49:47 2009 From: orion at cora.nwra.com (Orion Poplawski) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] storage server hardware considerations In-Reply-To: <4A5E107A.4050804@scalableinformatics.com> References: <4A5E0740.7080209@cora.nwra.com> <4A5E107A.4050804@scalableinformatics.com> Message-ID: <4A5E16BB.9090900@cora.nwra.com> On 07/15/2009 11:23 AM, Joe Landman wrote: > Orion Poplawski wrote: >> I'm thinking about building a moderately fast network storage server >> using a striped array of SSD disks. I'm not sure how much such a >> device would benefit from being dual processor and whether the latest >> Nehalem chips would be better than the older in the tooth (and less >> expensive) AMD Opteron offerings. I'm almost certainly going to be >> limited at the moment by using GigE, but would like to move to >> Infiniband in future. Thoughts? >> > > What specifically are you trying to do with this? > > Rather than looking at the solution, could you describe the problem you > need to solve? I've been building cheap, big, slow storage servers for our use. However, some of our users are starting to need higher performance IO. We still don't have a lot of money, but I'd like to provide something with a modest amount of storage (at least 500GB) that can at least handle full GigE, perhaps up to DDR Infiniband for future expansion, without breaking the bank. We have only about 10 compute machines so we're pretty small. -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA/CoRA Division FAX: 303-415-9702 3380 Mitchell Lane orion@cora.nwra.com Boulder, CO 80301 http://www.cora.nwra.com From landman at scalableinformatics.com Wed Jul 15 11:02:24 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] storage server hardware considerations In-Reply-To: <4A5E16BB.9090900@cora.nwra.com> References: <4A5E0740.7080209@cora.nwra.com> <4A5E107A.4050804@scalableinformatics.com> <4A5E16BB.9090900@cora.nwra.com> Message-ID: <4A5E19B0.3050409@scalableinformatics.com> Orion Poplawski wrote: >> Rather than looking at the solution, could you describe the problem you >> need to solve? > > I've been building cheap, big, slow storage servers for our use. > However, some of our users are starting to need higher performance IO. > We still don't have a lot of money, but I'd like to provide something > with a modest amount of storage (at least 500GB) that can at least > handle full GigE, perhaps up to DDR Infiniband for future expansion, > without breaking the bank. You can saturate GbE fairly easily (we can). A good RAID10 could get you pretty far here. 500GB is quite small though. Likely you will spend more in chassis and other stuff than in disk. Using our DeltaV as a guide, a 2U boxen with 6TB (12x 500 GB drives) and 4 GbE ports in a RAID10 provides about 250-350 MB/s for NFS. > We have only about 10 compute machines so we're pretty small. One thing you can do is channel bond 2-4 GbE ports on the server, and then you can at least reduce network port contention there. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, email: landman@scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From mmuratet at hudsonalpha.org Tue Jul 14 12:07:00 2009 From: mmuratet at hudsonalpha.org (Michael Muratet) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Missing pvmd for Rpvm In-Reply-To: <24485352.post@talk.nabble.com> References: <24485352.post@talk.nabble.com> Message-ID: Greetings I am trying to install the R rpvm package to run R on a scyld/beowulf cluster. > install.packages('rpvm',dependencies=TRUE) also installing the dependency ?rsprng? trying URL 'http://mira.sunsite.utk.edu/CRAN/src/contrib/rsprng_0.4.tar.gz' Content type 'application/x-gzip' length 36044 bytes (35 Kb) opened URL ================================================== downloaded 35 Kb trying URL 'http://mira.sunsite.utk.edu/CRAN/src/contrib/rpvm_1.0.2.tar.gz' Content type 'application/x-gzip' length 62737 bytes (61 Kb) opened URL ================================================== downloaded 61 Kb * Installing *source* package ?rsprng? ... checking for gcc... gcc checking for C compiler default output file name... a.out checking whether the C compiler works... yes checking whether we are cross compiling... no checking for suffix of executables... checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ANSI C... none needed Try to find sprng.h ... checking how to run the C preprocessor... gcc -E checking for egrep... grep -E checking for ANSI C header files... yes checking for sys/types.h... yes checking for sys/stat.h... yes checking for stdlib.h... yes checking for string.h... yes checking for memory.h... yes checking for strings.h... yes checking for inttypes.h... yes checking for stdint.h... yes checking for unistd.h... yes checking sprng.h usability... no checking sprng.h presence... no checking for sprng.h... no Cannot find sprng 2.0 header file. ERROR: configuration failed for package ?rsprng? * Removing ?/usr/local/lib64/R/library/rsprng? * Installing *source* package ?rpvm? ... checking for gcc... gcc checking for C compiler default output file name... a.out checking whether the C compiler works... yes checking whether we are cross compiling... no checking for suffix of executables... checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ANSI C... none needed Check if PVM_ROOT is defined... /usr/share/pvm3 Found pvm: /usr/share/pvm3 PVM_ROOT is /usr/share/pvm3 PVM_ARCH is BEOSCYLD checking how to run the C preprocessor... gcc -E checking for egrep... grep -E checking for ANSI C header files... yes checking for sys/types.h... yes checking for sys/stat.h... yes checking for stdlib.h... yes checking for string.h... yes checking for memory.h... yes checking for strings.h... yes checking for inttypes.h... yes checking for stdint.h... yes checking for unistd.h... yes checking pvm3.h usability... no checking pvm3.h presence... no checking for pvm3.h... no Try to find pvm3.h ... Found in /usr/share/pvm3/include checking for main in -lpvm3... no Try to find libpvm3 ... Found in /usr/share/pvm3/lib/BEOSCYLD checking for pvmd... no Cannot find pvmd executable Include it in your path or check your pvm installation. ERROR: configuration failed for package ?rpvm? * Removing ?/usr/local/lib64/R/library/rpvm? The downloaded packages are in ?/tmp/RtmpKJmN3z/downloaded_packages? Updating HTML index of packages in '.Library' Warning messages: 1: In install.packages("rpvm", dependencies = TRUE) : installation of package 'rsprng' had non-zero exit status 2: In install.packages("rpvm", dependencies = TRUE) : installation of package 'rpvm' had non-zero exit status I believe this to be a problem on the cluster and not with R. There are messages in the archives going back a few years regarding problems starting pvm but nothing seems to apply to my case. I've been to the pvm webpage but I don't see anything that's beowulf specific. The pvm libraries are there but the daemon apparently doesn't run unless you start it. Before I start it, can anyone point me to any R- or pvm-specific documentation? Thanks Mike From john.hearns at mclaren.com Wed Jul 15 04:56:42 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Cluster Networking In-Reply-To: <60361246034164@webmail18.yandex.ru> References: <60361246034164@webmail18.yandex.ru> Message-ID: <68A57CCFD4005646957BD2D18E60667B0C866424@milexchmb1.mil.tagmclarengroup.com> > 1. Is there any influence on performance of a NFS-server from the usage > of x32 CPU and OS instead of x64, if all other characteristics of the > system, i.e. amount of RAM, soft-SATA-II RAID 0, Realtek GLAN NIC are > the same? I would say no - but I don't run any 32 bit machines as NFS servers! To be honest, in the HPC world 32 bit CPUs are dead and gone - just get the 64 bit one. Being even more honest, I would look at this Realtek NIC and consider putting in An Intel E1000 based card. Also look closely at network tuning parameters for NFS, Rather than looking at the choice between 32 and 64 bit CPUs. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From jan.heichler at gmx.net Wed Jul 15 12:31:50 2009 From: jan.heichler at gmx.net (Jan Heichler) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] storage server hardware considerations In-Reply-To: <4A5E150F.6070206@cora.nwra.com> References: <4A5E0740.7080209@cora.nwra.com> <14810385286.20090715184832@gmx.net> <4A5E150F.6070206@cora.nwra.com> Message-ID: <1021421572.20090715213150@gmx.net> Hallo Orion, Mittwoch, 15. Juli 2009, meintest Du: >> Can you give a bit more Info? >> Network Storage Server = NFS? OP> NFS, though I'm open for trying other options. Well...NFS is a good start... if 100 MB/s is enough for your purposes. NFS does not work very well on higher speeds - the protocoll seems to have quite some overhead. >> Why SSDs? >> OP> Performance per watt, $? I'm happy to be corrected. Well. SSD is very expensive - and not necessarily more energy efficient. I would vote for 2.5" SATA drives. With the Supermicro SC216 for example you can have 24 disks. Add a 3ware 9650SE-24 to build a raidset. >> What speed do you want to reach? OP> I'd be happy with saturating 1 GigE ( ~ 100MB/s ?) to start, with some OP> headroom. Don't want to break the bank though for the last bit of OP> performance. To reach 100 MB/s you can probably start with 8+2 disks in a RAID6. That should give you enough to saturate a single 1GE link. >> In general: RAM does help, CPU power is normally not necessary. OP> That's my feeling, but wondering how low on the cpu to go. Is a dual OP> (physical) processor system a help? Well... single socket, dualcore is perfectly fine to saturate a 1GE Link. Jan From jan.heichler at gmx.net Wed Jul 15 12:34:17 2009 From: jan.heichler at gmx.net (Jan Heichler) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] storage server hardware considerations In-Reply-To: <4A5E16BB.9090900@cora.nwra.com> References: <4A5E0740.7080209@cora.nwra.com> <4A5E107A.4050804@scalableinformatics.com> <4A5E16BB.9090900@cora.nwra.com> Message-ID: <15210630244.20090715213417@gmx.net> Hallo Orion, Mittwoch, 15. Juli 2009, meintest Du: OP> On 07/15/2009 11:23 AM, Joe Landman wrote: >> Orion Poplawski wrote: >>> I'm thinking about building a moderately fast network storage server >>> using a striped array of SSD disks. I'm not sure how much such a >>> device would benefit from being dual processor and whether the latest >>> Nehalem chips would be better than the older in the tooth (and less >>> expensive) AMD Opteron offerings. I'm almost certainly going to be >>> limited at the moment by using GigE, but would like to move to >>> Infiniband in future. Thoughts? >>> >> >> What specifically are you trying to do with this? >> >> Rather than looking at the solution, could you describe the problem you >> need to solve? OP> I've been building cheap, big, slow storage servers for our use. OP> However, some of our users are starting to need higher performance IO. OP> We still don't have a lot of money, but I'd like to provide something OP> with a modest amount of storage (at least 500GB) that can at least OP> handle full GigE, perhaps up to DDR Infiniband for future expansion, OP> without breaking the bank. -> see my other mail. 24 3.5" disks on a 3ware 24-port reach up to 700 MB/s write speed. With 2.5" you are probably a bit slower. This is local access on the filesystem. Exporting this via NFS over TCP/IP (IP-over-IB for example) should give you about 300 to 400 MB/s. I have no experience with NFS-over-RDMA... maybe somebody else can say something about the performance of that... Jan From hahn at mcmaster.ca Wed Jul 15 18:57:56 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] storage server hardware considerations In-Reply-To: <4A5E16BB.9090900@cora.nwra.com> References: <4A5E0740.7080209@cora.nwra.com> <4A5E107A.4050804@scalableinformatics.com> <4A5E16BB.9090900@cora.nwra.com> Message-ID: >> Rather than looking at the solution, could you describe the problem you >> need to solve? > > I've been building cheap, big, slow storage servers for our use. However, > some of our users are starting to need higher performance IO. We still don't > have a lot of money, but I'd like to provide something with a modest amount > of storage (at least 500GB) that can at least handle full GigE, perhaps up to > DDR Infiniband for future expansion, without breaking the bank. if you can't saturate gigabit with very modest raids, you're doing something wrong. a single ultra-cheap disk these days (seagate 7200.12 500G, $60 or so) will hit close to 135 MB/s on outer tracks and average > 105 over the whole disk. I'm guessing that you're losing performance due to either bad controllers (avoid HW raid on anything that's not fairly recent) or a combination of raid6 and a write-heavy workload... bandwidth is easy; latency is hard. the only time I'd consider flash for storage is if the workload was astonishingly write-dominated. or for a pretty large cluster (for instance, the metadata write load for a a few thousand jobs can _flatten_ an indifferently configured Lustre system, even though the bandwidth is fairly low. From hahn at mcmaster.ca Wed Jul 15 19:16:54 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Cluster Networking In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0C866424@milexchmb1.mil.tagmclarengroup.com> References: <60361246034164@webmail18.yandex.ru> <68A57CCFD4005646957BD2D18E60667B0C866424@milexchmb1.mil.tagmclarengroup.com> Message-ID: >> 1. Is there any influence on performance of a NFS-server from the > usage >> of x32 CPU and OS instead of x64, if all other characteristics of the 64bitness is, at this level, purely a matter of register width, number and address-space size. it might very well be that an NFS server would show an advantage to 64b if configured with a very large pagecache and a read friendly workload. >> system, i.e. amount of RAM, soft-SATA-II RAID 0, Realtek GLAN NIC are >> the same? > > I would say no - but I don't run any 32 bit machines as NFS servers! I can't think of any reason to prefer 64b for NFS servers, except that 32b-only machines tend to be older (and thus limited in memory and IO bandwidth). > Being even more honest, I would look at this Realtek NIC and consider > putting in > An Intel E1000 based card. Also look closely at network tuning I'd be pretty surprised if any Gb nic couldn't manage wire speed. From gerry.creager at tamu.edu Wed Jul 15 20:40:21 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Cluster Networking In-Reply-To: References: <60361246034164@webmail18.yandex.ru> <68A57CCFD4005646957BD2D18E60667B0C866424@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4A5EA125.7060101@tamu.edu> Mark Hahn wrote: >>> 1. Is there any influence on performance of a NFS-server from the >> usage >>> of x32 CPU and OS instead of x64, if all other characteristics of the > > 64bitness is, at this level, purely a matter of register width, number and > address-space size. it might very well be that an NFS server would show > an advantage to 64b if configured with a very large pagecache and a read > friendly workload. > >>> system, i.e. amount of RAM, soft-SATA-II RAID 0, Realtek GLAN NIC are >>> the same? >> >> I would say no - but I don't run any 32 bit machines as NFS servers! > > I can't think of any reason to prefer 64b for NFS servers, except that > 32b-only machines tend to be older (and thus limited in memory and IO > bandwidth). > >> Being even more honest, I would look at this Realtek NIC and consider >> putting in >> An Intel E1000 based card. Also look closely at network tuning > > I'd be pretty surprised if any Gb nic couldn't manage wire speed. Then you haven't had my horrible experience with Broadcom 5708's. -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From eugen at leitl.org Wed Jul 15 23:36:58 2009 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] storage server hardware considerations In-Reply-To: <4A5E16BB.9090900@cora.nwra.com> References: <4A5E0740.7080209@cora.nwra.com> <4A5E107A.4050804@scalableinformatics.com> <4A5E16BB.9090900@cora.nwra.com> Message-ID: <20090716063658.GM23524@leitl.org> On Wed, Jul 15, 2009 at 11:49:47AM -0600, Orion Poplawski wrote: > I've been building cheap, big, slow storage servers for our use. > However, some of our users are starting to need higher performance IO. > We still don't have a lot of money, but I'd like to provide something > with a modest amount of storage (at least 500GB) that can at least > handle full GigE, perhaps up to DDR Infiniband for future expansion, > without breaking the bank. If it has to be cheap I'd take e.g. a Sun with 8x 2.5" SATA drives, using 300 GByte WD VelociRaptors as a stripe over mirrors, or 15 krpm SAS drives. Right now you can populate a SuperMicro chassis with 16x SATA 3.5" achieving e.g. a 24 GByte dual-socket Nehalem with 32 TByte raw storage (WD RE4; 4.8 TByte with WD VelociRaptor) for about 6.2 kEUR sans VAT. > We have only about 10 compute machines so we're pretty small. -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From prentice at ias.edu Thu Jul 16 07:42:10 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Missing pvmd for Rpvm In-Reply-To: References: <24485352.post@talk.nabble.com> Message-ID: <4A5F3C42.3000900@ias.edu> This is a problem for the rpvm mailing list (if one exists) or the rpvm maintainer. You don't need to start pvmd. The configure script can't find pvmd or libpvm. Try putting the path to pvmd in your PATH and specifying the path to libpvm in LDFLAGS first. -- prentice Michael Muratet wrote: > Greetings > > I am trying to install the R rpvm package to run R on a scyld/beowulf > cluster. > >> install.packages('rpvm',dependencies=TRUE) > also installing the dependency ?rsprng? > > trying URL 'http://mira.sunsite.utk.edu/CRAN/src/contrib/rsprng_0.4.tar.gz' > Content type 'application/x-gzip' length 36044 bytes (35 Kb) > opened URL > ================================================== > downloaded 35 Kb > > trying URL 'http://mira.sunsite.utk.edu/CRAN/src/contrib/rpvm_1.0.2.tar.gz' > Content type 'application/x-gzip' length 62737 bytes (61 Kb) > opened URL > ================================================== > downloaded 61 Kb > > * Installing *source* package ?rsprng? ... > checking for gcc... gcc > checking for C compiler default output file name... a.out > checking whether the C compiler works... yes > checking whether we are cross compiling... no > checking for suffix of executables... > checking for suffix of object files... o > checking whether we are using the GNU C compiler... yes > checking whether gcc accepts -g... yes > checking for gcc option to accept ANSI C... none needed > Try to find sprng.h ... > checking how to run the C preprocessor... gcc -E > checking for egrep... grep -E > checking for ANSI C header files... yes > checking for sys/types.h... yes > checking for sys/stat.h... yes > checking for stdlib.h... yes > checking for string.h... yes > checking for memory.h... yes > checking for strings.h... yes > checking for inttypes.h... yes > checking for stdint.h... yes > checking for unistd.h... yes > checking sprng.h usability... no > checking sprng.h presence... no > checking for sprng.h... no > Cannot find sprng 2.0 header file. > ERROR: configuration failed for package ?rsprng? > * Removing ?/usr/local/lib64/R/library/rsprng? > * Installing *source* package ?rpvm? ... > checking for gcc... gcc > checking for C compiler default output file name... a.out > checking whether the C compiler works... yes > checking whether we are cross compiling... no > checking for suffix of executables... > checking for suffix of object files... o > checking whether we are using the GNU C compiler... yes > checking whether gcc accepts -g... yes > checking for gcc option to accept ANSI C... none needed > Check if PVM_ROOT is defined... /usr/share/pvm3 > Found pvm: /usr/share/pvm3 > PVM_ROOT is /usr/share/pvm3 > PVM_ARCH is BEOSCYLD > checking how to run the C preprocessor... gcc -E > checking for egrep... grep -E > checking for ANSI C header files... yes > checking for sys/types.h... yes > checking for sys/stat.h... yes > checking for stdlib.h... yes > checking for string.h... yes > checking for memory.h... yes > checking for strings.h... yes > checking for inttypes.h... yes > checking for stdint.h... yes > checking for unistd.h... yes > checking pvm3.h usability... no > checking pvm3.h presence... no > checking for pvm3.h... no > Try to find pvm3.h ... > Found in /usr/share/pvm3/include > checking for main in -lpvm3... no > Try to find libpvm3 ... > Found in /usr/share/pvm3/lib/BEOSCYLD > checking for pvmd... no > Cannot find pvmd executable > Include it in your path or check your pvm installation. > ERROR: configuration failed for package ?rpvm? > * Removing ?/usr/local/lib64/R/library/rpvm? > > The downloaded packages are in > ?/tmp/RtmpKJmN3z/downloaded_packages? > Updating HTML index of packages in '.Library' > Warning messages: > 1: In install.packages("rpvm", dependencies = TRUE) : > installation of package 'rsprng' had non-zero exit status > 2: In install.packages("rpvm", dependencies = TRUE) : > installation of package 'rpvm' had non-zero exit status > > I believe this to be a problem on the cluster and not with R. > > There are messages in the archives going back a few years regarding > problems starting pvm but nothing seems to apply to my case. I've been > to the pvm webpage but I don't see anything that's beowulf specific. The > pvm libraries are there but the daemon apparently doesn't run unless you > start it. Before I start it, can anyone point me to any R- or > pvm-specific documentation? > > Thanks > > Mike > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From jlforrest at berkeley.edu Tue Jul 21 11:56:23 2009 From: jlforrest at berkeley.edu (Jon Forrest) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Approach For Diagnosing Heat Related Failure? Message-ID: <4A660F57.4030108@berkeley.edu> I have a rack full of identical compute nodes. One of them has become heat sensitive. When it's in the warm computer room it crashes. I can't even run memtest from the CentOS DVD for 2 seconds. However, when this node is in my much cooler office everything works fine. All the other nodes are working fine in the computer room. I'm not convinced the problem is actually the memory. Other than opening the node to spray cooling liquid when it's in the warm room, what approach would you use to figure out which component(s) is(are) failing? Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest@berkeley.edu From lindahl at pbm.com Tue Jul 21 13:35:09 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Approach For Diagnosing Heat Related Failure? In-Reply-To: <4A660F57.4030108@berkeley.edu> References: <4A660F57.4030108@berkeley.edu> Message-ID: <20090721203509.GD31988@bx9.net> On Tue, Jul 21, 2009 at 11:56:23AM -0700, Jon Forrest wrote: > Other than opening the node > to spray cooling liquid when it's in the warm > room, what approach would you use to figure out which > component(s) is(are) failing? Swap them with good ones until it doesn't fail anymore? -- g From bill at cse.ucdavis.edu Tue Jul 21 13:42:02 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Approach For Diagnosing Heat Related Failure? In-Reply-To: <4A660F57.4030108@berkeley.edu> References: <4A660F57.4030108@berkeley.edu> Message-ID: <4A66281A.40407@cse.ucdavis.edu> I'd suggest doing a visual inspection. Make sure all fans are not blocked by cables, are spinning. If that looks normal pull the CPU heat sinks and make sure they have good coverage with the heat sink goo, but not so much that it leaks over the edge of the chip. When you put the heat sink back on make sure the heat sink mount works as intended, especially on the (mostly intel?) 4 post system where an unclicked post can result in unevent heat sink pressure. Be careful, fans moving != spinning. I've seen some that just vibrate enough to look like they are spinning at a casual glance and are actually not moving much air and are contributing a fair bit of heat to the system (I.e. very hot to the touch). If that looks normal then I'd start swapping parts till you find the heat sensitive one. From mathog at caltech.edu Tue Jul 21 14:02:41 2009 From: mathog at caltech.edu (David Mathog) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] RE: Approach For Diagnosing Heat Related Failure? Message-ID: Jon Forrest wrote: > I have a rack full of identical compute > nodes. One of them has become heat sensitive. > > When it's in the warm computer room it crashes. > I can't even run memtest from the CentOS DVD > for 2 seconds. However, when this node is > in my much cooler office everything works > fine. All the other nodes are working fine > in the computer room. Presumably you have already blown the dust out of it and reseated all the obvious suspect components. If the motherboard has a "shutdown on overheat" option that may now have a value set low enough that it stops the machine in the warmer room. If you didn't explicitly set it to that value then suspect the motherboard battery - change it, reset the BIOS, and all should be well. If the machine has a hardware status monitor in the BIOS check there too for out of range temperatures. (Odd that your machine room is much hotter than your office.) Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From billycrook at gmail.com Tue Jul 21 14:33:46 2009 From: billycrook at gmail.com (Billy Crook) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Approach For Diagnosing Heat Related Failure? In-Reply-To: <4A66281A.40407@cse.ucdavis.edu> References: <4A660F57.4030108@berkeley.edu> <4A66281A.40407@cse.ucdavis.edu> Message-ID: On Tue, Jul 21, 2009 at 15:42, Bill Broadley wrote: > > I'd suggest doing a visual inspection. ?Make sure all fans are not blocked by > cables, are spinning. ?If that looks normal pull the CPU heat sinks and make > sure they have good coverage with the heat sink goo, but not so much that it > leaks over the edge of the chip. ?When you put the heat sink back on make sure > the heat sink mount works as intended, especially on the (mostly intel?) 4 > post system where an unclicked post can result in unevent heat sink pressure. > > Be careful, fans moving != spinning. ?I've seen some that just vibrate enough > to look like they are spinning at a casual glance and are actually not moving > much air and are contributing a fair bit of heat to the system (I.e. very hot > to the touch). Use the thin end of a zip tie to slowly interrupt and stop each fan while it is spinning. The pitch of the sound it makes will make a (very) rough comparison of the RPM, even in a noisy room. It will be obvious if it's turning normally or not. You might find one blowing backwards. Don't forget about double-rotor fans. > If that looks normal then I'd start swapping parts till you find the heat > sensitive one. He might swap his desk with that overheating node to help balance out the heat load... Or use something more intense than Memtest in your office. Try ACT Breakin. Once it's booted all the way, a machine with a heatsink ajar is usually powered off from thermal protection in < 5 seconds. Even in an ice cold room Try swapping it's power supply with another node that doesn't power off. P.S. And please do not spray liquid spray air upside down at hot computer components. From vgregorio at penguincomputing.com Tue Jul 21 13:56:00 2009 From: vgregorio at penguincomputing.com (Victor Gregorio) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Approach For Diagnosing Heat Related Failure? In-Reply-To: <4A660F57.4030108@berkeley.edu> References: <4A660F57.4030108@berkeley.edu> Message-ID: <20090721205600.GC32735@olive.penguincomputing.com> Hello Jon, If your system has temperature and fan sensors, you might be able to use lm_sensors to display component temperatures and diagnose fan failures. [root@tesla ~]# sensors-detect # answer all defaults [root@tesla ~]# /etc/init.d/lm_sensors start # load kernel modules [root@tesla ~]# sensors # check sensor stats Hope this helps. Regards, -- Victor Gregorio Penguin Computing On Tue, Jul 21, 2009 at 11:56:23AM -0700, Jon Forrest wrote: > I have a rack full of identical compute > nodes. One of them has become heat sensitive. > > When it's in the warm computer room it crashes. > I can't even run memtest from the CentOS DVD > for 2 seconds. However, when this node is > in my much cooler office everything works > fine. All the other nodes are working fine > in the computer room. > > I'm not convinced the problem is actually > the memory. Other than opening the node > to spray cooling liquid when it's in the warm > room, what approach would you use to figure out which > component(s) is(are) failing? > > Cordially, > -- > Jon Forrest > Research Computing Support > College of Chemistry > 173 Tan Hall > University of California Berkeley > Berkeley, CA > 94720-1460 > 510-643-1032 > jlforrest@berkeley.edu > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From joshua_mora at usa.net Tue Jul 21 15:42:30 2009 From: joshua_mora at usa.net (Joshua mora acosta) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Approach For Diagnosing Heat Related Failure? Message-ID: <222NguwpE0892S01.1248216150@cmsweb01.cms.usa.net> You can run HPL bound to a specific socket maximizing also the memory associated to that socket in order to try to shutdown it because of reaching the "hardware thermal control" due to lack of cooling. On BIOS you can also have HW monitoring to tell you speed of fans and perhaps detect the diff of rpms. You can also force the system to run at "low power" rather than "dynamic" or "performance" and rerun tests and see if it passes. Good luck! Joshua ------ Original Message ------ Received: 04:41 PM CDT, 07/21/2009 From: Billy Crook To: Bill Broadley Cc: Beowulf Mailing List Subject: Re: [Beowulf] Approach For Diagnosing Heat Related Failure? > On Tue, Jul 21, 2009 at 15:42, Bill Broadley wrote: > > > > I'd suggest doing a visual inspection. ?Make sure all fans are not blocked by > > cables, are spinning. ?If that looks normal pull the CPU heat sinks and make > > sure they have good coverage with the heat sink goo, but not so much that it > > leaks over the edge of the chip. ?When you put the heat sink back on make sure > > the heat sink mount works as intended, especially on the (mostly intel?) 4 > > post system where an unclicked post can result in unevent heat sink pressure. > > > > Be careful, fans moving != spinning. ?I've seen some that just vibrate enough > > to look like they are spinning at a casual glance and are actually not moving > > much air and are contributing a fair bit of heat to the system (I.e. very hot > > to the touch). > > Use the thin end of a zip tie to slowly interrupt and stop each fan > while it is spinning. The pitch of the sound it makes will make a > (very) rough comparison of the RPM, even in a noisy room. It will be > obvious if it's turning normally or not. You might find one blowing > backwards. Don't forget about double-rotor fans. > > > If that looks normal then I'd start swapping parts till you find the heat > > sensitive one. > > He might swap his desk with that overheating node to help balance out > the heat load... > > Or use something more intense than Memtest in your office. Try ACT > Breakin. Once it's booted all the way, a machine with a heatsink ajar > is usually powered off from thermal protection in < 5 seconds. Even > in an ice cold room > > Try swapping it's power supply with another node that doesn't power off. > > P.S. And please do not spray liquid spray air upside down at hot > computer components. > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From dzaletnev at yandex.ru Tue Jul 21 16:04:28 2009 From: dzaletnev at yandex.ru (Dmitry Zaletnev) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Approach For Diagnosing Heat Related Failure? In-Reply-To: <4A660F57.4030108@berkeley.edu> References: <4A660F57.4030108@berkeley.edu> Message-ID: <279011248217468@webmail42.yandex.ru> Jon, > I have a rack full of identical compute > nodes. One of them has become heat sensitive. > > When it's in the warm computer room it crashes. > I can't even run memtest from the CentOS DVD > for 2 seconds. However, when this node is > in my much cooler office everything works > fine. All the other nodes are working fine > in the computer room. I'd such a problem when the plastic clip wich mount the base ring of CPU cooler was broken and CPU cooler was mounted by the rest 3 clips. When I started to save Virtual Machine compiling OpenFOAM from sources, Ubuntu made shutdown on overheat. > > I'm not convinced the problem is actually > the memory. Other than opening the node > to spray cooling liquid when it's in the warm > room, what approach would you use to figure out which > component(s) is(are) failing? > > Cordially, > -- > Jon Forrest > Research Computing Support > College of Chemistry > 173 Tan Hall > University of California Berkeley > Berkeley, CA > 94720-1460 > 510-643-1032 > jlforrest@berkeley.edu > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Sincerely, Dmitry ??????.?????. ??????? ???? ???-?????? ??? http://mail.yandex.ru/nospam/sign From mdidomenico4 at gmail.com Mon Jul 27 05:14:24 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] solaris disk naming under linux Message-ID: Does anyone happen to have the udev rules for converting 0:0:0:0 disk identifiers under udev into solaris like disk/c0t0d0s0? I could have sworn I saw it on the net somewhere a while back, but I'm unable to google it again... From gmpc at sanger.ac.uk Mon Jul 27 06:23:31 2009 From: gmpc at sanger.ac.uk (Guy Coates) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] solaris disk naming under linux In-Reply-To: References: Message-ID: <4A6DAA53.9000801@sanger.ac.uk> Michael Di Domenico wrote: > Does anyone happen to have the udev rules for converting 0:0:0:0 disk > identifiers under udev into solaris like disk/c0t0d0s0? I could have > sworn I saw it on the net somewhere a while back, but I'm unable to > google it again... > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf This one? http://lists.lustre.org/pipermail/linux_hpc_swstack/2008-June/000036.html Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From giusy.venezia at gmail.com Sun Jul 26 12:11:56 2009 From: giusy.venezia at gmail.com (Giuseppina Venezia) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Problem with environment variable Message-ID: <39f5f2680907261211p55f85dc0j8bcb88109de4843@mail.gmail.com> Hello, I'am using PVM with two nodes: a master node with Fedora Core 8 called "pippo" and a slave node with ubuntu 8.10 called "pluto". I've installed PVM on both "pippo" and "pluto", and set those environment variables: PVM_ROOT, PVM_ARCH, PVM_RSH and PVM_TMP. in .bashrc and /etc/profile Tha problem in that when I add another host from the master node I got the following error: pvm> add pluto add pluto 0 successful HOST DTID pluto Can't start pvmd Auto-Diagnosing Failed Hosts... pluto... Verifying Local Path to "rsh"... Rsh found in /usr/bin/rsh - O.K. Testing Rsh/Rhosts Access to Host "pluto"... Rsh/Rhosts Access is O.K. Checking O.S. Type (Unix test) on Host "pluto"... Host pluto is Unix-based. Checking $PVM_ROOT on Host "pluto"... The value of the $PVM_ROOT environment variable on pluto is invalid (""). Use the absolute path to the pvm3/ directory. pvm> However, if I try echo $PVM_ROOT on slave node ("pluto") I got /usr/lib/pvm3 Could you help me? Thank you in advance From reuti at staff.uni-marburg.de Mon Jul 27 12:44:00 2009 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Problem with environment variable In-Reply-To: <39f5f2680907261211p55f85dc0j8bcb88109de4843@mail.gmail.com> References: <39f5f2680907261211p55f85dc0j8bcb88109de4843@mail.gmail.com> Message-ID: Hi, Am 26.07.2009 um 21:11 schrieb Giuseppina Venezia: > Hello, > > I'am using PVM with two nodes: a master node with Fedora Core 8 called > "pippo" and a slave node with ubuntu 8.10 called "pluto". > > I've installed PVM on both "pippo" and "pluto", and set those > environment variables: > > PVM_ROOT, PVM_ARCH, PVM_RSH and PVM_TMP. > > in .bashrc and /etc/profile > > Tha problem in that when I add another host from the master node I got > the following error: > > pvm> add pluto > add pluto > 0 successful > HOST DTID > pluto Can't start pvmd > > Auto-Diagnosing Failed Hosts... > pluto... > Verifying Local Path to "rsh"... > Rsh found in /usr/bin/rsh - O.K. > Testing Rsh/Rhosts Access to Host "pluto"... > Rsh/Rhosts Access is O.K. > Checking O.S. Type (Unix test) on Host "pluto"... > Host pluto is Unix-based. > Checking $PVM_ROOT on Host "pluto"... > > The value of the $PVM_ROOT environment > variable on pluto is invalid (""). > Use the absolute path to the pvm3/ directory. > > pvm> > > However, if I try echo $PVM_ROOT on slave node ("pluto") I got > > /usr/lib/pvm3 did you add the PATH specification in .bash_profile and/or .profile? This will only work for an interactive login. AFAIR you must add it to .bashrc which will be sourced for a non-interactive login, i.e. during the pvm startup. -- Reuti > Could you help me? > Thank you in advance > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From rgb at phy.duke.edu Mon Jul 27 12:47:05 2009 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Problem with environment variable In-Reply-To: <39f5f2680907261211p55f85dc0j8bcb88109de4843@mail.gmail.com> References: <39f5f2680907261211p55f85dc0j8bcb88109de4843@mail.gmail.com> Message-ID: On Sun, 26 Jul 2009, Giuseppina Venezia wrote: > Hello, > > I'am using PVM with two nodes: a master node with Fedora Core 8 called > "pippo" and a slave node with ubuntu 8.10 called "pluto". I've encountered this problem before, but it was a rather long time ago and I don't remember the solution. I think it had to do with the shell and environment passing. If you read man bash (down where it describes logging in vs shell commands) you will note that whether or not a given bash executes /etc/profile depends on whether or not the shell is a "login shell" (yes) or "command shell" (no) or a "shell command" (very much no). rsh hostname command (IIRC) thus bypasses /etc/profile. Passing an environment variable from your source shell is also very painful (that is, almost impossible). These are all reasons I abandoned rsh in favor of ssh -- ssh permits one to pass environment variables if one so desires. Anyway, I'd suggest experimenting with the placement of the setting of the PVM values and/or fixing up your shell so that no matter what kind of shell you run (interactive or non-interactive) they get set. Since I got irritated at the differential execution pathways, I eventually created this as my .bashrc: if [ -f $HOME/.bash_profile ] then . $HOME/.bash_profile fi and put all of the interesting stuff in .bash_profile. That way, at the expense of setting e.g. the prompt when it isn't strictly necessary, I get EXACTLY the same environment for remote shell commands that I have for remote login sessions, right down to the ability to use aliases in the remote commands. Naturally, in my .bash_profile I do have the following stanza: # Critical PVM environment variables PVM_ROOT=/usr/share/pvm3 PVM_RSH=/usr/bin/ssh XPVM_ROOT=/usr/share/pvm3/xpvm export PVM_ROOT PVM_RSH XPVM_ROOT See if this sort of thing helps. If it is, as I suspect, the fact that it is somehow bypassing the expected shell initialization pathway, that should fix it. rgb > > I've installed PVM on both "pippo" and "pluto", and set those > environment variables: > > PVM_ROOT, PVM_ARCH, PVM_RSH and PVM_TMP. > > in .bashrc and /etc/profile > > Tha problem in that when I add another host from the master node I got > the following error: > > pvm> add pluto > add pluto > 0 successful > HOST DTID > pluto Can't start pvmd > > Auto-Diagnosing Failed Hosts... > pluto... > Verifying Local Path to "rsh"... > Rsh found in /usr/bin/rsh - O.K. > Testing Rsh/Rhosts Access to Host "pluto"... > Rsh/Rhosts Access is O.K. > Checking O.S. Type (Unix test) on Host "pluto"... > Host pluto is Unix-based. > Checking $PVM_ROOT on Host "pluto"... > > The value of the $PVM_ROOT environment > variable on pluto is invalid (""). > Use the absolute path to the pvm3/ directory. > > pvm> > > However, if I try echo $PVM_ROOT on slave node ("pluto") I got > > /usr/lib/pvm3 > > Could you help me? > Thank you in advance > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From reuti at staff.uni-marburg.de Mon Jul 27 12:47:44 2009 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Problem with environment variable [update] In-Reply-To: <39f5f2680907261211p55f85dc0j8bcb88109de4843@mail.gmail.com> References: <39f5f2680907261211p55f85dc0j8bcb88109de4843@mail.gmail.com> Message-ID: <66AE302F-8399-4B72-A4A5-597A83C6E27C@staff.uni-marburg.de> Hi, Am 26.07.2009 um 21:11 schrieb Giuseppina Venezia: > Hello, > > I'am using PVM with two nodes: a master node with Fedora Core 8 called > "pippo" and a slave node with ubuntu 8.10 called "pluto". > > I've installed PVM on both "pippo" and "pluto", and set those > environment variables: > > PVM_ROOT, PVM_ARCH, PVM_RSH and PVM_TMP. > > in .bashrc and /etc/profile > > Tha problem in that when I add another host from the master node I got > the following error: > > pvm> add pluto > add pluto > 0 successful > HOST DTID > pluto Can't start pvmd > > Auto-Diagnosing Failed Hosts... > pluto... > Verifying Local Path to "rsh"... > Rsh found in /usr/bin/rsh - O.K. > Testing Rsh/Rhosts Access to Host "pluto"... > Rsh/Rhosts Access is O.K. > Checking O.S. Type (Unix test) on Host "pluto"... > Host pluto is Unix-based. > Checking $PVM_ROOT on Host "pluto"... > > The value of the $PVM_ROOT environment > variable on pluto is invalid (""). > Use the absolute path to the pvm3/ directory. > > pvm> > > However, if I try echo $PVM_ROOT on slave node ("pluto") I got > > /usr/lib/pvm3 [update] I meant something like: export PATH=~/pvm3/lib:$PATH === did you add the PATH specification in .bash_profile and/or .profile? This will only work for an interactive login. AFAIR you must add it to .bashrc which will be sourced for a non-interactive login, i.e. during the pvm startup. -- Reuti > Could you help me? > Thank you in advance > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From reuti at staff.uni-marburg.de Tue Jul 28 02:49:53 2009 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Problem with environment variable [update] In-Reply-To: <39f5f2680907261211p55f85dc0j8bcb88109de4843@mail.gmail.com> References: <39f5f2680907261211p55f85dc0j8bcb88109de4843@mail.gmail.com> Message-ID: <2206840B-79A3-4A42-B759-A30AF2670B11@staff.uni-marburg.de> Hi, Am 26.07.2009 um 21:11 schrieb Giuseppina Venezia: > Hello, > > I'am using PVM with two nodes: a master node with Fedora Core 8 called > "pippo" and a slave node with ubuntu 8.10 called "pluto". > > I've installed PVM on both "pippo" and "pluto", and set those > environment variables: > > PVM_ROOT, PVM_ARCH, PVM_RSH and PVM_TMP. > > in .bashrc and /etc/profile > > Tha problem in that when I add another host from the master node I got > the following error: > > pvm> add pluto > add pluto > 0 successful > HOST DTID > pluto Can't start pvmd > > Auto-Diagnosing Failed Hosts... > pluto... > Verifying Local Path to "rsh"... > Rsh found in /usr/bin/rsh - O.K. > Testing Rsh/Rhosts Access to Host "pluto"... > Rsh/Rhosts Access is O.K. > Checking O.S. Type (Unix test) on Host "pluto"... > Host pluto is Unix-based. > Checking $PVM_ROOT on Host "pluto"... > > The value of the $PVM_ROOT environment > variable on pluto is invalid (""). > Use the absolute path to the pvm3/ directory. > > pvm> > > However, if I try echo $PVM_ROOT on slave node ("pluto") I got > > /usr/lib/pvm3 [update] I meant something like: export PATH=~/pvm3/lib:$PATH === did you add the PATH specification in .bash_profile and/or .profile? This will only work for an interactive login. AFAIR you must add it to .bashrc which will be sourced for a non-interactive login, i.e. during the pvm startup. -- Reuti > Could you help me? > Thank you in advance > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From jlforrest at berkeley.edu Tue Jul 28 16:54:56 2009 From: jlforrest at berkeley.edu (Jon Forrest) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Resolved - Approach For Diagnosing Heat Related Failure? In-Reply-To: <4A660F57.4030108@berkeley.edu> References: <4A660F57.4030108@berkeley.edu> Message-ID: <4A6F8FD0.4010101@berkeley.edu> It turned out that the cause of my heat related failure that I posted about a couple of weeks ago was indeed bad memory. I did try all the suggestions about making sure the fans and the heat sinks were working properly. The BIOS showed that all temperatures were well within the proper range. This was a strange failure in that memtest was of no use because it itself crashed without showing any errors. The fact that memtest couldn't run wasn't in itself a sign that the problem was due to memory since there could be many reasons why this happens. To the commenter who mentioned the fact that my office is cooler than our computer room - this is a sad fact about the financial state of the Univ. of Calif. these days. Anyway, thanks for all the suggestions. Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest@berkeley.edu From mdidomenico4 at gmail.com Thu Jul 30 11:04:16 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Xorg question Message-ID: Does anyone know if there is a way to get 'glxinfo' to run against the running local Xserver, if i ssh in to a remote workstation? When i SSH in to the workstation i have ForwardX turn on in ssh, so i get the glxinfo from my local workstation. If I unset DISPLAY and reset it to DISPLAY=:0, I get 'cannot connect to DISPLAY' Even if I switch to root, I still get denied. I'm trying to avoid having to walk around to a bunch of workstations just to see if the Accelerated X drivers are installed and running instead of the Mesa drivers. From landman at scalableinformatics.com Thu Jul 30 11:21:40 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Xorg question In-Reply-To: References: Message-ID: <4A71E4B4.1020008@scalableinformatics.com> Michael Di Domenico wrote: > I'm trying to avoid having to walk around to a bunch of workstations > just to see if the Accelerated X drivers are installed and running > instead of the Mesa drivers. dmesg | grep -i nvidia on each machine. Then xdpyinfo | grep -i GLX landman@pgda-100:~$ dmesg | grep -i nvidia [ 74.357971] nvidia: module license 'NVIDIA' taints kernel. [ 74.632591] nvidia 0000:06:00.0: PCI INT A -> Link[LNEC] -> GSI 19 (level, low) -> IRQ 19 [ 74.632599] nvidia 0000:06:00.0: setting latency timer to 64 [ 74.658694] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 180.44 Tue Mar 24 05:46:32 PST 2009 landman@pgda-100:~$ xdpyinfo | grep -i GLX GLX NV-GLX -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman@scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From octopos at comp.ufu.br Wed Jul 29 16:09:31 2009 From: octopos at comp.ufu.br (Otavio Augusto) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] MPI can determine which CPU to send a process? In-Reply-To: References: Message-ID: <9b4d0a0a0907291609s17e7c8b0i76f433d090ca8138@mail.gmail.com> Hello, I'm search the best Message Passing implementation to use in the University Beowulf Cluster, and I was wandering if MPI can determine which CPU to send a process. With MPI I can determine the host and the number of process, and with -npernode , the number of process per nodes, but that guaranty that if I put 4 quad cores in some host list, and use -npernode 3, it will execute exactly 1 process per CPU in which host? And I can determine the CPU to send a process, like CPU1 ou CPU0 ? Best Regards. Otavio Augusto. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20090729/d0fd6287/attachment.html From brs at admin.usf.edu Thu Jul 30 08:18:50 2009 From: brs at admin.usf.edu (Smith, Brian) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Fabric design consideration Message-ID: <3E9B990982B6404CAC116FE42F1FC97240F7BCE1@USFMB1.forest.usf.edu> Hi, All, I've been re-evaluating our existing InfiniBand fabric design for our HPC systems since I've been tasked with determining how we will add more systems in the future as more and more researchers opt to add capacity to our central system. We've already gotten to the point where we've used up all available ports on the 144 port SilverStorm 9120 chassis that we have and we need to expand capacity. One option that we've been floating around -- that I'm not particularly fond of, btw -- is to purchase a second chassis and link them together over 24 ports, two per spline. While a good deal of our workload would be ok with 5:1 blocking and 6 hops (3 across each chassis), I've determined that, for the money, we're definitely not getting the best solution. The plan that I've put together involves using the SilverStorm as the core in a spine-leaf design. We'll go ahead and purchase a batch of 24 port QDR switches, two for each rack, to connect our 156 existing nodes (with up to 50 additional on the way). Each leaf will have 6 links back to the spine for 3:1 blocking and 5 hops (2 for the leafs, 3 for the spine). This will allow us to scale the fabric out to 432 total nodes before having to purchase another spine switch. At that point, half of the six uplinks will go to the first spine, half to the second. In theory, it looks like we can scale this design -- with future plans to migrate to a 288 port chassis -- to quite a large number of nodes. Also, just to address this up front, we have a very generic workload, with a mix of md, abinitio, cfd, fem, blast, rf, etc. If the good folks on this list would be kind enough to give me your input regarding these options or possibly propose a third (or forth) option, I'd very much appreciate it. Thanks in advance, Brian Smith Sr. HPC Systems Administrator IT Research Computing, University of South Florida 4202 E. Fowler Ave. ENB308 Office Phone: +1 813 974-1467 Organization URL: http://rc.usf.edu From trainor at divination.biz Thu Jul 30 08:12:50 2009 From: trainor at divination.biz (Douglas J. Trainor) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] en masse BIOS upgrade issue extremely relevant again Message-ID: <268B1A99-A039-4085-A83D-F9A06A3B6544@divination.biz> "Intel Server Boards in the S3000, S3200, S5000 series, S5400 series, and S5500 series": http://www.theregister.co.uk/2009/07/30/intel_bios_security_bug/ From hahn at mcmaster.ca Thu Jul 30 11:23:42 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Xorg question In-Reply-To: References: Message-ID: > If I unset DISPLAY and reset it to DISPLAY=:0, I get 'cannot connect > to DISPLAY' that just means that the running X server is doing access control. you have to start X without access control or hack it. I'd recommend not having X running on cluster nodes normally (why would you!?!). instead, I start X-requiring programs something like this: xinit /usr/X11R6/bin/xdpyinfo it starts X for nothing but that command, and cleanly shuts down. a more practical example might be: xinit $command -- /usr/X11R6/bin/Xorg -dpms -s off -config xorg.conf > I'm trying to avoid having to walk around to a bunch of workstations > just to see if the Accelerated X drivers are installed and running > instead of the Mesa drivers. surely you can find that out from the X startup logs... From mm at yuhu.biz Fri Jul 31 02:08:47 2009 From: mm at yuhu.biz (Marian Marinov) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] BProc Message-ID: <200907311208.48169.mm@yuhu.biz> Hello list, do you know if this project is still alive ? Or replaced/renamed ? http://bproc.sourceforge.net/ -- Best regards, Marian Marinov From krzywicki.pawel at googlemail.com Thu Jul 30 12:23:53 2009 From: krzywicki.pawel at googlemail.com (Pawel Krzywicki) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Xorg question In-Reply-To: References: Message-ID: <200907302023.53219.krzywicki.pawel@gmail.com> Thursday 30 July 2009 19:23:42 Mark Hahn napisa?(a): if you have configured ssh keys that should be easy. ssh remoteaddress "glxinfo" >> file; cat file|more > > If I unset DISPLAY and reset it to DISPLAY=:0, I get 'cannot connect > > to DISPLAY' > > that just means that the running X server is doing access control. > you have to start X without access control or hack it. I'd recommend > not having X running on cluster nodes normally (why would you!?!). > > instead, I start X-requiring programs something like this: > > xinit /usr/X11R6/bin/xdpyinfo > > it starts X for nothing but that command, and cleanly shuts down. > a more practical example might be: > > xinit $command -- /usr/X11R6/bin/Xorg -dpms -s off -config xorg.conf > > > I'm trying to avoid having to walk around to a bunch of workstations > > just to see if the Accelerated X drivers are installed and running > > instead of the Mesa drivers. > > surely you can find that out from the X startup logs... > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Pawel Krzywicki From abhijeetscience at gmail.com Fri Jul 31 09:17:21 2009 From: abhijeetscience at gmail.com (abhijeet panwar) Date: Wed Nov 25 01:08:46 2009 Subject: [Beowulf] Regarding a cluster setup Message-ID: Dear sir/mam, I am making a hybrid cluster using colinux fedora core 10 and an opensource clustering software.I am not able to find a software that might fulfill the needs 1)it must autodiscover the systems i connect to it later without a reboot and add them to the cluster after a minimal client installation on them 2)it must give all nodes to make programs and run rather than a head node and client nodes topology.. all node user must be able to run programs on the cluster being a part of the cluster themselves Please guide me in this matter This is strictly for educational purpose only as I am a student in a college I too wanted to make a software myself but do not know where to begin with and how if there is any reference or guide then please inform .. I will be highly obliged Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20090731/6536a982/attachment.html From becker at scyld.com Fri Jul 31 13:34:05 2009 From: becker at scyld.com (Donald Becker) Date: Wed Nov 25 01:08:47 2009 Subject: [Beowulf] BProc In-Reply-To: <200907311208.48169.mm@yuhu.biz> Message-ID: On Fri, 31 Jul 2009, Marian Marinov wrote: > Hello list, > do you know if this project is still alive ? Or replaced/renamed ? > > http://bproc.sourceforge.net/ That's actually a long-dead branch of BProc. Even when it was current, it had significant flaws, frequently changed interfaces, and never worked reliably with x86_64 clusters. It was started by a former employee who made a complete copy of our internal development servers to his home machine in the hours before he quit without notice, and then used the unpublished development tree and build system to compete against us. He used the justification that we released almost all of our production code under Open Sources licenses soon after releasing our commercial product. While this was a self-serving rationalization on his part, it was the major reason that we stopped doing open publication of our source code as soon as the commercial version was released. Alas, that means you won't be able to download the current BProc from a public web page. Nor many of the other innovative tools that we developed and used to publish. Now most of published contributions are part of other projects. Of course we continue use and improve BProc. Or more accurately a BProc kernel interface, as the code has been re-written several times to match newer kernels and add features. Over time we have made it more scalable, and have added features such as multiple and fail-over masters. Our customers still have access to the current source code, but over several web site re-writes even the old web pages for BProc (as well as our other innovative subsystems) have been moved and become unreachable. I really, really wish the situation were different. The people that have worked on BProc in the past nine year since have done a great job in keeping it working in the face of kernel re-writes, using new kernel facilities to simplify its code, and making it reliable with large-scale installations. All while keeping the interface the same so that the much larger infrastructure around it would continue to work. They have done the hard work, even while much more attention and money was spent on LANL's failed-and-abandoned attempt to build clusters around the stolen source code. To end on a positive and technical note, while BProc was a cornerstone in our efficient, single-system-image cluster system, it's not the only way to do things. You can get many of its benefits by without re-implementing it. BProc is based around directed process migration -- a more efficient technique than the common transparent process migration. You can do many cool things with process migration, but with experience we found that the costly parts weren't really the valuable ones. What you really want is the guarantee that running a program *over there* returns the expected results -- the same results as running it *here*. That means more than knowing the command line. You want the same executable, linked with exactly the same library versions in the same order, with the same environment and parameters. You can get that consistency without implementing transparent migration. And if you are willing to give up single-process-space monitoring and control, without even doing migration and thus being dependent on kernel features. You just need to send the right information when you start a remote job. That means finding the current executable on the host system, looking at the link information (essentially running 'ldd' but occasionally doing a partial link) to find the initial libraries, and making sure that those exact versions are installed or cached on the remote system. When you start up the process on the remote system, using the copied environment and command line, you get most of the consistency that BProc offers. People often give "BProc" the credit for light-weight, quick-booting nodes. In reality BProc has little to do with that -- it's role is only process creation, monitoring and control. The real innovation was the ability to dynamically cache, and update when needed, just the elements needed to run a process. (You also need services such as BeoNSS and access to a reference master... the devil is in the details.) That lets you start with almost nothing and incrementally build an environment to support the programs that are to be run. As you can extrapolate from above, "a cornerstone" doesn't mean "the only way to do it". There is much more I could write about benefits, trade-offs, and implementation details. Is there a specific area that you wanted to know about? -- Donald Becker becker@scyld.com Penguin Computing / Scyld Software www.penguincomputing.com www.scyld.com Annapolis MD and San Francisco CA From trainor at presciencetrust.org Fri Jul 31 03:37:02 2009 From: trainor at presciencetrust.org (Douglas J. Trainor) Date: Wed Nov 25 01:08:47 2009 Subject: [Beowulf] BProc Message-ID: <423757513.4982.1249036622685.JavaMail.mail@webmail04> XCPU is one: http://xcpu.org or http://xcpu.sourceforge.net/ and on portability, Ionkov & Mirtchovski say: Anything with a socket :) http://mirtchovski.com/p9/xcpu-talk.pdf XCPU presentation from a Plan 9 workshop: http://lsub.org/iwp9/cready/xcpu-madrid.pdf Here is an abstract introduction: XCPU: a new, 9p-based, process management system for clusters and grids Minnich, R. Mirtchovski, A. Los Alamos Nat. Lab, NM; This paper appears in: Cluster Computing, 2006 IEEE International Conference on Publication Date: 25-28 Sept. 2006 On page(s): 1-10 Location: Barcelona, ISSN: 1552-5244 ISBN: 1-4244-0327-8 INSPEC Accession Number: 9464866 Digital Object Identifier: 10.1109/CLUSTR.2006.311843 Current Version Published: 2007-02-20 Abstract Xcpu is a new process management system that is equally at home on clusters and grids. Xcpu provides a process execution service visible to client nodes as a 9p server. It can be presented to users as a file system if that functionality is desired. The Xcpu service builds on our earlier work with the Bproc system. Xcpu differs from traditional remote execution services in several key ways, one of the most important being its use of a push rather than a pull model, in which the binaries are pushed to the nodes by the job starter, rather than pulled from a remote file system such as NFS. Bproc used a proprietary protocol; a process migration model; and a set of kernel modifications to achieve its goals. In contrast, Xcpu uses a well-understood protocol, namely 9p; uses a non-migration model for moving the process to the remote node; and uses totally standard kernels on various operating systems such as Plan 9 and Linux to start, and MacOS and others in development. In this paper, we describe our clustering model; how Bproc implements it and how Xcpu implements a similar, but not identical model. We describe in some detail the structure of the various Xcpu components. Finally, we close with a discussion of Xcpu performance, as measured on several clusters at LANL, including the 1024-node Pink cluster, and the 256-node Blue Steel InfiniBand cluster EXCERPTED FROM :: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?tp=&arnumber=4100349&isnumber=4100333 ==================================== Jul 31, 2009 05:19:36 AM, Marian Marinov wrote: Hello list, do you know if this project is still alive ? Or replaced/renamed ? http://bproc.sourceforge.net/