From bb101_at at yahoo.com  Fri Oct  1 05:18:45 2004
From: bb101_at at yahoo.com (Brady Bonty)
Date: Wed Nov 25 01:03:26 2009
Subject: [Beowulf] HPC Survey
Message-ID: <20041001121845.63443.qmail@web20621.mail.yahoo.com>

Salutations,
  My name is Brady Black, I am currently a student
enrolled in a High Performance Computing curriculum.

  As part of my school work and internship, I am
gathering information relating to current High
Performance Computer installations and their
operations.  I plan on using this information to
better understand the hurdles currently facing High
Performance Computing and to provide some insight into
solutions of common problems.

  I would be very appreciative if you would take 5 –
10 minutes out of your busy schedule to answer this 20
question survey.

  http://www.unc.edu/~bradyb/hpcSurvey.html

  Please be assured that all information gathered from
this survey will remain anonymous unless specific
consent is provided.  My plan is to use the aggregate
data to provide an overview of challenges currently
faced by the High Performance Computing industry.  If
you would like a copy of the aggregate data, please
let me know.


Thank you,
Brady Black
bradyb[at]unc[dot]edu


__________________________________
Do you Yahoo!?
Yahoo! Mail Address AutoComplete - You start. We finish.
http://promotions.yahoo.com/new_mail 

From 050675 at student.unife.it  Fri Oct  1 00:59:25 2004
From: 050675 at student.unife.it (050675@student.unife.it)
Date: Wed Nov 25 01:03:26 2009
Subject: [Beowulf] raw results
Message-ID: <51685.192.84.144.228.1096617565.squirrel@student.unife.it>

Hi all,
   someone in this list (Robert J. Brown, probably), a few months ago
asked me to post one or twice a month if I had interesting results with
raw ethernet on which I'm making my first level degree.
Now I've some interesting results: packets loss have been decreased (even
if under 300 bytes payload I experience so many loss, but probably the
problem is in the used 32 bits architecture or in the fact that I use a
2.4.x kernel without NAPI, for the moment).
For any other payload value (in a range between 300 and 1500 bytes) I've
no losses, even with jumbo frames (achieved throughput 111 MB/sec in a
point-to-point connection with Gbit ethernet cards).
I've also experienced NAPI, but with a kernel 2.4, not 2.6 (next goal).
Thank you for your attention and help in these months.


Simone Saravalli


From rgb at phy.duke.edu  Fri Oct  1 06:42:01 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:26 2009
Subject: [Beowulf] raw results
In-Reply-To: <51685.192.84.144.228.1096617565.squirrel@student.unife.it>
Message-ID: <Pine.LNX.4.44.0410010941331.29076-100000@lucifer.rgb.private.net>

On Fri, 1 Oct 2004 050675@student.unife.it wrote:

> Hi all,
>    someone in this list (Robert J. Brown, probably), a few months ago
> asked me to post one or twice a month if I had interesting results with
> raw ethernet on which I'm making my first level degree.
> Now I've some interesting results: packets loss have been decreased (even
> if under 300 bytes payload I experience so many loss, but probably the
> problem is in the used 32 bits architecture or in the fact that I use a
> 2.4.x kernel without NAPI, for the moment).
> For any other payload value (in a range between 300 and 1500 bytes) I've
> no losses, even with jumbo frames (achieved throughput 111 MB/sec in a
> point-to-point connection with Gbit ethernet cards).
> I've also experienced NAPI, but with a kernel 2.4, not 2.6 (next goal).
> Thank you for your attention and help in these months.

Cool!

   robert >>G<< brown  (rgb:-)

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From gerry.creager at tamu.edu  Fri Oct  1 07:44:49 2004
From: gerry.creager at tamu.edu (Gerry Creager n5jxs)
Date: Wed Nov 25 01:03:26 2009
Subject: [Beowulf] Somewhat OT, but still...: Has anyone seen...
Message-ID: <415D6D61.2040707@tamu.edu>

problems on Supermicro dual Xeon motherboards/systems with the 2.6 
kernels, especially with interrupts and keyboard controllers?

I've got a system that will lock up using FC2, and the latest updates 
for 2.6.8-1.521smp, run fine in the uniproc mode, boot but not allow 
local keyboard access in 2.6.5-1.358smp and work fine in uniprocessor.

I'm thinking it's hardware, but I'm askin' if anyone else has seen 
something similar...

Thanks, Gerry
-- 
Gerry Creager -- gerry.creager@tamu.edu
Network Engineering -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020
FAX:  979.847.8578 Pager:  979.228.0173
Office: 903A Eller Bldg, TAMU, College Station, TX 77843

From gerry.creager at tamu.edu  Fri Oct  1 07:23:34 2004
From: gerry.creager at tamu.edu (Gerry Creager n5jxs)
Date: Wed Nov 25 01:03:26 2009
Subject: [Beowulf] raw results
In-Reply-To: <Pine.LNX.4.44.0410010941331.29076-100000@lucifer.rgb.private.net>
References: <Pine.LNX.4.44.0410010941331.29076-100000@lucifer.rgb.private.net>
Message-ID: <415D6866.6090309@tamu.edu>

One note:  Please review RFC2544 regarding packet loss and mitigation 
for small packets.  We tested a number of switches about 18 months ago 
using an Anritsu MD1230 automated tester, and saw a lot of packet loss 
in switches that were not crafted for small packets.  Something to consider.

There are manufacturers who have engineered for both jumbo and small 
packets.  The tuning for this is not trivial: the problems for small 
packets are not well translated to jumbo frames.  As packet size 
decreases, overhead increases and the packet transmission rate goes 'way 
up.

Jumbo frames may adversely impact switch fabric memory if you're testing 
store-and-forward devices not designed with sufficient memory for jumbos 
originally.  Option 'B' is fragmentation, removing the benefit of jumbos 
immediately.

Thanks, all the same, for posting your results.  We're always interested 
in independent reports and independent methods!

Regards,
Gerry

Robert G. Brown wrote:
> On Fri, 1 Oct 2004 050675@student.unife.it wrote:
> 
> 
>>Hi all,
>>   someone in this list (Robert J. Brown, probably), a few months ago
>>asked me to post one or twice a month if I had interesting results with
>>raw ethernet on which I'm making my first level degree.
>>Now I've some interesting results: packets loss have been decreased (even
>>if under 300 bytes payload I experience so many loss, but probably the
>>problem is in the used 32 bits architecture or in the fact that I use a
>>2.4.x kernel without NAPI, for the moment).
>>For any other payload value (in a range between 300 and 1500 bytes) I've
>>no losses, even with jumbo frames (achieved throughput 111 MB/sec in a
>>point-to-point connection with Gbit ethernet cards).
>>I've also experienced NAPI, but with a kernel 2.4, not 2.6 (next goal).
>>Thank you for your attention and help in these months.
> 
> 
> Cool!
> 
>    robert >>G<< brown  (rgb:-)
> 


-- 
Gerry Creager -- gerry.creager@tamu.edu
Network Engineering -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020
FAX:  979.847.8578 Pager:  979.228.0173
Office: 903A Eller Bldg, TAMU, College Station, TX 77843

From jrajiv at hclinsys.com  Mon Oct  4 04:43:33 2004
From: jrajiv at hclinsys.com (Rajiv)
Date: Wed Nov 25 01:03:26 2009
Subject: [Beowulf] Dual Boot in Master and Client
Message-ID: <04b001c4aa07$60aab630$39140897@PMORND>

Dear All,
    I would like to have dual boot - Windows and Linux in master and all clients. In which beowulf package this is possible?

Regards,
Rajiv
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20041004/3a0d3bb9/attachment.html
From mphil39 at hotmail.com  Mon Oct  4 12:23:38 2004
From: mphil39 at hotmail.com (Matt Phillips)
Date: Wed Nov 25 01:03:26 2009
Subject: [Beowulf] How to find a swapped out, runnable process?
Message-ID: <BAY19-F3LqVpUOF9vJZ0003df24@hotmail.com>


I am running RH9 (2.4.20-9SGI_XFS_1.2.0smp) on a 16-node cluster. I noticed 
the load on the I/O node to be consistently high after one of the clients 
crashed during rsync. I did vmstat and found that 1 or more process are 
always in the runnable but swapped out queue.. Here's a sample output of 
vmstat

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
2 0 1 30120 9508 176 1609060 0 0 1 1 0 2 0 0 0
0 0 1 30120 9508 176 1609060 0 0 0 0 118 237 0 3 97
0 0 2 30120 9508 176 1609060 0 0 0 1064 252 670 0 1 99
0 0 1 30120 9508 176 1609060 0 0 0 36 156 330 0 0 100
0 0 1 30120 9508 176 1609060 0 0 0 268 157 305 0 0 100
0 0 1 30120 9508 176 1609060 0 0 0 8 129 245 0 0 100

As you can see, there is always one process in procs/w queue.. How do I find 
which process is this? I tried various combos of ps (like looking at wchan, 
stat outputs etc, variations of top).. but ps/top only show 1-2 process in 
the runnable queue and doesnt indicated if they are swapped. Maybe I am 
reading the man pages incorrectly.

Anyone has ideas how I can catch this errant process?

TIA,
Matt

_________________________________________________________________
On the road to retirement? Check out MSN Life Events for advice on how to 
get there! http://lifeevents.msn.com/category.aspx?cid=Retirement


From mikee at mikee.ath.cx  Mon Oct  4 13:40:42 2004
From: mikee at mikee.ath.cx (Mike)
Date: Wed Nov 25 01:03:26 2009
Subject: [Beowulf] OT: effective amount of data through gigabit ether?
Message-ID: <20041004204042.GQ7153@mikee.ath.cx>

I know this is off topic, but I've not found an answer anywhere.
On one IBM doc it says the effective throughput for 10Mb/s is
5.7GB/hour, 100Mb/s is 17.6GB/hour, but only lists TBD for 1000MB/s.
Does anyone know what this effective number is? This is for
calculating how long backups should take through my backup network.

(I'm not interested in how long it takes to read/write the disk,
just the network throughput.)

Mike

From agrajag at dragaera.net  Mon Oct  4 14:23:05 2004
From: agrajag at dragaera.net (Sean Dilda)
Date: Wed Nov 25 01:03:26 2009
Subject: [Beowulf] OT: effective amount of data through gigabit ether?
In-Reply-To: <20041004204042.GQ7153@mikee.ath.cx>
References: <20041004204042.GQ7153@mikee.ath.cx>
Message-ID: <1096924985.4303.37.camel@pel>

On Mon, 2004-10-04 at 16:40, Mike wrote:
> I know this is off topic, but I've not found an answer anywhere.
> On one IBM doc it says the effective throughput for 10Mb/s is
> 5.7GB/hour, 100Mb/s is 17.6GB/hour, but only lists TBD for 1000MB/s.
> Does anyone know what this effective number is? This is for
> calculating how long backups should take through my backup network.
> 

Those are interesting numbers.  I calculate the peak numbers to be:
10MB/s - 4.19GB/hour (less than your IBM number)
100MB/s - 41.9GB/hour (way more than your IBM number)
1000MB/s - 419GB/hour

In reality, the answer depends on your hardware.  With my setup I've
pushed 114MB/s over gigabit for an excess of ten minutes, which tends to
average out all the bursts.  If I took that out, it'd come to about
400GB/hour.

> (I'm not interested in how long it takes to read/write the disk,
> just the network throughput.)

See now, that's the trick.  Gigabit maxes out around 119MB/s.  I've not
tended to see disks that can actually preform that well (maybe with
sequential data, but not with random data).


From mwill at penguincomputing.com  Mon Oct  4 14:13:57 2004
From: mwill at penguincomputing.com (Michael Will)
Date: Wed Nov 25 01:03:26 2009
Subject: [Beowulf] OT: effective amount of data through gigabit ether?
In-Reply-To: <20041004204042.GQ7153@mikee.ath.cx>
References: <20041004204042.GQ7153@mikee.ath.cx>
Message-ID: <200410041413.57525.mwill@penguincomputing.com>

On Monday 04 October 2004 01:40 pm, Mike wrote:
> I know this is off topic, but I've not found an answer anywhere.
> On one IBM doc it says the effective throughput for 10Mb/s is
> 5.7GB/hour, 100Mb/s is 17.6GB/hour, but only lists TBD for 1000MB/s.

I would assume 900Mb/s as an optimistic best case throughput for the GigE, 
which would be about 395GB/hour.  

17.6GB/hour seems like a really low estimate, that would be only
about 40Mb/s effective transfer rate over an 100Mb/s link? Maybe
that number is really measuring the tape writing speed instead?

Michael Will
> Does anyone know what this effective number is? This is for
> calculating how long backups should take through my backup network.
> 
> (I'm not interested in how long it takes to read/write the disk,
> just the network throughput.)
> 
> Mike
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Michael Will, Linux Sales Engineer
NEWS: We have moved to a larger iceberg :-)
NEWS: 300 California St., San Francisco, CA.
Tel:  415-954-2822  Toll Free:  888-PENGUIN
Fax:  415-954-2899 
www.penguincomputing.com


From pa_bosje at yahoo.co.uk  Mon Oct  4 13:47:09 2004
From: pa_bosje at yahoo.co.uk (Patricia)
Date: Wed Nov 25 01:03:26 2009
Subject: [Beowulf] myrinet (scali) or ethernet
Message-ID: <20041004204709.89858.qmail@web25402.mail.ukl.yahoo.com>


Hi People,

I am user of two clusters: One runs under myrinet and
the other under scali. In both cases I installed my
software to run under each of them (but not ethernet).
All I want to know is how to check whether my parallel
jobs are indeed running under myrinet (scali) or
ethernet. 

I have this question because I have observed a strong
decay in the performance after a  power outage.

thanks for any input!

Patricia


___________________________________________________________ALL-NEW Yahoo! Messenger - all new features - even more fun!  http://uk.messenger.yahoo.com

From nican at nsc.liu.se  Mon Oct  4 13:45:33 2004
From: nican at nsc.liu.se (Niclas Andersson)
Date: Wed Nov 25 01:03:26 2009
Subject: [Beowulf] Call for Participation - LCSC and NGN
Message-ID: <bulk.5820.20041004234416@papput.nsc.liu.se>

CALL FOR PARTICIPATION


          National Supercomputer Centre in Sweden (NSC) and
       Norwegian High Performance Computing Consortium (NOTUR)
                    welcome you to participate in

   5th Annual Workshop on Linux Clusters For Super Computing (LCSC)

                                 and

             workshop on Nordic Grid Neighbourhood (NGN)

18-20 October, 2004
Hosted by National Supercomputer Centre
Linkoping University, SWEDEN                 http://www.nsc.liu.se/lcsc

The LCSC workshop are brimful of knowledgeable speakers giving
exciting talks about Linux clusters and distributed applications
requiring vast computational resources. Just a few samples:

- LCSC Keynote: Cluster Computing - You've come a long way 
  in a short time
  Jack Dongarra, University of Tennessee

- Application Performance on High-End and Commodity-class Computers
  Martyn Guest, CLRC Daresbury Laboratory

- The BlueGene/L Supercomputer and LOFAR/LOIS
  Bruce Elmegreen, IBM Watson Research Center

- MPI Micro-benchmarks: Misleading and Dangerous
  Greg Lindahl, Pathscale Inc.

and many more.

In the NGN workshop we gather speakers from the Nordic countries,
Baltic states and northwest Russia to talk about Grids and efforts
made in the field of Grid technology. There will be presentations of
applications, Grid middleware and national initiatives as well as
industrial solutions. NGN is supported by the Nordplus Neighbour
program of the Research Council of Norway.

Additionally, during these days there will be valuable vendor
presentations, exciting exhibits, instructive tutorials, seminars and
several other meetings.

For more information and registration: http://www.nsc.liu.se/lcsc


From wathey at salk.edu  Mon Oct  4 15:19:37 2004
From: wathey at salk.edu (Jack Wathey)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] ammonite
In-Reply-To: <20040927235904.GA21014@piskorski.com>
References: <20040927182134.GA23662@piskorski.com>
	<Pine.LNX.4.61.0409271452000.5279@gauss.snl.salk.edu>
	<20040927235904.GA21014@piskorski.com>
Message-ID: <Pine.LNX.4.61.0410041510310.10716@gauss.snl.salk.edu>


The ammonite cluster can now be seen in pictures and words at

     http://jessen.ch/ammonite/

thanks to the generosity and skill of Per Jessen, and thanks to Tom
Bartol, who not only took the photos, but also helped bring ammonite
to life.


Ammonite is a 200-processor cluster of bare diskless motherboards.

Best wishes,
Jack


From James.P.Lux at jpl.nasa.gov  Mon Oct  4 16:11:30 2004
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] ammonite
In-Reply-To: <Pine.LNX.4.61.0410041510310.10716@gauss.snl.salk.edu>
References: <20040927235904.GA21014@piskorski.com>
	<20040927182134.GA23662@piskorski.com>
	<Pine.LNX.4.61.0409271452000.5279@gauss.snl.salk.edu>
	<20040927235904.GA21014@piskorski.com>
Message-ID: <5.2.0.9.2.20041004160726.018615d0@mail.jpl.nasa.gov>

At 03:19 PM 10/4/2004 -0700, Jack Wathey wrote:
>http://jessen.ch/ammonite/


Very nice.. particularly the efforts you put into the cooling air. The 
comment about flow against a pressure drop is very sound.

I realize it's a time thing, but had you considered removing the fans from 
the power supplies?

What's a rough order of magnitude cost on the blower/VFD?  Does this VFD 
have a "servo" input that could be used to automatically change blower 
speed in response to temperature?  (some VFDs have an analog input that can 
be used to set up a speed = linear function of input voltage, and by 
cleverly setting the gain and offset, you can do quite nicely)


James Lux, P.E.
Spacecraft Radio Frequency Subsystems
Flight Telecommunications Systems
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875


From brett at nssl.noaa.gov  Mon Oct  4 16:24:48 2004
From: brett at nssl.noaa.gov (Brett Morrow)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Oklahoma Supercomputing Symposium 2004
Message-ID: <4161DBC0.8050403@nssl.noaa.gov>

Anyone else out there attending this event?

------------------------------------------------------------------------------------------------------------------------------------------ 


Join us for the Oklahoma Supercomputing Symposium 2004, Wed Oct 6 -
Thu Oct 7, here at OU (CCE Forum).

To register for the Symposium, go to

 http://symposium2004.oscer.ou.edu/

and follow the links.

Some 400 people have registered for the Symposium, from 38 academic
institutions, 40 companies, 18 government agencies and 2 non-
governmental organizations, in 17 states and Canadian provinces.

The Symposium is free, with meals provided, and it's a great way to
meet leaders, potential collaborators, colleagues, and potential
future employers and employees, from academia, government and industry.

Our speaker list includes:

* Sangtae Kim, new Division Director, Shared Cyberinfrastructure
 Division, Director for Computer & Information Science & Engineering,
 National Science Foundation
* S. Ramakrishnan, Director, Center for Development of Advanced
 Computing, India
* Stephen Wheat, Principal Scientist, Intel Corp
* Joerg Schwartz, Senior Program Manager, Sun Labs
* Steve Modica, Principal Engineer, SGI
* Ian Lumb, Grid Solutions Manager, Platform Computing Inc. * Kurt 
Snodgrass, Vice Chancellor, Information Technology and 
 Telecommunications, Oklahoma State Regents for Higher Education
* Mark Musser, Senior Solutions Architect, Oracle Corporation
* Viswa Sharma, Chief Technical Officer, CorEdge Networks
* Anil Srivastava, Executive Chairman & Chief Strategic Officer,
 AcrossWorld Communications
* Krzysztof Kuczera, Associate Professor, Department of Chemistry,
 University of Kansas
* Ed Seidel, Director, Center for Computation & Technology, Louisiana
 State University
* Mary Fran Yafchak, IT Program Coordinator, Southeastern Universities
 Research Association
* Art Vandenberg, Director of Advanced Campus Services, Georgia State
 University
* Dennis Aebersold, Chief Information Officer, University of Oklahoma
* Amy Apon, Associate Professor of Computer Science, University of
 Arkansas
* Richard Braley, Professor & Chair, Department of Technology,
 Cameron University * Paul Gray, Assistant Professor, Department of 
Computer Science,
 University of Northern Iowa
* John Matrow, System Administrator/Trainer, High Performance Computing
 Center, Wichita State University

We'll also have a vendor exposition, where you'll have an opportunity
to learn about existing and emerging HPC technologies.

Also, if you know of any students -- grad and undergrad -- who might be
interested in the Symposium, this is a great opportunity to introduce
them to conferences, especially because it's free.

Our academic sponsors include Oklahoma EPSCoR, the Oklahoma Chamber
of Commerce, and the OU Department of Information Technology, the
Ou Vice President for Research, and the OU Supercomputing Center for
Education & Research (OSCER).

And if there are colleagues or students that you think might be
interested, please forward this note to them.

-- 
Brett Morrow, NSSL/SPC Alternate Program Manager
INDUS Corporation
National Severe Storms Laboratory
(405) 366-0515
Brett.Morrow@noaa.gov
http://www.induscorp.com


From redboots at ufl.edu  Mon Oct  4 15:31:31 2004
From: redboots at ufl.edu (JOHNSON,PAUL C)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] ethernet switch, dhcp question
Message-ID: <265814864.1096929091119.JavaMail.osg@osgjas04.cns.ufl.edu>

All:

Im fairly new to beowulf clusters so please excuse the question if 
it is trivial.  Ive installed mpich on several computers and have 
run several programs but the performance seems a little slow.  All 
the computers in my lab are connected directly to the campus 
network.  Would I see an increase in performance if I instead had 
slaves connected through a switch in my room connected to a master 
computer using dhcp to assign ip's?
Thanks for any help,
Paul

--
JOHNSON,PAUL C


From tmattox at gmail.com  Mon Oct  4 16:32:11 2004
From: tmattox at gmail.com (Tim Mattox)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Dual Boot in Master and Client
In-Reply-To: <04b001c4aa07$60aab630$39140897@PMORND>
References: <04b001c4aa07$60aab630$39140897@PMORND>
Message-ID: <ea86ce2204100416322301b0be@mail.gmail.com>

Hi Rajiv,
I would think you could do this with Warewulf.
http://warewulf-cluster.org/
Just make the BIOS on each node first attempt to boot with PXE,
and upon PXE failure, boot from a locally installed Windows on
the node's hard drive.
To switch from Linux to Windows, turn off the dhcpd server on
the master, and reboot the nodes.  They should then come up
in Windows.
To switch back, you would turn on the dhcpd server on the master,
and then using some "unknown-to-me" windows utility to remotely
reboot the nodes, which should then come back up into Linux via
the PXE+ramdisk booting with Warewulf.

As for making the master dual boot, that is up to your local Linux
guru to configure LILO or Grub for dual boot.  I don't do Windows,
so I can't help you there.

Other Beowulf methods for diskless nodes should also work
similarly.

----- Original Message -----
From: Rajiv <jrajiv@hclinsys.com>
Date: Mon, 4 Oct 2004 17:13:33 +0530
Subject: [Beowulf] Dual Boot in Master and Client
To: beowulf@beowulf.org
 
Dear All, 
    I would like to have dual boot - Windows and Linux in master and
all clients. In which beowulf package this is possible?
  
Regards, 
Rajiv 

-- 
Tim Mattox - tmattox@gmail.com - http://homepage.mac.com/tmattox/

From wathey at salk.edu  Mon Oct  4 16:37:15 2004
From: wathey at salk.edu (Jack Wathey)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] ammonite
In-Reply-To: <5.2.0.9.2.20041004160726.018615d0@mail.jpl.nasa.gov>
References: <20040927235904.GA21014@piskorski.com>
	<20040927182134.GA23662@piskorski.com>
	<Pine.LNX.4.61.0409271452000.5279@gauss.snl.salk.edu>
	<20040927235904.GA21014@piskorski.com>
	<5.2.0.9.2.20041004160726.018615d0@mail.jpl.nasa.gov>
Message-ID: <Pine.LNX.4.61.0410041627220.10774@gauss.snl.salk.edu>


On Mon, 4 Oct 2004, Jim Lux wrote:

> At 03:19 PM 10/4/2004 -0700, Jack Wathey wrote:
>> http://jessen.ch/ammonite/
>
>
> Very nice.. particularly the efforts you put into the cooling air. The 
> comment about flow against a pressure drop is very sound.
>
> I realize it's a time thing, but had you considered removing the fans from 
> the power supplies?

The linear flow rate through the whole rack is typically about 120 to 200 
fpm, which is sigificantly less than the linear flow rate through a PS. 
If the PS fan was there *only* to cool the mid-tower box it was meant to 
go into, then removing it would do no harm.  But a PS generates a fair 
amount of heat of its own, so I left the fans.  And as you say, it would 
have taken time.

>
> What's a rough order of magnitude cost on the blower/VFD?  Does this VFD have 
> a "servo" input that could be used to automatically change blower speed in 
> response to temperature?  (some VFDs have an analog input that can be used to 
> set up a speed = linear function of input voltage, and by cleverly setting 
> the gain and offset, you can do quite nicely)

The blower was about $1300, the Teco inverter about $1000.  The inverter 
has an rs232 interface for computer control, which I don't use at present, 
but hope to some day.  It also has the analog control feature you 
describe.


From hahn at physics.mcmaster.ca  Mon Oct  4 21:53:04 2004
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] ethernet switch, dhcp question
In-Reply-To: <265814864.1096929091119.JavaMail.osg@osgjas04.cns.ufl.edu>
Message-ID: <Pine.LNX.4.44.0410050047130.13274-100000@coffee.psychology.mcmaster.ca>

> run several programs but the performance seems a little slow.  All 
> the computers in my lab are connected directly to the campus 
> network.  Would I see an increase in performance if I instead had 
> slaves connected through a switch in my room connected to a master 
> computer using dhcp to assign ip's?

possibly - I think you should ssh to your slaves and look at 
/proc/loadavg while running the MPI program.  (actually, 
I usually run "vmstat 1" on slaves, since it aggregates lots of 
other potentially valuable information.)

if your network is a bottleneck, slaves will be not-fully-busy.
if your campus network is 10 or 100bT or not full-duplex,
that's very likely the case.  if your campus net is gigabit,
then I would be surprised to see much improvement by using 
a local switch (assuming your lab machines are plugged into 
the same campus-owned switch).

if your lab machines are not all equivalent in speed, or if your 
MPI problem is not well-balanced, I'd expect to see some nodes 
busy and others not.  similarly, if there are pesky users running
netscape on some nodes, that's probably going to hurt (assuming 
your code is fairly tight-coupled.)

regards, mark hahn.


From hahn at physics.mcmaster.ca  Mon Oct  4 21:56:38 2004
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Dual Boot in Master and Client
In-Reply-To: <ea86ce2204100416322301b0be@mail.gmail.com>
Message-ID: <Pine.LNX.4.44.0410050053220.13274-100000@coffee.psychology.mcmaster.ca>

> I would think you could do this with Warewulf.

I didn't look, but I suspect warewulf uses all the usual
open-source tools.

> To switch from Linux to Windows, turn off the dhcpd server on

if warewulf uses pxelinux, you can much more nicely configure
particular nodes to boot with particular default configs,
including different kernels, windows, etc by providing per-node
pxelinux.conf/ files.

> To switch back, you would turn on the dhcpd server on the master,
> and then using some "unknown-to-me" windows utility to remotely

I suspect that cygwin and ssh could do this nicely.  but being 
the blunt-object sort of guy, I'd rather reset the windows machines 
remotely via IPMI-over-lan ;)


From eugen at leitl.org  Tue Oct  5 08:04:50 2004
From: eugen at leitl.org (Eugen Leitl)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Cray XD1 out
Message-ID: <20041005150450.GW1457@leitl.org>


http://investors.cray.com/phoenix.zhtml?c=98390&p=irol-newsArticle&ID=622736&highlight=

Cray Announces General Availability of the Cray XD1 Opteron/Linux-based
Supercomputer

SEATTLE--(BUSINESS WIRE)--Oct. 4, 2004--Global supercomputer leader Cray Inc.
(Nasdaq:CRAY) today announced the general availability of the new Cray
XD1(TM) supercomputer, an Opteron/Linux-based system priced from under
$100,000 to about $2 million (U.S. list) that handily outperforms similarly
priced Linux clusters. The company also announced the United States
Department of Agriculture Forest Service is a Cray XD1 customer, which adds
to an impressive list of early users, including the Ohio Supercomputer
Center, the Pacific Northwest National Laboratory (PNNL), Germany's Helmut
Schmidt University and the SAHA Institute of Nuclear Physics (Calcutta,
India).

"Tracking the evolving chemical composition of a smoke plume produces a task
so computationally intense that we assumed we would not be able to afford any
computer capable of performing it," said Bryce Nordgren, Physical Scientist
with the Forest Service's Fire Science Lab. "Reviewing the test case results
from Cray restored our hope that we would be able to perform a scientifically
meaningful simulation on our budget. We were particularly impressed with the
Cray XD1's awesome scalability on this challenging interdisciplinary
problem."

The Cray XD1 supercomputer is ideal for the special needs of high-performance
computing (HPC) applications used by government and academia, as well as
computer-aided engineering (CAE) in the aerospace, automotive and marine
industries; weather forecasting and climate modeling; petroleum exploration;
financial modeling; and life sciences research.

"We evaluated many proposals from leading IT companies and decided on Cray
because of the Cray XD1 system's excellent price-to-performance ratio," said
Professor Hendrik Rothe, chair of Helmut Schmidt University's Laboratory for
Measurement and Information Technology.

According to Rich Partridge, Enterprise Systems analyst with D.H. Brown
Associates, "With the XD1, Cray leverages its strong heritage to bring highly
parallel, affordable supercomputing to a broad market of industrial,
government and academic users. The Cray XD1 is not merely an Opteron/Linux
parallel system; it is a 'Cray,' and that makes all the difference. This is a
true supercomputer, with balanced performance that commodity designs just
cannot achieve."

About the Cray XD1 Supercomputer

The Cray XD1 features the direct connect processor (DCP) architecture, which
removes PCI bottlenecks and memory contention to deliver superior sustained
performance. According to the HPC Challenge benchmarks, the Cray XD1 has the
lowest latency of any HPC system, with MPI latency of 1.8 microseconds and
random ring latency of 1.3 microseconds. Tests conducted by the Ohio
Supercomputer Center show that the Cray XD1 ships messages with four times
lower MPI latency than common cluster interconnects such as Infiniband,
Quadrics or Myrinet, and 30 times lower than Gigabit Ethernet employed in
lowest-cost clusters. The Cray XD1's interconnect delivers twice the
bandwidth of 4X Infiniband for messages up to 1 KB and 60 percent higher
throughput for very large messages.

The Linux/Opteron system runs x86 32/64 bit codes. Field programmable gate
arrays (FPGAs) are available to accelerate applications, and the Active
Manager subsystem provides single system command and control and high
availability features. A 3VU (5.25") chassis provides 12 compute processors,
58 peak gigaflops, 96 GB/second aggregate switching capacity, 1.8-microsecond
MPI interprocessor latency, 84 GB maximum memory and 1.5 TB maximum disk
storage. A 12-chassis rack provides 144 compute processors, 691 peak
gigaflops, 1TB/second aggregate switching capacity, 2 microsecond MPI
interprocessor latency, 922 GB/second aggregate memory bandwidth, 1 TB
maximum memory and 18 TB maximum disk storage.

About Cray Inc.

The world's leading supercomputer company, Cray Inc. pioneered
high-performance computing with the introduction of the Cray-1 in 1976. The
only company dedicated to meeting the specific needs of HPC users, Cray
designs and manufactures supercomputers used by government, industry and
academia worldwide for applications ranging from scientific research to
product design, testing to manufacturing. Cray's diverse product portfolio
delivers superior performance, scalability and reliability to the entire HPC
market, from the high-end capability user to the department workgroup. For
more information, go to www.cray.com.

Safe Harbor Statement

This press release contains forward-looking statements. There are certain
factors that could cause Cray's execution plans to differ materially from
those anticipated by the statements above. These include the successful
porting of application programs to Cray systems and general economic and
market conditions. For a discussion of these and other risks, see "Factors
That Could Affect Future Results" in Cray's most recent Quarterly Report on
Form 10-Q filed with the SEC.

Cray is a registered trademark, and Cray XD1 is a trademark, of Cray Inc. All
other trademarks are the property of their respective owners.

CONTACT: Cray Inc.
Victor Chynoweth, 206-701-2280 (Investors)
victorc@cray.com
or
Steve Conway, 651-592-7441 (Media)
sttico@aol.com

SOURCE: Cray Inc.
"Safe Harbor" Statement under the Private Securities Litigation Reform Act of
1995: Statements in this press release regarding Cray Inc.'s business which
are not historical facts are "forward-looking statements" that involve risks
and uncertainties. For a discussion of such risks and uncertainties, which
could cause actual results to differ from those contained in the
forward-looking statements, see "Risk Factors" in the Company's Annual Report
or Form 10-K for the most recently ended fiscal year.

-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
______________________________________________________________
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
Url : http://www.scyld.com/pipermail/beowulf/attachments/20041005/dffedbee/attachment.bin
From kc7rad at radstream.com  Mon Oct  4 20:04:45 2004
From: kc7rad at radstream.com (Ken Linder (kc7rad))
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] ethernet switch, dhcp question
References: <265814864.1096929091119.JavaMail.osg@osgjas04.cns.ufl.edu>
Message-ID: <00a101c4aa88$11843e10$4f32a8c0@kc7rad>

Paul,
I don't think the DHCP has much to do with mpich performance...  I think the
only additional overhead it adds is when the node computer first comes
on-line.  It just sends a request to the DHCP server for an IP address (and
gateway, netmask, etc... I think :-)

Now, what WILL affect your performance is using your campus network.  You
are relying on the cable and infrastructure of that network.  Your cluster
is also competing for resources with all other network users, at the switch
level.  One could argue that this delay is negligable but if you add all the
delays that may exist in a small public network, it could be considerable.

I recommend you spend a little money and get yourself a switch.  I just saw
an 8-port on e-bay for $13.  I suggest you make your own cables.  For me
anyway, it is an almost cathartic release for me to build my own cables. :-)

With your own switch, you control the traffic.  If you do this and response
time is still slow, at least you have your own network to analyze.

Ken
www.radstream.com
----- Original Message ----- 
From: "JOHNSON,PAUL C" <redboots@ufl.edu>
To: <Beowulf@beowulf.org>
Sent: Monday, October 04, 2004 4:31 PM
Subject: [Beowulf] ethernet switch, dhcp question


> All:
>
> Im fairly new to beowulf clusters so please excuse the question if
> it is trivial.  Ive installed mpich on several computers and have
> run several programs but the performance seems a little slow.  All
> the computers in my lab are connected directly to the campus
> network.  Would I see an increase in performance if I instead had
> slaves connected through a switch in my room connected to a master
> computer using dhcp to assign ip's?
> Thanks for any help,
> Paul
>
> --
> JOHNSON,PAUL C


From joachim at ccrl-nece.de  Tue Oct  5 00:46:07 2004
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] myrinet (scali) or ethernet
In-Reply-To: <20041004204709.89858.qmail@web25402.mail.ukl.yahoo.com>
References: <20041004204709.89858.qmail@web25402.mail.ukl.yahoo.com>
Message-ID: <4162513F.2070701@ccrl-nece.de>

Patricia wrote:
> Hi People,
> 
> I am user of two clusters: One runs under myrinet and
> the other under scali. In both cases I installed my
> software to run under each of them (but not ethernet).
> All I want to know is how to check whether my parallel
> jobs are indeed running under myrinet (scali) or
> ethernet. 

For Myrinet, you can check if an MPI-programm linked against the 
MPICH-GM library runs correctly.

With Scali ("MPI Connect"), it's more complicated as it can fallback to 
Ethernet if the other interconnect (SCI?) does not work. Just run a 
ping-pong benchmark to measure latency (there's one included with Scali 
MPI), and if you get < 10us latency, you are not using Ethernet. Next to 
this, there should also be diagnostic tools included.

  Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de

From patrick at myri.com  Tue Oct  5 05:05:29 2004
From: patrick at myri.com (Patrick Geoffray)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] myrinet (scali) or ethernet
In-Reply-To: <20041004204709.89858.qmail@web25402.mail.ukl.yahoo.com>
References: <20041004204709.89858.qmail@web25402.mail.ukl.yahoo.com>
Message-ID: <41628E09.1020809@myri.com>

Hi Patricia,

Patricia wrote:
> I am user of two clusters: One runs under myrinet and
> the other under scali. In both cases I installed my

Myrinet is hardware and Scali makes software. Do you run Scali's 
software on Myrinet ?

> jobs are indeed running under myrinet (scali) or
> ethernet.

Do you link with Scali's software or MPICH-GM ? If it's MPICH-GM, 
binaries will only run on Myrinet. If it's Scali, I don't know, I guess 
it chooses Myrinet or Ethernet at runtime. In this case, you can look at 
the output of gm_board_info on a node where your job is running, and see 
if any of the PIDs and the command lines of programs using a GM port 
matches your application process. It may still be possible that Scali 
opens a GM port without using it. Another solution would be to unplug a 
few nodes but Scali may be able to use Ethernet only for the nodes where 
Myrinet has been unplugged. You can also look at the GM counters (with 
gm_counters) and see if the number of packets sent/received goes up. 
However, you would not be sure if another process is using Myrinet at 
that time or if IP/Myrinet is up and running too.

I guess there should be a way with Scali to know which device is used at 
runtime, but I really don't know how.

Is it the same problem than the Myricom Help ticket #30885 ?

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com

From john.hearns at clustervision.com  Tue Oct  5 07:22:32 2004
From: john.hearns at clustervision.com (John Hearns)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Dual Boot in Master and Client
In-Reply-To: <ea86ce2204100416322301b0be@mail.gmail.com>
References: <04b001c4aa07$60aab630$39140897@PMORND>
	<ea86ce2204100416322301b0be@mail.gmail.com>
Message-ID: <1096986151.17462.21.camel@vigor12>

On Tue, 2004-10-05 at 00:32, Tim Mattox wrote:
> Hi Rajiv,
> I would think you could do this with Warewulf.
> http://warewulf-cluster.org/
> Just make the BIOS on each node first attempt to boot with PXE,
> and upon PXE failure, boot from a locally installed Windows on
> the node's hard drive.
That's a good idea.

In addition to that, I saw this project for Windows installs:
http://unattended.sourceforge.net/

Depends if you want to re-install, or quickly boot an already installed
setup. Your suggestion is probably better.


> To switch from Linux to Windows, turn off the dhcpd server on
> the master, and reboot the nodes.  They should then come up
> in Windows.
> To switch back, you would turn on the dhcpd server on the master,
> and then using some "unknown-to-me" windows utility to remotely
> reboot the nodes, 

It seems possible to use Samba to do this,
using the "net rpc shutdown" command.
So assuming you run Samba on your head node you could probably reboot
your master mode from Windows to Linux then issue this command.


From tmattox at gmail.com  Tue Oct  5 07:32:29 2004
From: tmattox at gmail.com (Tim Mattox)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Dual Boot in Master and Client
In-Reply-To: <Pine.LNX.4.44.0410050053220.13274-100000@coffee.psychology.mcmaster.ca>
References: <ea86ce2204100416322301b0be@mail.gmail.com>
	<Pine.LNX.4.44.0410050053220.13274-100000@coffee.psychology.mcmaster.ca>
Message-ID: <ea86ce220410050732579fedf8@mail.gmail.com>

Hi Mark,

On Tue, 5 Oct 2004 00:56:38 -0400 (EDT), Mark Hahn
<hahn@physics.mcmaster.ca> wrote:
> > I would think you could do this with Warewulf.
> 
> I didn't look, but I suspect warewulf uses all the usual
> open-source tools.

Not sure what you mean exactly, but Warewulf itself is GPL'ed,
and it leverages things like yum, rpm, pxelinux or Etherboot, rsync,
etc.  You can install whatever Beowulf tools you want such as
LAM MPI, SGE, and pdsh for a small list of examples.  If you have
an RPM of whatever package, it's easy to install for the nodes.
If you have a SRPM it takes just a few more steps.

> > To switch from Linux to Windows, turn off the dhcpd server on
> 
> if warewulf uses pxelinux, you can much more nicely configure
> particular nodes to boot with particular default configs,
> including different kernels, windows, etc by providing per-node
> pxelinux.conf/ files.

For now, Warewulf automatically creates those config files for
pxelinux to do it's thing...  so your custom configs would get
clobbered by Warewulf when it generates it's own.  Similarly,
Warewulf rebuilds the dhcpd.conf file based it's node "database"
and config files.

I don't foresee putting any effort myself into making a dual
boot into Window's a config option for Warewulf directly.
However, if someone really needs this functionality, I doubt
Greg (gmk to his friends) or I would reject their contributed code,
as long as it was general enough to support dual/multi-booting
into a locally installed Linux on nodes as well.
-- 
Tim Mattox - tmattox@gmail.com - http://homepage.mac.com/tmattox/

From jakob at unthought.net  Tue Oct  5 07:56:53 2004
From: jakob at unthought.net (Jakob Oestergaard)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] ammonite
In-Reply-To: <Pine.LNX.4.61.0410041510310.10716@gauss.snl.salk.edu>
References: <20040927182134.GA23662@piskorski.com>
	<Pine.LNX.4.61.0409271452000.5279@gauss.snl.salk.edu>
	<20040927235904.GA21014@piskorski.com>
	<Pine.LNX.4.61.0410041510310.10716@gauss.snl.salk.edu>
Message-ID: <20041005145653.GR18307@unthought.net>

On Mon, Oct 04, 2004 at 03:19:37PM -0700, Jack Wathey wrote:
> 
> The ammonite cluster can now be seen in pictures and words at
> 
>     http://jessen.ch/ammonite/

Cool!

On the SMP/UP problem:

Do you have ACPI support in your kernel? Newer kernels can use ACPI for
parsing SMP information from the motherboard, rather than guess-working
on the old MP tables at magic locations in memory.  This *could* be
worth a shot I think.

I am 100% sure that your SMP/UP problem has *nothing* to do, what so
ever, with NFS server contention - either your kernel loads and boots,
or it doesn't load and boot.  The kernel does not use NFS (or local
disk) during the early stages of boot, where the processors are set up.
NFS problems would result in a failed boot, not a missing CPU.

Alternatively, I'd try out 2.4.27 (or 2.6.8.1 if you're feeling lucky),
your 2.4.20 kernel is *really* dated.

Cheers,

-- 

 / jakob


From rgb at phy.duke.edu  Tue Oct  5 09:17:29 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] OT: effective amount of data through gigabit ether?
In-Reply-To: <200410041413.57525.mwill@penguincomputing.com>
Message-ID: <Pine.LNX.4.44.0410051149240.21372-100000@lucifer.rgb.private.net>

On Mon, 4 Oct 2004, Michael Will wrote:

> On Monday 04 October 2004 01:40 pm, Mike wrote:
> > I know this is off topic, but I've not found an answer anywhere.
> > On one IBM doc it says the effective throughput for 10Mb/s is
> > 5.7GB/hour, 100Mb/s is 17.6GB/hour, but only lists TBD for 1000MB/s.
> 
> I would assume 900Mb/s as an optimistic best case throughput for the GigE, 
> which would be about 395GB/hour.  
> 
> 17.6GB/hour seems like a really low estimate, that would be only
> about 40Mb/s effective transfer rate over an 100Mb/s link? Maybe
> that number is really measuring the tape writing speed instead?


I agree with Michael (and Sean) here -- it is pretty straightforward to
compute a theoretical peak bandwidth -- just convert the Mbps into MBps
by dividing by 8.  10 Mbps ethernet can thus manage 1.25 MBps, 100 Mbps
-> 12.5 MBps, 1000 Mbps -> 125 MBps.

Most people don't consider the packet headers to be part of the
"throughput" -- with a standard ethernet MTU of 1500 bytes + 18 bytes
ethernet header, take away 64 bytes for TCP/IP header, one has at most
1436/1518 or 94.6% of peak This leaves a STILL theoretical peak of 1.18
x 10^{n-1} MBps (where n is the log_10 of the raw BW in Mbps).  One
reason people like to use switches and NICs that support oversize
packets is that doing so reduces both this 5.4% chunk of header-based
overhead and the ordinarily "invisible" mandatory pause between packets,
per packet, letting you get a bit closer to peak.

On top of this, switches and so forth will typically add a small bit of
latency per packet on top of the minimum interpacket interval, and the
TCP stack on both ends of the connection will add another slice of
latency per packet.  These are typically of order 50-200 microseconds
(which end of this wide range you see depending on lots of things like
switch quality and load, NIC quality and type, CPU speed and load, OS
revision, and probably whether or not it is a Tuesday).  By the time all
is said and done, one usually ends up with somewhere between 80% and 94%
of theoretical peak, or 1.0 x 10^{n-1| and 1.174 x 10&{n-1} MBps,
although for particularly poor NICs I've done worse than 80% in years
past.  Higher numbers (closer to theoretical peak, as noted) for giant
packets.

This is likely to be the relevant estimate for moving large data files.
For moving SMALL data files or messages, one moves from
bandwidth-dominated communications to latency-dominated communications
bottlenecks.  For small messages (say, less than 1K for the sake of
argument) the "bandwidth" increasingly is simply the size of the data
portion of the packet times the number of packets per second your
interface can manage.

To convert into GBph is trivial:  there are 3600 seconds/hour, and 1000
MB in on GB, so multiplying the numbers above by 3.6 seems in order.
This gives a theoretical peak in the ballpark of 4.25 x 10^{n - 1} GBph
for a standard MTU (higher with large packets), a probable real world
peak more like 4.05 x 10^{n - 1} GBph at 90% of wirespeed.

FWIW, I think that GbE tends to perform closer to its theoretical peak
than do the older 10 or 100 BT.  This is both because it is much more
likely that the interfaces and switches will handle large frames and
because the hardware tends to be more expensive and better built, with
more attention paid to details like how things are cached and DMA that
can make a big difference in overall performance efficiency.

Hope this helps, although as has already been noted (and will likely be
noted again:-) the network isn't necessarily going to be the rate
limiting bottleneck for backup.

   rgb

> 
> Michael Will
> > Does anyone know what this effective number is? This is for
> > calculating how long backups should take through my backup network.
> > 
> > (I'm not interested in how long it takes to read/write the disk,
> > just the network throughput.)
> > 
> > Mike
> > _______________________________________________
> > Beowulf mailing list, Beowulf@beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> > 
> 
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From rgb at phy.duke.edu  Tue Oct  5 09:42:23 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] ethernet switch, dhcp question
In-Reply-To: <265814864.1096929091119.JavaMail.osg@osgjas04.cns.ufl.edu>
Message-ID: <Pine.LNX.4.44.0410051238300.21372-100000@lucifer.rgb.private.net>

On Mon, 4 Oct 2004, JOHNSON,PAUL C wrote:

> All:
> 
> Im fairly new to beowulf clusters so please excuse the question if 
> it is trivial.  Ive installed mpich on several computers and have 
> run several programs but the performance seems a little slow.  All 
> the computers in my lab are connected directly to the campus 
> network.  Would I see an increase in performance if I instead had 
> slaves connected through a switch in my room connected to a master 
> computer using dhcp to assign ip's?

Quite probably, depending on how your campus is networked.  A local
switch is the preferred method of building a cluster.  If you are using
a 100 BT network and only have a small cluster, a 100BT switch is so
cheap it is almost a non-issue.  Even if you DO leave them connected to
the campus network, if you do this by interconnecting your switch and
the campus network you'll likely see a performance increase.

Putting them on a private network gives you an even quieter networking
environment and better control over the network, but yes, it will make
you learn a whole bunch of things (e.g. DHCP/PXE and more) to get it
right.  If you go this route, I'd strongly urge that you get PXE-capable
network cards and set up fully automated installation and booting at the
same time.  You'll spend a month learning all sorts of complicated
networking, but at the end of it you'll REALLY save time on
installation, operation, and so forth and your cluster will be upwardly
scalable in size with very little additional effort.

   rgb

> Thanks for any help,
> Paul
> 
> --
> JOHNSON,PAUL C
> 
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From wathey at salk.edu  Tue Oct  5 14:40:24 2004
From: wathey at salk.edu (Jack Wathey)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] ammonite
In-Reply-To: <20041005145653.GR18307@unthought.net>
References: <20040927182134.GA23662@piskorski.com>
	<Pine.LNX.4.61.0409271452000.5279@gauss.snl.salk.edu>
	<20040927235904.GA21014@piskorski.com>
	<Pine.LNX.4.61.0410041510310.10716@gauss.snl.salk.edu>
	<20041005145653.GR18307@unthought.net>
Message-ID: <Pine.LNX.4.61.0410051437001.14292@gauss.snl.salk.edu>


On Tue, 5 Oct 2004, Jakob Oestergaard wrote:

> On the SMP/UP problem:
>
> Do you have ACPI support in your kernel? Newer kernels can use ACPI for
> parsing SMP information from the motherboard, rather than guess-working
> on the old MP tables at magic locations in memory.  This *could* be
> worth a shot I think.

Here are what I suspect are the relevant lines from the .config file:

CONFIG_ACPI=y
CONFIG_ACPI_DEBUG=y
CONFIG_ACPI_BUSMGR=y
CONFIG_ACPI_SYS=y
CONFIG_ACPI_CPU=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_AC=y
CONFIG_ACPI_EC=y
CONFIG_ACPI_CMBATT=y
CONFIG_ACPI_THERMAL=y

So I guess the answer is 'yes'.

> I am 100% sure that your SMP/UP problem has *nothing* to do, what so
> ever, with NFS server contention - either your kernel loads and boots,
> or it doesn't load and boot.  The kernel does not use NFS (or local
> disk) during the early stages of boot, where the processors are set up.
> NFS problems would result in a failed boot, not a missing CPU.
>
> Alternatively, I'd try out 2.4.27 (or 2.6.8.1 if you're feeling lucky),
> your 2.4.20 kernel is *really* dated.

I'll try updating the kernel when I get a chance.  It is, as you say, 
rather old now.

Thanks,
Jack


From haavardw at ifi.uio.no  Tue Oct  5 23:52:53 2004
From: haavardw at ifi.uio.no (=?ISO-8859-1?Q?H=E5vard_Wall?=)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] myrinet (scali) or ethernet
In-Reply-To: <4162513F.2070701@ccrl-nece.de>
References: <20041004204709.89858.qmail@web25402.mail.ukl.yahoo.com>
	<4162513F.2070701@ccrl-nece.de>
Message-ID: <41639645.3090306@ifi.uio.no>

Joachim Worringen wrote:
> Patricia wrote:
> With Scali ("MPI Connect"), it's more complicated as it can fallback to 
> Ethernet if the other interconnect (SCI?) does not work. Just run a 
> ping-pong benchmark to measure latency (there's one included with Scali 
> MPI), and if you get < 10us latency, you are not using Ethernet. Next to 
> this, there should also be diagnostic tools included.
> 

With Scampi MPI Connect, you can check which interconnects are in use by 
setting the environment variable SCAMPI_NETWORKS_VERBOSE=2.

It is true that scampi will try to fallback to another interconnect if 
the primary fails. The interconnects used is listed in 
/opt/scali/etc/ScaMPI.conf. You may override this by setting the 
environment variable SCAMPI_NETWORKS (or use the -net switch with 
mpimon). For example SCAMPI_NETWORKS="smp,sci,tcp" will first try 
communication through shared memory, then SCI, and at last (tcp) if this 
fails.

--
hw

From Hakon.Bugge at scali.com  Tue Oct  5 23:54:34 2004
From: Hakon.Bugge at scali.com (=?iso-8859-1?Q?H=E5kon?= Bugge)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] myrinet (scali) or ethernet
In-Reply-To: <200410051551.i95FofLX020150@bluewest.scyld.com>
References: <200410051551.i95FofLX020150@bluewest.scyld.com>
Message-ID: <6.1.2.0.0.20041006083354.03f4a6f0@elin.scali.no>

At 05:51 PM 10/5/04, Patrick Geoffray wrote:

>Patricia wrote:
> > I am user of two clusters: One runs under myrinet and
> > the other under scali. In both cases I installed my
>
>Myrinet is hardware and Scali makes software. Do you run Scali's
>software on Myrinet ?

Patricia, would be nice to know if you run Scali MPI Connect (SMC) or some 
older ancient versions. If you run SMC, it would be nice to know if you run:
    o) Gbe through the TCP/IP stack (-net tcp)
    o) Gbe through the Direct Ethernet Transport (-net det0)
    o) Myrinet (-net gm0)
    o) Infiniband (-net ib0)
    o) 10Gbe through 3rd party DAPLs (network name will vary)

I assume you do not run combinations of the above although that is 
possible. I quick sanity check of your system could be to run sample 
benchmarks to assess the system(s) performance (latency and bandwidth). I 
usually use bandwidth and all2all, both located in /opt/scali/examples/bin.

Another obvious check is to see if the system is idle when it is supposed 
to be idle. You could use the utility scatop for that purpose.

Another way out is of course mailto:support _AT_ scali.com

>[snip]
>
>I guess there should be a way with Scali to know which device is used at
>runtime, but I really don't know how.

# SCAMPI_NETWORKS_VERBOSE=2 mpimon -net smp,gm0,tcp 
/opt/scali/examples/bin/hello -- `scahosts`

Hakon Bugge
Hakon.Bugge _ AT_ scali.com 


From anandv at singnet.com.sg  Wed Oct  6 03:41:50 2004
From: anandv at singnet.com.sg (Anand Vaidya)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Another bare motherboard cluster in a box
Message-ID: <200410061841.51046.anandv@singnet.com.sg>


Another bare motherboard based cluster, with Via CPU. Found this one on via 
arena

"I found myself deeply interested (disturbed?) in the latest clustering 
software called HPC (High Performance Clustering).  Most of the software is 
Linux based therefore free to download but I still needed an actual cluster 
to run the stuff on.  Rather than standing in line somewhere like Stanford 
for a brief encounter with a cluster I went about building my own."

http://www.slipperyskip.com/page10.html

From Umesh.Chaurasia at siemens.com  Wed Oct  6 04:38:24 2004
From: Umesh.Chaurasia at siemens.com (Chaurasia Umesh)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Linux memory leak?
Message-ID: <BED61B4631F1D51189C300A0C9E9597606D5BA8A@delg001a>

Hello,

I am Umesh Chaurasia working in Siemens. I got your mail id from Linux forum
.
We have developed application on Linux 7.2 Kernel 2.4. System H/W
configuration is 1 GB RAM, P-4, 2.4 GHZ.
When we are putting our system on load after whole night we found only 5 MB
memory left whereas in start it was 800 MB.
Is there any patch or special configuration required to save the memory.
Your input will really help us to build our system for Linux plateform.

Regards,
Umesh Chaurasia

From rgb at phy.duke.edu  Wed Oct  6 07:42:46 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Storage
Message-ID: <Pine.LNX.4.44.0410061019160.21372-100000@lucifer.rgb.private.net>

Dear List,

I'm turning to you for some top quality advice as I have so often in the
past.

I'm helping assemble a grant proposal that involves a grid-style cluster
with very large scale storage requirements.  Specifically, it needs to
be able to scale into the 100's of TB in "central disk store" (whatever
that means:-) in addition to commensurate amounts of tape backup.  The
tape backup is relatively straightforward -- there is a 100 TB library
available to the project already that will hold 200 TB after an
LTO1->LTO2 upgrade, and while tapes aren't exactly cheap, they are
vastly cheaper than disk in these quantities.

The disk is a real problem.  Raw disk these days is less than $1/GB for
SATA in 200-300 GB sizes, a bit more for 400 GB sizes, so a TB of disk
per se costs in the ballpark of $1000.  However, HOUSING the disk in
reliable (dual power, hot swap) enclosures is not cheap, adding RAID is
not cheap, and building a scalable arrangement of servers to provide
access with some controllable degree of latency and bandwidth for access
is also not cheap.  Management requirements include 3 year onsite
service for the primary server array -- same day for critical
components, next day at the latest for e.g. disks or power supplies that
we can shelve and deal with ourselves in the short run.  The solution we
adopt will also need to be scalable as far as administration is
concerned -- we are not interested in "DIY" solutions where we just buy
an enclosure and hang it on an over the counter server and run MD raid,
not because this isn't reliable and workable for a departmental or even
a cluster RAID in the 1-8 TB range (a couple of servers) it isn't at all
clear how it will scale to the 10-80 TB range, when 10's of servers
would be required.

Management of the actual spaces thus provided is not trivial -- there
are certain TB-scale limits in linux to cope with (likely to soon be
resolved if they aren't already in the latest kernels, but there in many
of the working versions of linux still in use) and with an array of
partitions and servers to deal with, just being able to index, store and
retrieve files generated by the compute component of the grid will be a
major issue.

SO, what I want to know is:

  a) What are listvolken who have 10+ TB requirements doing to satisfy
them?

  b) What did their solution(s) cost, both to set up as a base system
(in the case of e.g. a network appliance) and

  c) incremental costs (e.g. filled racks)?

  d) How does their solution scale, both costwise (partly answered in b
and c) and in terms of management and performance?

  e) What software tools are required to make their solution work, and
are they open source or proprietary?

  f) Along the same lines, to what extent is the hardware base of their
solution commodity (defined here as having a choice of multiple vendors
for a component at a point of standardized attachment such as a fiber
channel port or SCSI port) or proprietary (defined as if you buy this
solution THIS part will always need to be purchased from the original
vendor at a price "above market" as the solution is scaled up).

Rules:  Vendors reply directly to me only, not the list.  I'm in the
market for this, most of the list is not.  Note also that I've already
gotten a decent picture of at least two or three solutions offered by
tier 1 cluster vendors or dedicated network storage vendors although I'm
happy to get more.

However, I think that beowulf administrators, engineers, and users
should likely answer on list as the real-world experiences are likely to
be of interest to lots of people and therefore would be of value in the
archives.  I'm hoping that some of you bioinformatics people have
experience here, as well as maybe even people like movie makers.

FWIW, the actual application is likely to be Monte Carlo used to
generate huge data sets (per node) and cook them down to smaller (but
still multiGB) data sets, and hand them back to the central disk store
for aggregation and indexed/retrievable intermediate term storage, with
migration to the tape store on some as yet undetermined criterion for
frequency of access and so forth.  Other uses will likely emerge, but
this is what we know for now.  I'd guess that bioinformatics and movie
generation (especially the latter) are VERY similar in the actual data
flow component and also require multiTB central stores and am hoping
that you have useful information to share.

Thanks in advance,

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From john.hearns at clustervision.com  Wed Oct  6 09:14:38 2004
From: john.hearns at clustervision.com (John Hearns)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Linux memory leak?
In-Reply-To: <BED61B4631F1D51189C300A0C9E9597606D5BA8A@delg001a>
Message-ID: <Pine.LNX.4.44.0410061811350.16137-100000@druifje.clustervision.com>

On Wed, 6 Oct 2004, Chaurasia Umesh wrote:

> Hello,
> 
> I am Umesh Chaurasia working in Siemens. I got your mail id from Linux forum
> .
> We have developed application on Linux 7.2 Kernel 2.4. System H/W
> configuration is 1 GB RAM, P-4, 2.4 GHZ.
> When we are putting our system on load after whole night we found only 5 MB
> memory left whereas in start it was 800 MB.

Are you SURE that you are not counting the buffer memory as used?
Linux uses free memory as disk buffer, which is released on demand.
Please send us your output from 'free' and 'vmstat'

Also, and I hate to say this, I guess you mean Redhat 7.2 which is very 
long in the tooth.
Redhat 9 is end-of-life.

You should consider Redhat Enterprise or Fedora Core 1 for 2.4 series 
kernels. (And yes, 


From alvin at Mail.Linux-Consulting.com  Thu Oct  7 00:42:35 2004
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Storage - housing 100TB
In-Reply-To: <Pine.LNX.4.44.0410061019160.21372-100000@lucifer.rgb.private.net>
Message-ID: <Pine.LNX.3.96.1041007000139.18979A-100000@Maggie.Linux-Consulting.com>


hi ya robert

our solution is scalable using off the shelf commodity parts
and open source software

- we also recommend a duplicate system for "live backups"

- we can customize our products ( hardware solutions ) to fit the clients
  requirements and budget

- example large 100TB disk-subsystem
	
	on 4 disks per blade ........ 1.2TB per blade with 300GB disks
	10 blades per 4U chassis .... 12TB per 4U chassis
	10 4U chassis per rack ...... 120TB per 42U rack

	http://www.itx-blades.net/Dwg/BLADE.jpg
	- model shown holds 4 disks, but we can fit 8-disks in it

	http://www.itx-blades.net/Dwg/4U-BLADE.jpg
	- cooling ( front to back or top to bottom ) is our main concerns
	  that we try to solve with one solution

- system runs on +12V dc input
	- 2x 600W 2U powersupply is enough power for driving the system

- i'd be more than happy to send a demo chassis and blades, no charge
  if we can get feedback that you've used it and built it out as you
  needed
	- hopefully you can provide the disks, mb, cpu, memory,

	- we can provide the "system assembly and testing time" at 
	"evaluation" costs  ( all fees credited toward the purchase )

	- you keep the 4U chassis afterward ( no charge )

On Wed, 6 Oct 2004, Robert G. Brown wrote:

> I'm helping assemble a grant proposal that involves a grid-style cluster
> with very large scale storage requirements.  Specifically, it needs to
> be able to scale into the 100's of TB in "central disk store" (whatever
> that means:-) in addition to commensurate amounts of tape backup.  The

good .. sounds like fun

> tape backup is relatively straightforward -- there is a 100 TB library
> available to the project already that will hold 200 TB after an
> LTO1->LTO2 upgrade, and while tapes aren't exactly cheap, they are
> vastly cheaper than disk in these quantities.

- tape backups are not cheap ...
- tape backups are not reliable ( to save the tapes and restore )
	- dirty heads, tapes that need to be swapped, ..
- tape backups are too slow ( to restore )

> The disk is a real problem.  Raw disk these days is less than $1/GB for
> SATA in 200-300 GB sizes, a bit more for 400 GB sizes, so a TB of disk
> per se costs in the ballpark of $1000.

yup.. good ball park

>  However, HOUSING the disk in
> reliable (dual power, hot swap) enclosures is not cheap, adding RAID is
> not cheap,

it can be ...

does it need to be dual-hot-swap power supplies ??
	- no problem... we can provide that (though not a pretty "case" )

raid is cheap ... but why use raid ... there is no benefit to using
software or hardware raid at this size data ...

	- time is better spent in optimizing data and backup of
	the data to a 2nd system

	- it is NOT trivial to backup 20TB - 100TB of data

	- raid'ing reduces the overall reliability ( more things to
	fail ) and increases the system admin costs ( more testing )

> and building a scalable arrangement of servers to provide
> access with some controllable degree of latency and bandwidth for access
> is also not cheap.

not sure what the issues are .. 
- it'd depend on the switch/hub, and "disk subsystem/infrastructure"

>  Management requirements include 3 year onsite
> service for the primary server array -- same day for critical
> components,

we'd be using a duplicate "hot swap backup system"

>  next day at the latest for e.g. disks or power supplies that
> we can shelve and deal with ourselves in the short run. 

most everything we use is off the shelf and be kept on the shelf
for emergencies

power supplies, disks, motherboards, cpu, memory, fans

> The solution we
> adopt will also need to be scalable as far as administration is
> concerned --

scaling is easy in our case ...

> we are not interested in "DIY" solutions where we just buy
> an enclosure and hang it on an over the counter server and run MD raid,

we can build and test for you ( onsite if needed )

> not because this isn't reliable and workable for a departmental or even
> a cluster RAID in the 1-8 TB range (a couple of servers) it isn't at all
> clear how it will scale to the 10-80 TB range, when 10's of servers
> would be required.

we don't forecast any issues with sw raid ...
	on 4 disks per blade ........ 1.2TB per blade with 300GB disks
	10 blades per 4U chassis .... 12TB per 4U chassis
	10 4U chassis per rack ...... 120TB per 42U rack
 
> Management of the actual spaces thus provided is not trivial 

actual data to save would be a bigger issue than the
saving of it onto the disk subsystems

> -- there
> are certain TB-scale limits in linux to cope with (likely to soon be
> resolved if they aren't already in the latest kernels, but there in many
> of the working versions of linux still in use) and with an array of

individual file size issues would limit the raw data one can save

other way around it is to use custom device drivers like oracle
that uses their own "raw data" drivers to get around file size limiations

> partitions and servers to deal with, just being able to index, store and
> retrieve files generated by the compute component of the grid will be a
> major issue.

that depends on how the data is created and stored ???
	- we dont think it as a major issue, as long as each "TB-sized 
	files" can be indexed properly at the time of its creation

> SO, what I want to know is:
> 
>   a) What are listvolken who have 10+ TB requirements doing to satisfy
> them?

we prefer non-raided systems ... and duplicate disk-systems for backup
 
>   b) What did their solution(s) cost, both to set up as a base system
> (in the case of e.g. a network appliance) and

raw components is roughly $25K per 12TB in one 4U chassis
	
	http://www.itx-blades.net/Systems/


	- add marketing/sales/admin/contract/onsite costs to it
	( $250K for fully Managed - 3yr contracts w/ 2nd backup system )
		http://www.itx-blades.net/Systems/

>   c) incremental costs (e.g. filled racks)?

the system is expandable as needed per 1.2TB blade or 12TB ( 4U chassis )

additional costs to intall additional blades into the disk-subsystem
is incremental for the time needed to add its config to the existing
config files for the disk subsystem ( fairly simple, since the rest of
the system has already been tested and operational )

>   d) How does their solution scale, both costwise (partly answered in b
> and c) and in terms of management and performance?

partly and answered above

scalable solutions is accomplished with modular blades and blade chassis

>   e) What software tools are required to make their solution work, and
> are they open source or proprietary?

just the standard linux software raid tools in the kernel 

everything is open source
 
>   f) Along the same lines, to what extent is the hardware base of their
> solution commodity (defined here as having a choice of multiple vendors

everything is off-the-shelf

we have the proprietory 4U blade chassis for "holding the blades" in place
along with the power supply 
	( the system can be changed per customer requirements

> for a component at a point of standardized attachment such as a fiber
> channel port or SCSI port)

fiber channel cards may be used if needed, but it'd require some
reconfigurations
	- fiber channel PCI cards are expensive and it is unclear
	if its required or not

> or proprietary (defined as if you buy this
> solution THIS part will always need to be purchased from the original
> vendor at a price "above market" as the solution is scaled up).

everything is off-the-shelf

> Rules:  Vendors reply directly to me only, not the list.

i was wondering why nobody replied publicly :-)

> I'm in the
> market for this, most of the list is not.  Note also that I've already
> gotten a decent picture of at least two or three solutions offered by
> tier 1 cluster vendors or dedicated network storage vendors although I'm
> happy to get more.

i hope "name brand" is not the primary evaluation consideration
 
> However, I think that beowulf administrators, engineers, and users
> should likely answer on list as the real-world experiences are likely to
> be of interest to lots of people and therefore would be of value in the
> archives.  I'm hoping that some of you bioinformatics people have
> experience here, as well as maybe even people like movie makers.

we've been indirectly selling small systems to the movie industry 
( by the hundred's of systems )  .. its just a simple mpeg player :-)

> FWIW, the actual application is likely to be Monte Carlo used to
> generate huge data sets (per node) and cook them down to smaller (but
> still multiGB) data sets, and hand them back to the central disk store
> for aggregation and indexed/retrievable intermediate term storage, with

good ...

> migration to the tape store on some as yet undetermined criterion for
> frequency of access and so forth.  Other uses will likely emerge, but

i'd avoid tape storage due to costs and index/restore/uptime issues

> this is what we know for now.  I'd guess that bioinformatics and movie
> generation (especially the latter) are VERY similar in the actual data
> flow component and also require multiTB central stores and am hoping
> that you have useful information to share.

have fun
alvin


From hahn at physics.mcmaster.ca  Thu Oct  7 15:08:27 2004
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Storage
In-Reply-To: <Pine.LNX.4.44.0410061019160.21372-100000@lucifer.rgb.private.net>
Message-ID: <Pine.LNX.4.44.0410070018090.4347-100000@coffee.psychology.mcmaster.ca>

> that means:-) in addition to commensurate amounts of tape backup.  The

ick!  our big-storage plans very, very much hope to eliminate tape.

> tape backup is relatively straightforward -- there is a 100 TB library
> available to the project already that will hold 200 TB after an
> LTO1->LTO2 upgrade, and while tapes aren't exactly cheap, they are
> vastly cheaper than disk in these quantities.

hmm, LTO2 is $0.25/GB; disks are about double that.  considering the 
issues of tape reliability, access time and migration, I think 
disk is worth it.  from what I hear in the storage industry, this 
is a growing consensus among, for instance, hospitals - they don't 
want to spend their time reading tapes to see whether the media is 
failing and content needs to be migrated.  migrating content that's 
online is ah, easier.  in the $ world, online data is attractive in part
so its lifetime can be more explicitly managed (ie, deleted!)


> The disk is a real problem.  Raw disk these days is less than $1/GB for
> SATA in 200-300 GB sizes, a bit more for 400 GB sizes, so a TB of disk
> per se costs in the ballpark of $1000.  However, HOUSING the disk in
> reliable (dual power, hot swap) enclosures is not cheap, adding RAID is
> not cheap, and building a scalable arrangement of servers to provide
> access with some controllable degree of latency and bandwidth for access
> is also not cheap.

no insult intended, but have you looked closely, recently?  I did some 
quick web-pricing this weekend, and concluded:

vendor          capacity        size    $Cad list per TB        density
dell/emc	12x250          3U      $7500                   1.0 TB/U
apple		14x250          3U      $4000                   1.166
hp/msa1500cs	12x250x4        10U     $3850                   1.2

(divide $Cad by 1.25 or so to get $US.)  all three plug into FC.
the HP goes up to 8 shelves per controller or 24 TB per FC port, though.

> Management requirements include 3 year onsite
> service for the primary server array -- same day for critical
> components, next day at the latest for e.g. disks or power supplies that
> we can shelve and deal with ourselves in the short run.  The solution we

pretty standard policies.

> adopt will also need to be scalable as far as administration is
> concerned -- we are not interested in "DIY" solutions where we just buy
> an enclosure and hang it on an over the counter server and run MD raid,
> not because this isn't reliable and workable for a departmental or even
> a cluster RAID in the 1-8 TB range (a couple of servers) it isn't at all
> clear how it will scale to the 10-80 TB range, when 10's of servers
> would be required.

Robert, are you claiming that 10's of servers are unmanagable
on a *cluster* mailing list!?!  or are you thinking of the number
of moving parts?

> Management of the actual spaces thus provided is not trivial -- there
> are certain TB-scale limits in linux to cope with (likely to soon be
> resolved if they aren't already in the latest kernels, but there in many
> of the working versions of linux still in use) and with an array of

I can understand and even emphathize with some people's desire to 
stick to old and well-understood kernels.  but big storage is a very 
good reason to kick them out of this complacency - the old kernel are 
justifiable only on not-broke-don't-fix grounds...

> partitions and servers to deal with, just being able to index, store and
> retrieve files generated by the compute component of the grid will be a
> major issue.

how so?  I find that people still use sensible hierarchical organization,
even if the files are larger and more numerous than in the past.

>   a) What are listvolken who have 10+ TB requirements doing to satisfy
> them?

we're acquiring somewhere between .2 and 2 PB, and are planning machinrooms
around the obvious kinds of building blocks: lots of servers that are in 
the say 4-20 TB range, preferably connected by some fast fabric (IB seems
attractive, since it's got mediocre latency but good bandwidth.)

>   b) What did their solution(s) cost, both to set up as a base system
> (in the case of e.g. a network appliance) and

I'm fairly certain that if I were making all the decisions here, I'd 
go for fairly smallish modular servers plugged into IB.

>   c) incremental costs (e.g. filled racks)?

?

>   d) How does their solution scale, both costwise (partly answered in b
> and c) and in terms of management and performance?

my only real concern with management is MTBF: if we had a hypothetical 
collection 2PB of 250G SATA disks with 1Mhour MTBF, we'd go 5 days between
disk replacements.  to me, this motivates toward designs that have fairly
large numbers of disks that can share a hot spare (or maybe raid6?)

>   e) What software tools are required to make their solution work, and
> are they open source or proprietary?

I'd be interested in knowing what the problem is that you're asking to be
solved.  just that you don't want to run "find / -name whatever" on 
a filesystem of 20 TB?  or that you don't want 10 separate 2TB filesystems?

>   f) Along the same lines, to what extent is the hardware base of their
> solution commodity (defined here as having a choice of multiple vendors
> for a component at a point of standardized attachment such as a fiber
> channel port or SCSI port) or proprietary (defined as if you buy this
> solution THIS part will always need to be purchased from the original
> vendor at a price "above market" as the solution is scaled up).

as far as I can see, the big vendors are somehow oblivious of the fact
that customers *HATE* the proprietary, single-source attitude.  
	oh, you can plug any FC devices you want into your san, 
	as long as they're all our products and we've "qualified" them.

> Rules:  Vendors reply directly to me only, not the list.  I'm in the
> market for this, most of the list is not.  Note also that I've already

I think you'd be surprised at how many, many people are buying 
multi-TB systems for isolated labs.  there are good reasons that 
this kind of scattershot approach is not wise in, say, a university
setting, where a shared resource pool can respond better to burstiness,
consistent maintenance, stable environment, etc.

regards, mark hahn.


From rgb at phy.duke.edu  Thu Oct  7 16:59:59 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Storage
In-Reply-To: <Pine.LNX.4.44.0410070018090.4347-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0410070018090.4347-100000@coffee.psychology.mcmaster.ca>
Message-ID: <Pine.LNX.4.58.0410071823270.3153@lilith.rgb.private.net>

On Thu, 7 Oct 2004, Mark Hahn wrote:

> > that means:-) in addition to commensurate amounts of tape backup.  The
> 
> ick!  our big-storage plans very, very much hope to eliminate tape.
> 
> > tape backup is relatively straightforward -- there is a 100 TB library
> > available to the project already that will hold 200 TB after an
> > LTO1->LTO2 upgrade, and while tapes aren't exactly cheap, they are
> > vastly cheaper than disk in these quantities.
> 
> hmm, LTO2 is $0.25/GB; disks are about double that.  considering the 
> issues of tape reliability, access time and migration, I think 
> disk is worth it.  from what I hear in the storage industry, this 
> is a growing consensus among, for instance, hospitals - they don't 
> want to spend their time reading tapes to see whether the media is 
> failing and content needs to be migrated.  migrating content that's 
> online is ah, easier.  in the $ world, online data is attractive in part
> so its lifetime can be more explicitly managed (ie, deleted!)

It isn't the media, it's the way it is served.  Tape is ballpark of
$250/TB, but once you've invested in a general shell -- a tape library
of whatever size you want to pay for -- cost scales linearly, and it
(tape) is easy and relatively safe to transport.  Disk, by the time you
wrap it up, serve it, connect it to this and that, and provide it with
this and that costs much more.  Otherwise I agree with most of what you
say, but remember, I didn't write the RFP specs.

Besides, today they decided to drop the 60 TB of tape spec.  Oops!

We'll still meet or exceed it anyway, as we have a big tape library that
is conveniently underutilized handy, so we REALLY just pay for the
media (plus maybe kick in a drive or two).

> > The disk is a real problem.  Raw disk these days is less than $1/GB for
> > SATA in 200-300 GB sizes, a bit more for 400 GB sizes, so a TB of disk
> > per se costs in the ballpark of $1000.  However, HOUSING the disk in
> > reliable (dual power, hot swap) enclosures is not cheap, adding RAID is
> > not cheap, and building a scalable arrangement of servers to provide
> > access with some controllable degree of latency and bandwidth for access
> > is also not cheap.
> 
> no insult intended, but have you looked closely, recently?  I did some 
> quick web-pricing this weekend, and concluded:
> 
> vendor          capacity        size    $Cad list per TB        density
> dell/emc	12x250          3U      $7500                   1.0 TB/U
> apple		14x250          3U      $4000                   1.166
> hp/msa1500cs	12x250x4        10U     $3850                   1.2
> 
> (divide $Cad by 1.25 or so to get $US.)  all three plug into FC.
> the HP goes up to 8 shelves per controller or 24 TB per FC port, though.

So you add FC switch and server(s) and end up at a minimum of around
$5K/TB.  The maximum prices I'm seeing reported by respondants and that
we've seen in quotes or prices of actual systems are well over $10K/TB,
some as high as $30K/TB.  Price depends on how fast and scalable you
want it to be, which in turn depends on how proprietary it is.  But I'll
summarize all of this when I get through the proposal and can breathe
again.

The cheapest solutions are those you build yourself, BTW -- as one might
expect -- followed by ones that a vendor assembles for you, followed in
order by proprietary/named solutions that require special software or
special software and special hardware.  Some of the solutions out there
use basically "no" commodity parts that you can replace through anybody
but the vendor -- they even wrap up the disks themselves in their own
custom packaging and firmware and double the price in the process.

> > Management requirements include 3 year onsite
> > service for the primary server array -- same day for critical
> > components, next day at the latest for e.g. disks or power supplies that
> > we can shelve and deal with ourselves in the short run.  The solution we
> 
> pretty standard policies.
> 
> > adopt will also need to be scalable as far as administration is
> > concerned -- we are not interested in "DIY" solutions where we just buy
> > an enclosure and hang it on an over the counter server and run MD raid,
> > not because this isn't reliable and workable for a departmental or even
> > a cluster RAID in the 1-8 TB range (a couple of servers) it isn't at all
> > clear how it will scale to the 10-80 TB range, when 10's of servers
> > would be required.
> 
> Robert, are you claiming that 10's of servers are unmanagable
> on a *cluster* mailing list!?!  or are you thinking of the number
> of moving parts?

I'm thinking of scalability of management at all levels, and performance
at all levels.  I don't >>think<< that I'm crazy in thinking that this
is an issue in large scale storage design -- at least one respondant so
far suggested that I wasn't radical enough and that off-the shelf or
homemade SAN solutions are doomed to nasty failure at very large (100+
TB) sizes.  I'm not certain that I believe him (I had several people
describe their off-the-shelf solutions that scale to 100+ TB sizes, and
was directed to e.g. http://www.archive.org/web/petabox.php) but think
of me as being hypercautious in my already admitted ignorance;-)

That is, if there are no issues and people are running stacks of 6.4 TB
enclosures hanging off of OTC linux boxes and managing the volumes and
data issues transparently and they scale to 100's of TB, sure, I'd love
to hear about it.  Now I have, although there are issues, there are
issues.  As I said, I'll summarize (and maybe start some lovely
arguments:-) when I'm done but I'm still DIGESTING all the data I've
gotten from vendors and list-friends (all of whom I profoundly thank!).

> > Management of the actual spaces thus provided is not trivial -- there
> > are certain TB-scale limits in linux to cope with (likely to soon be
> > resolved if they aren't already in the latest kernels, but there in many
> > of the working versions of linux still in use) and with an array of
> 
> I can understand and even emphathize with some people's desire to 
> stick to old and well-understood kernels.  but big storage is a very 
> good reason to kick them out of this complacency - the old kernel are 
> justifiable only on not-broke-don't-fix grounds...

Again, agreed, but one wants to be very conservative in a project
proposal, especially when we HAVE NO CHOICE as to the actual kernel or
OS distribution -- we will have to just "install the grid" with a
package developed elsewhere by people that you or I might or might not
agree with.  Historically, in fact, I think that there is no chance that
either one of us would do things the way they have done them so far, and
maybe we will ultimately influence the design, but when writing the
proposal we have to assume that we'll be using their linux.  Where at
least we've talked them up from some -- shall we say old? obsolete?
non-x64 supporting? versions of linux and the associated kernels and
libraries as a base... (you get the idea).

> > partitions and servers to deal with, just being able to index, store and
> > retrieve files generated by the compute component of the grid will be a
> > major issue.
> 
> how so?  I find that people still use sensible hierarchical organization,
> even if the files are larger and more numerous than in the past.

It's a grid, and we're trying to avoid direct NFS mounts on all the
nodes for a variety of reasons (like performance, reliability, security)
and because in this kind of grid people will need to use fully automated
schema for data storage, retrieval, and archival migration on and off
the main data store.

Honestly, I personally think that the data management issue and toolset
is MORE important than the hardware.  As you note, we can build arrays
of disk servers or arrays of disk and associated servers or network
appliances and arrays of disk a variety of ways, including DIY with a
fairly obvious design.  In order for people to be able to direct a node
to run for a week and drop its results, properly indexed and
crossreferenced by user/group/program/parameters in a database,
somewhere into the data store where it will be transparently migrated
onto and off of an attached tape archive as needed AND possibly resync'd
back to a project CENTRAL store AND possibly sync'd back to the home LAN
and store of the grid user for local processing --- it is doable, sure,
but I wouldn't call it trivial or necessarily doable without some
hacking or involvement in OS projects addressing this issue or purchase
of proprietary software ditto.

If it is trivial, and there is a simple package that does all this and
eats your meatloaf for you out to a PB of data, please enlighten
me...:-)

> >   a) What are listvolken who have 10+ TB requirements doing to satisfy
> > them?
> 
> we're acquiring somewhere between .2 and 2 PB, and are planning machinrooms
> around the obvious kinds of building blocks: lots of servers that are in 
> the say 4-20 TB range, preferably connected by some fast fabric (IB seems
> attractive, since it's got mediocre latency but good bandwidth.)

Ya.

> >   b) What did their solution(s) cost, both to set up as a base system
> > (in the case of e.g. a network appliance) and
> 
> I'm fairly certain that if I were making all the decisions here, I'd 
> go for fairly smallish modular servers plugged into IB.

Any idea of what that would cost?

> 
> >   c) incremental costs (e.g. filled racks)?

I meant "cost of additional filled disk enclosures" once you've bought
in.  Some solutions involve network appliances with a large capital
investment before you buy your first disk enclosure, and then scale
linearly with filled enclosures to some point, where you buy another
appliance.  Some solutions already specify an appliance interconnect so
that the whole thing is transparent to your cluster.  Some solutions are
expensive, expensive.

I'm just trying to figure out HOW expensive, and how far we can go for
what we can afford with the different alternatives.  I'm happy for
anyone to tell me the virtues of the expensive systems (the benefits) as
long as I have the costs in hand as well, so I can ultimately do the old
fashioned CBA.

> >   d) How does their solution scale, both costwise (partly answered in b
> > and c) and in terms of management and performance?
> 
> my only real concern with management is MTBF: if we had a hypothetical 
> collection 2PB of 250G SATA disks with 1Mhour MTBF, we'd go 5 days between
> disk replacements.  to me, this motivates toward designs that have fairly
> large numbers of disks that can share a hot spare (or maybe raid6?)

Right, but if your hypothetical array of disks also involved a stack of
over the counter servers, network switches (of any sort, eg IB or FC or
GE), and so on, there isn't just the disk to worry about -- in fact, in
a good RAID enclosure it is relatively straightforward to deal with the
disk (and hot swap power and hot swap fan) failures.  Dealing with
intrinsic server failures, e.g. toasted memory, CPU, CPU fans, CPU power
supply (maybe, unless server has dual power) and sundry networking or
peripheral card failures takes a lot more time and expertise, and can
take down whole blocks of disks if the disk is provided only via direct
connections to specific servers.

Both human effort and expertise required and projected downtime depend a
lot on how you build and set things up.  Or rather, I >>expect<< it to,
and am seeking war-stories (stories of profound failures where some
design was FUBAR and ultimately abandoned for cause, especially) so I
can figure out which designs to avoid because they DON'T scale in
management.

Performance scaling is also important, but we're not looking for the
fastest possible solution or truly superior performance scaling (the
kinds of solutions that cost the $10K+/TB sorts of prices).  Unless of
course all the other solutions simply choke to death at some e.g. 80 TB
scale.  I don't "expect" them too, sure, but if I knew the answer, why'd
I ask?

> 
> >   e) What software tools are required to make their solution work, and
> > are they open source or proprietary?
> 
> I'd be interested in knowing what the problem is that you're asking to be
> solved.  just that you don't want to run "find / -name whatever" on 
> a filesystem of 20 TB?  or that you don't want 10 separate 2TB filesystems?

Partially described above.  The dataflow we are expecting isn't unique
to our problem, BTW.  One respondant with almost exactly the same needs
described a tool they are developing that is designed fairly
specifically to manage the dataflow and archival/migration issues
transparently.  I'm waiting to hear whether it interfaces with any sort
of indexing schema or toolset -- if so, it would simply solve the
problem.  Solve it for the cheapest possible (hardware reliable, COTS
component) data stack -- a pile of OTS multiTB servers -- as well!

In case the above wasn't clear, think:

a) Run 1 day to 1 week, generate some 100+ GB per CPU on node local
storage;

b) Run hours to days, reduce the data to "interesting" and compressed
form, occupying maybe 10% of this space.  How the actual data is
originally created (one big file or many little files, e.g.) I haven't a
clue yet.  How it is aggregated ditto.  At some point, though;

c) condensed data (be it in one 10 GB file or 10 1 GB files or larger or
smaller fragments) is sent in to the central store, where it has to be
saved in a way that is transparent to the user, indexed by the
generating program, its parameters, the generating/owning group, various
node and timestamp metadata, all in a DB that is searchable by the large
community that wants to SHARE this data.  So "find" is clearly out, even
find with really long filenames.  Find is REALLY out if you think about
its performance scaling as you fill the store with lots of inodes.

d) Once on the central store, the data has to be able to stay there (if
it is being used), be backed up to tape (regardless), be MIGRATED to
tape to free space on the central store for other data that IS being
used, be retreiveable from backup or archive, be downloadable by the
generating user to a home faraway for local processing, be downloadable
by OTHER groups/users to THEIR homes faraway, and be uploadable to a
PB-scale toplevel store and centralized archive in a higher tier of the
grid.

e) and maybe other stuff.  The RFP wasn't horribly detailed (it wasn't
at ALL detailed) and the material we've obtained from grid prototype
sites isn't very helpful at the design phase.  So we may NEED to export
NFS space to the nodes or use XFS and some fancy toolsets or the like,
but we're hoping to avoid this if the actual workflow permits it.  On a
grid, it "should", since grid tasks should all use "grid functions" to
accomplish macroscopic tasks, not Unix/linux/posix functions or tools.

> >   f) Along the same lines, to what extent is the hardware base of their
> > solution commodity (defined here as having a choice of multiple vendors
> > for a component at a point of standardized attachment such as a fiber
> > channel port or SCSI port) or proprietary (defined as if you buy this
> > solution THIS part will always need to be purchased from the original
> > vendor at a price "above market" as the solution is scaled up).
> 
> as far as I can see, the big vendors are somehow oblivious of the fact
> that customers *HATE* the proprietary, single-source attitude.  
> 	oh, you can plug any FC devices you want into your san, 
> 	as long as they're all our products and we've "qualified" them.

You are now the third or fourth person to make THAT observation.

  "Standards?  We don' care about no stinkin' standards..." (apologies
  to Mel Brooks and Blazing Saddles...;-)

> 
> > Rules:  Vendors reply directly to me only, not the list.  I'm in the
> > market for this, most of the list is not.  Note also that I've already
> 
> I think you'd be surprised at how many, many people are buying 
> multi-TB systems for isolated labs.  there are good reasons that 
> this kind of scattershot approach is not wise in, say, a university
> setting, where a shared resource pool can respond better to burstiness,
> consistent maintenance, stable environment, etc.

I agree again.  Hell, I maintain a 3x80 GB disk IDE RAID in my HOME
server these days, and the only thing special about the "80" is the age
of the disks -- next time I upgrade it I'll likely make it close to a TB
just because I can.  So TB-scale storage is to be expected in most
departmental size computing efforts at $1/GB plus housing and server.

100 TB-scale storage is a different beast.  One is really engineering a
storage "cluster" and like all cluster engineering, the optimal result
depends on the application mix and expected usage; a "recipe" based
solution might work or it might lead to disaster and effectively
unusuable resources due to bottlenecks, contention, or management
issues.

Cluster engineering I have a reasonable understanding of; storage
cluster engineering at this scale is way beyond my ken, although I'm
learning fast.

If only I had a couple of hundred thousand dollars, now, I'd build and
buy a bunch of prototypes and really learn it the right way...;-)

 Thanks enormously for the response,

    rgb

> 
> regards, mark hahn.
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From rgb at phy.duke.edu  Thu Oct  7 17:27:13 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Storage
In-Reply-To: <Pine.LNX.4.58.0410071823270.3153@lilith.rgb.private.net>
References: <Pine.LNX.4.44.0410070018090.4347-100000@coffee.psychology.mcmaster.ca>
	<Pine.LNX.4.58.0410071823270.3153@lilith.rgb.private.net>
Message-ID: <Pine.LNX.4.58.0410072021560.3153@lilith.rgb.private.net>

On Thu, 7 Oct 2004, Robert G. Brown wrote:

> In case the above wasn't clear, think:
> 
> a) Run 1 day to 1 week, generate some 100+ GB per CPU on node local
> storage;

I hate to reply to myself, but I meant "per node" -- on a hundred node
dual CPU cluster, generate as much as 2 TB of raw data a week, which
reduces to maybe 0.2 TB of data a week in hundreds of files.  Multiply
by 50 and we'll fill 10+ TB in a year, in tens of thousands of files (or
more).  And this is the lower-bound estimate, likely off by a factor of
2-4 and certain to be off by even more as the cluster scales up in size
over the next few years to as many as 500 nodes sustained, all cranking
out data according to this prescription but amplified by Moore's Law by
exponentially increasing factors.

This is why I'm worried about scaling so much.  Even the genomics people
have some sort of linear bounds on their data production rate.  This has
exponential growth in productivity matching (hopefully) expected growth
in storage, so it might not get relatively easier... and if the
exponents mismatch, it could get a lot worse.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From csamuel at vpac.org  Thu Oct  7 18:54:06 2004
From: csamuel at vpac.org (Chris Samuel)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Strange NFS corruption on Linux cluster to AIX 5.2 NFS
	server
Message-ID: <200410081154.09031.csamuel@vpac.org>

Hi folks,

(system details at the end)

I'm having a real hard time trying to track down a really bizzare NFS related 
issue on some clusters we're helping out on and I'm wondering if anyone here 
quickly knows the answer to this question before I go off trawling through 
the kernel sources.

I have a 72K assembler file (the results of a day of narrowing down the 
problem) that when I do:

 as -o /tmp/file.o file.s

generates a valid .o file, but when I do:

 as -o /some/nfs/directory/file.o file.s

creates a corrupted object file (and in the original case leads to a link 
error due to the corrupted ELF format).

However, cp'ing or cat'ing the object file from /tmp to the NFS filesystem is 
fine, it's just the assemblers output that is corrupted.

I thought that this was just an NFS probem until I used strace to dump out the 
entire contents of the file descriptors that 'as' reads and writes to for the 
assembler file and for the object file, and then diff'd them.

The only significant differences is that the write(2)'s to the object files 
are not the same, which I find extremely puzzling, I can see no way that the 
assembler can generate different output depending on whether the file it's 
just open()'d is on NFS or local disk. :-(

My only thought is that strace (which uses ptrace(2)) is reading the data from 
the kernel at some point after it has been corrupted, presumably at some 
point in the NFS parts of the kernel.

The problem with this file goes away (MD5 matches that of the one in created 
in /tmp) if I change rsize & wsize from 8192 to 4096, but then other object 
files get corrupted instead. :-(

We've tried this out on three nodes in the cluster, and they all corrupt the 
output file, so it's unlikely to be a particular hardware problem.

What is hurting my brain is that there is a mirror of this cluster both in OS 
installs (identical RPMs of the OS, especially kernel, gcc, assembler & 
libraries were used) and in firmware (BIOS and firmware updates were from the 
same CD) where this problem does not occur at all.

In both situations the NFS server is an AIX 5.2 box, it is possible that there 
are minor differences there, but I cannot see how a difference in the NFS 
server could affect the output of the assembler on the Linux box before it 
goes anywhere near hitting the wire, let alone making it to the NFS server.

The mount options are identical (we've checked both /etc/fstab 
and /proc/mounts) and rpm -Va doesn't show any unusual discrepancies between 
the two clusters.

OS:  RHEL3
Kernel: kernel-smp-2.4.21-15.EL
Binutils: binutils-2.14.90.0.4-35
NFS-utils: nfs-utils-1.0.6-21EL

Hardware: IBM x335 and IBM x345 dual Xeons.

cheers!
Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.scyld.com/pipermail/beowulf/attachments/20041008/1e5e9377/attachment.bin
From srgadmin at cs.hku.hk  Fri Oct  8 00:56:49 2004
From: srgadmin at cs.hku.hk (SRG Administrator)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] NPC2004: Call For Participation
Message-ID: <41664841.5020303@cs.hku.hk>

Call For Participation
2004 IFIP Internation Conference on Network and Parallel Computing (NPC 2004)
http://grid.hust.edu.cn/npc04

****************************************************************************

INTRODUCTION:

The goal of NPC 2004 is to establish an international forum for engineers and 
scientists to present their excellent ideas and experiences in all system fields 
of network and parallel computing. NPC 2004, hosted by Huazhong University of 
Science and Technology, will be held at Oct 18 - 20, 2004 in Wuhan, China. All 
accepted papers will be published by Springer-Verlag in the Lecture Notes in 
Computer Science Series (cited by SCI). There are many scenic spots and 
historical   sites   in   Wuhan  including the Yellow Crane Tower with the 1,700 
years history, one of the three famous towers in South China, and the East Lake 
whose natural beauty rivals that of the West Lake in Hangzhou. The main topics 
of interest include, but not limited to:

         Parallel & Distributed Architectures
         Network Security
         Multimedia Streaming Services
         Performance Modeling/Evaluation
         Network Storage
         Middleware Frameworks and Toolkits
         Network & Interconnect Architecture
         Parallel Programming Environments and Tools
         Parallel & Distributed Applications/Algorithms
         Advanced Web and Proxy Services
         Peer-to-peer Computing
         Cluster & Grid Computing

**************************************************************************

KEYNOTE SPEAKERS:

Prof. Kai Hwang Director of Internet and Grid Computing Laboratory
University of Southern California
Topic: Secure Grid Computing with Trusted Resources
        and Internet Datamining

Dr. Thomas Sterling
Faculty Associate Center for Advanced Computing Research
California Institute of Technology
Topic: Towards Memory Oriented Scalable Computer Architecture
        and High Efficiency Petaflops Computing

Prof. Jose A.B. Fortes
Director of Advanced Computing and Information Systems (ACIS) Laboratory
University of Florida Topic: In-VIGO: Making the grid virtually yours

Dr. Robert Kuhn
Intel Americas, Inc
Topic: Productivity in HPC Clusters

Dr. Mootaz Elnozahy
IBM
Topic: PERCS: IBM Effort in HPCS
************************************************************************

REGISTRATION FEE (ON-SITE FEES)

Regular   : Euro 400 or US$ 480
Student   : Euro 200 or US $240
Accompany : Euro 150 or US $180

EXTRA ROCEEDINGS : EURO 100 (or US $ 120) FOR EVERY EXTRA PROCEEDINGS

The registration form can be downloaded from the following address.
http://grid.hust.edu.cn/npc04/download/registration-form-npc04.pdf

************************************************************************

VENUE

The conference will be held in Wuhan Lake View Garden Hotel that is the only 
traditional ancient style hotel in Wuhan City to meet the international five-
star hotel standard. It is located in the beautiful East Lake scenery site ,and 
the East Lake High Technological Development Zone. The nice environment and the 
convenient traffic make it the ideal accommodation for travelers. 200 all kinds 
of well-equipped rooms are delightfully decorated and offer an array of comforts 
and amenities.
http://www.lakeviewgarden.com/english/
************************************************************************

PROGRAM

The detailed program can be found at http://grid.hust.edu.cn/npc04/program.htm

************************************************************************

TOURISM

Wuhan, the capital of Hubei Province, is the largest city in Central China, with 
a population of over 7 million and an area of 8,467 square kilometers. It lies 
at the confluence of the Yangtze and Han rivers and is comprised of three 
towns--Wuchang, Hankou, and Hanyang--that face each other across the rivers and 
are linked by two bridges. A major junction of traffic and communication, it is 
the center of economy, culture and politics in Central China and is proud of 
metallurgy, automobiles, machinery and high-tech industries. A core of national 
air, water and land transportation it offers great potential for further 
development and foreign investment. Wuhan is rich in culture and history. Its 
civilization began about 3,500 years ago, and is of great importance in Chinese 
culture, military, economy and politics. It shares the same culture of Chu, 
formed since the ancient Kingdom of Chu more than 2,000 years ago. Numerous 
natural and artificial attractions and scenic spots are scattered around. Famous 
scenic spots in Wuhan include Yellow Crane Tower, Guiyuan Temple, East Lake, and 
Hubei Provincial Museum with the famous chimes playing the music of different 
styles.

Yellow Crane Tower is the symbol of Wuhan. It is located on the Snake Hill in 
Wuchang, at the south bank of Yangtze River; it is called one of the three most 
famous towers in southern China, together with Yueyang Tower in Hunan Province 
and Tengwang Tower in Jiangxi Province.

The East Lake is one of the first state scenic spots in the east of Wuhan. The 
lake covers an area of 33 square Km, and is the largest lake of a city 
throughout China. In 1999, it was granted by the State as National Civilized 
Scenic Spot Model Site.

Moshan Hill located in the East Lake scenic spot, it is surrounded by the lake 
from the East, the West and the North. From East to West, it is 2,200 meter 
long; from North to South, about 500 meter broad. Moshan Hill has six peaks, 
with Chu culture as its subject, such as Chu Bazaar, Chu Heaven Platform, Chu 
Talents Park, are full of antique flavor and classic beauty of Chu culture.

************************************************************************

For more information, please contact the program vice-chair, Dr. Hai Jin
Tel:+86-27-87543529
Fax:+86-27-87557354
Email:hjin@hust.edu.cn

From janfrode at parallab.uib.no  Fri Oct  8 12:51:32 2004
From: janfrode at parallab.uib.no (Jan-Frode Myklebust)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Storage
In-Reply-To: <Pine.LNX.4.58.0410071823270.3153@lilith.rgb.private.net>
References: <Pine.LNX.4.44.0410070018090.4347-100000@coffee.psychology.mcmaster.ca>
	<Pine.LNX.4.58.0410071823270.3153@lilith.rgb.private.net>
Message-ID: <20041008195132.GC32425@ii.uib.no>

On Thu, Oct 07, 2004 at 07:59:59PM -0400, Robert G. Brown wrote:

> If it is trivial, and there is a simple package that does all this and
> eats your meatloaf for you out to a PB of data, please enlighten
> me...:-)

Have you had a look at SRB -> http://www.npaci.edu/DICE/SRB/ ? Sounds
to me like it fullfills all your requirements (except for the
meatloaf part, but I could be wrong).


  -jf

From jonathan.hujsak at baesystems.com  Fri Oct  8 13:15:26 2004
From: jonathan.hujsak at baesystems.com (Hujsak, Jonathan T (US SSA))
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] 64bit comparisons
Message-ID: <436A69503F99214D9EBC27415F66C40C3288E1@blums0008.sd.gd.com>

Hi!

 
We're looking at implementing a large G5 cluster here at BAE Systems.

 
Have you gained any new 'lessons learned' since the communication 

below? Can you recommend a good version of MPI to use for these?

We've been looking at MPICH, MPIPro and also the Apple xgrid...

 
Thanks!

 
Jonathan Hujsak

BAE Systems

San Diego

 
Bill Broadley bill at cse.ucdavis.edu
<mailto:beowulf%40beowulf.org?Subject=%5BBeowulf%5D%2064bit%20comparison
s&In-Reply-To=200405141644.i4EGi1Aq023213%40marvin.ibest.uidaho.edu> 
Fri May 14 11:48:21 PDT 2004 

*	Previous message: [Beowulf] 64bit comparisons
<http://www.beowulf.org/pipermail/beowulf/2004-May/016707.html> 
*	Next message: [Beowulf] 64bit comparisons
<http://www.beowulf.org/pipermail/beowulf/2004-May/016724.html> 
*	Messages sorted by: [ date ]
<http://www.beowulf.org/pipermail/beowulf/2004-May/date.html#16711>  [
thread ]
<http://www.beowulf.org/pipermail/beowulf/2004-May/thread.html#16711>  [
subject ]
<http://www.beowulf.org/pipermail/beowulf/2004-May/subject.html#16711>
[ author ]
<http://www.beowulf.org/pipermail/beowulf/2004-May/author.html#16711>  

  _____  

On Fri, May 14, 2004 at 09:44:01AM -0700, Robert B Heckendorn wrote:
> One of the options we are strongly considering for our next cluster is
> going with Apple X-servers.  There performance is purported to be good
 
Careful to benchmark both processors at the same time if that is your
intended usage pattern.  Are the dual-g5's shipping yet?  Last I heard
yield problems were resulting in only uniprocessor shipments.  My main
concern that despite the marketing blurb of 2 10GB/sec CPU interfaces
or similar that there is a shared 6.4 GB/sec memory bus.
 
> and their power consumption is small.
 
Has anyone measured a dual g5 xserv with a kill-a-watt or similar?
 
> Can people comment on any comparisons betwee Apple and (Athlon64
> or Opteron)?
 
Personally I've had problems, I need to spend more time resolving them,
things like:
*       Need to tweak /etc/rc to allow Mpich to use shared memory
*       Latency between two mpich processes on the same node is 10-20
times the 
    linux latency.  I've yet to try LAM.
*   Differences in semaphores requires a rewrite for some linux code I
had
*   Difference in the IBM fortran compiler required a rewrite compared
to code
    that ran on Intel's, portland group's, and GNU's fortran compiler.

 
Given all that I'm still interested to see what the G5 is good at and
under
what workloads the G5 wins perf/price or perf/watt.
 
-- 
Bill Broadley
Computational Science and Engineering
UC Davis

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20041008/19c8e119/attachment.html
From rgb at phy.duke.edu  Fri Oct  8 14:16:28 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Storage
In-Reply-To: <20041008195132.GC32425@ii.uib.no>
References: <Pine.LNX.4.44.0410070018090.4347-100000@coffee.psychology.mcmaster.ca>
	<Pine.LNX.4.58.0410071823270.3153@lilith.rgb.private.net>
	<20041008195132.GC32425@ii.uib.no>
Message-ID: <Pine.LNX.4.58.0410081703290.456@ganesh.phy.duke.edu>

On Fri, 8 Oct 2004, Jan-Frode Myklebust wrote:

> On Thu, Oct 07, 2004 at 07:59:59PM -0400, Robert G. Brown wrote:
> 
> > If it is trivial, and there is a simple package that does all this and
> > eats your meatloaf for you out to a PB of data, please enlighten
> > me...:-)
> 
> Have you had a look at SRB -> http://www.npaci.edu/DICE/SRB/ ? Sounds
> to me like it fullfills all your requirements (except for the
> meatloaf part, but I could be wrong).

Ah, but read:

  http://www.npaci.edu/dice/srb/srbOpenSource.html

where it is clear that the answer to the question "is it GPL-level open
source" (essentially free softare) is "no".  Worse, it is one of those
really evil packages that requires that you contact a University's
"Technology Transfer staff" for anything but carefully prescribed kinds
of usage (by Universities, basically).  This kinds of licensing drives
me somewhat wild, especially since in this particular project we/Duke
(an academic institution) will be partnering with MCNC (a state funded
center) and other area schools and universities.  Oops.  State funded
centers have to dicker for the right to use the toolset.

In spite of its impressive list of projects (and features), this makes
it, as you say, a package that does NOT eat your meatloaf for you.

This is the general idea of the project's data management package tool
as well (and some others folks have pointed out) and I appreciate the
reference.  I just wish that Universities would stop taking software
developed (generally) with generous support from federal and state
grants and putting these silly "we want to make money from this"
licenses.  Just GPL them and do things right...

Condor used to drive me nuts the same way.  SGE ditto. PBS even more so.

<preach>
Tools like this need to be REAL open source, free like air, especially
when it is almost dead certain that they began with all sorts of ideas
and possibly code contributed by a free source community, built on top
of free tools contributed by that community.
</preach>

   rgb

> 
> 
>   -jf
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From lindahl at pathscale.com  Fri Oct  8 14:47:12 2004
From: lindahl at pathscale.com (Greg Lindahl)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] 64bit comparisons
In-Reply-To: <436A69503F99214D9EBC27415F66C40C3288E1@blums0008.sd.gd.com>
References: <436A69503F99214D9EBC27415F66C40C3288E1@blums0008.sd.gd.com>
Message-ID: <20041008214712.GA3602@greglaptop.internal.keyresearch.com>

On Fri, Oct 08, 2004 at 01:15:26PM -0700, Hujsak, Jonathan T (US SSA) wrote:

> We're looking at implementing a large G5 cluster here at BAE Systems.
> 
> Have you gained any new 'lessons learned' since the communication 
> below? Can you recommend a good version of MPI to use for these?

Jonathan,

The most important lesson learned for large clusters is that you
should gain your own experience -- buy one of each potential node and
run your apps on it.

As for MPI implementations, it usually depends on the interconnect
that you're planning on using.

-- greg


From landman at scalableinformatics.com  Fri Oct  8 14:59:36 2004
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Storage
In-Reply-To: <Pine.LNX.4.58.0410081703290.456@ganesh.phy.duke.edu>
References: <Pine.LNX.4.44.0410070018090.4347-100000@coffee.psychology.mcmaster.ca>
	<Pine.LNX.4.58.0410071823270.3153@lilith.rgb.private.net>
	<20041008195132.GC32425@ii.uib.no>
	<Pine.LNX.4.58.0410081703290.456@ganesh.phy.duke.edu>
Message-ID: <Pine.LNX.4.58.0410081744190.27260@crunch.scalableinformatics.com>

On Fri, 8 Oct 2004, Robert G. Brown wrote:

> This is the general idea of the project's data management package tool
> as well (and some others folks have pointed out) and I appreciate the
> reference.  I just wish that Universities would stop taking software
> developed (generally) with generous support from federal and state
> grants and putting these silly "we want to make money from this"
> licenses.  Just GPL them and do things right...

<usa-centric>

Unless you negotiate this as part of your employment package (and my 
understanding is that few universities are willing to give up their 
Bayh-Dole based rights to your work), that this probably won't happen.  
Notice the intense resistance from certain interested groups to the 
NIH-NCRR policy of requesting software developed with federal money to be 
open-source.  University tech transfer folks were among the interested 
parties.

I think what needs to evolve is a two pronged model ala mysql.  If you are 
going to spin it out and turn it into a profit center, then by all means, 
pay for a license.  If you are going to use it in research (not for 
products or derivative works), then GPL it (or similar).

> Condor used to drive me nuts the same way.  SGE ditto. PBS even more so.

For some reason, Condor has not released their code.  I find this odd.  I 
thought they had.

> <preach>
> Tools like this need to be REAL open source, free like air, especially
> when it is almost dead certain that they began with all sorts of ideas
> and possibly code contributed by a free source community, built on top
> of free tools contributed by that community.
> </preach>

Remember, the poor starving universities need to eat too... :(

There are valid reasons to ask for money for software.  There are valid 
reasons not to distribute everything gratis (GPL is *not* a business 
plan) and to constrain redistribution.  These reasons make sense for 
businesses.  Universities generally have a different mission than 
businesses (though arguably, Bayh-Dole has blurred this significantly).

As with other employers, they own in most cases, everything you do.  If 
you want to build a company based upon what you have done in your lab, you 
have to negotiate with the tech transfer office.  

</usa-centric>

Joe

From rgb at phy.duke.edu  Fri Oct  8 17:24:43 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Storage
In-Reply-To: <Pine.LNX.4.58.0410081744190.27260@crunch.scalableinformatics.com>
References: <Pine.LNX.4.44.0410070018090.4347-100000@coffee.psychology.mcmaster.ca>
	<Pine.LNX.4.58.0410071823270.3153@lilith.rgb.private.net>
	<20041008195132.GC32425@ii.uib.no>
	<Pine.LNX.4.58.0410081703290.456@ganesh.phy.duke.edu>
	<Pine.LNX.4.58.0410081744190.27260@crunch.scalableinformatics.com>
Message-ID: <Pine.LNX.4.58.0410082011210.15636@lilith.rgb.private.net>

On Fri, 8 Oct 2004, Joe Landman wrote:

> Remember, the poor starving universities need to eat too... :(

Oh, I agree, I just dislike it immensely when University's start to
resemble occult, tax protected venture capital investment firms with a
built-in money laundering business in the form of "open and free"
teaching and research.  Supposedly they exist to teach and to do
research in the most philosophical of senses, but for far too many of
them the moment there is a sniff of money in a development they are all
over it.

For "patentable" technology, one can just barely justify this, at least
historically.  For software, which typically lives on a copyright, the
university has no more business getting involved than it has trying to
co-opt a publication to a learned journal and force the author to
republish it with copyright belonging to the University for money.

In practice, I think even the patent co-opting is middling Evil.  Rather
than even partnering with the actual developer who likely had the idea,
did all the groundwork, got the grant, so that the University MADE MONEY
(likely money exceeding the developer's salary) from every step of the
process at basically no risk to themselves, they just assert, sorry,
this belongs to us now and we'll give back some tiny fraction of
anything we make from it to your research program, if you are
fortunate enough to get tenure and keep your grants and still be working
here when we do.

But for software, especially software developed by academics and
grant-paid employees in association with federally funded projects, this
kind of nonsense is just unforgiveable.  One thing I like about Duke is
that they understand the clear benefit to open source software, and more
or less insist that stuff developed by systems staff that is reusable be
GPL or equivalent.  But then you run into a place that doesn't and is
grasping mine mine mine... while freely using the pieces WE contribute
back to the GPL pool.

> There are valid reasons to ask for money for software.  There are valid 
> reasons not to distribute everything gratis (GPL is *not* a business 
> plan) and to constrain redistribution.  These reasons make sense for 
> businesses.  Universities generally have a different mission than 
> businesses (though arguably, Bayh-Dole has blurred this significantly).

One would hope.

> 
> As with other employers, they own in most cases, everything you do.  If 
> you want to build a company based upon what you have done in your lab, you 
> have to negotiate with the tech transfer office.  

"Negotiate" isn't exactly the word -- generally it is laid out pretty
clearly in the faculty and staff bylaws.  Any negotiations had better
start before you even start the project, and to keep something you may
have to formally leave the University before you start or risk their
just taking it no matter when or how you finish.

Greed is a universal human trait, I guess, even in the Ivory Tower.

Doesn't mean I have to like it.

    rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From alvin at Mail.Linux-Consulting.com  Fri Oct  8 22:20:10 2004
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Storage - vendors
In-Reply-To: <Pine.LNX.4.58.0410071823270.3153@lilith.rgb.private.net>
Message-ID: <Pine.LNX.3.96.1041008215629.2266A-100000@Maggie.Linux-Consulting.com>


hi ya

On Thu, 7 Oct 2004, Robert G. Brown wrote:

> On Thu, 7 Oct 2004, Mark Hahn wrote:
... 
> > vendor          capacity        size    $Cad list per TB  density
> > dell/emc	     12x250          3U      $7500            1.0TB/U
> > apple	     14x250          3U      $4000            1.166
> > hp/msa1500cs     12x250x4        10U     $3850            1.2


    1Us w/ 8 drives  3 * 8 * 250     3U      $ 9K /6TB        2.oTB/U
   Blades w/ 4drive  10* 4 * 250     4U      $20K /10TB       2.5TB/U
 
* plain old 1Us with 8 drives per 1U allows ( 2.0TB per 1U )
	http://linux-1u.net/Dwg/jpg.sm/c2610.jpg 
	
* 10 mini-itx blades per 4U chassis w/ 4 disks 
  ( 1TB per blade, 10 blade per 4U chassis )
	http://itx-blades.net

* adding FC cards will increase the system costs :-)
	- the FC/san market is fairly tight market and very expensive 

> > (divide $Cad by 1.25 or so to get $US.)  all three plug into FC.
> > the HP goes up to 8 shelves per controller or 24 TB per FC port, though.
...

> The cheapest solutions are those you build yourself, BTW -- as one might
> expect -- followed by ones that a vendor assembles for you, followed in
> order by proprietary/named solutions that require special software or
> special software and special hardware.

"costs" are usually based on "vendor name recognition" compared to the
raw costs of parts and the closed market of competitors selling their 
widgets ( costs of parts is minimal compared to their retail pricing )

> describe their off-the-shelf solutions that scale to 100+ TB sizes, and
> was directed to e.g. http://www.archive.org/web/petabox.php) but think
> of me as being hypercautious in my already admitted ignorance;-)

their design is also based on their ability to use rs232 to log into
the adjacent box if it goes down for some reason, but, rs232 might
not work if the power fialed or the machine didn't boot to get to 
the init level to turn on agetty/uugetty

----

it'd be good to have a 2nd 100TB backup subsystem ... as it's not trivial
to backup and restore ( from bare metal ) that amount of data 
and you want to be certain you don't lose yesterday'sor last weeks data
due to today's faulty backup

c ya
alvin


From hanzl at noel.feld.cvut.cz  Fri Oct  8 14:33:11 2004
From: hanzl at noel.feld.cvut.cz (hanzl@noel.feld.cvut.cz)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Storage and cachefs on nodes?
Message-ID: <20041008233311G.hanzl@unknown-domain>

Anybody using cachefs(-alike) and local disks on nodes for
reboot-persistent cache of huge central storage?


(I periodically and obsessively repeat this poll with negative answer
ever so far, obviously I am the only person with data storage needs
perverted this way. Given the recent interest in storage, I dare to
ask again...)

Thanks

Vaclav Hanzl

From laurence at scalablesystems.com  Fri Oct  8 18:20:02 2004
From: laurence at scalablesystems.com (Laurence Liew)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] 64bit comparisons
In-Reply-To: <436A69503F99214D9EBC27415F66C40C3288E1@blums0008.sd.gd.com>
References: <436A69503F99214D9EBC27415F66C40C3288E1@blums0008.sd.gd.com>
Message-ID: <41673CC2.5060605@scalablesystems.com>

Hi Jonathan

1) But 1 or 2 nodes of each platform and test with your apps to see
- do they work on G5/Linux or G5/Mac OS
- which gives best price/performance

2) MPI
- depends on your budget for the interconnect
- Quadrics, Myrinet, Infiniband are all candidates
- your performance requirements and budget will determine which one 
suits best

3) IO
- notice you did not mention anything about IO
- spend some time thinking about IO
- depending on your needs you may need a parallel filesystem or a simple NAS


Hope this helps.

Cheers!
Laurence


Hujsak, Jonathan T (US SSA) wrote:
> *Hi!*
> 
> * *
> 
> *We?re looking at implementing a large G5 cluster here at BAE Systems.*
> 
> * *
> 
> *Have you gained any new ?lessons learned? since the communication *
> 
> *below? Can you recommend a good version of MPI to use for these?*
> 
> *We?ve been looking at MPICH, MPIPro and also the Apple xgrid?*
> 
> * *
> 
> *Thanks!*
> 
> * *
> 
> *Jonathan Hujsak*
> 
> *BAE Systems*
> 
> *San Diego*
> 
> * *
> 
> *Bill Broadley* bill at cse.ucdavis.edu 
> <mailto:beowulf%40beowulf.org?Subject=%5BBeowulf%5D%2064bit%20comparisons&In-Reply-To=200405141644.i4EGi1Aq023213%40marvin.ibest.uidaho.edu>
> /Fri May 14 //11:48:21 PDT// 2004/
> 
>     * Previous message: [Beowulf] 64bit comparisons
>       <http://www.beowulf.org/pipermail/beowulf/2004-May/016707.html>
>     * Next message: [Beowulf] 64bit comparisons
>       <http://www.beowulf.org/pipermail/beowulf/2004-May/016724.html>
>     * *Messages sorted by:* [ date ]
>       <http://www.beowulf.org/pipermail/beowulf/2004-May/date.html#16711>
>       [ thread ]
>       <http://www.beowulf.org/pipermail/beowulf/2004-May/thread.html#16711>
>       [ subject ]
>       <http://www.beowulf.org/pipermail/beowulf/2004-May/subject.html#16711>
>       [ author ]
>       <http://www.beowulf.org/pipermail/beowulf/2004-May/author.html#16711>
> 
> ------------------------------------------------------------------------
> 
> On Fri, May 14, 2004 at 09:44:01AM -0700, Robert B Heckendorn wrote:
> 
>>/ One of the options we are strongly considering for our next cluster is/
> 
>>/ going with Apple X-servers.  There performance is purported to be good/
> 
>  
> 
> Careful to benchmark both processors at the same time if that is your
> 
> intended usage pattern.  Are the dual-g5's shipping yet?  Last I heard
> 
> yield problems were resulting in only uniprocessor shipments.  My main
> 
> concern that despite the marketing blurb of 2 10GB/sec CPU interfaces
> 
> or similar that there is a shared 6.4 GB/sec memory bus.
> 
>  
> 
>>/ and their power consumption is small./
> 
>  
> 
> Has anyone measured a dual g5 xserv with a kill-a-watt or similar?
> 
>  
> 
>>/ Can people comment on any comparisons betwee Apple and (Athlon64/
> 
>>/ or Opteron)?/
> 
>  
> 
> Personally I've had problems, I need to spend more time resolving them,
> 
> things like:
> 
> *       Need to tweak /etc/rc to allow Mpich to use shared memory
> 
> *       Latency between two mpich processes on the same node is 10-20 times the 
> 
>     linux latency.  I've yet to try LAM.
> 
> *   Differences in semaphores requires a rewrite for some linux code I had
> 
> *   Difference in the IBM fortran compiler required a rewrite compared to code
> 
>     that ran on Intel's, portland group's, and GNU's fortran compiler.      
> 
>  
> 
> Given all that I'm still interested to see what the G5 is good at and under
> 
> what workloads the G5 wins perf/price or perf/watt.
> 
>  
> 
> -- 
> 
> Bill Broadley
> 
> Computational Science and Engineering
> 
> UC Davis
> 
>  
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
==========================================
Visit us at Supercomputing2004. Booth #400
==========================================

Laurence Liew, CTO		Email: laurence@scalablesystems.com
Scalable Systems Pte Ltd	Web  : http://www.scalablesystems.com
(Reg. No: 200310328D)
7 Bedok South Road		Tel  : 65 6827 3953
Singapore 469272		Fax  : 65 6827 3922


From jrajiv at hclinsys.com  Fri Oct  8 21:55:34 2004
From: jrajiv at hclinsys.com (Rajiv)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] HPC in Windows
Message-ID: <01ee01c4adbc$3618d330$39140897@PMORND>

Dear All,
    Are there any Beowulf packages for windows?

Regards,
Rajiv
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20041009/eb3fc4ef/attachment.html
From jrajiv at hclinsys.com  Fri Oct  8 21:54:34 2004
From: jrajiv at hclinsys.com (Rajiv)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Application Deployment
Message-ID: <01df01c4adbc$1229c240$39140897@PMORND>

Dear All,
    Is there any software available for application deployment- both linux and windows. I would like to install packages from master to all the clients through a management console.

Regards,
Rajiv
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20041009/315c2948/attachment.html
From clwang at cs.hku.hk  Fri Oct  8 22:57:52 2004
From: clwang at cs.hku.hk (Cho Li Wang)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] CFP: CCGrid2005  (Cardiff, UK)
Message-ID: <41677DE0.5070006@csis.hku.hk>


                          CLUSTER COMPUTING AND GRID
                               (CCGrid 2005)

                     http://www.cs.cf.ac.uk/ccgrid2005/
                                9-12 May 2005
                                 Cardiff, UK

     ******************************************************************
         IMPORTANT DATE: Paper submission: November 15, 2004
     ******************************************************************


SCOPE
=====

Commodity-based clusters and Grid computing technologies are rapidly
developing, and are key components in the emergence of a novel
service-based fabric for high capability computing. Cluster-powered
Grids not only provide access to cost-effective problem-solving power,
but also promise to enable a more collaborative approach to the use of
distributed resources, and new economic products and services.
CCGrid2005, sponsored by the IEEE Computer Society (final approval
pending), is designed to bring together international leaders who are
pioneering researchers, developers, and users of clusters, networks,
and Grid architectures and applications. The symposium will also serve
as a forum to present the latest work, and highlight related activities
from around the world.  CCGrid2005 is interested in topics including,
but not limited to:


o    Hardware and Software (based on PCs, Workstations,
      SMPs or Supercomputers)
o    Middleware for Clusters and Grids
o    Dynamic Optical Network Architectures for Grid Computing
o    Parallel File Systems, including wide area file systems,
      and Parallel I/O
o    Scheduling and Load Balancing
o    Programming Models, Tools, and Environments
o    Performance Evaluation and Modeling
o    Resource Management and Scheduling
o    Computational, Data, and Information Grid Architectures
      and Systems
o    Grid Economies, Service Architectures, and Resource Exchange
      Architectures
o    Grid-based Problem Solving Environments
o    Scientific, Engineering, and Commercial Grid Applications
o    Portal Computing / Science Portals

TECHNICAL PAPER SUBMISSION
==========================

Authors are invited to submit papers of not more than 8 pages of double
column text using single spaced 10 point size type on 8.5 x 11 inch
pages, as per IEEE 8.5 x 11 manuscript guidelines, see
http://www.computer.org/cspress/instruct.htm. Authors should submit a
PostScript (level 2) or PDF file that will print on a PostScript
printer.  Paper submission instructions will be placed on this webpage
(http://www.cs.cf.ac.uk/ccgrid2005).  It is expected that the
proceedings will be published by the IEEE Computer Society Press, USA.

POSTER SUBMISSION
=================

Authors may also submit short papers, of no more than 4 pages,  of
double column text using single spaced 10 point size type on 8.5 x 11
inch pages, as per IEEE 8.5 x 11 manuscript guidelines, see
http://www.computer.org/cspress/instruct.htm. Authors should submit a
PostScript (level 2) or PDF file that will print on a PostScript
printer.  Paper submission instructions will be placed on this webpage
(http://www.cs.cf.ac.uk/ccgrid2005).

Please contact the Poster's chair -- Dr Yan (Coral) Huang -- if you
have queries. Dr Huang can be reached at: yan.huang@cs.cardiff.ac.uk


IMPORTANT DATES
===============

Paper Submission    November 15, 2004
Notification        January 10, 2005
Final (Camera Ready)    February 9, 2005
Version

SPECIAL EVENTS
==============

Those wishing to organize workshops, present tutorials on emerging topics
or participate in the industry track are invited to send the following
information to:

Workshops: workshops-ccgrid2005@cs.cf.ac.uk,
Tutorials: tutorials-ccgrid2005@cs.cf.ac.uk, or
Industry Track: industrytrack-ccgrid2005@cs.cf.ac.uk.

COMMITTEES
==========

Honorary Chair
--------------

Tony Hey, EPSRC, UK

Conference Chairs
-----------------

David W. Walker, Cardiff University, UK
Carl Kesselman, USC/ISI, US

Programme Committee Chair
-------------------------

Omer F. Rana, Cardiff University, UK

Programme Committee Vice-Chairs
-------------------------------

Jack Dongarra, University of Tenneesee, US
Luc Moreau, University of Southampton, UK
Sven Graupner, HP Labs, US
Peter Sloot, University of Amsterdam, The Netherlands
Craig Lee, The Aerospace Corporation, US

Publications Chair
------------------

Rajkumar Buyya, University of Melbourne, Australia

Workshops Chair
---------------

Craig Lee, Aerospace Corporation, US

Publicity Chairs
----------------

Vladimir Getov, University of Westminster, UK (Europe)
Marcin Paprzycki, Oaklahoma State University, US (Europe)
C. L. Wang, University of Hong Kong (Asia Pacific)
Ken Hawick, Massey University, New Zealand (Asia Pacific)
Manish Parashar, Rutgers University, US (America)

Tutorials Chair
---------------

Michael Gerndt, TU Munich, Germany

Industry Track Chair
--------------------

Alistair Dunlop, OMII, UK

Exhibits Chair
--------------

Steven Newhouse, OMII, UK

Posters Chair
-------------

Yan Huang, Cardiff University, UK

Finance Chair
-------------

John Oliver, Welsh eScience Centre, UK

Registration Chair
------------------

Tracey Lavis, Cardiff University, UK

Local Arrangements Chair
------------------------

Linda Wilson, Welsh eScience Centre, UK


PROGRAMME COMMITTEE
-------------------

Seif Haridi, KTH Stockholm, Sweden
Bruno Schulze, Laboratsrio Nacional de Computagco Cientmfica, Brazil
David Abramson, Monash University, Australia
Steven Willmott, Universitat Polithcnica de Catalunya, Spain
Xian-He Sun, Illinois Institute of Technology, US
Yun-Heh (Jessica) Chen-Burger, University of Edinburgh, UK
Thilo Kielmann, Vrije Universiteit, The Netherlands
Brian Matthews, RAL/CCLRC and Oxford Brookes University, UK
Maozhen Li, Brunel University, UK
Greg Astfalk, HP Labs, US
Marty Humphrey, University of Virginia, US
Geoffrey Fox, University of Indiana, US
Martin Berzins, University of Leeds, UK
Hai Jin, Huazhong University of Science and Technology, China
Giovanni Chiola, Universita' di Genova, Italy
Domenico Talia, Universita' della Calabria/ICAR-CNR, Italy
Josi Cunha, Universidade Nova de Lisboa, Portugal
Ron Perrott, Queens University Belfast, UK
Ewa Deelman, ISI/USC, US
Stephen Jarvis, Warwick University, UK
Niclas Andersson, Linkvping University, Sweden
Putchong Uthayopas, Kasetsart University, Thailand
John Morrison, University College Cork, Ireland
Stephen Scott, Oak Ridge National Lab, US
Luciano Serafini, ITC-IRST, Italy
David A. Bader, University of New Mexico, US
Mark Baker, University of Portsmouth, UK
Emilio Luque, Universitat Autrnoma de Barcelona, Spain
Akhil Sahai, HP Labs, US
Gregor von Laszewski, Argonne National Lab, US
Fethi Rabhi, University of New South Wales, Sydney, Australia
Fabrizio Petrini, Los Alamos National Lab, US
Kate Keahey, Argonne National Lab, US
Sergei Gorlatch, Universitdt M|nster, Germany
Brian Tierney, Lawrence Berkeley National Lab, US
Rauf Izmailov, NEC Labs, US
Stephen J. Turner, Nanyang Technological University, Singapore
Savas Parastatidis, University of Newcastle, UK
Elias Houstis, University of Thessaly, Greece -- and Purdue University, US
Karl Aberer, EPFL, Switzerland
Rolf Hempel, DLR, Germany
Anne Elster, NTNU, Norway
Artur Andrzejak, Zuse Institute Berlin, Germany
Jennifer Schopf, Argonne National Laboratory, US
John Gurd, University of Manchester, UK
Domenico Laforenza, ISTI/CNR, Italy
Wolfgang Rehm, TU Chemnitz, Germany
Gabriel Antoniu, IRISA, France

From janfrode at parallab.uib.no  Sat Oct  9 02:08:54 2004
From: janfrode at parallab.uib.no (Jan-Frode Myklebust)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Storage
In-Reply-To: <Pine.LNX.4.58.0410081703290.456@ganesh.phy.duke.edu>
References: <Pine.LNX.4.44.0410070018090.4347-100000@coffee.psychology.mcmaster.ca>
	<Pine.LNX.4.58.0410071823270.3153@lilith.rgb.private.net>
	<20041008195132.GC32425@ii.uib.no>
	<Pine.LNX.4.58.0410081703290.456@ganesh.phy.duke.edu>
Message-ID: <20041009090854.GB21880@ii.uib.no>

On Fri, Oct 08, 2004 at 05:16:28PM -0400, Robert G. Brown wrote:
> 
> Ah, but read:
> 
>   http://www.npaci.edu/dice/srb/srbOpenSource.html

Ouch! Thanks for pointing this out.


  -jf

From gustavo at martinelli.etc.br  Sat Oct  9 12:49:10 2004
From: gustavo at martinelli.etc.br (Gustavo Gobi Martinelli)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] rsh don't see the real variables
Message-ID: <1097351349.416840b60272e@www.martinelli.etc.br>


I'm trying to make the pvm 3.4.5 work, but I'm having a problem with "rsh".

if I execute the command:

# rsh 192.168.0.2 'set'

I will see a list of variables that isn't in the .bash_profile of root user.

But, if I execute this:

# rsh 192.168.0.1

the login occurs and I can execute

# set

Now, I can see the variable that I need.

What did happen? How can I find the local where I can declare the variables that
will be appear with " rsh 192.168.0.1 'set' " command?

Because this, the PVM doesn't work. It needs see $PVM_ROOT variable that exists
in .bash_profile but not in "rsh" session.

Someone know anything about it? I'm using the Fedora Core 2 with kernel 2.6.7.

--
Atenciosamente,
Gustavo Gobi Martinelli
Linux User# 270627

From alvin at Mail.Linux-Consulting.com  Sat Oct  9 14:43:26 2004
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Storage - storage cache
In-Reply-To: <Pine.LNX.4.58.0410081703290.456@ganesh.phy.duke.edu>
Message-ID: <Pine.LNX.3.96.1041009143441.19478A-100000@Maggie.Linux-Consulting.com>


hi ya

i deleted the email and decided to reply to the 
prev post about disk cache

- have you checked into webcache/file cache apps ??
	- those $5K - $15K apps will cache your files
	on their hw ... your local clients would fetch
	its data from their local disk cache 
	from 2x - 100x faster than going across
	to the far away colo on the internet

	( its intended for making your far away colo
	( look like its in your local lan 

	- it's sorta like a fancy "file" proxy
	or fancy version control that moves data around
	behind the scene 

- it's disk capacity limited to disk space in its cache
  
c ya
alvin

file cache apps...
	riverbed.com
	actona.com ( now cisco )
	tacitnetworks.com


From rgb at phy.duke.edu  Sat Oct  9 15:11:01 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] HPC in Windows
In-Reply-To: <01ee01c4adbc$3618d330$39140897@PMORND>
References: <01ee01c4adbc$3618d330$39140897@PMORND>
Message-ID: <Pine.LNX.4.58.0410091757360.15636@lilith.rgb.private.net>

On Sat, 9 Oct 2004, Rajiv wrote:

> Dear All,
>     Are there any Beowulf packages for windows?

Not that I know of.  In fact, the whole concept seems a bit oxymoronic,
as the definition of a beowulf is a cluster supercomputer running an
open source operating system.

However, there are windows based clusters, and there are parallel
libraries that will compile and work on windows based LANs.  A lot of
things will be more difficult, as Windows is missing a few million
moving parts that are standard everyday fare under an *nix OS (like
xterms, shells, secure remote logins) UNLESS you pay for them or build
them yourself where open sources exist.  Back when I still used WinXX,
one could find a small suite of *nix-alike tools, but back then WinXX
was still based on DOS.

SO, you can certainly use windows machines in a cluster (or a grid) if
you can manage the hassle of paying for all the operating systems,
compilers, associated tools to facilitate remote login and shell
operations.

Or, you can just use linux on all those systems, at worst on a dual boot
basis.  Run Windows by day, a linux cluster by night.  These days, Open
Office (and a few other packages) renders most linux boxes so copacetic
that they can coexist in a Windows environment and a WinXX user can
learn to do pretty much everything they need to do under Linux (and in a
GUI) in a day or two.

   rgb

> 
> Regards,
> Rajiv

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From rgb at phy.duke.edu  Sat Oct  9 15:24:32 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Application Deployment
In-Reply-To: <01df01c4adbc$1229c240$39140897@PMORND>
References: <01df01c4adbc$1229c240$39140897@PMORND>
Message-ID: <Pine.LNX.4.58.0410091811250.15636@lilith.rgb.private.net>

On Sat, 9 Oct 2004, Rajiv wrote:

> Dear All,

> Is there any software available for application deployment- both linux
> and windows. I would like to install packages from master to all the
> clients through a management console.

Again, I don't know about Windows -- most software running on a WinXX
box will be proprietary, and simply cannot be installed in the way you
describe without either a lot of knowledge and/or Windows-specific
tools.  After all, you've got all those CD codes and serial numbers and
other proprietary bullshit to manage, and you're at serious risk of
lawsuit if you fail to manage them perfectly.  Windows does remote
install these days, as I understand it, although I doubt that it
remotely approaches kickstart in its ease of use and transparency.

In linux, there are a variety of solutions, depending on whether you use
RPMs or Debian.  With Red Hat and descendents (Fedora, Centos) you can
use kickstart, which is a lovely tool for installing clusters.
Kickstart run on top of PXE and DHCP makes installing most systems a
matter of turning them on (after making a single host specific MAC
address entry in a table or two, and even this can be automated).  It
makes reinstalling non-servers at any time just as easy.  Servers, of
course, require knowledge, experience, wisdom, and time to do right,
which is why sysadmins get paid and are worth a very decent salary.

To install packages from a master to clients, there are both shrink
wrapped tools and general approaches.  For fairly obvious reasons, I'd
suggest yum and possibly a mix of rsync and any of the packages that let
you execute an ordinary shell command on a list of hosts).  This is both
because yum manages dependencies for you and because once installed it
also manages automatic updates and even upgrades from your repository.
Altering a client configuration is often just a matter of e.g. making an
entry in a table that is read in by a script that calls yum update
whatever, that is itself pushed out in an installed rpm that yum
updates.  Add an entry to the table and wait a day, or use the shell
distribution tool if you are in a hurry.

I don't know about "management consoles", though.  Again, this sounds
WinXX-ish -- you're hoping for something to hide all the detail of
several distributions, packaging systems, software installation tools,
and operating systems and still make them all work transparently for you
without your needing to know what they are doing.

Not in this Universe, at least not unless you pay a real expert a lot of
money (for their software) and are willing to live with something that
doesn't work horribly well at best anyway.  Your best bet is to learn
the specific systems you're working with well enough to make them dance
through hoops, and not to rely on interfaces that are very expensive to
maintain (and which to my experience NEVER work anyway).

    rgb

> Regards,
> Rajiv

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From landman at scalableinformatics.com  Sat Oct  9 15:24:14 2004
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Storage and cachefs on nodes?
In-Reply-To: <20041008233311G.hanzl@unknown-domain>
References: <20041008233311G.hanzl@unknown-domain>
Message-ID: <4168650E.8050800@scalableinformatics.com>

hanzl@noel.feld.cvut.cz wrote:

>Anybody using cachefs(-alike) and local disks on nodes for
>reboot-persistent cache of huge central storage?
>  
>

Not really, though we have a package under development that might 
address some aspect of this. Contact me offline if you want to hear 
more.  We are in early stages of this work, so it will be a while before 
it is ready.

>
>(I periodically and obsessively repeat this poll with negative answer
>ever so far, obviously I am the only person with data storage needs
>perverted this way. Given the recent interest in storage, I dare to
>ask again...)
>
>Thanks
>
>Vaclav Hanzl
>_______________________________________________
>Beowulf mailing list, Beowulf@beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>  
>


From rgb at phy.duke.edu  Sat Oct  9 15:32:14 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] rsh don't see the real variables
In-Reply-To: <1097351349.416840b60272e@www.martinelli.etc.br>
References: <1097351349.416840b60272e@www.martinelli.etc.br>
Message-ID: <Pine.LNX.4.58.0410091825080.15636@lilith.rgb.private.net>

On Sat, 9 Oct 2004, Gustavo Gobi Martinelli wrote:

> 
> I'm trying to make the pvm 3.4.5 work, but I'm having a problem with "rsh".
> 
> if I execute the command:
> 
> # rsh 192.168.0.2 'set'
> 
> I will see a list of variables that isn't in the .bash_profile of root user.
> 
> But, if I execute this:
> 
> # rsh 192.168.0.1
> 
> the login occurs and I can execute
> 
> # set
> 
> Now, I can see the variable that I need.
> 
> What did happen? How can I find the local where I can declare the variables that
> will be appear with " rsh 192.168.0.1 'set' " command?
> 
> Because this, the PVM doesn't work. It needs see $PVM_ROOT variable that exists
> in .bash_profile but not in "rsh" session.
> 
> Someone know anything about it? I'm using the Fedora Core 2 with kernel 2.6.7.

Use ssh, and look into the environment commands.  rsh has many flaws,
one of which is a failure to pass environment variables at all sanely
from the calling host.  So this is one way to proceed.

Also note (from man bash):

       When bash is invoked as an interactive login shell, or as a
non-inter-
       active shell with the --login option, it first reads and executes
com-
       mands from the file /etc/profile, if that file exists.  After
reading
       that file, it looks for ~/.bash_profile, ~/.bash_login, and
~/.profile,
       in that order, and reads and executes commands from the first one
that
       exists and is readable.  The --noprofile option may be used when
the
       shell is started to inhibit this behavior.

In other words, there is a difference between the behavior of an
interactive shell (what you get when you execute rsh hostname to log in)
and a non-interactive shell -- they actually read and execute different
.??* files in a different order.  In fact, you can control the order to
some extent with the call syntax.  Assuming that you're using bash, you
might read the man page carefully and experiment -- if you put the
requisite environment variable definitions in the right place you should
still be able to have them initialized even over rsh, as long as you
don't have to pass them via the remote shell itself.  If you do, you'll
NEED to look into ssh in more detail.

HTH,

   rgb

> 
> --
> Atenciosamente,
> Gustavo Gobi Martinelli
> Linux User# 270627
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From rgb at phy.duke.edu  Sat Oct  9 15:39:06 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] rsh don't see the real variables
In-Reply-To: <1097351349.416840b60272e@www.martinelli.etc.br>
References: <1097351349.416840b60272e@www.martinelli.etc.br>
Message-ID: <Pine.LNX.4.58.0410091832350.15636@lilith.rgb.private.net>

On Sat, 9 Oct 2004, Gustavo Gobi Martinelli wrote:

> 
> I'm trying to make the pvm 3.4.5 work, but I'm having a problem with "rsh".

One more comment.  pvm under Red Hat * (and now under Fedora Core 2) IS
a shell script (in /usr/bin/pvm).  It should set PVM_ROOT correctly for
you, automagically, whether or not it is set in your original or nodes
shells, as PVM is started somewhere and used to build a virtual cluster
IF you invoke PVM by name on the default path.

This can probably still screw up with some ways you might use PVM, but
there SHOULD be ways to do your project that don't require a PVM_ROOT to
be spelled out in your node shells.

  rgb

> 
> if I execute the command:
> 
> # rsh 192.168.0.2 'set'
> 
> I will see a list of variables that isn't in the .bash_profile of root user.
> 
> But, if I execute this:
> 
> # rsh 192.168.0.1
> 
> the login occurs and I can execute
> 
> # set
> 
> Now, I can see the variable that I need.
> 
> What did happen? How can I find the local where I can declare the variables that
> will be appear with " rsh 192.168.0.1 'set' " command?
> 
> Because this, the PVM doesn't work. It needs see $PVM_ROOT variable that exists
> in .bash_profile but not in "rsh" session.
> 
> Someone know anything about it? I'm using the Fedora Core 2 with kernel 2.6.7.
> 
> --
> Atenciosamente,
> Gustavo Gobi Martinelli
> Linux User# 270627
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From james.p.lux at jpl.nasa.gov  Sat Oct  9 16:56:22 2004
From: james.p.lux at jpl.nasa.gov (Jim Lux)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] Application Deployment
References: <01df01c4adbc$1229c240$39140897@PMORND>
	<Pine.LNX.4.58.0410091811250.15636@lilith.rgb.private.net>
Message-ID: <000801c4ae5b$957423d0$33a8a8c0@LAPTOP152422>


----- Original Message -----
From: "Robert G. Brown" <rgb@phy.duke.edu>
To: "Rajiv" <jrajiv@hclinsys.com>
Cc: <beowulf@beowulf.org>
Sent: Saturday, October 09, 2004 3:24 PM
Subject: Re: [Beowulf] Application Deployment


> On Sat, 9 Oct 2004, Rajiv wrote:
>
> > Dear All,
>
> > Is there any software available for application deployment- both linux
> > and windows. I would like to install packages from master to all the
> > clients through a management console.
>
> Again, I don't know about Windows -- most software running on a WinXX
> box will be proprietary, and simply cannot be installed in the way you
> describe without either a lot of knowledge and/or Windows-specific
> tools.  After all, you've got all those CD codes and serial numbers and
> other proprietary bullshit to manage, and you're at serious risk of
> lawsuit if you fail to manage them perfectly.  Windows does remote
> install these days, as I understand it, although I doubt that it
> remotely approaches kickstart in its ease of use and transparency.

Actually, Windows DOES provide a central management capability, with fairly
good control of the client images, management of those pesky licensing
issues, etc.

It's called SMS (Systems Management Server), and it's been around for about
10 years (at least), and in its current form is a godsend for people who
have to manage all those thousands of WinXX desktops in big companies.  With
the huge volume of patches required to keep a Windows environment working,
you'd have to have something like it.

Ever since the earliest Windows versions that supported networking, there
have been ways to do centralized network installs (I can't remember if
Windows for Workgroups did it, but certainly, the first versions of Win NT
did) and fairly automated rollouts of new software versions.  Typically, the
documentation came in the "resource kits", and lately, would be in things
like "enterprise version back office resource kits".  Often, if you forked
out the kilobuck for a Visual whatever development kit, it would include all
that stuff (along with the driver development kit, the SDK, etc., )

I might note that part of the incentive behind the ".NET" initiative is to
simplify the whole configuration management across an enterprise scale
installation.  Part of it is a "late binding" to the services/components
that your application needs, and the ability for your application to be
insensitive to just how that component becomes available.  Not incidentally,
of course, the architecture includes ways to manage charging for the use of
a component, both on a traditional licensing model, and on a per use model,
and probably all manner of complex ways in between. Let's see, I want to
watch a HDTV movie, so I have a monthly license for the decompression
engine, a per use/per view license for the movie, a per day license for the
ability to stop/start/rewind the movie, all with complex cross costing among
the various and sundry providers (Now, for 3 days only, watch Star Trek XXI
without paying decompression license fees, with the purchase of a Nokia
cellphone (available only in areas served by Adelphia, within 6 miles of a
qualifying retail outlet, 3 year contract required, mail-in rebate may take
6-8 months to process, void where prohibited, see etc.etc. for details.)


You're just not going to get it at your local Comp-USA or download it off
the web.  And, it's not free or even particularly cheap (although, compared
to the cost of all those licenses for the desktops, it's quite reasonable.)
And, in a somewhat limited version, it's fairly inexpensive (so that
developers can readily develop their software to fit within the MS Windows
deployment model).  Heck, you can probably even go and test your application
for free on a Windows cluster at Microsoft.  They used to provide the Jolt
cola while you were at the facility testing for free as well, and may still
well do.

Qualitatively, managing 1000 computers (particularly with identical
configurations) under Windows is probably not much more difficult than doing
it under Linux. In both cases, you're going to need some training and/or
experience to make it work.


>
> which is why sysadmins get paid and are worth a very decent salary.

In both the Windows and *nix world, this is true.


>

> I don't know about "management consoles", though.  Again, this sounds
> WinXX-ish -- you're hoping for something to hide all the detail of
> several distributions, packaging systems, software installation tools,
> and operating systems and still make them all work transparently for you
> without your needing to know what they are doing.
>
> Not in this Universe, at least not unless you pay a real expert a lot of
> money (for their software) and are willing to live with something that
> doesn't work horribly well at best anyway.  Your best bet is to learn
> the specific systems you're working with well enough to make them dance
> through hoops, and not to rely on interfaces that are very expensive to
> maintain (and which to my experience NEVER work anyway).

Modern, enterprise scale Windows installations do this just as well as
Linux, to a first order.   Don't judge the large scale management
capabilities of Windows by the consumer Windows experience.  And, as far as
cost goes, I suspect that configuration and software management costs for
large Windows installations are not much different than for *nix. Both
require training, expertise, etc.  Sure, for Linux, the actual software is
free, but that's a small fraction of the $100K/yr you're paying to folks to
use the software.

Microsoft is VERY aware of where their bread is buttered, and has worked
VERY hard to make sure that managing 1000 Windows desktops (or server farms)
in a corporate enviroment isn't a whole lot more difficult or expensive than
managing 1000 Linux boxes.  The last thing MS wants to hear from a Fortune
500 CIO is that they are dumping Windows for Linux because of the management
costs.  As you point out, the desktop "office productivity" applications are
typically just as good under Linux as under Windows.  It's a very different
model from the consumer world, where once you've bought the heavily
discounted box with the manufacturer OEM install of the OS, you're really on
your own.

In the cluster arena, things are different yet. Typically, people want pedal
to the metal speed, they don't give a rat's fuzzy behind for the office
productivity tools, and they're going to write all their production code
themselves, so they want something that is conceptually simple to work with
(OS interface wise), they generally have no need for sophisticated digital
rights management and revenue schemes, etc.

>
>     rgb
>
> > Regards,
> > Rajiv
>
> Robert G. Brown                        http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


From gustavo at martinelli.etc.br  Sat Oct  9 18:39:44 2004
From: gustavo at martinelli.etc.br (Gustavo Gobi Martinelli)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] rsh don't see the real variables
In-Reply-To: <Pine.LNX.4.58.0410091825080.15636@lilith.rgb.private.net>
References: <1097351349.416840b60272e@www.martinelli.etc.br>
	<Pine.LNX.4.58.0410091825080.15636@lilith.rgb.private.net>
Message-ID: <1097372384.416892e05302f@www.martinelli.etc.br>

Robert,

> Use ssh, and look into the environment commands.  rsh has many flaws,
> one of which is a failure to pass environment variables at all sanely
> from the calling host.  So this is one way to proceed.

the rsh doesn't pass environment variables, it gets the variables on the host.
And I have to know, where I can declare this variables because they are
different of the .bash_profile and /etc/profile.

>
> Also note (from man bash):
>
>        When bash is invoked as an interactive login shell, or as a
> non-inter-
>        active shell with the --login option, it first reads and executes
> com-
>        mands from the file /etc/profile, if that file exists.  After
> reading
>        that file, it looks for ~/.bash_profile, ~/.bash_login, and
> ~/.profile,
>        in that order, and reads and executes commands from the first one
> that
>        exists and is readable.  The --noprofile option may be used when
> the
>        shell is started to inhibit this behavior.
>
> In other words, there is a difference between the behavior of an
> interactive shell (what you get when you execute rsh hostname to log in)
> and a non-interactive shell -- they actually read and execute different
> .??* files in a different order.

I will study about it.

>  In fact, you can control the order to some extent with the call syntax.

The PVM executes its codes. I don't have any control about it. So I can't make a
different sintax.

> Assuming that you're using bash, you
> might read the man page carefully and experiment -- if you put the
> requisite environment variable definitions in the right place you should
> still be able to have them initialized even over rsh,

I will create the other files and I will test it.

> as long as you
> don't have to pass them via the remote shell itself.

I don't have to pass variables on the rsh session, It executes remote codes
using the variables on that host.

> If you do, you'll
> NEED to look into ssh in more detail.

I will see it too.

> HTH,
>    rgb

Thank?s for the help.

From gustavo at martinelli.etc.br  Sat Oct  9 18:57:04 2004
From: gustavo at martinelli.etc.br (Gustavo Gobi Martinelli)
Date: Wed Nov 25 01:03:27 2009
Subject: [Beowulf] rsh don't see the real variables
In-Reply-To: <Pine.LNX.4.58.0410091832350.15636@lilith.rgb.private.net>
References: <1097351349.416840b60272e@www.martinelli.etc.br>
	<Pine.LNX.4.58.0410091832350.15636@lilith.rgb.private.net>
Message-ID: <1097373424.416896f038293@www.martinelli.etc.br>

Robert,

> One more comment.  pvm under Red Hat * (and now under Fedora Core 2) IS
> a shell script (in /usr/bin/pvm).  It should set PVM_ROOT correctly for
> you, automagically, whether or not it is set in your original or nodes
> shells, as PVM is started somewhere and used to build a virtual cluster
> IF you invoke PVM by name on the default path.

Yes, I don't include the pvm path on the default path. But I remember that the
PATH that I saw on the rsh session was incomplete, compareted with the
origiral.

I have to know where this variables are declared.

> This can probably still screw up with some ways you might use PVM, but
> there SHOULD be ways to do your project that don't require a PVM_ROOT to
> be spelled out in your node shells.

I agree, but I have to make PVM works quickly. I will try other ways.

Thank?s again
Gustavo Gobi Martinelli


From mark.westwood at ohmsurveys.com  Sun Oct 10 05:52:18 2004
From: mark.westwood at ohmsurveys.com (mark.westwood@ohmsurveys.com)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] Re: HPC in Windows
In-Reply-To: <01ee01c4adbc$3618d330$39140897@PMORND>
References: <01ee01c4adbc$3618d330$39140897@PMORND>
Message-ID: <E1CGdBS-000LzX-87@oceanus.uk.clara.net>

Hi Rajiv 

I don't know about Windows-based clusters, but you might want to check out 

Beowulf Cluster Computing with Windows
edited by Thomas Sterling
MIT Press, 2001 

The book runs to 488 pages so must have something to say on the topic.  I 
have the companion volume Beowulf Cluster Computing with Linux and would 
recommend that as a good introduction to the topic. 

Regards
Mark Westwood
OHM Ltd 

Rajiv writes: 

> Dear All,
>     Are there any Beowulf packages for windows? 
> 
> Regards,
> Rajiv


From hahn at physics.mcmaster.ca  Sun Oct 10 08:43:11 2004
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] Application Deployment
In-Reply-To: <Pine.LNX.4.58.0410091811250.15636@lilith.rgb.private.net>
Message-ID: <Pine.LNX.4.44.0410101126110.19913-100000@coffee.psychology.mcmaster.ca>

> RPMs or Debian.  With Red Hat and descendents (Fedora, Centos) you can
> use kickstart, which is a lovely tool for installing clusters.
> Kickstart run on top of PXE and DHCP makes installing most systems a
> matter of turning them on (after making a single host specific MAC

don't forget the zero-install approach - nothing installed on nodes.
just export the nodes' root filesystem from a fileserver, and you never
have to do anything per-node.  yum and rpm both let you install within
a separate tree, so the fileserver doesn't need to be running the same 
config as the nodes.

obviously, this results in a certain amount of NFS traffic, as opposed 
to having those files installed on the node's disk.  issues:

	- diskless nodes are very attractive in many contexts:
	reliability, price, maintainability, etc.

	- running NFS-root is a way of tolerating local disk faults;
	lack of swap may or may not be a problem.

	- NFS can easily be faster than local disk IO.

	- in aggregate, a buch of diskless nodes will, in the worst case,
	create much more traffic than your net and fileserver can handle.

	- my experience so far with 50-100-node clusters is that a 
	single NFS-connected fileserver is actually pretty good.
	(our nodes have a local disk used for things like checkpoints
	of big parallel applications.)

	- for big MPI clusters, it's extremely attractive to put
	fileservers directly onto the MPI fabric.  suddenly, gigabit
	is no longer a limiter for file IO and systems like Lustre
	can give some pretty impressive data rates.

	- this scheme is probably optimal for very hetrogenous 
	datacenters as well, where you might boot a node in some 
	random OS purely for a particular user/app.  that kind of thing
	seems very dubious to me, but it would only take a few minutes
	of perl scripting to write a web frontend to select things 
	like IP, distro, kernel, server, etc for a particular node,
	and propogate the changes.

I think that for a small cluster, I'd consider having the nodes
with full installs on them.  for anything larger than say 4 nodes, 
I definitely prefer the root-on-fileserver approach with "ephemeral" nodes.
it's also pretty sexy to take a node out of the box, plug it in and have it 
accept jobs in a minute or so with no manual intervention.

> course, require knowledge, experience, wisdom, and time to do right,
> which is why sysadmins get paid and are worth a very decent salary.

hmm.  anyone for a cluster-admin salary survey?

regards, mark hahn.


From hahn at physics.mcmaster.ca  Sun Oct 10 08:55:54 2004
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] Application Deployment
In-Reply-To: <000801c4ae5b$957423d0$33a8a8c0@LAPTOP152422>
Message-ID: <Pine.LNX.4.44.0410101146070.19913-100000@coffee.psychology.mcmaster.ca>

> require training, expertise, etc.  Sure, for Linux, the actual software is
> free, but that's a small fraction of the $100K/yr you're paying to folks to
> use the software.

very interesting.  one structural disadvantage that the windows ecosystem
does labor under is that it must stick to the OS-really-installed-on-desktop
model.  that is, msft is not quite ready to go to a ephemeral-client model,
where desktops just PXE-boot and mount everything of consequence across
the net.  (not just thin-client, where clients are all hard-installed, but
use only a thin app like a browser for whatever the user needs.)

with lan-ipmi and pxe, it's almost reasonable to claim that support
doesn't scale with increasing nodes.  there are still costs that scale
with number of users, number of apps.  hardware maintenance always 
scales with number of moving parts, but for ephemeral clients, it's 
far easier to have spares.  the infrastructure to support 1K clients 
all booting monday morning would be nontrivial, but very tractable.

no doubt the lack of nazi DRM (uncontrolled and dangerous network!)
is why the msft community hasn't taken this approach.


From james.p.lux at jpl.nasa.gov  Sun Oct 10 10:26:50 2004
From: james.p.lux at jpl.nasa.gov (Jim Lux)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] Application Deployment
References: <Pine.LNX.4.44.0410101146070.19913-100000@coffee.psychology.mcmaster.ca>
Message-ID: <000601c4aeee$545a73c0$33a8a8c0@LAPTOP152422>


----- Original Message -----
From: "Mark Hahn" <hahn@physics.mcmaster.ca>
To: <beowulf@beowulf.org>
Sent: Sunday, October 10, 2004 8:55 AM
Subject: Re: [Beowulf] Application Deployment


> > require training, expertise, etc.  Sure, for Linux, the actual software
is
> > free, but that's a small fraction of the $100K/yr you're paying to folks
to
> > use the software.
>
> very interesting.  one structural disadvantage that the windows ecosystem
> does labor under is that it must stick to the
OS-really-installed-on-desktop
> model.  that is, msft is not quite ready to go to a ephemeral-client
model,
> where desktops just PXE-boot and mount everything of consequence across
> the net.  (not just thin-client, where clients are all hard-installed, but
> use only a thin app like a browser for whatever the user needs.)

I'm not sure, but MS is certainly heading towards the ephemeral client (with
local cacheing) model, since it enables such things as revenue based on a
per use of a component basis. Say you're a small software developer and
you've developed a really nifty piechart algorithm for Excel. MS wants to
give you a way that you could generate revenue from each use of this
component, and that sort of implies that the component is fetched from some
repository on the fly.
Same for "notepad" or you name it.  I think they're desperately trying to
get away from the "transfer of tangible property" for software, because
sooner or later, shrink wrap licenses are going to get hammered in court (if
it looks, walks, and talks like a sale, then it IS a sale, and you should be
able to resell, etc., freely).  On the other hand, if each and every time
you use the component (be it "MS Word", that clever Excel chart, etc.) you
are separately engaging in a revenue transaction, then you don't get into
those sticky areas.

>
> with lan-ipmi and pxe, it's almost reasonable to claim that support
> doesn't scale with increasing nodes.  there are still costs that scale
> with number of users, number of apps.  hardware maintenance always
> scales with number of moving parts, but for ephemeral clients, it's
> far easier to have spares.  the infrastructure to support 1K clients
> all booting monday morning would be nontrivial, but very tractable.
>
> no doubt the lack of nazi DRM (uncontrolled and dangerous network!)
> is why the msft community hasn't taken this approach.

Precisely so...

MS, and the legions of developers who develop for the Windows environment,
generally want some mechanism to be paid for their work.  Per use revenue is
a nice way of getting around the "bootleg copy" problem.  Who cares if you
copy it, if every time it runs, you have to hit a license server and pay
your little micropayment.  In fact, bootlegs are great... they cost the
originator of the software nothing.  It's the whole compatibility,
configuration managment thing that is a big problem (all those components
have to be compatible with all the other components, etc.).

MS, for all of their faults, doesn't have stupid people working for it.  If
they could find a better way to "sell" software (or, more properly, the
added value provided by the software/content/what-have-you) that doesn't
rely on copyright (which everyone admits is poorly suited to such things),
they'd love it.


From landman at scalableinformatics.com  Sun Oct 10 11:22:32 2004
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] Application Deployment
In-Reply-To: <000601c4aeee$545a73c0$33a8a8c0@LAPTOP152422>
References: <Pine.LNX.4.44.0410101146070.19913-100000@coffee.psychology.mcmaster.ca>
	<000601c4aeee$545a73c0$33a8a8c0@LAPTOP152422>
Message-ID: <41697DE8.2090301@scalableinformatics.com>


Jim Lux wrote:

>Precisely so...
>
>MS, and the legions of developers who develop for the Windows environment,
>generally want some mechanism to be paid for their work.  
>
(please note:  not intended to be flame bait or trolling)

<tangent>

hrm...  so do those of us who try to survive and grow in companies that 
specialize in the Linux environment.  As the owner of 1.5 such 
companies, I really would like to see them thrive and grow, and this 
requires getting paid for our work.   I have had some customers wish to 
freely share our work with others, which I take as an indication of the 
value of the stuff; it wouldn't be shared if it did not have 
value/merit, but as I said before, I have to pay the developers.  Can't 
pay them with a check that reads "3000 goodwill dollars, not redeemable 
at your mortgage company, or at the food store, but you sure made some 
of our `customers' happy".

As I have said privately to others, the GPL is not a business model.  
Moreover, the ivory tower concept of giving away the source to sell the 
consulting seems to work for very few groups, if any (I can think of 
one, mySQL AB). As it is unlikely that there will be millions of linux 
clusters out there, the MySQL model of leveraging the needs of a huge 
installed user base will not work here. 

Most of us who are betting their families well being and livelihood on 
this, would like to be able to earn a living from this.  There is 
nothing wrong with asking for money for the value you provide.  Most of 
the consumers of the OSS stuff do so for varied reasons.  One of the 
larger components is the "Free as in beer"* approach to cost 
containment.  The control is nice, but as I see it, control is not what 
is driving the adoption of Linux based systems.  Its TCO.  If it costs 
you $2500 for that RHAT install, it doesn't cost you per client 
connecting to it.  This immediately puts it at a (tremendous) cost 
advantage over any MSFT based solution.  Add to this that this is a real 
market, with several competing vendors with mostly overlapping 
offerings.  Hence there is real competition.  Prices are held in check 
(to a degree).

(* I know, I know, not too much free beer out there ...)

MSFT has (at last check) the wrong licensing model for their software 
for clusters.  They would need to change it in order to make it sensible 
from a financial perspective, to deploy such clusters.  As this is a 
tiny market compared to their major market, I think that the needed 
changes will not happen.  I could be wrong (and I think the MSFT HPC 
folks read this stuff in stealth mode, so feel free to correct me 
on/offline).

</tangent>

>Per use revenue is
>a nice way of getting around the "bootleg copy" problem.  Who cares if you
>copy it, if every time it runs, you have to hit a license server and pay
>your little micropayment.  In fact, bootlegs are great... they cost the
>originator of the software nothing.  It's the whole compatibility,
>configuration managment thing that is a big problem (all those components
>have to be compatible with all the other components, etc.).
>  
>

The per use view presumes you are using a consumable resource in some 
sense, and then attaching a value to it.  It reminds me of  the 
innumerable requests for registration going on now, coupled with the 
"would you like to view this article? only 1.95$USD right now..." I see 
linked from various news sites.

Of course, apart from electrical power, it is hard to understand what 
resource you are consuming when this occurs. I suspect that consumer 
backlash against this will likely halt this march.  I do see a 
subscription model becoming far more likely, whereby for a fixed 
(continuous) fee, you get access to content (much like a magazine, but 
software content in this case).  I think people generally would be more 
accepting of this model than a micropayment per click.

>MS, for all of their faults, doesn't have stupid people working for it.  If
>they could find a better way to "sell" software (or, more properly, the
>added value provided by the software/content/what-have-you) that doesn't
>rely on copyright (which everyone admits is poorly suited to such things),
>they'd love it.
>  
>

Winston Churchill had something to say about this being the worst 
possible model, apart from the others.
Of course the context was different, but generally the idea is correct.

Software companies have a model that forces continuous "innovation" in 
order to maintain an upgrade cycle, and therefore get revenues flowing 
"continuously".  The problem for them is convincing you to upgrade.  Why 
upgrade if it works well enough?

So they need to even out their revenue, get it more predictable, and 
break the upgrade cycle.  Oddly, by breaking the upgrade cycle, you can 
spend more time fixing stuff, and less time inventing new broken stuff.  
Similar to what OSS gives you (to a degree).  The important part (for 
business types) is that your revenue is now much more predictable.

Now do something interesting (which I have not seen done yet by MSFT, 
though I expect in in short order).  Change the acquisition model to 
that of a subscription.  So instead of paying $500 for an install with a 
free set of patches, pay $50 to acquire the base + $100/year of 
subscription.  Roll the next "versions" out in phases, with 
inter-function dependencies rather than entire version dependencies.  
The software becomes the platform.

Talk about lock-in.  There will be no upgrade cycle to contend with.  
Changes can be made quite modular.  New features better tested and 
rolled in in an evolutionary manner.  Brand new functionality could be 
created into different subscription paths.  Copying and "pirating" would 
be encouraged (you need that subscription after all) as each machine 
would require its own subscription to function.

Rough guess, but I would bet on something much like this emerging.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman@scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 612 4615


From james.p.lux at jpl.nasa.gov  Sun Oct 10 12:02:44 2004
From: james.p.lux at jpl.nasa.gov (Jim Lux)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] Application Deployment
References: <Pine.LNX.4.44.0410101146070.19913-100000@coffee.psychology.mcmaster.ca>
	<000601c4aeee$545a73c0$33a8a8c0@LAPTOP152422>
	<41697DE8.2090301@scalableinformatics.com>
Message-ID: <000601c4aefb$c8fe7480$33a8a8c0@LAPTOP152422>


----- Original Message -----
From: "Joe Landman" <landman@scalableinformatics.com>
To: "Jim Lux" <James.P.Lux@jpl.nasa.gov>
Cc: "Mark Hahn" <hahn@physics.mcmaster.ca>; <beowulf@beowulf.org>
Sent: Sunday, October 10, 2004 11:22 AM
Subject: Re: [Beowulf] Application Deployment


>
>
> Jim Lux wrote:
>
> >Precisely so...
> >
> >MS, and the legions of developers who develop for the Windows
environment,
> >generally want some mechanism to be paid for their work.
> >
> (please note:  not intended to be flame bait or trolling)
>
> <tangent>
>
> hrm...  so do those of us who try to survive and grow in companies that
> specialize in the Linux environment.

Indeed, this IS true as well. I suppose it comes down to what you actually
want to pay for.  In the generalized (oversimplified) Linux/GPL/free as in
beer model, the income stream comes from providing support, etc.  while the
software development is provided altruistically (or, as advertising, good
will garnering).  The U.S.Gov't and various EU institutions (ESA) pay for
development of all sorts of software, some of which has general usefulness.
USGS map data might be a good example here.

>
> As I have said privately to others, the GPL is not a business model.
> Moreover, the ivory tower concept of giving away the source to sell the
> consulting seems to work for very few groups, if any (I can think of
> one, mySQL AB). As it is unlikely that there will be millions of linux
> clusters out there, the MySQL model of leveraging the needs of a huge
> installed user base will not work here.

Very true, I think.  Fortunately, there are people who are willing to work
with stars in their eyes.

>
> Most of us who are betting their families well being and livelihood on
> this, would like to be able to earn a living from this.  There is
> nothing wrong with asking for money for the value you provide.  Most of
> the consumers of the OSS stuff do so for varied reasons.  One of the
> larger components is the "Free as in beer"* approach to cost
> containment.  The control is nice, but as I see it, control is not what
> is driving the adoption of Linux based systems.  Its TCO.  If it costs
> you $2500 for that RHAT install, it doesn't cost you per client
> connecting to it.  This immediately puts it at a (tremendous) cost
> advantage over any MSFT based solution.  Add to this that this is a real
> market, with several competing vendors with mostly overlapping
> offerings.  Hence there is real competition.  Prices are held in check
> (to a degree).

However, in the enterprise market, I think that MSFT is holding their own
(yes, partly by anticompetitive practices, I suspect) in the TCO area.  I
don't know the details of how MS licenses thousand unit installations, but
I'll bet it's not a "per copy, per year" basis. More likely, it's a "you
have X thousand computers, so you fit in our 3000-6000 unit bulk rate".  The
last thing MS wants to have to do is count desktops and audit the numbers.
This is how MS deals with OEM consumer mfrs.. you shipped X number of
computers (a publically available number), so send us a check for Y*X
dollars. We don't really care if you installed Win on them or not.

>
> (* I know, I know, not too much free beer out there ...)
>
> MSFT has (at last check) the wrong licensing model for their software
> for clusters.  They would need to change it in order to make it sensible
> from a financial perspective, to deploy such clusters.  As this is a
> tiny market compared to their major market, I think that the needed
> changes will not happen.

I think you're right. The relatively small number of cases involved could be
handled by special deals with PR tie-in.

I could be wrong (and I think the MSFT HPC
> folks read this stuff in stealth mode, so feel free to correct me
> on/offline).
>
> </tangent>
>
> >Per use revenue is
> >a nice way of getting around the "bootleg copy" problem.  Who cares if
you
> >copy it, if every time it runs, you have to hit a license server and pay
> >your little micropayment.  In fact, bootlegs are great... they cost the
> >originator of the software nothing.  It's the whole compatibility,
> >configuration managment thing that is a big problem (all those components
> >have to be compatible with all the other components, etc.).
> >
> >
>
> The per use view presumes you are using a consumable resource in some
> sense, and then attaching a value to it.  It reminds me of  the
> innumerable requests for registration going on now, coupled with the
> "would you like to view this article? only 1.95$USD right now..." I see
> linked from various news sites.

I would never maintain that the price someone is willing to pay for a
commodity is related to its intrinsic value.  Something is worth what
someone is willing to pay for it.  I pay (grumpily) $7 to watch a movie at
the theater.  I don't imagine that the marginal cost to show the movie to me
is even close to $7.

>
> Of course, apart from electrical power, it is hard to understand what
> resource you are consuming when this occurs. I suspect that consumer
> backlash against this will likely halt this march.  I do see a
> subscription model becoming far more likely, whereby for a fixed
> (continuous) fee, you get access to content (much like a magazine, but
> software content in this case).  I think people generally would be more
> accepting of this model than a micropayment per click.

But people are willing to pay "per click" for things like SMS messages and
phone calls. (There are various "bulk purchase" schemes for both...500 free
minutes, etc., but I think those have more importance from a marketing
standpoint than from a revenue standpoint.  Industries with "cost
reimbursement" models (most gov't contractors, gov't agencies, health care,
etc.) are very attracted to a "dollars per click" model, because it allows
easy allocation of costs to multiple accounts.


>
> Now do something interesting (which I have not seen done yet by MSFT,
> though I expect in in short order).  Change the acquisition model to
> that of a subscription.  So instead of paying $500 for an install with a
> free set of patches, pay $50 to acquire the base + $100/year of
> subscription.  Roll the next "versions" out in phases, with
> inter-function dependencies rather than entire version dependencies.
> The software becomes the platform.
>
> Talk about lock-in.  There will be no upgrade cycle to contend with.
> Changes can be made quite modular.  New features better tested and
> rolled in in an evolutionary manner.  Brand new functionality could be
> created into different subscription paths.  Copying and "pirating" would
> be encouraged (you need that subscription after all) as each machine
> would require its own subscription to function.
>
> Rough guess, but I would bet on something much like this emerging.

This IS really the .NET model...
Get away from software being "WinXX compatible" and to "generalized Windows
platform compatible", with a steady revenue stream, just like the gas
company.


From hvidal at tesseract-tech.com  Sun Oct 10 09:01:37 2004
From: hvidal at tesseract-tech.com (H.Vidal, Jr.)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] SATA vs SCSI drives
Message-ID: <41695CE1.80507@tesseract-tech.com>

Hello all.

We are building some Network Area Storage gear around some high-end
imaging and data acq. systems. Reliability for storage of this data is
a big time must.

To date, we have built all of this lab's gear around SCSI drives because it
has been our research and experience that SCSI drives are better built
than IDE drives. However, when looking at these drive arrays and NAS
appliances, it is very clear that SATA drives are really driving large scale
storage.

What has been the general experience on this list of SATA vs SCSI in terms
of performance, reliability, quoted as well as real-world failure rates, 
etc?
Which SATA drives are considered 'the best' the way, say Seagate drives are
held in high esteem for SCSI?

And, if anybody likes any particular RAID and/or NAS system, let's hear
your stories. About 1.4-1.7 Terabyte raw space.

Thanks for your collective help and attention.

Hernando Vidal, Jr.
Tesseract Technology


From kallio at ebi.ac.uk  Sun Oct 10 13:10:01 2004
From: kallio at ebi.ac.uk (Kimmo Kallio)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] Storage and cachefs on nodes?
In-Reply-To: <20041008233311G.hanzl@unknown-domain>
Message-ID: <Pine.LNX.4.44.0410102029290.12097-100000@puffin.ebi.ac.uk>

I can guarantee you are not the only one interested in this... we've been
even semi-seriously thinking of implementing this ourself, but there is
never enough time as usual.

I've been toying on the idea of extending Linux (memory) buffer cache to
utilise local disk as raw block device instead of going through a
filesystem. This wouldn't be reboot-persistent, but wouldn't suffer from
filesystem corruption or such meaning that there would be no extra (human)
management overhead on individual machines. However, I haven't gotten as
far as finding out if this is realistically doable or not.

As for the storage in general we use NetApp filers and their DNFS
(NetCache) caching devices. It's reliable but doesn't come cheap. We are
looking for distributed filesystems (Lustre, Terragrid, ...) to complement
the existing setup.

Regards,

Kimmo Kallio, Europen Bioinformatics Institute

On Fri, 8 Oct 2004 hanzl@noel.feld.cvut.cz wrote:

> Anybody using cachefs(-alike) and local disks on nodes for
> reboot-persistent cache of huge central storage?
> 
> 
> (I periodically and obsessively repeat this poll with negative answer
> ever so far, obviously I am the only person with data storage needs
> perverted this way. Given the recent interest in storage, I dare to
> ask again...)
> 
> Thanks
> 
> Vaclav Hanzl
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 


From landman at scalableinformatics.com  Sun Oct 10 15:26:59 2004
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] Application Deployment
In-Reply-To: <000601c4aefb$c8fe7480$33a8a8c0@LAPTOP152422>
References: <Pine.LNX.4.44.0410101146070.19913-100000@coffee.psychology.mcmaster.ca>
	<000601c4aeee$545a73c0$33a8a8c0@LAPTOP152422>
	<41697DE8.2090301@scalableinformatics.com>
	<000601c4aefb$c8fe7480$33a8a8c0@LAPTOP152422>
Message-ID: <4169B733.7040303@scalableinformatics.com>


Jim Lux wrote:

>----- Original Message -----
>
>  
>
>>As I have said privately to others, the GPL is not a business model.
>>Moreover, the ivory tower concept of giving away the source to sell the
>>consulting seems to work for very few groups, if any (I can think of
>>one, mySQL AB). As it is unlikely that there will be millions of linux
>>clusters out there, the MySQL model of leveraging the needs of a huge
>>installed user base will not work here.
>>    
>>
>
>Very true, I think.  Fortunately, there are people who are willing to work
>with stars in their eyes.
>  
>

Stars don't pay (my) bills. :(


[...]

>>>      
>>>
>>The per use view presumes you are using a consumable resource in some
>>sense, and then attaching a value to it.  It reminds me of  the
>>innumerable requests for registration going on now, coupled with the
>>"would you like to view this article? only 1.95$USD right now..." I see
>>linked from various news sites.
>>    
>>
>
>I would never maintain that the price someone is willing to pay for a
>commodity is related to its intrinsic value.  Something is worth what
>someone is willing to pay for it.  I pay (grumpily) $7 to watch a movie at
>the theater.  I don't imagine that the marginal cost to show the movie to me
>is even close to $7.
>  
>

Value in this case != marginal cost.  Value is that difficult to define 
aspect of something.  The marginal cost of viewing the movie has to be 
quite low, far below the $7 you pay.  The "value" in the case is 
"entertainment" (though I leave that to other threads) from the movie.  
That is, you make a judgment in your mind before parting with the money 
that the thing you are buying serves some need that you ascribe a 
"value" to.   In the case of movie tickets, it is a simple binary 
system; either it has value or it does not (e.g. your need to be 
entertained).

If you have a problem similar to Robert's storage issue, what is the 
value to you (not the marginal cost) of solving it?  E.g. will other 
projects be delayed?  Value includes opportunity cost, and many other 
softer calculcations/guesstimates.


>  
>
>>Of course, apart from electrical power, it is hard to understand what
>>resource you are consuming when this occurs. I suspect that consumer
>>backlash against this will likely halt this march.  I do see a
>>subscription model becoming far more likely, whereby for a fixed
>>(continuous) fee, you get access to content (much like a magazine, but
>>software content in this case).  I think people generally would be more
>>accepting of this model than a micropayment per click.
>>    
>>
>
>But people are willing to pay "per click" for things like SMS messages and
>phone calls. (There are various "bulk purchase" schemes for both...500 free
>  
>

I am subscribing to a service in such a way so as not to pay per 
minute.  I don't SMS (I simply do not see the value in it, and would 
welcome someone explaining this to me (offline)).  I am not talking 
about basic paging or blackberry stuff (the latter being quite cool).  I 
don't mind this recurring cost.

>minutes, etc., but I think those have more importance from a marketing
>standpoint than from a revenue standpoint.  Industries with "cost
>reimbursement" models (most gov't contractors, gov't agencies, health care,
>etc.) are very attracted to a "dollars per click" model, because it allows
>easy allocation of costs to multiple accounts.
>
>  
>

ok... I anthropomorphised.  "There are more business models in the 
economy Joseph, than are dreamt of in your philosophy."  I stand ... er 
... sit... corrected.


>
>  
>
>>Now do something interesting (which I have not seen done yet by MSFT,
>>though I expect in in short order).  Change the acquisition model to
>>that of a subscription.  So instead of paying $500 for an install with a
>>free set of patches, pay $50 to acquire the base + $100/year of
>>subscription.  Roll the next "versions" out in phases, with
>>inter-function dependencies rather than entire version dependencies.
>>The software becomes the platform.
>>
>>Talk about lock-in.  There will be no upgrade cycle to contend with.
>>Changes can be made quite modular.  New features better tested and
>>rolled in in an evolutionary manner.  Brand new functionality could be
>>created into different subscription paths.  Copying and "pirating" would
>>be encouraged (you need that subscription after all) as each machine
>>would require its own subscription to function.
>>
>>Rough guess, but I would bet on something much like this emerging.
>>    
>>
>
>This IS really the .NET model...
>Get away from software being "WinXX compatible" and to "generalized Windows
>platform compatible", with a steady revenue stream, just like the gas
>company.
>
>
>
>  
>

Yup.  I started looking at Mono to see if it made sense to start 
targetting commercial apps for it.  Still not sure, but it is getting 
there.  Not HPC apps, but user interfaces and other things.

What I remember from a committee I sat on about a decade ago about how 
to handle cost distribution/sharing was "chargeback kills usage".   Per 
usage fees were not appealing to most end users we spoke with (academe). 

This of course suggests adaptive business micro-models for software in 
specific market contexts (government, academe, industry).


-- 
Joe


From bropers at cct.lsu.edu  Sun Oct 10 15:38:46 2004
From: bropers at cct.lsu.edu (Brian D. Ropers-Huilman)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] Application Deployment
In-Reply-To: <Pine.LNX.4.44.0410101126110.19913-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0410101126110.19913-100000@coffee.psychology.mcmaster.ca>
Message-ID: <4169B9F6.3050101@cct.lsu.edu>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mark Hahn said the following on 2004-10-10 10:43:
|
| 	- NFS can easily be faster than local disk IO.
|

How so? Under what configurations, versions, etc.?

- --
Brian D. Ropers-Huilman  .::.   Manager   .::.   HPC and Computation
Center for Computation & Technology (CCT)        bropers@cct.lsu.edu
Johnston Hall, Rm. 350                           +1 225.578.3272 (V)
Louisiana State University                       +1 225.578.5362 (F)
Baton Rouge, LA 70803-1900  USA              http://www.cct.lsu.edu/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFBabn2wRr6eFHB5lgRAot3AJ42WJ3vN3rGXrf01BTTkcmwur6lAACcC29H
HY22RxHrcAzXU7c/LmiwwY8=
=YRfV
-----END PGP SIGNATURE-----

From george at galis.org  Sun Oct 10 16:44:14 2004
From: george at galis.org (George Georgalis)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] SATA vs SCSI drives
In-Reply-To: <41695CE1.80507@tesseract-tech.com>
References: <41695CE1.80507@tesseract-tech.com>
Message-ID: <20041010234414.GB777@trot.local>

On Sun, Oct 10, 2004 at 12:01:37PM -0400, H.Vidal, Jr. wrote:
>Which SATA drives are considered 'the best' the way, say Seagate drives are
>held in high esteem for SCSI?
>
>And, if anybody likes any particular RAID and/or NAS system, let's hear
>your stories. About 1.4-1.7 Terabyte raw space.

I've heard these are a good value
http://www.winsys.com/products/flata.php

If you build your own, the 3com controllers can be had under $400 and are
said to be quite good. I'm booting SATA with a $35 addonics controller on
a workstation -- which I consider as reliable, faster and cheaper than
ATA. But that setup wasn't without difficulty setting up.

// George


-- 
George Georgalis, systems architect, administrator Linux BSD IXOYE
http://galis.org/george/ cell:646-331-2027 mailto:george@galis.org

From hanzl at noel.feld.cvut.cz  Sun Oct 10 15:10:46 2004
From: hanzl at noel.feld.cvut.cz (hanzl@noel.feld.cvut.cz)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] Storage and cachefs on nodes?
In-Reply-To: <Pine.LNX.4.44.0410102029290.12097-100000@puffin.ebi.ac.uk>
References: <20041008233311G.hanzl@unknown-domain>
	<Pine.LNX.4.44.0410102029290.12097-100000@puffin.ebi.ac.uk>
Message-ID: <20041011001046L.hanzl@unknown-domain>

>> Anybody using cachefs(-alike) and local disks on nodes for
>> reboot-persistent cache of huge central storage?
>
> I can guarantee you are not the only one interested in this...
> ...
> ... Europen Bioinformatics Institute

Great, thanks. I always believed that this data access pattern must
appear in bioinformatics.

> ... we've been even semi-seriously thinking of implementing this
> ourself, but there is never enough time as usual.

People start to implement this again and again but none of the small
nice projects seems to survive in long term.

> We are looking for distributed filesystems (Lustre, Terragrid, ...)

Problem with most huge projects going this way is that they involve
special server while many users could be quite happy with just a
special client (NFS client with local filesystem cache and certain
degree of filesystem semantics screwup).


Most discussions on this topic end by "It can be done, if you need it,
just implement it". But the real question is how to implement it and
let it survive in long term - across changing kernel versions etc.

I think persistent file caching should be as independent as it can
get, using standard commodity server and being careful to minimize
dependencies in client. Solaris cachefs looked good from this point. I
am not sure how much can I expect from linux cachefs as seen in
e.g. 2.6.9-rc3-mm3 - if I got it right, it is a kernel subsystem with
intra-kernel API, being now tested with AFS and intended as usable for
NFS. It is however "low" on NFS team priority list. So linux cachefs
might provide cleaner solutions than Solaris cachefs - if it ever
provides them.

Regards

Vaclav Hanzl


From daniel.kidger at quadrics.com  Mon Oct 11 00:12:34 2004
From: daniel.kidger at quadrics.com (Dan Kidger)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] Application Deployment
In-Reply-To: <4169B9F6.3050101@cct.lsu.edu>
References: <Pine.LNX.4.44.0410101126110.19913-100000@coffee.psychology.mcmaster.ca>
	<4169B9F6.3050101@cct.lsu.edu>
Message-ID: <200410110812.34738.daniel.kidger@quadrics.com>

On Sunday 10 October 2004 11:38 pm, Brian D. Ropers-Huilman wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Mark Hahn said the following on 2004-10-10 10:43:
> | 	- NFS can easily be faster than local disk IO.
>
> How so? Under what configurations, versions, etc.?

easy - Fileserver is RAID and/or Lustre and you have a high bandwidth network.
This can easily outstrip the IO available from a single local disk

This point also crops up when people do ftp (or scp) perfromance tests across 
their high BW network (like our QsNet). Unfortunately they end up measuring 
and hence reporting the bandwidth of the disks at either end.:-(

Daniel.

--------------------------------------------------------------
Dr. Dan Kidger, Quadrics Ltd.      daniel.kidger@quadrics.com
One Bridewell St., Bristol, BS1 2AA, UK         0117 915 5505
----------------------- www.quadrics.com --------------------


From hahn at physics.mcmaster.ca  Mon Oct 11 10:23:15 2004
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] SATA vs SCSI drives
In-Reply-To: <41695CE1.80507@tesseract-tech.com>
Message-ID: <Pine.LNX.4.44.0410111306380.19913-100000@coffee.psychology.mcmaster.ca>

> imaging and data acq. systems. Reliability for storage of this data is
> a big time must.

good - reliability is easy.  the ubiquity of raid has made inherent
drive reliability less of a critical factor.

> To date, we have built all of this lab's gear around SCSI drives because it
> has been our research and experience that SCSI drives are better built
> than IDE drives.

research = web opinions?  there are some differences between traditional
scsi products and everything else.  I don't think most customers, even 
ones who've tried to inform themselves, understand what the differences 
really are.

basically, SCSI is and has always been driven by "enterprise" database 
needs.  for instance, higher RPM is not a way to get higher bandwidth -
indeed, higher density, lower-RPM disks often deliver higher bandwidth,
and in any case, bandwidth is easily scaled by striping.  the real
differences have more to do with expected duty cycle and lifespan.
again, the enterprise DB market expect to be able to do max seeks/second
for the full service life, 24x365.2425x5.

your use almost certainly does not involve constant, maxed-out activity.
as such, your solution should not use parts designed and priced for that.

> However, when looking at these drive arrays and NAS
> appliances, it is very clear that SATA drives are really driving large scale
> storage.

enterprise DB's are not driven by density, whereas the rest of the market is.
there are good technical reasons for this divergence (more heads means slower
seeks; greater density means slower seeks).

> What has been the general experience on this list of SATA vs SCSI in terms
> of performance, reliability, quoted as well as real-world failure rates, 
> etc?

somewhat higher infant mortality due to a lower-quality supply chain for 
low-end (*ata) disks.  in use, failure rates more or less in keeping with
the drives mtbf or warranty.

> Which SATA drives are considered 'the best' the way, say Seagate drives are
> held in high esteem for SCSI?

your statement about Seagate is pure aesthetics, and I very much doubt that 
there's a clear taste preference for Seagate.  (for instance Fujitsu, HGST
and Maxtor all make drives of equal quality/reliability/performance.)

> And, if anybody likes any particular RAID and/or NAS system, let's hear
> your stories. About 1.4-1.7 Terabyte raw space.

small servers like this are no longer much of a challenge or interest.
just slap 8x250G sata disks into a box, raid5 with one hot spare, and relax.
personally, I'm fond of Promise s150tx4 controllers since they're cheap
and effective.  any 3-5yr warranty disk from seagate/maxtor/hgst/wd will
work perfectly fine.  yes, of course the box should have decent airflow,
and a hefty power supply, but none of that is hard anymore (it also helps
that disks themselves have become cooler.)

regards, mark hahn.


From andrew at ceruleansystems.com  Mon Oct 11 11:08:09 2004
From: andrew at ceruleansystems.com (J. Andrew Rogers)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] SATA vs SCSI drives
Message-ID: <1097518089.32071@whirlwind.he.net>


I purchase many, many terabytes of disk array every year, and use a
mixture of SCSI and (S)ATA depending on the specific application.  My
observations and experiences with the current crop of technology:

SCSI attached to a decent RAID controller (e.g. LSI MegaRAID) will
generally outperform a roughly equivalent SATA array for many purposes,
and if you have money to burn you can build significantly faster arrays.
 This is due to a combination of physically faster drives and mature
drive and controller implementations that work very well together.

That said, for single-process access, streaming, and similar, the
performance is largely similar.  A 10k SATA array will perform about as
well as a 10k SCSI array in most cases.  For applications that are bound
by access/seek times (e.g. databases), SCSI still seems to have
substantially more throughput in practice.  The bandwidth issue is
almost a non-issue in my experience, as you'll run into access/seek
limitations first for most apps.

So to summarize, they are mostly differentiated by the effective
access/seek throughput; SATA is the cheaper choice if you aren't
significantly bound by this parameter.  And as SATA firmware in both the
drives and controllers improves, and fast SCSI drive hardware is adapted
to SATA interfaces, I expect this gap to close.  It hasn't closed yet,
but in a couple years I expect it will be.

j. andrew rogers


From jlb17 at duke.edu  Mon Oct 11 06:06:18 2004
From: jlb17 at duke.edu (Joshua Baker-LePain)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] SATA vs SCSI drives
In-Reply-To: <20041010234414.GB777@trot.local>
References: <41695CE1.80507@tesseract-tech.com>
	<20041010234414.GB777@trot.local>
Message-ID: <Pine.LNX.4.58.0410110905430.8484@chaos.egr.duke.edu>

On Sun, 10 Oct 2004 at 7:44pm, George Georgalis wrote

> On Sun, Oct 10, 2004 at 12:01:37PM -0400, H.Vidal, Jr. wrote:
> >Which SATA drives are considered 'the best' the way, say Seagate drives are
> >held in high esteem for SCSI?
> >
> >And, if anybody likes any particular RAID and/or NAS system, let's hear
> >your stories. About 1.4-1.7 Terabyte raw space.
> 
> I've heard these are a good value
> http://www.winsys.com/products/flata.php
> 
> If you build your own, the 3com controllers can be had under $400 and are
                             ^^^^
I think you meant "3ware" there...

> said to be quite good. I'm booting SATA with a $35 addonics controller on
> a workstation -- which I consider as reliable, faster and cheaper than
> ATA. But that setup wasn't without difficulty setting up.


-- 
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University

From laurenceliew at yahoo.com.sg  Sat Oct  9 23:00:09 2004
From: laurenceliew at yahoo.com.sg (Laurence Liew)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] HPC in Windows
In-Reply-To: <01ee01c4adbc$3618d330$39140897@PMORND>
References: <01ee01c4adbc$3618d330$39140897@PMORND>
Message-ID: <4168CFE9.7010903@yahoo.com.sg>

Yes..

You can download the Windows HPC package..  go to microsoft and look for 
it. If You are a Microsoft partner... you can regsiter and attend the MS 
Windows HPC training classes held at Cornell University.

Speak with your Microsoft rep and they will be able to arrange.

Laurence


Rajiv wrote:
> Dear All,
>     Are there any Beowulf packages for windows?
>  
> Regards,
> Rajiv
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
A non-text attachment was scrubbed...
Name: laurenceliew.vcf
Type: text/x-vcard
Size: 150 bytes
Desc: not available
Url : http://www.scyld.com/pipermail/beowulf/attachments/20041010/8ead9717/laurenceliew.vcf
From iwao at rickey-net.com  Sun Oct 10 09:59:42 2004
From: iwao at rickey-net.com (Iwao Makino)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] HPC in Windows
In-Reply-To: <01ee01c4adbc$3618d330$39140897@PMORND>
References: <01ee01c4adbc$3618d330$39140897@PMORND>
Message-ID: <p0600101cbd8f1a7c40ba@[192.168.2.4]>

Rajiv,

You may want to check this web site;
<http://www.microsoft.com/windowsserver2003/hpc/default.mspx>

They have plenty of resources there for you to study and start.
They even had starter demo kit(not sure about current status)

However, I am not sure if it was Beowulf, but was sure HPC cluster.

At 10:25 AM +0530 04.10.9, Rajiv wrote:
>Dear All,
>     Are there any Beowulf packages for windows?
>
>Regards,
>Rajiv
>
>_______________________________________________
>Beowulf mailing list, Beowulf@beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20041011/b3374e9e/attachment.html
From matej.ciesko at stud.uni-erlangen.de  Sun Oct 10 11:25:32 2004
From: matej.ciesko at stud.uni-erlangen.de (Matej Ciesko)
Date: Wed Nov 25 01:03:28 2009
Subject: AW: [Beowulf] HPC in Windows
In-Reply-To: <01ee01c4adbc$3618d330$39140897@PMORND>
Message-ID: <iss.83522729.378f.41697e98.1d0ac.2b@max71.rrze.uni-erlangen.de>

Hi,

 
In fact there is everything on the market that you need to build Windows
based computational clusters as easily as with other operating systems. 

 
OS: Windows 2003 operating system provides functionality for everything you
need to deploy, run and maintain computational clusters.

For starters go look for the free Microsoft's Computational Clustering
Technical Preview Toolkit (CCTP). It includes some useful tools to get you
started. 

 
More info is available at the MS HPC web page:
http://www.microsoft.com/windowsserver2003/hpc/default.mspx

If you are looking for references here's a tip: The Cornell Theory Center
(CTC) is and has been using large scale Windows based computational clusters
for quite some time now. They also have "best practice" documentation
available on of how to build them on their web site.

 
Middleware: Most common middleware packages used for HPC clustering are
ported to the windows platform. Look for MPI/PRO for example (or NT-MPICH if
you go for free stuff). Google for more :-)

 
Except that, most of these (commercial) middleware packages work well
together with Microsoft development environments (Visual Studio) which makes
development very comfortable.

 
AND, if you or your research facility are part of a university or other
educational institution, the chance is high that you can get all of the MS
Products for free for the purpose of your research by applying to one of
their many academic alliance programs. (Go look for MSDN AA).

http://msdn.microsoft.com/academic/ (at the bottom of the page).

 
Best Regards,

Matej Ciesko.

 
  _____  

Von: beowulf-bounces@beowulf.org [mailto:beowulf-bounces@beowulf.org] Im
Auftrag von Rajiv
Gesendet: Saturday, October 09, 2004 6:56 AM
An: beowulf@beowulf.org
Betreff: [Beowulf] HPC in Windows

 
Dear All,

    Are there any Beowulf packages for windows?

 
Regards,

Rajiv


---
Incoming mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.775 / Virus Database: 522 - Release Date: 10/8/2004


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20041010/6c8ae8e9/attachment.html
From john.hearns at clustervision.com  Sun Oct 10 23:47:22 2004
From: john.hearns at clustervision.com (John Hearns)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] Re: HPC in Windows
In-Reply-To: <E1CGdBS-000LzX-87@oceanus.uk.clara.net>
References: <01ee01c4adbc$3618d330$39140897@PMORND>
	<E1CGdBS-000LzX-87@oceanus.uk.clara.net>
Message-ID: <1097477241.1977.3.camel@vigor12>

On Sun, 2004-10-10 at 13:52, mark.westwood@ohmsurveys.com wrote:
> Hi Rajiv 
> 
> I don't know about Windows-based clusters, but you might want to check out 
> 
> Beowulf Cluster Computing with Windows
> edited by Thomas Sterling
> MIT Press, 2001 
There's also an HPC edition of Windows Server 2003
http://www.microsoft.com/windowsserver2003/hpc/default.mspx
Can't comment further as I have never used it,
but that page has lots of links.


From pjs at eurotux.com  Mon Oct 11 03:49:07 2004
From: pjs at eurotux.com (Paulo Silva)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] MPI problem
Message-ID: <1097491747.6316.21.camel@valen>

Hello,

I'm testing a 12 node Beowulf cluster using torque (an OpenPBS based
program) and mpich with rsh/nfs.

When a submitted a program that generates a big output file at the end
of execution I got this error:

/opt/mpich/bin/mpirun: line 1: 31098 File size limit
exceeded/home/xpto/QCD/su3_ora -p4pg /home/xpto/QCD/PI30888 -
p4wd /home/xpto/QCD

The output file remains with 2.0 GB

Since this error only occurs when I use MPI programs, I suspect this is
some issue related with mpich. Does anyone knows what's the problem?

Thanks for any help
-- 
Paulo Silva <pjs@eurotux.com>
Eurotux Inform?tica, SA
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Esta =?ISO-8859-1?Q?=E9?= uma parte de mensagem
	assinada digitalmente
Url : http://www.scyld.com/pipermail/beowulf/attachments/20041011/db2f1bec/attachment.bin
From epaulson at cs.wisc.edu  Mon Oct 11 10:16:50 2004
From: epaulson at cs.wisc.edu (Erik Paulson)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] HPC in Windows
In-Reply-To: <Pine.LNX.4.58.0410091757360.15636@lilith.rgb.private.net>
References: <01ee01c4adbc$3618d330$39140897@PMORND>
	<Pine.LNX.4.58.0410091757360.15636@lilith.rgb.private.net>
Message-ID: <20041011171650.GE15038@cobalt.cs.wisc.edu>

On Sat, Oct 09, 2004 at 06:11:01PM -0400, Robert G. Brown wrote:
> On Sat, 9 Oct 2004, Rajiv wrote:
> 
> > Dear All,
> >     Are there any Beowulf packages for windows?
> 
> Not that I know of.  In fact, the whole concept seems a bit oxymoronic,
> as the definition of a beowulf is a cluster supercomputer running an
> open source operating system.
> 

It's really time that gave up on trying to hold a strong definition
to "beowulf". It's like kleenex or hacker/cracker. The world doesn't
care. Clusters of x86 PCs doing "HPC" = beowulf

And on the Beowulf on Windows bit - 
http://www.amazon.com/exec/obidos/tg/detail/-/0262692759/qid=1097514164/sr=8-1/ref=sr_8_xs_ap_i1_xgl14/104-7091285-1915902?v=glance&s=books&n=507846

"Beowulf Cluster Computing with Windows (Scientific and Engineering Computation)
by Thomas Sterling" - If Tom says that you can build a beowulf on 
Windows, I think you can. 


-Erik

ps - define "supercomputer" :)

From michael.fitzmaurice at ngc.com  Mon Oct 11 09:31:14 2004
From: michael.fitzmaurice at ngc.com (Fitzmaurice, Michael)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] bwbug: BWBUG meeting tomorrow at 3:00 PM in McLean
	Virginia
Message-ID: <1C0477C28F2A16489E5765E92BCD9A04029418CB@xcgva008.northgrum.com>

The next BWBUG meeting:

Will meet October 12, 2004 at Northrop Grumman Corporation at 7575 Colshire Drive McLean Virginia  22102 at 3:00 PM to 5:00 PM.

There will be two presentations. One, on Global Files Systems the key component to large reliable data storage systems. The second talk will on a comprehensive study that will provide a unique insight into the Linux HPC market from the users perceptive.

Join us for two great talks this Tuesday.


Speaker: Sudhir Srinivasan, Ph.D., CTO and VP Engineering for IBRIX a leader in Global File Systems

Description: As the market coalesces around cluster-based computing, one of the primary impediments to scalability and performance is the file system. As such, technology vendors have developed distributed parallel file systems to overcome these I/O challenges. This session will offer a brief overview of how today's parallel file system offerings take advantage of clustered environments to get the best performance and scalability possible. Further, it will endeavor to explain how design and architectural elements such as segmentation, metadata algorithms, and non-hierarchical architectures make large clustered file systems more scalable and practical. We believe attendees will come away with a better understanding of the elements that make file systems solutions appropriate given the cluster environment and applications users are running. 

John L Payne is president of JLP Associates, a consulting company
specializing in computer and communication technologies.

HPC Clusters - The best technology buy!
The first independent study of HPC clusters and HPC industry growth.  Based on comprehensive interviews with more than 40 users. The talk will provide insights into users operational experience, reliability and performance.  It will also look at their views on COTS versus Blades.  Future design options and requirements will be discussed.


T. Michael Fitzmaurice
Coordinator of the BWBUG
8110 Gatehouse Road 400W
Falls Church, Virginia 22042
Office 703-205-3132
Cell    703-625-9054

http://www.it.northropgrumman.com/index.asp
http://www.bwbug.org

_______________________________________________
bwbug mailing list
bwbug@pbm.com
http://www.pbm.com/mailman/listinfo/bwbug


From eugen at leitl.org  Mon Oct 11 14:43:31 2004
From: eugen at leitl.org (Eugen Leitl)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] HPC in Windows
In-Reply-To: <20041011171650.GE15038@cobalt.cs.wisc.edu>
References: <01ee01c4adbc$3618d330$39140897@PMORND>
	<Pine.LNX.4.58.0410091757360.15636@lilith.rgb.private.net>
	<20041011171650.GE15038@cobalt.cs.wisc.edu>
Message-ID: <20041011214331.GV1457@leitl.org>

On Mon, Oct 11, 2004 at 12:16:50PM -0500, Erik Paulson wrote:

> It's really time that gave up on trying to hold a strong definition
> to "beowulf". It's like kleenex or hacker/cracker. The world doesn't
> care. Clusters of x86 PCs doing "HPC" = beowulf

The term "Beowulf" is completely unknown outside of a small community.
 
> And on the Beowulf on Windows bit - 
> http://www.amazon.com/exec/obidos/tg/detail/-/0262692759/qid=1097514164/sr=8-1/ref=sr_8_xs_ap_i1_xgl14/104-7091285-1915902?v=glance&s=books&n=507846
> 
> "Beowulf Cluster Computing with Windows (Scientific and Engineering Computation)
> by Thomas Sterling" - If Tom says that you can build a beowulf on 
> Windows, I think you can. 

Yeah, if your node licenses are subsidized, and you don't care for worst-case
message passing latency, and lack of tools, I guess you can...

Sorry, but you seem to subscribe to a very peculiar definition of COTS.

-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
______________________________________________________________
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
Url : http://www.scyld.com/pipermail/beowulf/attachments/20041011/6def4452/attachment.bin
From hunting at ix.netcom.com  Mon Oct 11 17:25:52 2004
From: hunting at ix.netcom.com (Michael Huntingdon)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] SATA vs SCSI drives
In-Reply-To: <41695CE1.80507@tesseract-tech.com>
References: <41695CE1.80507@tesseract-tech.com>
Message-ID: <6.1.2.0.2.20041011170124.01d64600@popd.ix.netcom.com>

Hernando

Though slightly dated, I hope the attachment is helpful....btw....I didn't 
do an exhaustive search, but found the 10K SATA drives only offered at 
72GB's and under. The higher cap drives are 7200RPM.

cheers
michael


At 09:01 AM 10/10/2004, H.Vidal, Jr. wrote:
>Hello all.
>
>We are building some Network Area Storage gear around some high-end
>imaging and data acq. systems. Reliability for storage of this data is
>a big time must.
>
>To date, we have built all of this lab's gear around SCSI drives because it
>has been our research and experience that SCSI drives are better built
>than IDE drives. However, when looking at these drive arrays and NAS
>appliances, it is very clear that SATA drives are really driving large scale
>storage.
>
>What has been the general experience on this list of SATA vs SCSI in terms
>of performance, reliability, quoted as well as real-world failure rates, etc?
>Which SATA drives are considered 'the best' the way, say Seagate drives are
>held in high esteem for SCSI?
>
>And, if anybody likes any particular RAID and/or NAS system, let's hear
>your stories. About 1.4-1.7 Terabyte raw space.
>
>Thanks for your collective help and attention.
>
>Hernando Vidal, Jr.
>Tesseract Technology
>
>_______________________________________________
>Beowulf mailing list, Beowulf@beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SATA vs. SAS disk technology.pdf
Type: application/pdf
Size: 393316 bytes
Desc: not available
Url : http://www.scyld.com/pipermail/beowulf/attachments/20041011/00b7ca2c/SATAvs.SASdisktechnology.pdf
-------------- next part --------------
*********************************************************************
Systems Performance Consultants
Michael Huntingdon
Higher Education Technology               Office (408) 294-6811
131-A Stony Circle, Suite 500            Cell   (707) 478-0226
Santa Rosa, CA 95401                   fax    (707) 577-7419
Web: 
<<http://www.spcnet.com>http://www.spcnet.com> 
hunting@ix.netcom.com
*********************************************************************
From george at galis.org  Mon Oct 11 16:44:21 2004
From: george at galis.org (George Georgalis)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] SATA vs SCSI drives
In-Reply-To: <Pine.LNX.4.58.0410110905430.8484@chaos.egr.duke.edu>
References: <41695CE1.80507@tesseract-tech.com>
	<20041010234414.GB777@trot.local>
	<Pine.LNX.4.58.0410110905430.8484@chaos.egr.duke.edu>
Message-ID: <20041011234421.GD9260@trot.local>

On Mon, Oct 11, 2004 at 09:06:18AM -0400, Joshua Baker-LePain wrote:
>On Sun, 10 Oct 2004 at 7:44pm, George Georgalis wrote
>
>> If you build your own, the 3com controllers can be had under $400 and are
>                             ^^^^
>I think you meant "3ware" there...

Indeed. Thanks.

// George

-- 
George Georgalis, systems architect, administrator Linux BSD IXOYE
http://galis.org/george/ cell:646-331-2027 mailto:george@galis.org

From mark.westwood at ohmsurveys.com  Tue Oct 12 01:11:39 2004
From: mark.westwood at ohmsurveys.com (Mark Westwood)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] Grid Engine question
Message-ID: <416B91BB.8070102@ohmsurveys.com>

Hi All

We use the open source Grid Engine, Enterprise Edition v5.3, here to 
manage job submission to our 70 processor Beowulf.  I'm rather new to 
managing Grid Engine and my users have me baffled with a question of 
priorities.

The scenario is this:

- suppose that there is a job running on 40 processors, leaving 30 free;
- a high priority job, requesting 64 processors, is submitted;
- a low priority, but long, job, requesting 24 processors is submitted.

Currently, with our configuration, the low priority job would be run 
immediately, since there are more than 24 processors available. 
However, my users want to hold that job until the high priority job has run.

Can we configure Grid Engine so that the low priority job is not started 
until after the high priority job, even though there are resources 
available for the low priority job when it is submitted ?

Thanks for any help you can provide.

PS Yes I have RTFMed and am not much the wiser on this specific question.

-- 
Mark Westwood
Software Engineer
OHM Ltd
The Technology Centre
Offshore Technology Park
Claymore Drive
Aberdeen
AB23 8GD
United Kingdom

+44 (0)870 429 6586
www.ohmsurveys.com

From rgb at phy.duke.edu  Tue Oct 12 08:35:05 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] HPC in Windows
In-Reply-To: <20041011171650.GE15038@cobalt.cs.wisc.edu>
References: <01ee01c4adbc$3618d330$39140897@PMORND>
	<Pine.LNX.4.58.0410091757360.15636@lilith.rgb.private.net>
	<20041011171650.GE15038@cobalt.cs.wisc.edu>
Message-ID: <Pine.LNX.4.58.0410120657010.4120@lilith.rgb.private.net>

On Mon, 11 Oct 2004, Erik Paulson wrote:

> On Sat, Oct 09, 2004 at 06:11:01PM -0400, Robert G. Brown wrote:
> > On Sat, 9 Oct 2004, Rajiv wrote:
> > 
> > > Dear All,
> > >     Are there any Beowulf packages for windows?
> > 
> > Not that I know of.  In fact, the whole concept seems a bit oxymoronic,
> > as the definition of a beowulf is a cluster supercomputer running an
> > open source operating system.
> > 
> 
> It's really time that gave up on trying to hold a strong definition
> to "beowulf". It's like kleenex or hacker/cracker. The world doesn't
> care. Clusters of x86 PCs doing "HPC" = beowulf

Now look what you did.  Used up my whole morning, just about.

The easily bored can skip the rant below.

<rant id="389">

This (what's a beowulf?) is list discussion #389, actually.  Or maybe it
is that the discussion has occurred 389 times, I can't remember.  I do
remember that the first time I participated in it was around seven or
eight years ago, that I advanced the point of view that you espouse here
-- and that I changed my mind.

The definition of beowulf as OPPOSED to "just" a cluster of systems
(nuttin' in the definition about them being "PC"s, just COTS systems)
was given by the members of the original beowulf project with explicit
reasons for each component.  Note well that cluster supercomputing was
at the time not new -- I'd been doing it myself by then for years (on
COTS systems, for that matter, if Unix workstations can be considered
off the shelf), and I was far, far from the first.

At that time, there were already NOWs, COWs, PoPs and more.  See
Pfister's "In Search of Clusters" for a lovely, balanced, and not
terribly beowulf-centric historical review.

Two things differentiated the beowulf from earlier cluster efforts.

  a) Custom software designed to present a view of the cluster as "a
supercomputer" in the same sense (precisely) that e.g. an SP2 or SP3 is
"a supercomputer" -- a single "head" that is identified as being "the
computer", specialized communications channels to augment the speed of
communications (then quite slow on 10 Mbps ethernet), stuff like bproc
designed to support the member computers being "processors" in a
multiprocessor machine rather than standalone computers.  Note that this
idea was NOT totally original to the beowulf project, as PVM already had
incorporated much of this vision years earlier.

  b) The fact that the beowulf utilized an open source operating system
and was built on top of open source software.  The reasons for this at
the time were manifest, and really haven't changed.  In order to realize
their design goals that >>extended<< the concepts already in place in
PVM, they had to write numerous kernel drivers (hard to do without the
kernel source) as well as a variety of support packages.  Don Becker
wrote (IIRC) something like -- would that be all of the linux kernel's
network drivers at the time or just 80% of them? -- hard to remember at
this point, but a grep on Becker in /usr/src/linux/drivers/net is STILL
pretty revealing.  Now look for Sterling and Becker's contributions to
the WinXX networking stack.  Hmmmm....

The insistence on COTS hardware, actually, is what I'd consider the
"weakest" component of the original definition, as it is the one
component that was readily bent by the community in order to better
realize the design goal of a parallel supercomputer capable of running
fine grained parallel code competitively with "big iron" supercomputers.
The beowulf community readily embraced non-commodity networks when they
appeared. Note that I consider "commodity" as meaning multisourced with
real competition holding down prices and generally built on an "open"
standard, e.g. ethernet is open and has many vendors, myrinet is not
open and is available only from Myricom (although at all points there
has been at least some generic competition at least between high end
proprietary networks).  

Myrinet historically was perhaps >>the<< key component that permitted
beowulves to reach and even exceed the performance of so-called big iron
supercomputers for precisely the kind of fine grained numerical problems
that the supercomputers had historically dominated.  I remember well
Greg Lindahl, for example, showing graphs of Alpha/Myrinet speedup
scaling compared to e.g. SP-series systems and others, with the beowulf
model actually winning (at less than 1/3 the price, even using the
relatively expensive hardware involved).

> And on the Beowulf on Windows bit - 
> http://www.amazon.com/exec/obidos/tg/detail/-/0262692759/qid=1097514164/sr=8-1/ref=sr_8_xs_ap_i1_xgl14/104-7091285-1915902?v=glance&s=books&n=507846
> 
> "Beowulf Cluster Computing with Windows (Scientific and Engineering Computation)
> by Thomas Sterling" - If Tom says that you can build a beowulf on 
> Windows, I think you can. 

I can only reply with:

  http://www.beowulf.org/community/column2.html

by Don Becker, in which he points out that when they first met, Sterling
was "obsessed with writing open source network drivers".  Or if you
prefer, Question Number One of the beowulf FAQ:

  1. What's a Beowulf?

  Beowulf Clusters are scalable performance clusters based on commodity
  hardware, on a private system network, with open source software
  (Linux) infrastructure.

  Each consists of a cluster of PCs or workstations dedicated to running
  high-performance computing tasks. The nodes in the cluster don't sit
  on people's desks; they are dedicated to running cluster jobs. It is
  usually connected to the outside world through only a single node.

  Some Linux clusters are built for reliability instead of speed. These
  are not Beowulfs.

Or check out my "snapshot" of the original beowulf website, preserved in
electronic amber (so to speak) from back when I ran a mirror:

 http://www.phy.duke.edu/resources/computing/brahma/Resources/beowulf/

The introduction and overview contains a number of lovely tidbits
concerning the beowulf design and how it differs from a NOW.  It makes
it pretty clear that the only way a pile of WinXX boxes could be "a
beowulf" (as opposed to a NOW) would be if Microsoft Made it So -- the
WinXX kernels and networking stack and job scheduling and management are
essentially inaccessible to developers in an open community, which is
why WinXX clusters like Cornell's (however well they work) stand alone,
supported only to the extent that MS or Cornell pay for it with little
community synergy.

Nobody would argue, of course, that one can't build a NOW based on WinXX
boxes.  A number exist.  WinXX boxes run PVM or MPI (and have been able
to for many years, probably even predating the beowulf project although
I'm too lazy to check the mod dates of the WinXX ifdefs in PVM).  One
can also obviously build a grid with WinXX boxes in it, probably more
easily than one can build a true parallel cluster.  Grid-style clusters
(a.k.a.  "compute farms") predate even virtual supercomputers in cluster
taxonomy, for all that they have a new name and a relatively new set of
high-level support software (just as the beowulf has, in the form of
bproc implemented in clustermatic and scyld).

Those of use who used to "roll our own" gridware to permit the use of
entire LANs of workstations on embarrassingly parallel problems find
this (toplevel support software) a welcome development, and it has
indeed blurred the lines between beowulfs and other NOWs to some degree,
but if anything it is DIMINISHING the identification of all clusters as
"beowulfs".  Look at all the Grid projects in the universe -- BioGRID,
the smallpox grid, ATLAS grid, PatriotGrid -- grids are proliferating
like crazy, but they aren't considered or referred to as beowulfs.  In
most cases "beowulf" isn't even mentioned in their toplevel
documentation.

One of the fundamental reasons for differentiation is this very list.
Few people who have been on the list for a long time and who have worked
with beowulfs and other kinds of open source clusters for a long time
have any particular interest in providing community support to cluster
computing under Windows.  For one thing, it is nearly impossible -- it
requires somebody with trans-MCSE knowledge of Windows' kernels,
libraries, drivers, networking stack, and tools including the various
WinXX ports of key cluster software where it exists.  For another,
people who work in that community who DO have that level of expertise
don't seem to want to share -- they want to sell.  One has to pay to
become a MCSE; one then expects a high rate of consultative return on
the investment.  One cannot easily obtain access to WinXX source code,
and open or not, access to kernel-level source code turns out to be
essential to getting maximal performance out of a true beowulf or even
advanced non-beowulf style cluster.  

Besides, nearly all the tools involved (beyond userspace stuff like PVM
or MPI in certain flavors) are SOLD and supported by Microsoft (only) or
other Microsoft-connected commercial developers and the only "benefit"
we get back in the community from providing support for them is to
increase their profits and to encourage them to turn around and resell
us our own developments and ideas at a high cost.  So let THEM provide
the consultation and expertise and "intellectual property" they prize so
highly; I will not contribute.

Contrast that with the really rather unbelieveable level of support
freely offered via this list to (yes) general cluster computer users and
builders (not just "beowulf" builders by the strict definition).  This
support is predicated on the fundamental notions of open source software
-- that effort expended on it comes back to you amplified tenfold as the
COMMUNITY is strengthened in the open and free exchange of ideas.
Consider the many tools and products that support beowulfery (or
generalized cluster computer operation) that would simply be impossible
to develop in a closed source proprietary model.  People who participate
in this sort of development have no desire to do all the work to create
new tools and products only to have Microsoft and its software lackeys
do its usual job of co-opting the tool, branding it, shifting the core
standard from open to proprietary, and then squeezing out the original
inventors (extended rant available on request:-).

For all of these reasons, I think that it is worthwhile to maintain the
moderately strict definition of "a beowulf" as a particular isolated
network arrangement of COTS systems running open source software and
functioning as a cluster capable of running anything from fine grained
parallel problems down to distributed single tasks with a single "view"
of task ID space.  This is a fairly open and embracing definition --
people on the list run "beowulfs" with a single head, multiple heads,
many operating systems other than Linux (most of them open source --
WinXX users are subjected to fairly merciless teasing if nothing else
...hotter:-).  It is differentiated from (recently emerging) definitions of
Grid-style clusters, from my much older definition of a "distributed
parallel supercomputer" (built largely of dual use workstations that
function as desktop machines in a LAN while still permitting
long-running numerical tasks to be run in the background), from MUCH
older definitions of NOWs, COWs, Piles of PCs.

So, if somebody says they've "built a beowulf" out of a bunch of WinXX
boxes, yes, I know what they mean, even though what they say is almost
certainly not correct.  The list is fairly tolerant of pretty much
anybody doing any kind of cluster computing, even Windows based NOWs or
Grids.  "Extreme Linux" as a more general vehicle for linux cluster
development never quite took off, and www.extremelinux.org continues to
be a blank page as it has been for years now.  As I said above, I
personally don't even DO "real" beowulf computing and never have -- my
clusters tend to be NOWs, although we're gradually shifting more towards
a Grid model as improved software makes this the easy path support-wise.

As a final note, I personally view the original PVM team as the
"inventors of commodity cluster computing" even more than Sterling and
Becker (much as I revere their contributions).  If a "beowulf" is a
network of computers running e.g. PVM on top of proprietary software,
Dongarra et. al. beat Sterling and Becker to the punch by years.  This
isn't a crazy idea -- PVM already contains "out of the box" many of the
design goals of the beowulf project -- a unified process id space
(tids), a single control head that supports the "virtual machine" model,
the ability to run on commodity hardware.  It just does it in userspace,
and hence has limits on what can be accomplished performance-wise, and
has the usual PVM vs MPI problems with the older supercomputer
programmers (who all used MPI, for interesting historical reasons).
(Interestingly, "old hands" in the beowulf/cluster business nearly all
tell me that they used to use and still prefer PVM, while MPI is still
the "commercially salable" parallel library that better favors the
traditional big iron supercomputing model;-)

To what PVM already provided, Sterling and Becker contributed the
notions of >>network isolation<< to achieve predictable network latency,
>>channel bonding<< of network channels, built on top of open source
network drivers, to improve network bandwidth (an accomplishment
somewhat overshadowed by the rapid development faster networks and
low-latency networks), and eventually >>kernel-level modifications<<
that truly converted a cluster of PCs into a "single machine" the
components of which could no longer stand alone but were merely
"processors" in a massively parallel system with a single user-level
kernel interface.

So how in the world can Sterling argue that this >>beowulf<< software,
developed by the original beowulf team, is available for Windows?  Did I
miss something?

Network isolation, fine, that's a matter of trivial network arrangement
that anybody with $50 for an OTC router/firewall can now accomplish, but
channel bonded networks?  Unified process id spaces?  Kernel
modifications that make nodes into virtual processors in a single
"machine"?  Not that I know of, anyway, and obviously impossible without
fairly open access to Windows source code in any event.  At a guess, it
would require such a violent modification even to the more modern and
POSIX compliant WinXX's that the result could be called "Windows" only
in the sense that linux running a windowing system can be called
"Windows" -- pretty much a complete rewrite and de-integration of the
GUI from the OS kernel would be required (something that Microsoft has
argued in court is impossible, amusingly enough, as they have sought to
convince an ignorant public that Internet Explorer -- a userspace
program if ever there was one -- cannot be be de-integrated from
Windows:-).

Asserting that there are truly Windows-based beowulfs does not make it
so, and coopting the term "beowulf" to apply to generic computing models
and tools that preceded the project by years is a kind of newspeak.
I'll have to just go on thinking of the idea as an oxymoronic one, at
least until Microsoft opens its source code or somebody succeeds in
rewriting history and the original definition and goals of the beowulf
project.

> ps - define "supercomputer" :)

AT THE TIME of the beowulf project, the definition was actually pretty
clear, if only by example.  I'd say it is still pretty clear, actually.

At that time (and still today, mostly) the generic term "computer"
embraced:

  a) Mainframes (the oldest example of "computer", still annoyingly
common in business, industry and academe).

  b) Minicomputers (e.g. PDP's, Vaxes, Harris's).  Basically
cheaper/smaller versions of mainframes that generally stood alone
although of course a number of them were used as the core servers for
Unix-based workstation LANs.

  c) Workstations (e.g. Suns, SGIs).  Typically desktop-sized computers
in a client-server arrangement on a LAN. Server-class Suns and SGIs were
sometimes refrigerator-sized units that were de facto minicomputers,
blurring the lines between b) and c) in the case where both were running
Unix flavors (or at least real multitasking operating systems).

  d) Personal computers.  A "personal" computer was always a desktop
sized unit, and the term "PC" generally applied to x86-family examples,
although clearly Apples were (and continue to be) PCs as well.  Note
that PCs were sometimes as capable, hardware-wise, as workstations and
had been networkable for years, so networking or hardware per se had
nothing to do with being a PC vs a workstation.  A PC really was
differentiated from being a workstation by a key feature of its
operating system -- the INability to login to the system remotely over a
network.  To use a PC, you had to sit at the PC's actual interface.
(Note that aftermarket tools like "PC anywhere" did not a PC a
workstation make).

  e) Supercomputers.  A supercomputer was (and continues to be) a
generic term for a "computer" capable of doing numerical (HPC)
computations much faster than the CURRENT GENERATION of a-d computers.
Obviously a moving target, given Moore's Law.  From the "first"
so-called supercomputer, the 12 MFLOP Cray-1, through to today's top 500
list, the differentiating feature is obviously RELATIVE performance, as
the Palm Tungsten C in my pocket (with its 400 MHz CPU) is faster than
the Cray 1.  

  f) Today there is a weak association between "supercomputer" and
"single task" HPC (so Grids and compute farms of various sorts are
somewhat excluded, probably BECAUSE of the top500 list and its
insistence on parallel linpack-y sorts of stuff as the relevant measure
of supercomputer performance).  So Grids have emerged as a kind of
cluster in their own right that isn't ordinarily viewed as a
supercomputer although a Grid is essentially unbounded from above in
terms of aggregate floating point capacity in a way that supercomputers
are not.  One could make a grid of all the top500 supercomputers, in
fact...

Note that historically supercomputers are differentiated from other a-d
class computers not by being "mainframe" or not, not by being vector
processor based vs interconnected parallel multiprocessor based, not by
its operating system, not even by its underlying computational paradigm
(e.g. shared memory vs message passing), certainly not by its ABSOLUTE
performance, but strictly by relative numerical performance.  My Palm a
decade ago would have been an export-restricted munition supercomputer,
usable by rogue nations to simulate nuclear blasts and build WMD.  Today
it is a casual tool used by businessmen to check the web and email and
remind them of appointments, while other munitions-quality systems are
now toys, used by my kids to race virtual motorcycles around
hyperrealistically rendered city streets.

Talk about swords into plowshares...;-)

The exact multiplier between "ordinary computer" performance and
supercomputer performance is of course not terribly sharp.  Over the
years, a factor of order ten has often sufficed.  In the original
beowulf project, aggregating 16 80486DX processors (at best a few
hundred aggregate FLOPS, again, my Palm probably would beat it at a
walk) was enough.  Nowadays perhaps we are jaded, and only clusters
consisting of hundreds or thousands of CPUs, instead of tens, are in the
running.  Maybe only the top500 systems are "supercomputers.  Maybe the
term itself is really obsolete, as fewer and fewer systems that are
anything BUT a beowulf style cluster (even if it is assembled and sold
as a big iron "single system" with its internal cluster CPUs and IPC
network and memory model hidden by a custom designed operating system)
appear in the HPC marketplace.

Still, I think most people still know what "supercomputer" means.  In
fact, when one looks over the current top500, it appears that it has
>>almost<< become synonymous with the term "beowulf";-)

But not (note well!) with the term "grid", as grids aren't architected
to excell at linpack, and a grid is very definitely not a beowulf.

As far as I can tell, just about 100% of the top500 are clusters (COTS
or otherwise) architected along the lines laid out by the beowulf
project, with 95% of them having lots scalar processors and the
remaining 5% having lots of vector processors.  Unfortunately, the
top500 (which I continue to think of as being almost totally useless for
anything but advertising) doesn't present us with a clear picture of the
operating systems or systems software architectures in place on most of
the clusters.  In fact, it provides remarkably little useful information
except the name of the cluster manufacturer/integrator/reseller (imagine
that;-).  Two clusters on the list (#146 at Cornell and #233 in Korea)
are explicitly indicated as running Windows.  Looking over the general
cluster hardware architectures and manufacturer/integrator/resellers, I
would guess that linux is overwhelmingly dominant, followed by freebsd
and other (proprietary) flavors of Unix, with WinXX quite possibly dead
last.

Open source development is an evolutionary model, capable of paradigm
shifts, far jumps in parametric space, and N^3 advantage in searching
high dimensional spaces.  Proprietary software development is by its
nature a gradient search process, prone to optimizing in perpetuity
around a slowly evolving local minimum, making long jumps only when it
steals fully developed memetic patterns (such as the Internet, cluster
computing, and many more) more often than not produced by evolutionary
communities.  To be fair, new patterns are sometimes introduced a priori
by brilliant individuals without clear roots in open communities (e.g.
"Turbo" compilers), although that is less common in recent years as the
open source development process has itself evolved.  The individuals
only RARELY work for major corporations any more, and the corporations
that are famous as idea factories -- e.g. Bell Labs -- created internal
"open" communities of their very own where the new ideas were incubated
and exchanged and kicked around.

It's just a matter of mathematics, you see.

   Linux = mammal (sorry, Tux:-)

Evolving at a stupendous speed (compare everything from kernel to
complete distributions over the last decade)

   WinXX = Great White Shark

Evolutionarily frozen, remarkably efficient at what it does, immensely yet
curiously vulnerable...

</rant>

Well, that's enough rant for the day.  I've GOT to get some actual work
done...

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From craig at craigsplanet.force9.co.uk  Tue Oct 12 01:18:30 2004
From: craig at craigsplanet.force9.co.uk (Craig Robertson)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] Re: HPC in Windows
Message-ID: <416B9356.40400@craigsplanet.force9.co.uk>

All,
Not wishing to be pedantic, but as RGB points out, the definition of 
Beowulf means that there isn't such a thing as a MS based Beowulf. Yes, 
you can use COTS hardware but Windows is a non-free OS (in both senses 
of the word). Perhaps this was a disguised advertisement of some kind ;0)

A likely scenario with an MS based cluster would be that a problem would 
present itself and there really would be no way of fixing it. Time, 
effort and money would then have to be expended on a kludge since you've 
already paid out thousands of dollars on licensing fees.

-- 
Craig.

---------------------------------------------------------
Dr. C. Robertson
Craig's Planet Ltd.

tel/fax:  +44 1383 411123       fax2email: +44 870 7050992
  mobile:  +44 7890 565695         email: craig@craigspla.net
                 http://www.craigspla.net
---------------------------------------------------------

The information contained within this e-mail is confidential and may be
privileged. It is intended for the addressee only. If you have received
this e-mail in error please inform the sender and delete this e-mail and
any attachments immediately. The contents of this e-mail must not be
disclosed or copied without the sender's consent.
Statements made in email are binding in honour only.
--- Litigous people force us to put this statement here ---


From john.hearns at clustervision.com  Tue Oct 12 07:42:30 2004
From: john.hearns at clustervision.com (John Hearns)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] Grid Engine question
In-Reply-To: <416B91BB.8070102@ohmsurveys.com>
Message-ID: <Pine.LNX.4.44.0410121638160.24903-100000@druifje.clustervision.com>

On Tue, 12 Oct 2004, Mark Westwood wrote:

> Hi All
> 
> We use the open source Grid Engine, Enterprise Edition v5.3, here to 
> manage job submission to our 70 processor Beowulf.  I'm rather new to 
> managing Grid Engine and my users have me baffled with a question of 
> priorities.
Mark,
you would be better off asking this on the Gridengine mailing list.

And if you don't mind me being a little forward, Gridengine version 6.0u1
is now available.

> 
> The scenario is this:
> 
> - suppose that there is a job running on 40 processors, leaving 30 free;
> - a high priority job, requesting 64 processors, is submitted;
> - a low priority, but long, job, requesting 24 processors is submitted.
> 
> Currently, with our configuration, the low priority job would be run 
> immediately, since there are more than 24 processors available. 
> However, my users want to hold that job until the high priority job has run.
> 
> Can we configure Grid Engine so that the low priority job is not started 
> until after the high priority job, even though there are resources 
> available for the low priority job when it is submitted ?
> 
I'm not sure of the exact answer here.
But SGE 6 does have advance reservations - so a hold could be put on 
processors till 64 become free.


From hahn at physics.mcmaster.ca  Tue Oct 12 08:46:07 2004
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] SATA vs SCSI drives
In-Reply-To: <6.1.2.0.2.20041011170124.01d64600@popd.ix.netcom.com>
Message-ID: <Pine.LNX.4.44.0410121132120.15894-100000@coffee.psychology.mcmaster.ca>

> Though slightly dated, I hope the attachment is helpful....btw....I didn't 
> do an exhaustive search, but found the 10K SATA drives only offered at 
> 72GB's and under. The higher cap drives are 7200RPM.

that's correct.  but remember - RPM is mainly for latency, not bandwidth.
if your workload is not incredibly seeky, then you don't want to pay 
for latency, since higher density leads to lower cost, bigger disks, higher
bandwidth and slower seeks.

in summary:
	- meet your reliability requirements using raid.  it's insane 
	to think about relying on a single disk in any non-ephemeral
	setting anyway.  raid lets you achieve pretty much any reliability
	you want (as well as offering a broad spectrum of performance.)

	- meet your seek-rate requirements using RPM.  I find very, very
	few applications are really seek-limited - really it's only very
	databases with uniform-random distribution of reads of tiny data
	from monumentally large tables.  in particular, if there's any 
	data locality or reuse at all, spend money on RAM not RPM.

	- for anything large, get MTBF specs for prospective disks.
	this lets you calculate how often you'll be replacing hardware,
	physically.  your raid has taken care of data robustness;
	this is purely a maintenance issue.

there's no dramatic difference in any of the families of disks available
(well, avoid 1yr warranties, of course!).  consider, for instance, that 
you can easily build raids based on 300G SATA disks that have half as 
many moving parts as with 147G SCSI disks.  even if the MTBF's differ 
by 50% (guess 1.0 and 1.5 Mhours respectively) SATA is more reliabile.
it'll probably also be 1/4 the price and sometimes actually faster.

regards, mark hahn.


From James.P.Lux at jpl.nasa.gov  Tue Oct 12 11:49:07 2004
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] SATA vs SCSI drives
References: <6.1.2.0.2.20041011170124.01d64600@popd.ix.netcom.com>
Message-ID: <5.2.0.9.2.20041012113818.017dca28@mail.jpl.nasa.gov>

At 11:46 AM 10/12/2004 -0400, Mark Hahn wrote:
> > Though slightly dated, I hope the attachment is helpful....btw....I didn't
> > do an exhaustive search, but found the 10K SATA drives only offered at
> > 72GB's and under. The higher cap drives are 7200RPM.
>
>that's correct.  but remember - RPM is mainly for latency, not bandwidth.
>if your workload is not incredibly seeky, then you don't want to pay
>for latency, since higher density leads to lower cost, bigger disks, higher
>bandwidth and slower seeks.
>
>in summary:
>         - meet your reliability requirements using raid.  it's insane
>         to think about relying on a single disk in any non-ephemeral
>         setting anyway.  raid lets you achieve pretty much any reliability
>         you want (as well as offering a broad spectrum of performance.)
>
>         - meet your seek-rate requirements using RPM.  I find very, very
>         few applications are really seek-limited - really it's only very
>         databases with uniform-random distribution of reads of tiny data
>         from monumentally large tables.  in particular, if there's any
>         data locality or reuse at all, spend money on RAM not RPM.
>
>         - for anything large, get MTBF specs for prospective disks.
>         this lets you calculate how often you'll be replacing hardware,
>         physically.  your raid has taken care of data robustness;
>         this is purely a maintenance issue.
>
>there's no dramatic difference in any of the families of disks available
>(well, avoid 1yr warranties, of course!).  consider, for instance, that
>you can easily build raids based on 300G SATA disks that have half as
>many moving parts as with 147G SCSI disks.  even if the MTBF's differ
>by 50% (guess 1.0 and 1.5 Mhours respectively) SATA is more reliabile.
>it'll probably also be 1/4 the price and sometimes actually faster.


Read those MTBF specs carefully... Typically they'll have some sort of 
usage tied to it (so many seeks per second, number of power up/power down 
cycles).  ALso check the temperature effects on MTBF.  It's not unheard of 
for mfrs to specify MTBF assuming a 20C drive temperature, which is 
unrealistically cold.  Typically MTBF halves for each 10C rise in 
temperature.

All the big mfrs have fairly decent descriptions of how they rate MTBF for 
their various drive classes.  Note well that the assumptions of use for 
drives intended for, e.g. consumer PCs, are very different from those 
intended for server duty, and this is primarily determined by how they are 
positioned in the market.


James Lux, P.E.
Spacecraft Radio Frequency Subsystems
Flight Telecommunications Systems
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875


From cnsidero at syr.edu  Tue Oct 12 12:59:27 2004
From: cnsidero at syr.edu (Chris Sideroff)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] choosing a high-speed interconnect
Message-ID: <1097611167.28704.104.camel@syru212-207.syr.edu>

I'm sure posing this may raise more questions than answer but which
high-speed interconnect would offer the best 'bang for the buck':

1) myrinet
2) quardics qsnet
3) mellanox infiniband

Currently, our 30 node dual Opteron (MSI K8D Master-FT boards) cluster
uses Gig/E and are looking to upgrade to a faster network. 

As well, what are the components would one need for each setup?  The
reason I ask is for example the Myrinet switches accept different line
cards and am not sure which one to use.  Sorry if this a bit of a newbie
question but I have no experience with any of this kind of hardware. I
am reading the docs for each but thought your feedback would be good.

Thanks

Chris Sideroff


From mathog at mendel.bio.caltech.edu  Tue Oct 12 11:38:10 2004
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] Tyan 2466 crashes, no obvious reason why
Message-ID: <E1CHRXG-0000fN-00@mendel.bio.caltech.edu>

Just thought I'd share the final outcome of this.

After much swapping around of components and days of 
running memtest86 the problem was moving with the power
supply.  Swapping in the spare PS fixed it and that node
has not so much as hiccupped in the month since.

Note in particular that all of the voltages seen
by the motherboard were always in range.  My working hypothesis
is that the PS either passes too much noise or just
glitches occasionally (for instance, an intermittant
internal short).

The PS was a Zippy power supply with a power cord that
attached via spades to the socket at the back of the 2U case.

  model   AX2-5300FB-2S
  P/N     6AX2-300B055
  ser no: T21905564M1A977732

Big EMACS loggy, tiny www.zippy.com.tw down at the bottom.
It was still under Zippy's warranty and the good folks
at PSSC handled the exhange promptly.

A day (!) after the replacement unit came in a
second node started doing the exact same thing -
unexplained crashes and lock ups with nothing in the
log file.  Logging lm_sensors every 2 minutes showed nothing
untoward up through the last entry.  Crashes were
every few hours.  This time I just swapped the PS first
thing and it has been ok now for over 4 days.  Same
type of power supply inside, this one with 
Serial No. T21905562M1A977732, which differs by only one digit from
the first one that failed. 

Could be a coincidence but I'm beginning to suspect that there
may be a bad component in this lot of power supplies, in which
case an unpleasant series of node failures can probably be expected
in the not too distant future.

Regards,

David Mathog
mathog@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

From agrajag at dragaera.net  Tue Oct 12 13:25:32 2004
From: agrajag at dragaera.net (Sean Dilda)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] Grid Engine question
In-Reply-To: <416B91BB.8070102@ohmsurveys.com>
References: <416B91BB.8070102@ohmsurveys.com>
Message-ID: <1097612732.18951.14.camel@pel>

On Tue, 2004-10-12 at 04:11, Mark Westwood wrote:
> Hi All
> 
> We use the open source Grid Engine, Enterprise Edition v5.3, here to 
> manage job submission to our 70 processor Beowulf.  I'm rather new to 
> managing Grid Engine and my users have me baffled with a question of 
> priorities.
> 
> The scenario is this:
> 
> - suppose that there is a job running on 40 processors, leaving 30 free;
> - a high priority job, requesting 64 processors, is submitted;
> - a low priority, but long, job, requesting 24 processors is submitted.
> 
> Currently, with our configuration, the low priority job would be run 
> immediately, since there are more than 24 processors available. 
> However, my users want to hold that job until the high priority job has run.
> 
> Can we configure Grid Engine so that the low priority job is not started 
> until after the high priority job, even though there are resources 
> available for the low priority job when it is submitted ?

In SGE 6.0 they added a feature they call 'advanced reservations'.  Its
not really advanced, and its not what I consider 'reservations' to be,
but it is exactly what you want.  When reservations are enabled on the
cluster, and the job is submitted with '-R y', the mutli-processor job
will be able to 'hold' available resources until it has enough to run,
and thus keep lower priority jobs from using them.

However, to do this you need to upgrade to at least version 6.0. 
However, 6.0 also has cluster queues which I find makes administration
much easier (it allows you to create one queue setup and assign it to
multiple hosts instead of doing a separate setup for each compute host).


From landman at scalableinformatics.com  Tue Oct 12 13:48:57 2004
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] choosing a high-speed interconnect
In-Reply-To: <1097611167.28704.104.camel@syru212-207.syr.edu>
References: <1097611167.28704.104.camel@syru212-207.syr.edu>
Message-ID: <416C4339.3040309@scalableinformatics.com>

First questions first:

  Why do you think you need a faster network, and what aspect of fast do 
you think you need?  Low latency?  High bandwidth?

Then...

  What codes are you running?  Across how many CPUS?  Have you done a 
performance analysis on your system to observe "slow" runs in progress, 
and are you convinced that the network is the issue?

We have done lots of tuning bits for customers where the issues wound up 
being something else than what they had thought.  It is worth at least 
looking into for your code/problems, and identifying the bottleneck (if 
you haven't already done so).

That said, all the below require an external "switch" fabric.  All range 
from $500-$2000 per HBA, and about $1000 or more per switch port.  
Varies a bit.  Performance is comparible in most cases, with IB  seeming 
to have a higher ceiling than the others.

Joe

Chris Sideroff wrote:

>I'm sure posing this may raise more questions than answer but which
>high-speed interconnect would offer the best 'bang for the buck':
>
>1) myrinet
>2) quardics qsnet
>3) mellanox infiniband
>
>Currently, our 30 node dual Opteron (MSI K8D Master-FT boards) cluster
>uses Gig/E and are looking to upgrade to a faster network. 
>
>As well, what are the components would one need for each setup?  The
>reason I ask is for example the Myrinet switches accept different line
>cards and am not sure which one to use.  Sorry if this a bit of a newbie
>question but I have no experience with any of this kind of hardware. I
>am reading the docs for each but thought your feedback would be good.
>
>Thanks
>
>Chris Sideroff
>
>_______________________________________________
>Beowulf mailing list, Beowulf@beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>  
>

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman@scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 612 4615


From hahn at physics.mcmaster.ca  Tue Oct 12 14:06:41 2004
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] choosing a high-speed interconnect
In-Reply-To: <1097611167.28704.104.camel@syru212-207.syr.edu>
Message-ID: <Pine.LNX.4.44.0410121631030.15894-100000@coffee.psychology.mcmaster.ca>

> I'm sure posing this may raise more questions than answer but which
> high-speed interconnect would offer the best 'bang for the buck':
> 
> 1) myrinet
> 2) quardics qsnet
> 3) mellanox infiniband

at least in the last cluster I bought, Myrinet and IB had similar
overall costs and MPI latency.  so far at least, I haven't found 
any users who are bandwidth-limited, and so no reason there to prefer IB.
(Myri can match the others in bandwidth if you go dual-port; that
approximately doubles the Myri cost, though, making it clearly more 
expensive than IB.)

quadrics is more expensive, but also much faster in latency, and 
competitive with IB in bandwidth.  (there are only three interconnects
that can claim <2 us latency: quadrics elan4, SGI's numalink and the cray
xd1/octigabay.) 

IB vendors swear up and down that they're cheaper than Myri,
lower-latency, higher bandwidth and taste great with iced cream.
I must admit to some skepticism in spite of lacking any IB experience ;)
it does seem clear that upcoming PCI-e systems will let IB vendors
drop a few more chips off their nic, and theoretically come down to 
the $2-300/nic range.  as far as I know, switches are staying more or 
less at the same price.  and it's worth remembering that IB still 
doesn't have *that* much field-proof (questions regarding whether IB
will continue to be a sole-source ecosystem, issues of integrating 
with Linux, rumors of sticking points  regarding pinned memory, qpair 
scaling in large clusters, handling congestion, etc.)

> Currently, our 30 node dual Opteron (MSI K8D Master-FT boards) cluster
> uses Gig/E and are looking to upgrade to a faster network. 

why?  how have you evaluated your need for faster networking?
do you know whether by "faster" you mean latency or bandwidth?
offhand, I'd be a little surprised if a 30-node cluster made 
a lot of sense with quadrics, since you're unlikley to *need*
the superior latency.  (ie, it seems like people jones for low-lat
mainly when they have frequent, large collective operations. 
where large means "hundreds" of MPI workers...)

> As well, what are the components would one need for each setup?  The
> reason I ask is for example the Myrinet switches accept different line
> cards and am not sure which one to use.  Sorry if this a bit of a newbie
> question but I have no experience with any of this kind of hardware. I
> am reading the docs for each but thought your feedback would be good.

hmm, myrinet's pages aren't stunningly clear, but also not *that* 
hard, since they do describe some sample configs.

for instance, you can see the "small switches" section of
http://www.myrinet.com/myrinet/product_list.html
and notice that it's all based on a single 3U enclosure,
one or two 8-way cards (M3-SW16-8F) and an optional monitoring
card (M3-M).

for a 32-node cluster, you'd need 32 nics, a 5-slot cab, 4x M3-SW16-8F's,
either a monitoring card or a blanking panel, and 32 cables.  if you have 
fairly firm and short-term plans for adding more nodes, consider getting
a bigger chassis.  if you have any reason to do IO over myrinet (speed!),
consider giving the fileserver(s) dual-port access...

configuring other networks is not drastically different, though they 
often have different terminology, etc.  for instance, quadrics switches
can be configured with "slim" fat-trees (partially populated with 
spine/switching cards.)  configuration beyond a single switch cab
also tends to be interesting ;)

regards, mark hahn.


From landman at scalableinformatics.com  Tue Oct 12 14:23:53 2004
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] choosing a high-speed interconnect
In-Reply-To: <1097615581.28704.126.camel@syru212-207.syr.edu>
References: <1097611167.28704.104.camel@syru212-207.syr.edu>	
	<416C4339.3040309@scalableinformatics.com>
	<1097615581.28704.126.camel@syru212-207.syr.edu>
Message-ID: <416C4B69.1020509@scalableinformatics.com>


Chris Sideroff wrote:

>On Tue, 2004-10-12 at 16:48, Joe Landman wrote:
>  
>
>>First questions first:
>>
>>  Why do you think you need a faster network, and what aspect of fast do 
>>you think you need?  Low latency?  High bandwidth?
>>    
>>
>
>  To tell you the truth I can't answer that with more than, "I have a
>gut feeling".  I am in the process of profiling the performance of our
>current cluster with our programs.  Any suggestions ???
>  
>

Yes, measure the performance as a function of number of CPUs, and then 
trying this on another similar cluster with the faster interconnect.  Do 
this for "real" runs.  Contact me offline if you would like to discuss.

>  
>
>>Then...
>>
>>  What codes are you running?  Across how many CPUS?  Have you done a 
>>performance analysis on your system to observe "slow" runs in progress, 
>>and are you convinced that the network is the issue?
>>    
>>
>
>   We run exclusively computation fluid dynamics on it.  One program is
>Fluent the other is an in-house turbo-machinery code.  My experiences so
>far have led me to believe Fluent is much more sensitive to the
>network's performance than the in-house program.  Thus my inquiry into a
>higher performance network.
>  
>

I haven't run fluent in the last few months, but it is a latency 
sensitive code.  Would be worth exploring your models performance on a 
faster (e.g. lower latency) net.

>  
>
>>We have done lots of tuning bits for customers where the issues wound up 
>>being something else than what they had thought.  It is worth at least 
>>looking into for your code/problems, and identifying the bottleneck (if 
>>you haven't already done so).
>>    
>>
>
>  Do you have more information on this 'tuning for customers'.  I am
>interested in your results.  Again any suggestions on how to go about
>this are welcomed.
>  
>

Get atop (http://freshmeat.net/projects/atop/), it is your friend.  
Profile your code with the profile tools. If you see lots of time spent 
in "do_writ" and similar, as well as high IO percentages in run times 
from sar, atop, and other tools, you might want to look at IO tuning. 

The important aspect of this is to gather real data about where your 
program spends its time.  That is invaluable in deciding how to speed it up.

Joe


>Thanks, Chris
>  
>

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman@scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 612 4615


From cnsidero at syr.edu  Tue Oct 12 14:13:46 2004
From: cnsidero at syr.edu (Chris Sideroff)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] choosing a high-speed interconnect
In-Reply-To: <416C4339.3040309@scalableinformatics.com>
References: <1097611167.28704.104.camel@syru212-207.syr.edu>
	<416C4339.3040309@scalableinformatics.com>
Message-ID: <1097615626.28704.129.camel@syru212-207.syr.edu>

On Tue, 2004-10-12 at 16:48, Joe Landman wrote:
> First questions first:
> 
>   Why do you think you need a faster network, and what aspect of fast do 
> you think you need?  Low latency?  High bandwidth?

  To tell you the truth I can't answer that with more than, "I have a
gut feeling".  I am in the process of profiling the performance of our
current cluster with our programs.  Any suggestions ???

> Then...
> 
>   What codes are you running?  Across how many CPUS?  Have you done a 
> performance analysis on your system to observe "slow" runs in progress, 
> and are you convinced that the network is the issue?

   We run exclusively computation fluid dynamics on it.  One program is
Fluent the other is an in-house turbo-machinery code.  My experiences so
far have led me to believe Fluent is much more sensitive to the
network's performance than the in-house program.  Thus my inquiry into a
higher performance network.

> We have done lots of tuning bits for customers where the issues wound up 
> being something else than what they had thought.  It is worth at least 
> looking into for your code/problems, and identifying the bottleneck (if 
> you haven't already done so).

  Do you have more information on this 'tuning for customers'.  I am
interested in your results.  Again any suggestions on how to go about
this are welcomed.

Thanks, Chris


From laytonjb at charter.net  Tue Oct 12 14:43:37 2004
From: laytonjb at charter.net (Jeffrey B. Layton)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] choosing a high-speed interconnect
In-Reply-To: <1097611167.28704.104.camel@syru212-207.syr.edu>
References: <1097611167.28704.104.camel@syru212-207.syr.edu>
Message-ID: <416C5009.7070902@charter.net>

Chris Sideroff wrote:

>I'm sure posing this may raise more questions than answer but which
>high-speed interconnect would offer the best 'bang for the buck':
>
>1) myrinet
>2) quardics qsnet
>3) mellanox infiniband
>  
>

   Just as a data point, I've recently seen IB prices as low as $600 a port
including HBA's, cables, software, etc.

   To also had a little fuel to the fire, if you are using your own codes,
try a different MPI. There are a couple of MPI's with really good
performance over GigE.
   Another option is to look at a RDMA NIC. For example, Ammasso
has a low-latency GigE NIC. I don't know prices, but be sure to do
some testing on these NICs vs. IB and Myrinet. Then you can make
a better decision.

Good Luck!

Jeff


From mprinkey at aeolusresearch.com  Tue Oct 12 14:05:27 2004
From: mprinkey at aeolusresearch.com (Michael T. Prinkey)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] choosing a high-speed interconnect
In-Reply-To: <416C4339.3040309@scalableinformatics.com>
Message-ID: <Pine.LNX.4.44.0410121656340.1618-100000@ra.thebes>


This won't help with your Opteron systems as they probably have broadcom
(tg3) NICs, but GAMMA has just released an update that supports Intel
(e1000) gigabit cards:

http://www.disi.unige.it/project/gamma/index.html

They have an MPI implementation as well:

http://www.disi.unige.it/project/gamma/mpigamma/index.html

They claim vastly improved latency and incrementally improved bandwidth on
gigabit hardware relative to TCP/IP.  We are planning to test it with the
new Xeon cluster we will be building next month.  It will be interesting
to see how it fairs with LINPACK and the MFIX CFD code.

Anyone given GAMMA a try?

Mike

On Tue, 12 Oct 2004, Joe Landman wrote:

> First questions first:
> 
>   Why do you think you need a faster network, and what aspect of fast do 
> you think you need?  Low latency?  High bandwidth?
> 
> Then...
> 
>   What codes are you running?  Across how many CPUS?  Have you done a 
> performance analysis on your system to observe "slow" runs in progress, 
> and are you convinced that the network is the issue?
> 
> We have done lots of tuning bits for customers where the issues wound up 
> being something else than what they had thought.  It is worth at least 
> looking into for your code/problems, and identifying the bottleneck (if 
> you haven't already done so).
> 
> That said, all the below require an external "switch" fabric.  All range 
> from $500-$2000 per HBA, and about $1000 or more per switch port.  
> Varies a bit.  Performance is comparible in most cases, with IB  seeming 
> to have a higher ceiling than the others.
> 
> Joe
> 
> Chris Sideroff wrote:
> 
> >I'm sure posing this may raise more questions than answer but which
> >high-speed interconnect would offer the best 'bang for the buck':
> >
> >1) myrinet
> >2) quardics qsnet
> >3) mellanox infiniband
> >
> >Currently, our 30 node dual Opteron (MSI K8D Master-FT boards) cluster
> >uses Gig/E and are looking to upgrade to a faster network. 
> >
> >As well, what are the components would one need for each setup?  The
> >reason I ask is for example the Myrinet switches accept different line
> >cards and am not sure which one to use.  Sorry if this a bit of a newbie
> >question but I have no experience with any of this kind of hardware. I
> >am reading the docs for each but thought your feedback would be good.
> >
> >Thanks
> >
> >Chris Sideroff
> >
> >_______________________________________________
> >Beowulf mailing list, Beowulf@beowulf.org
> >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> >  
> >
> 
> 


From rgb at phy.duke.edu  Tue Oct 12 15:39:19 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] choosing a high-speed interconnect
In-Reply-To: <1097615626.28704.129.camel@syru212-207.syr.edu>
References: <1097611167.28704.104.camel@syru212-207.syr.edu>
	<416C4339.3040309@scalableinformatics.com>
	<1097615626.28704.129.camel@syru212-207.syr.edu>
Message-ID: <Pine.LNX.4.58.0410121825560.5426@lilith.rgb.private.net>

On Tue, 12 Oct 2004, Chris Sideroff wrote:

> On Tue, 2004-10-12 at 16:48, Joe Landman wrote:
> > First questions first:
> > 
> >   Why do you think you need a faster network, and what aspect of fast do 
> > you think you need?  Low latency?  High bandwidth?
> 
>   To tell you the truth I can't answer that with more than, "I have a
> gut feeling".  I am in the process of profiling the performance of our
> current cluster with our programs.  Any suggestions ???

Analyze the applications, preferrably at the code level.  If they
exchange a few, big messages then they are likely bandwidth limited.  If
they exchange many, small messages then they are likely latency limited.
If you don't have access to the code, then run a tool such as
xmlsysd/wulfstat that lets you watch the (ether)net on a whole cluster
at once as it runs your applications and take note on e.g. packet counts
per second per node, net data throughput per second per node.

Joe's question is dead on the money.  Until you do this, you cannot be
sure that your application is choking due to a network that is "slow" in
any dimension.  Even if it IS slow due the network, it may not be slow
in a sense that can be substantively fixed by changing networks, if
you're already using gigE.  gigE's latency isn't great, but its
bandwidth should be at least comparable (within a factor of 1-3) of the
faster networks.

Sometimes, also, the problem is the network but not at the physical
layer; rather in the way the code itself is organized and uses the
network.  If the code is YOUR code, then a trip through e.g. Ian
Foster's book on parallel programming and algorithms (there are several
others with good reputations) is indicated before investing a LOT of
money in a new network.  If the code is somebody else's code, then the
list is a great place to get actual feedback on what the essential
bottlenecks are and to learn of actual clusters that are successful
designs.

It sounds (below) like you have a bit of both -- good luck finding
Fluent users or a Fluent-savvy consultant on the list (both seem pretty
likely).

Before departing, I'd suggest working with vendors to arrange a loaner
network and prototyping it with your programs before finally buying it.
These networks are a substantial investment, as the companies that sell
them well know.  The companies are quite competitive and want your
business.  They are usually pretty willing to let their hardware "speak
for itself" so you aren't investing $1-2K/node only to learn afterwards
that it doesn't speed your code up at all.  That is an outcome that
benefits nobody, really, not even the network vendor (as you'll
doubtless later poison their reputation in this very competitive and
reputation-sensitive marketplace).

   rgb

> 
> > Then...
> > 
> >   What codes are you running?  Across how many CPUS?  Have you done a 
> > performance analysis on your system to observe "slow" runs in progress, 
> > and are you convinced that the network is the issue?
> 
>    We run exclusively computation fluid dynamics on it.  One program is
> Fluent the other is an in-house turbo-machinery code.  My experiences so
> far have led me to believe Fluent is much more sensitive to the
> network's performance than the in-house program.  Thus my inquiry into a
> higher performance network.
> 
> > We have done lots of tuning bits for customers where the issues wound up 
> > being something else than what they had thought.  It is worth at least 
> > looking into for your code/problems, and identifying the bottleneck (if 
> > you haven't already done so).
> 
>   Do you have more information on this 'tuning for customers'.  I am
> interested in your results.  Again any suggestions on how to go about
> this are welcomed.
> 
> Thanks, Chris
> 
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From rgb at phy.duke.edu  Tue Oct 12 16:38:51 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] choosing a high-speed interconnect
In-Reply-To: <Pine.LNX.4.58.0410121825560.5426@lilith.rgb.private.net>
References: <1097611167.28704.104.camel@syru212-207.syr.edu>
	<416C4339.3040309@scalableinformatics.com>
	<1097615626.28704.129.camel@syru212-207.syr.edu>
	<Pine.LNX.4.58.0410121825560.5426@lilith.rgb.private.net>
Message-ID: <Pine.LNX.4.58.0410121936080.5426@lilith.rgb.private.net>

On Tue, 12 Oct 2004, Robert G. Brown wrote:

> you're already using gigE.  gigE's latency isn't great, but its
> bandwidth should be at least comparable (within a factor of 1-3) of the
> faster networks.

Correction (as Greg pointed out offline, very kindly):

My gross generalization is grossly incorrect -- you CAN get nearly an
order of magnitude improvement in bandwidth with Quadrics and IB.

I stand humbly corrected.

But you still should verify that bandwidth is your problem before
investing in more of it.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From daniel.kidger at quadrics.com  Tue Oct 12 16:44:28 2004
From: daniel.kidger at quadrics.com (Dan Kidger)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] choosing a high-speed interconnect
In-Reply-To: <1097611167.28704.104.camel@syru212-207.syr.edu>
References: <1097611167.28704.104.camel@syru212-207.syr.edu>
Message-ID: <200410130044.28260.daniel.kidger@quadrics.com>

Chris,

> I'm sure posing this may raise more questions than answer but which
> high-speed interconnect would offer the best 'bang for the buck':
>
> 1) myrinet
> 2) quadrics qsnet
> 3) mellanox infiniband
>
> Currently, our 30 node dual Opteron (MSI K8D Master-FT boards) cluster
> uses Gig/E and are looking to upgrade to a faster network.


WelI I am from one of the vendors that you cite so perhaps by reply is biased.
But hopefully I can reply without it seeming like a sales pitch.

Our QsNetII interconnect sells for around $1700 per node (card=$999, rest is 
cable and share of the switch). A 4U high 32-way switch would be the nearest 
match in tems of size for a 30-node cluster. (c $14K iirc)

MPI bandwidth is 875MB/s on Opteron (higher on say IA64/Nocona but the AMD 
PCI-X bridge limits us), 
MPI latency is 1.5us on Opteron. - only sligthtly better the Cray/Octigabay 
Opteron product (usually quoted as 1.7us.)

Infiniband bandwidth is only a little less than ours, and latency not much 
worse than twice ours.  Myrinet lags a fair bit currently but they do have a 
new faster product soon to hit the market which you should look out for.


All vendors have a variety of switch sizes - either as a fixed size 
configuration - or as a chassis that takes one or more line cards that can be 
upgraded if your cluster gets expanded. Some solutions such as Myrinet revE 
cards need two switch ports per node but otherwise you just need a switch big 
enough for your node count and allowing for possible future expansion.

   Very large clusters have multiple switch cabinets arranged as node-level 
switches which have links to the nodes and top-level 'spine' switch cabinets 
that interconnect the node-level cabinets. If you have the same number of 
links to the spine switches as you do to the actual nodes then you should 
have 'full bisectionall bandwidth'. However you can save money by cutting 
back on the amount of spine switching you buy.

Many interconnect vendors offer a choice of copper or fibre cabling. The 
former is often cheaper (no expensive lasers) but the latter can be used for 
longer cable runs and is often easier to physically manage particularly when 
installing very large clusters.

What to buy depends very much on your application. Maybe you haven't proved 
that your GigE is the limiting factor. I do have figures for Fluent on ours 
and other interconnects but the Beowulf list is not the correct place to post 
these.

As Robert pointed out, most vendors will loan equipment for a month or so and 
indeed many can provide external access to clusters for benchmarking 
purposes. Also for example the AMD Developer Center has large Myrient and 
Infiniband clusters that you can ask to get access to.

Hope this helps,
Daniel

--------------------------------------------------------------
Dr. Dan Kidger, Quadrics Ltd.      daniel.kidger@quadrics.com
One Bridewell St., Bristol, BS1 2AA, UK         0117 915 5505
----------------------- www.quadrics.com --------------------


From evan.cull at duke.edu  Tue Oct 12 18:47:29 2004
From: evan.cull at duke.edu (Evan Cull)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] a cluster to drive a wall of monitors
Message-ID: <416C8931.4030805@duke.edu>

Hi all,

I was told this list would be a good place to ask for advice on the 
following project.  (I've tried to search through list archives for 
related info, but I haven't managed to spot anything so far.) 

I'm helping with a project that want's to drive a wall of about 50 LCD 
panels with a linux cluster running Syzygy:
http://www.isl.uiuc.edu/syzygy.htm

I was considering a cluster of either 50 single processor nodes or 25 
dual processor + dual output graphics card nodes.  I suppose 50 dual 
processor nodes would be nice, but I'm pretty sure that's well out of my 
budget range.  I'm betting that the 50 single processor nodes would 
easily have twice the graphics performance of the 25 dual nodes because 
they have 2x as many video cards.  The tradeoff here is that the dual 
processor nodes might be more useful for other more general computing 
tasks we could run on them. 

Does anyone here have experience buying rackmountable cluster nodes 
*with graphics cards* who can point me to a vendor?

For that matter, have any of you built a similar system & have any 
suggestions / comments?

thanks,
Evan Cull


From mlleinin at hpcn.ca.sandia.gov  Tue Oct 12 20:42:48 2004
From: mlleinin at hpcn.ca.sandia.gov (Matt L. Leininger)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] choosing a high-speed interconnect
In-Reply-To: <Pine.LNX.4.44.0410121631030.15894-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0410121631030.15894-100000@coffee.psychology.mcmaster.ca>
Message-ID: <1097638968.8496.184.camel@trinity>

On Tue, 2004-10-12 at 17:06 -0400, Mark Hahn wrote:
> 
> IB vendors swear up and down that they're cheaper than Myri,
> lower-latency, higher bandwidth and taste great with iced cream.
> I must admit to some skepticism in spite of lacking any IB experience ;)
> it does seem clear that upcoming PCI-e systems will let IB vendors
> drop a few more chips off their nic, and theoretically come down to 
> the $2-300/nic range.  as far as I know, switches are staying more or 
> less at the same price.  and it's worth remembering that IB still 
> doesn't have *that* much field-proof (questions regarding whether IB
> will continue to be a sole-source ecosystem, issues of integrating 
> with Linux, rumors of sticking points  regarding pinned memory, qpair 
> scaling in large clusters, handling congestion, etc.)
> 
 
  There are multiple 128 node (and greater) IB systems that are stable
and are being used for production apps.  The #7 top500 machine from
RIKEN is using IB and has been in production for over six months.  My
cluster at Sandia (about 128 nodes) is being used for IB R&D and
production science runs.  The science runs have produce many papers over
the last 9 months.  We've purchased other IB clusters ranging from 64 to
>300 nodes that are for production use.  All run great under Linux, and
you have multiple IB vendors to choose from (Voltaire, Topspin,
InfiniCon, and Mellanox).   Almost all of the IB software development is
done under Linux first and then ported to other OSes.  

   QP scaling isn't as critical an issue if the MPI implementation sets
up the connections as needed (kinda of a lazy connection setup).  Why
set up an all-to-all QP connectivity if the MPI implements an all-to-all
or collectives as tree based pt2pt algorithms.  Network congestion on
larger clusters can be reduced by using source based adaptive
(multipath) routing instead of the standard IB static routing.  

  Also remember that IB has a lot more field experience than the latest
Myricom hardware and MX software stack.  

	- Matt


From Thomas_Hoeffel at chiron.com  Tue Oct 12 14:54:09 2004
From: Thomas_Hoeffel at chiron.com (Hoeffel, Thomas)
Date: Wed Nov 25 01:03:28 2009
Subject: [Beowulf] torque vs openpbs?
Message-ID: <1D07750058CEAC4396F1FAB701900301028C7E21@emvosiris.chiron.com>

our cluster environment is beginning to tax our openpbs installation.
It runs fine on our old cluster (PIII's/10/100 switch) but is a bit quirky
on the newer opterons (gig switches, more mem...etc.)
Pricing for PBSPro is, well, a bit outrageous and I'm considering
Torque/Maui combo.

Any thoughts/feedback on the size of the torque community, it's life
expectancy..etc.
SGE is currently not an option as we have 3rd party code which interfaces
well w/ PBS but not SGE.

Thomas J. Hoeffel
Computational Chemistry
Chiron Corp. MS 4.2
4560 Horton St.
Emeryville, CA 94608


From alvin at Mail.Linux-Consulting.com  Tue Oct 12 22:05:40 2004
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] a cluster to drive a wall of monitors
In-Reply-To: <416C8931.4030805@duke.edu>
Message-ID: <Pine.LNX.3.96.1041012215324.11014A-100000@Maggie.Linux-Consulting.com>


hi ya evan

On Tue, 12 Oct 2004, Evan Cull wrote:

> I'm helping with a project that want's to drive a wall of about 50 LCD 
> panels with a linux cluster running Syzygy:
> http://www.isl.uiuc.edu/syzygy.htm

i didn't see 4, 16, 50 monitors at that site :-) but maybe i didnt look
in the right places or with the right eyeballs

for 2x3 or more monitors ..

	http://www.linux-1u.net/X11/Quad/

- a wall of 16 monitors

	http://www.linux-1u.net/X11/Quad/gstreamer.net/vw1.png
	http://www.linux-1u.net/X11/Quad/gstreamer.net/vw2.png

	http://www.linux-1u.net/X11/Quad/gstreamer.net/video-wall-howto.html

the trick is to divide out the one pic into 1/4 pics each
and the bracket between each adjacent lcd to be minimal and
non-distracting fromt eh whole image displayed on 4 or more monitors

	lots of XF86Config editing and tweeking

doing that with *.jpg is almost trivial 

doing that with *.mpeg with mplayer/zine becomes a fun project

> I was considering a cluster of either 50 single processor nodes or 25 
> dual processor + dual output graphics card nodes.  I suppose 50 dual 

an itty bitty P3-800 equivalent cpu can trivially play an mpeg file
( you dont need horsepower to play mpegs )

if you are encoding ... that might be trickier .. and that you'd
need to keep the video and audio in sync ( not trivial )
	- lots of rejected *.mpegs due to sound and video being
	out of sync ( even on the fastest pcs )

> Does anyone here have experience buying rackmountable cluster nodes 
> *with graphics cards* who can point me to a vendor?

we sell those puppises, which is half the fun .. :-) 

> For that matter, have any of you built a similar system & have any 
> suggestions / comments?

depending on where the movies are being played, remote admin or not
and if they "hit reset" or powerfailures will be yur biggest problem
- we have 100 systems in 100 cities across this itty-bitty-land

c ya
alvin


From hahn at physics.mcmaster.ca  Tue Oct 12 22:40:03 2004
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] choosing a high-speed interconnect
In-Reply-To: <1097638968.8496.184.camel@trinity>
Message-ID: <Pine.LNX.4.44.0410130129310.25599-100000@coffee.psychology.mcmaster.ca>

>   There are multiple 128 node (and greater) IB systems that are stable
> and are being used for production apps.  The #7 top500 machine from

I thank you for this street-level information!  it's frustrating
to only know a technology based on marketing...

> RIKEN is using IB and has been in production for over six months.  My
> cluster at Sandia (about 128 nodes) is being used for IB R&D and

still, 128 nodes is fairly small these days.  would you characterize
your applications as fairly bandwidth-intensive?  I know that many 
of the apps that run on really big weapons-related labs tend to 
emphasize latency to an extreme degree, but perhaps your codes are 
not like that?

> >300 nodes that are for production use.  All run great under Linux, and
> you have multiple IB vendors to choose from (Voltaire, Topspin,
> InfiniCon, and Mellanox).

well, aren't all of those just minor modifications of the same 
mellanox chip?  that's what I meant by "not-really-multi-vendor".
the IB world would like to compare itself to the eth world,
but it's a very, very long way away from being really vendor-independent.

> Almost all of the IB software development is
> done under Linux first and then ported to other OSes.  

very interesting!  do you mean user-level IB software and middleware?
I had the impression (circa OLS in July) that there was no real 
unification of linux IB stacks, and significant problems with 
windows-centricness of the code.

>    QP scaling isn't as critical an issue if the MPI implementation sets
> up the connections as needed (kinda of a lazy connection setup).  Why
> set up an all-to-all QP connectivity if the MPI implements an all-to-all
> or collectives as tree based pt2pt algorithms.

that sounds reasonable, but does it work out well?  I guess it would 
depend mainly on whether the actual collective groups change frequently and
are reused.

> Network congestion on
> larger clusters can be reduced by using source based adaptive
> (multipath) routing instead of the standard IB static routing.  

interesting, again!  in the most recent visit by S&M people from 
an IB vendor, they claimed that there was no problem and that any
reasonably smart switch would have a routing manager smart enough
to prevent the non-problem.

>   Also remember that IB has a lot more field experience than the latest
> Myricom hardware and MX software stack.  

to me, "recent myricom" means e-cards, which I, perhaps naively,
think are more of a known quantity than anything IB.  and I haven't
managed to lay hands on MX yet <sniff>.

I'm really glad to hear early adopters of IB speak up; I still claim
that they actually are early adopters, though ;)

regards, mark hahn.


From nixon at nsc.liu.se  Wed Oct 13 07:15:11 2004
From: nixon at nsc.liu.se (Leif Nixon)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] Grid Engine question
In-Reply-To: <1097612732.18951.14.camel@pel> (Sean Dilda's message of "Tue,
	12 Oct 2004 16:25:32 -0400")
References: <416B91BB.8070102@ohmsurveys.com> <1097612732.18951.14.camel@pel>
Message-ID: <873c0iwvps.fsf@nsc.liu.se>

Sean Dilda <agrajag@dragaera.net> writes:

> In SGE 6.0 they added a feature they call 'advanced reservations'.  Its
> not really advanced, and its not what I consider 'reservations' to be,
> but it is exactly what you want.

That's "advance reservations", not "advanced reservations".

-- 
Leif Nixon                                    Systems expert
------------------------------------------------------------
National Supercomputer Centre           Linkoping University
------------------------------------------------------------

From brian at cypher.acomp.usf.edu  Wed Oct 13 09:05:57 2004
From: brian at cypher.acomp.usf.edu (Brian R Smith)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] a cluster to drive a wall of monitors
In-Reply-To: <416C8931.4030805@duke.edu>
References: <416C8931.4030805@duke.edu>
Message-ID: <1097683557.23424.11.camel@cypher.acomp.usf.edu>

Evan,

We just built a smaller version of that, a 4x3 display wall.  We went
the dual-output video card route and deeply regret it.  The performance
is rather lackluster and getting the screens to align correctly e.g. not
displaying broken chunks of images, is damn near impossible with
DMX/Chromium.  We originally configured it so that each video card ran
two frame buffers, one for each screen, allowing us to configure each
displays bounderies manually.  The problem was that running two frame
buffers on each card resulted in further degrading each machine's
performance.

If at all possible, make sure you have one machine for each screen.  Of
course, YMMV, but this has been our experience in the matter.

Good luck.

Brian Smith

On Tue, 2004-10-12 at 21:47, Evan Cull wrote:
> Hi all,
> 
> I was told this list would be a good place to ask for advice on the 
> following project.  (I've tried to search through list archives for 
> related info, but I haven't managed to spot anything so far.) 
> 
> I'm helping with a project that want's to drive a wall of about 50 LCD 
> panels with a linux cluster running Syzygy:
> http://www.isl.uiuc.edu/syzygy.htm
> 
> I was considering a cluster of either 50 single processor nodes or 25 
> dual processor + dual output graphics card nodes.  I suppose 50 dual 
> processor nodes would be nice, but I'm pretty sure that's well out of my 
> budget range.  I'm betting that the 50 single processor nodes would 
> easily have twice the graphics performance of the 25 dual nodes because 
> they have 2x as many video cards.  The tradeoff here is that the dual 
> processor nodes might be more useful for other more general computing 
> tasks we could run on them. 
> 
> Does anyone here have experience buying rackmountable cluster nodes 
> *with graphics cards* who can point me to a vendor?
> 
> For that matter, have any of you built a similar system & have any 
> suggestions / comments?
> 
> thanks,
> Evan Cull
> 
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

From chliaskos at yahoo.gr  Wed Oct 13 00:10:53 2004
From: chliaskos at yahoo.gr (Chris LS)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] Optimal Number of nodes?
Message-ID: <20041013071053.82519.qmail@web86906.mail.ukl.yahoo.com>

Hello
   
   I'm an el. engineering student
  Can anyone help me on the following subject? 
   
   I have to theoritically design a clustered server ,and although most
   parts of the procedure are complete ,i can't find a way to calculate or 
   estimate the optimal number of nodes needed . I've spent countless
   hours searching but i can't find anything but very general advices, while 
   i'm interested in the exact procedure -if it exists.
   Can anyone give me any related links or any other info on this?
  
                                                            Thanks in advance !
                                                                       Chris
                                                            chliaskos@yahoo.gr


---------------------------------
Do You Yahoo!?
????????? ??? ?????? ???@yahoo.gr ?????????  ??? Yahoo! Mail.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20041013/f27d219d/attachment.html
From cnsidero at syr.edu  Wed Oct 13 07:55:05 2004
From: cnsidero at syr.edu (Chris Sideroff)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] choosing a high-speed interconnect
In-Reply-To: <1097611167.28704.104.camel@syru212-207.syr.edu>
References: <1097611167.28704.104.camel@syru212-207.syr.edu>
Message-ID: <1097679300.28704.160.camel@syru212-207.syr.edu>

Thanks for all the excellent replies.  The consensus seems to be carry
out some performance profiling with the current hardware and compare
with a high-speed network cluster.  The former which I am in the process
of doing and the latter which I will try to do.

More specifically, I believe the most important testing for our cluster
will be Fluent's scalability and sensitivity to the network.  The reason
I say this is because there are multiple users (~6-8) running large
Fluent jobs (1-10 million cells) with various solvers, which have
different CPU and memory requirements.  While the in-house code is run
by one person, a lot less frequently and (for other reasons) currently
does not use more than 8 processors.

Following rgb's suggestions, I will be difficult to analyze Fluent at
the code level since we don't have the code but it has some built in
monitors that I can use combined with some Linux tools.  When the time
comes to scale the in-house code to >8 procs we will have greater
flexibility tuning it. BTW, thanks for the parallel programming
reference.

I did find some benchmarks on Fluent's website which indicate that
Myrinet definitely scales better than GigE but I still want to carry out
my own benchmarking.  If anyone has any experiences benchmarking their
clusters using Fluent feel free to supply your thoughts.


From joachim at ccrl-nece.de  Wed Oct 13 07:37:18 2004
From: joachim at ccrl-nece.de (joachim@ccrl-nece.de)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] choosing a high-speed interconnect
In-Reply-To: <1097638968.8496.184.camel@trinity>
References: <Pine.LNX.4.44.0410121631030.15894-100000@coffee.psychology.mcmaster.ca>
	<1097638968.8496.184.camel@trinity>
Message-ID: <3836.221.114.211.251.1097678238.squirrel@postman.ccrl-nece.de>


>    QP scaling isn't as critical an issue if the MPI implementation sets
> up the connections as needed (kinda of a lazy connection setup).  Why
> set up an all-to-all QP connectivity if the MPI implements an all-to-all
> or collectives as tree based pt2pt algorithms.  Network congestion on

Good MPI collectives often are not tree based, but need more connectivity.
Of course, the best collectives are optimized for a certain interconnect.

 Joachim


From john.hearns at clustervision.com  Wed Oct 13 00:26:35 2004
From: john.hearns at clustervision.com (John Hearns)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] torque vs openpbs?
In-Reply-To: <1D07750058CEAC4396F1FAB701900301028C7E21@emvosiris.chiron.com>
References: <1D07750058CEAC4396F1FAB701900301028C7E21@emvosiris.chiron.com>
Message-ID: <1097652395.8836.8.camel@vigor12>

On Tue, 2004-10-12 at 22:54, Hoeffel, Thomas wrote:
> our cluster environment is beginning to tax our openpbs installation.
> It runs fine on our old cluster (PIII's/10/100 switch) but is a bit quirky
> on the newer opterons (gig switches, more mem...etc.)
> Pricing for PBSPro is, well, a bit outrageous and I'm considering
> Torque/Maui combo.
> 
> Any thoughts/feedback on the size of the torque community, it's life
> expectancy..etc.
> SGE is currently not an option as we have 3rd party code which interfaces
> well w/ PBS but not SGE.
Is this a package called Materials Studio by any chance?

I posted something a few weeks ago to the Gridengine list about that.
Looking at the interface, it didn't look too hard to port to Gridengine.


From laurenceliew at yahoo.com.sg  Wed Oct 13 06:04:21 2004
From: laurenceliew at yahoo.com.sg (Laurence Liew)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] a cluster to drive a wall of monitors
In-Reply-To: <416C8931.4030805@duke.edu>
References: <416C8931.4030805@duke.edu>
Message-ID: <416D27D5.4020802@yahoo.com.sg>

Hi,

Check out the Visualization Roll over in www.rocksclusters.org...

The SDCS Rocks guys have got such a system setup....  3x3 panels... 
driven by a Rocks cluster with 9 compute nodes (Shuttle XPC + NVidia 
cards) and 1 frontend....

Contact me offline if there is more interest... or post to the Rocks 
mailing list.

Cheers!
laurence

Evan Cull wrote:
> Hi all,
> 
> I was told this list would be a good place to ask for advice on the 
> following project.  (I've tried to search through list archives for 
> related info, but I haven't managed to spot anything so far.)
> I'm helping with a project that want's to drive a wall of about 50 LCD 
> panels with a linux cluster running Syzygy:
> http://www.isl.uiuc.edu/syzygy.htm
> 
> I was considering a cluster of either 50 single processor nodes or 25 
> dual processor + dual output graphics card nodes.  I suppose 50 dual 
> processor nodes would be nice, but I'm pretty sure that's well out of my 
> budget range.  I'm betting that the 50 single processor nodes would 
> easily have twice the graphics performance of the 25 dual nodes because 
> they have 2x as many video cards.  The tradeoff here is that the dual 
> processor nodes might be more useful for other more general computing 
> tasks we could run on them.
> Does anyone here have experience buying rackmountable cluster nodes 
> *with graphics cards* who can point me to a vendor?
> 
> For that matter, have any of you built a similar system & have any 
> suggestions / comments?
> 
> thanks,
> Evan Cull
> 
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: laurenceliew.vcf
Type: text/x-vcard
Size: 150 bytes
Desc: not available
Url : http://www.scyld.com/pipermail/beowulf/attachments/20041013/432b1a47/laurenceliew.vcf
From oliviacal at earthlink.net  Tue Oct 12 21:55:24 2004
From: oliviacal at earthlink.net (Olivia)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] Beowulf Illustrated
Message-ID: <410-220041031345524410@earthlink.net>

Is There a picture of Beowulf?  I have to draw it on a poster board for a project. 

Olivia Calzada
oliviacal@earthlink.net
Why Wait? Move to EarthLink.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20041012/65bc212c/attachment.html
From bill at cse.ucdavis.edu  Wed Oct 13 13:31:39 2004
From: bill at cse.ucdavis.edu (Bill Broadley)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] choosing a high-speed interconnect
In-Reply-To: <1097611167.28704.104.camel@syru212-207.syr.edu>
References: <1097611167.28704.104.camel@syru212-207.syr.edu>
Message-ID: <20041013203139.GB23216@cse.ucdavis.edu>

I'm glossing over many details, but in general I've found the below
mentioned strategies a good first order approximation.

I'd suggest taking several representative production runs and try graph
the performance on 1,2,4,8,16,32 processors or whatever is feasible for
your jobs and cluster.

If you see good scaling I.e. each jump gets almost twice as much work
done you very likely will not benefit from a faster interconnect.

If you do not see good scaling then you might be bottlenecked by latency
or bandwidth, or possibly other factors like a faster than linear increase
in work with extra nodes, and disk I/O performance among others.

Hope for the first, it will save you money, time and effort.  If it's
the later then it would be worth your while to try to find out exactly
why your code isn't scaling.  Even the simplest measures can help,
for instance recording the before and after packet counts as reported
by ifconfig.  Graphing how the number of packets increases with N and
how the performance scales with N might provide valuable insight.

Another dirty hack can be to force your interfaces to 100 Mbit and
see how the performance changes.  If it's minimal it's likely to be
either latency (100 Mbit and GigE usually don't vary by much) or not
bandwidth constrained.

Also something like ganglia can provide you with a significant amount of
additional info for a run, so you can watch memory, network, load, memory
used, buffers used etc.  See how these variables change with the timestep
and with the number of nodes can be very helpful for getting a general
idea of how your job is behaving.  One particular job I was running had
network traffic increasing with each iteration, above a certain point
the wall clock time per timestep increased.  Calculations showed I was
getting 30% of peak GigE performance, it is likely that between the
packet overhead and MPI overhead that was as fast as I was likely to see.

Certainly none of the above will give you as good as an idea as a source
code analysis or a fully profiled run, but they can help steer you in
the right direction.

-- 
Bill Broadley
Computational Science and Engineering
UC Davis

From landman at scalableinformatics.com  Tue Oct 12 22:12:31 2004
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] choosing a high-speed interconnect
In-Reply-To: <1097638968.8496.184.camel@trinity>
References: <Pine.LNX.4.44.0410121631030.15894-100000@coffee.psychology.mcmaster.ca>
	<1097638968.8496.184.camel@trinity>
Message-ID: <416CB93F.40502@scalableinformatics.com>

Hi Matt:

  Good to see you here ... :)

Matt L. Leininger wrote:

>  
>
> 
>  There are multiple 128 node (and greater) IB systems that are stable
>and are being used for production apps.  The #7 top500 machine from
>RIKEN is using IB and has been in production for over six months.  My
>cluster at Sandia (about 128 nodes) is being used for IB R&D and
>  
>

FWIW I used the nice setup that the AMD Dev center team have set up for 
benchmarking and testing.  They have a nice IB platform there.

[...]

>   QP scaling isn't as critical an issue if the MPI implementation sets
>up the connections as needed (kinda of a lazy connection setup).  Why
>set up an all-to-all QP connectivity if the MPI implements an all-to-all
>or collectives as tree based pt2pt algorithms.  Network congestion on
>larger clusters can be reduced by using source based adaptive
>(multipath) routing instead of the standard IB static routing.  
>  
>

On features utility ... (qp scaling, ...)  (more to Mark than Matt here)

One of the things I remember as a "feature" much touted by the 
marketeers in the ccNUMA 6.5 IRIX days was page migration.  This feature 
was supposed to ameliorate memory access hotspots in parallel codes.  
Enough hits on a page from a remote CPU, and whammo, off it went to the 
remote CPU.

Turns out this was "A Bad Thing(TM)".  There were many reasons for this, 
but in the end, page migration was little more than a marginal feature, 
best used in specific corner cases.  Sure, someone will speak up and 
tell me how much pain it saved them, or made their code 3 orders of 
magnitude faster.  I never saw that in general.  I got better results 
from dplace, and large pages than I ever got from some of these other 
features.

The point is that there are often lots of features.  Some of which might 
even be generally useful.  Others might simply not be useful as the 
application level issues might be better served by other methods (as you 
pointed out). 

IB works pretty nicely on clusters.  So do many of the other 
interconnects.  If you have latency bound or bandwidth bound problems, 
certainly it would be worth looking into.

The original question was which to look at.  First the need needs to be 
assessed, and from there, a reasonable comparison may be made.  IB does 
look like it is drawing wide support right now, and is not single 
sourced.  It may be possible (though I haven't done much in the way of 
measurement) that tcp offload systems might help as well.  If you are 
not extremely sensitive to latency, you might be able to use these.  If 
you are, you should stick to the low latency fabrics.


>  Also remember that IB has a lot more field experience than the latest
>Myricom hardware and MX software stack.  
>  
>

Joe


From deadline at linux-mag.com  Wed Oct 13 15:41:32 2004
From: deadline at linux-mag.com (Douglas Eadline, Cluster World Magazine)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] a cluster to drive a wall of monitors
In-Reply-To: <416C8931.4030805@duke.edu>
Message-ID: <Pine.LNX.4.44.0410131835200.6086-100000@boltzmann>


We did an issue on this:

http://www.clusterworld.com/issues/jul-04-preview.shtml

BTW: Issue gallery is here:
http://www.clusterworld.com/issues.shtml

We are working on way make back issues available. For now, if
you know someone that gets ClusterWorld, maybe you can borrow an issue.

Oh, I see you are at Duke. Maybe contact rgb.
(http://www.phy.duke.edu/~rgb/)

Doug


On Tue, 12 Oct 2004, Evan Cull wrote:

> Hi all,
> 
> I was told this list would be a good place to ask for advice on the 
> following project.  (I've tried to search through list archives for 
> related info, but I haven't managed to spot anything so far.) 
> 
> I'm helping with a project that want's to drive a wall of about 50 LCD 
> panels with a linux cluster running Syzygy:
> http://www.isl.uiuc.edu/syzygy.htm
> 
> I was considering a cluster of either 50 single processor nodes or 25 
> dual processor + dual output graphics card nodes.  I suppose 50 dual 
> processor nodes would be nice, but I'm pretty sure that's well out of my 
> budget range.  I'm betting that the 50 single processor nodes would 
> easily have twice the graphics performance of the 25 dual nodes because 
> they have 2x as many video cards.  The tradeoff here is that the dual 
> processor nodes might be more useful for other more general computing 
> tasks we could run on them. 
> 
> Does anyone here have experience buying rackmountable cluster nodes 
> *with graphics cards* who can point me to a vendor?
> 
> For that matter, have any of you built a similar system & have any 
> suggestions / comments?
> 
> thanks,
> Evan Cull
> 
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
----------------------------------------------------------------
Editor-in-chief                   ClusterWorld Magazine
Desk: 610.865.6061                            
Fax:  610.865.6618                www.clusterworld.com


From rgb at phy.duke.edu  Wed Oct 13 16:01:55 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] a cluster to drive a wall of monitors
In-Reply-To: <Pine.LNX.4.44.0410131835200.6086-100000@boltzmann>
References: <Pine.LNX.4.44.0410131835200.6086-100000@boltzmann>
Message-ID: <Pine.LNX.4.58.0410131900580.6462@lilith.rgb.private.net>

On Wed, 13 Oct 2004, Douglas Eadline, Cluster World Magazine wrote:

> 
> We did an issue on this:
> 
> http://www.clusterworld.com/issues/jul-04-preview.shtml
> 
> BTW: Issue gallery is here:
> http://www.clusterworld.com/issues.shtml
> 
> We are working on way make back issues available. For now, if
> you know someone that gets ClusterWorld, maybe you can borrow an issue.
> 
> Oh, I see you are at Duke. Maybe contact rgb.
> (http://www.phy.duke.edu/~rgb/)

<blush> Oh yeah, kinda forgot about that.  Busy week.  I'll see if I can
dig out the issue from my neatly organized stash (ha!) </blush>

   rgb

> 
> Doug
> 
> 
> On Tue, 12 Oct 2004, Evan Cull wrote:
> 
> > Hi all,
> > 
> > I was told this list would be a good place to ask for advice on the 
> > following project.  (I've tried to search through list archives for 
> > related info, but I haven't managed to spot anything so far.) 
> > 
> > I'm helping with a project that want's to drive a wall of about 50 LCD 
> > panels with a linux cluster running Syzygy:
> > http://www.isl.uiuc.edu/syzygy.htm
> > 
> > I was considering a cluster of either 50 single processor nodes or 25 
> > dual processor + dual output graphics card nodes.  I suppose 50 dual 
> > processor nodes would be nice, but I'm pretty sure that's well out of my 
> > budget range.  I'm betting that the 50 single processor nodes would 
> > easily have twice the graphics performance of the 25 dual nodes because 
> > they have 2x as many video cards.  The tradeoff here is that the dual 
> > processor nodes might be more useful for other more general computing 
> > tasks we could run on them. 
> > 
> > Does anyone here have experience buying rackmountable cluster nodes 
> > *with graphics cards* who can point me to a vendor?
> > 
> > For that matter, have any of you built a similar system & have any 
> > suggestions / comments?
> > 
> > thanks,
> > Evan Cull
> > 
> > _______________________________________________
> > Beowulf mailing list, Beowulf@beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> > 
> 
> -- 
> ----------------------------------------------------------------
> Editor-in-chief                   ClusterWorld Magazine
> Desk: 610.865.6061                            
> Fax:  610.865.6618                www.clusterworld.com
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From kus at free.net  Wed Oct 13 11:09:54 2004
From: kus at free.net (Mikhail Kuzminsky)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] choosing a high-speed interconnect
In-Reply-To: <Pine.LNX.4.44.0410121656340.1618-100000@ra.thebes>
Message-ID: <web-151234@free.net>

In message from "Michael T. Prinkey" <mprinkey@aeolusresearch.com> 
(Tue, 12 Oct 2004 17:05:27 -0400 (EDT)):
>
>This won't help with your Opteron systems as they probably have 
>broadcom
>(tg3) NICs, but GAMMA has just released an update that supports Intel
>(e1000) gigabit cards:
>
>http://www.disi.unige.it/project/gamma/index.html
>
>They have an MPI implementation as well:
>
>http://www.disi.unige.it/project/gamma/mpigamma/index.html
We had some experience w/older GAMMA versions, but then we stopped
our using because of the absense of reliable SMP support and e1000 
cards we begun to use. Now e1000 is supported only for 2.6 kernels
(we are using 2.4 for x86_64). Latest pair of GAMMA versions for 2.4 
and 2.6 kernels is attractive and may give GAMMA "new life". 

But I'm afraid the instability of Intel Pro/1000 NICs in the sense of 
extremally often NIC chips exchange by Intel - between different NICs 
"version". GAMMA staff say about i82546 chipset; what will be for
other ?  

Yours
Mikhail Kuzminsky
Zelinsky Institute of Organic Chemistry
Moscow

>They claim vastly improved latency and incrementally improved 
>bandwidth on
>gigabit hardware relative to TCP/IP.  We are planning to test it with 
>the
>new Xeon cluster we will be building next month.  It will be 
>interesting
>to see how it fairs with LINPACK and the MFIX CFD code.
>
>Anyone given GAMMA a try?
>
>Mike
>

From hugo at dolphinics.no  Wed Oct 13 11:03:09 2004
From: hugo at dolphinics.no (Hugo Kohmann)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] Re: HPC in Windows
In-Reply-To: <416B9356.40400@craigsplanet.force9.co.uk>
References: <416B9356.40400@craigsplanet.force9.co.uk>
Message-ID: <Pine.LNX.4.61.0410131951220.27356@boss.dolphinics.no>


All,

A very good and stable open source MPI package for Windows ( and 
Linux/Solaris ) can be found at

http://www.lfbs.rwth-aachen.de/content/index.php?ctl_pos=172

This package has been available for several years.

Best regards

Hugo

=========================================================================================
Hugo Kohmann                           |
Dolphin Interconnect Solutions AS      | E-mail:
P.O. Box 150 Oppsal                    | hugo at dolphinics.com
N-0619 Oslo, Norway                    | Web:
Tel:+47 23 16 71 73                    | http://www.dolphinics.com
Fax:+47 23 16 71 80                    | 
Visiting Address: Olaf Helsets vei 6   |

From michael at halligan.org  Wed Oct 13 22:38:33 2004
From: michael at halligan.org (Michael T. Halligan)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] RedHat Satellite Server as a cluster management tool.
Message-ID: <416E10D9.70006@halligan.org>

Has anybody used (or tried to use) the RHN system as a HPC management 
tool. I've implemented this
successfully in a 100 host environment for a customer of mine, and am in 
the process of
re-architecting an infrastructure with about 150 nodes.. That's about as 
far as I've gotten
with it. Once I get past the cost, the poor documentation, and "OK" 
support, I'm finding
that it's actually a great (though slightly immature) piece of software 
for the enterprise.  The ease of keeping
an infrastructure in sync, and tthe lowered workload for sysadmins

At 100 nodes, the pricing seems to be about $274/year per node including 
licensing, entitlements, and the
software cost of a RHN server (add another $5k-$7k for a pair of beefy 
boxes to act as the
RHN server.. though as far as I can tell, redhat's specs on the RHN 
server are far exagerrated.. I
could get by with $2500 worth of servers on that end for the 
environments I've deployed on).  So, in the
end, $28k/year for an enterprise of 100 servers, in one environment has 
meant being able to shrink the
next year staffing needs by 2 people, and in one by one person, it pays 
for itself..

We have a 512 node render farm project we're bidding on for a new 
customer, and I'm wondering how those in the
beowulf community who have used RHN satellite server perceive it. So far 
we're considering LFS and Enfusion,
which are both more HPC oriented, but I'm really enjoying RHN as a 
management system.

----------------
BitPusher, LLC
http://www.bitpusher.com/
1.888.9PUSHER
(415) 724.7998 - Mobile


From rgb at phy.duke.edu  Thu Oct 14 10:39:55 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] RedHat Satellite Server as a cluster management tool.
In-Reply-To: <416E10D9.70006@halligan.org>
Message-ID: <Pine.LNX.4.44.0410141321370.19292-100000@lucifer.rgb.private.net>

On Wed, 13 Oct 2004, Michael T. Halligan wrote:

> Has anybody used (or tried to use) the RHN system as a HPC management 
> tool. I've implemented this
> successfully in a 100 host environment for a customer of mine, and am in 
> the process of
> re-architecting an infrastructure with about 150 nodes.. That's about as 
> far as I've gotten
> with it. Once I get past the cost, the poor documentation, and "OK" 
> support, I'm finding
> that it's actually a great (though slightly immature) piece of software 
> for the enterprise.  The ease of keeping
> an infrastructure in sync, and tthe lowered workload for sysadmins

<nuke warning="alert"> 

I can only say "why bother".  Everything it does can be done easier,
faster, and better with PXE/kickstart for the base install followed by
yum for fine tuning the install, updates and maintenance (all totally
automagical).  Yum is in RHEL, is fully GPL, is well documented, has a
mailing list providing the active support of LOTS of users as well as
the developers/maintainers, and is free as in air.  Oh, and it works
EQUALLY well with Centos, SuSE, Fedora Core 2, and other RPM-based
distros, and is in wide use in clusters (and LANs) across the country.

With PXE/kickstart/yum, you just build and test a kickstart file for the
basic node install (necessary in any event), bootstrap the install over
the net via PXE, and then forget the node altogether.  yum automagically
handles updates, and can also manage things like distributed installs
and locking a node to a common specified set of packages.  It manages
all dependencies for you so that things work properly.

It takes me ten minutes to install ten nodes, mostly because I like to
watch the install start before moving on to handle the rare install that
is interrupted for some reason (e.g. a faulty network connection).  One
can do a lot more than this much faster if you control the boot strictly
from PXE so you don't even need to interact with the node on the console
at all.  How much better than that can you do?  

Alternatively, there are things like warewulf and scyld where even
commercial solutions probably won't work out to be much more (if any
more) expensive.  Especially when you add in the cost of those two
"beefy boxes acting as RHN servers".  What a waste!  We use a single
repository to manage installs and updates for our entire campus (close
to 1000 systems just in clusters, plus that many more in LANs and on
personal desktops).  And the server isn't terribly beefy -- it is
actually a castoff desktop being pressed into extended service, although
we finally have plans to put a REAL server in pretty soon.

I mean, what kind of load does a cluster node generally PLACE on a
repository server after the original install?  Try "none" and you'd be
really close to the truth -- an average of a single package a week
updated is probably too high an estimate, and that consumes (let's see)
something like 1 network-second of capacity between server and node a
week with plain old 100BT.

There are solutions that are designed to be scalable and easy to
understand and maintain, and then there are solutions designed to be
topdown manageable with a nifty GUI (and sell a lot of totally unneeded
resources at the same time).  Guess which one RHN falls under.
</nuke>

  Flamingly yours (not at you, but at RHN)

      rgb

> 
> At 100 nodes, the pricing seems to be about $274/year per node including 
> licensing, entitlements, and the
> software cost of a RHN server (add another $5k-$7k for a pair of beefy 
> boxes to act as the
> RHN server.. though as far as I can tell, redhat's specs on the RHN 
> server are far exagerrated.. I
> could get by with $2500 worth of servers on that end for the 
> environments I've deployed on).  So, in the
> end, $28k/year for an enterprise of 100 servers, in one environment has 
> meant being able to shrink the
> next year staffing needs by 2 people, and in one by one person, it pays 
> for itself..
> 
> We have a 512 node render farm project we're bidding on for a new 
> customer, and I'm wondering how those in the
> beowulf community who have used RHN satellite server perceive it. So far 
> we're considering LFS and Enfusion,
> which are both more HPC oriented, but I'm really enjoying RHN as a 
> management system.
> 
> ----------------
> BitPusher, LLC
> http://www.bitpusher.com/
> 1.888.9PUSHER
> (415) 724.7998 - Mobile
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From rgb at phy.duke.edu  Thu Oct 14 12:14:51 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] RedHat Satellite Server as a cluster management tool.
In-Reply-To: <53761.66.150.251.142.1097780055.squirrel@mail3.bitpusher.com>
Message-ID: <Pine.LNX.4.44.0410141451410.19292-100000@lucifer.rgb.private.net>

On Thu, 14 Oct 2004, Michael T. Halligan wrote:

> 
> Robert,
> 
> So have you actually used the satellite server? My biggest problem with
> using RHN has been the strong lack of deployments it's had.. A lot of

We looked at it pretty seriously at Duke -- RH is a short walk away,
we've had a long and productive relationship with them, and they were
offering us a "deal" on their RHN supported product.

The problem (and the likely reason for the strong lack of deployment) is
the cost scaling and minimum buy-in.  Frankly, if they gave it away free
the server requirements are kind of crazy, given the number of machines
we run on campus (and the fact that we manage it now, quite
successfully, on a shoestring and a mix/choice of Centos and FC2).

I think that RHN's major advantage to consumers is topdown network
management in corporate environments where the costs of this sort of
management tool are swallowed in the greater TCO issues of running a
major data center (and where local sysadmin competence is likely to be
"red hat certified systems engineers" who've gone through their training
and are roughly as deep-roots hapless as their MCSE counterparts tend to
be).  That is, they know how to use RH's GUI tools, but they really
don't UNDERSTAND that much about the systems they manage.

For a corporation who just wants it to work and considers spending $100K
on this and that so it works with the human resources they have
available to be petty cash, that's fine.  In the University/Research
world, resources tend to be tight, expertise levels are relatively high,
and there is even opportunity cost labor in the expert labor pools that
can be diverted to learning how to do something really cheaply AND
really well.

That's why you have Debian clusters, ROCKS clusters, RH/FC/Centos
clusters, SuSE clusters, Mandrake clusters, Warewulf clusters,
Clustermatic clusters -- all largely "homebrew" at the administrative
level (although the cluster-specific projects can get pretty fancy
wrapping up the brew;-) -- that avoid using a) anything that you have to
pay for if possible; b) anything that you have to pay a LOT for period;
c) anything that doesn't scale.  Yum requires more investment of effort
at the beginning to learn it, as it is a real, command line sysadmin
tool and yes, you'll need to read the documentation (some of which I
wrote:-), work with it, play with it, figure out how to make it jump
through hoops, and ultimately realize that it is REALLY powerful.
Designed by sysadmins, for sysadmins.  Designed and maintained by people
who use it every day in large scale deployments in resource starved
institutions.  That sort of thing.

Like all GUI tools vs command line tools, there is the usual "learn to
use it in a day, pay for using it forever" that plagues the user of any
windowing interface that actually has to manipulate large numbers of
files and complex relationships (GUIs are all about simplicity, but not
everything is "simple").

So Duke has at least for the moment tabled the RHN issue until there is
a clear and burning need for it that justifies the cost, including the
cost of diverting our human resources AWAY from using a tool that
manifestly scales better once it is mastered.

> people just naturally assume redhat is bad (hell, I even do. I use debian
> for all of my personal and corporate servers).. But very few who
> automatically take that stance have actually worked with the products
> enough to give emperical evidence as to why.

I love RH.  I used to pay them money for their OS distro every major
release voluntarily, until they went hypercommercial.  Now I use FC2 and
may migrate even further away.

RH's pricing model is purely corporate.  I just don't think they've
grokked either the university or the personal or the HPC cluster market,
or maybe they have and just don't care.

   rgb

> 
> It took a while to gather enthusiasm enough to evaluate it, and a couple
> of months of solid testing before I could recommend it.  I've built about
> 1/2 dozen similar deployment/management tools at this point, each one
> built for a customer (hence the reason building 6 instead of just
> improving upon the same one).
> 
> Imaging is one thing, and yeah kickstart is easy, no objections to that..
> RHN just makes it a lot easier to deal with kickstart. It also gives a
> rather useful, but more enterprise focused management system to allow you
> to manage (software|config) channels, server groups, and a good method to
> deal with groups with unions & intersections.  I'm finding it especially
> nice at one site at which 1/2 of their servers are used for testing and
> 1/2 for their production environment.  Pushing new patches, scripts,
> commands, files to select sets of systems requires very little effort.
> 
> RedHat's configuration management system is actually really nice. They've
> put a simple  (but extensible) macro system into it, which allows you to
> keep one configuration file for all of the servers in a given class, when
> only a few things change, and having system-specific variables be parsed
> out when servers pull configs from the gold server.. Sure, you can do this
> with cfengine or pikt, but uploading a config file to a webform is a lot
> simpler than setting up cfengine/pikt and implementing it (I know this
> from a lot of experience.
> 
> One of the lackings of using a yum/pxe/kickstart environment (of which I'm
> rather familiar with, currently  managing 6 customers with a similar
> environment) is that there's no "already there" configuration/versioning
> management system.  That was one of the key points of redhat, the fact
> that I can do at-will repurposing/reprovisioning (like turning a 100
> server 30/70 app db server/app server environment into a 70/30 app/db
> server environment in 5 minutes without kickstarting and zero manual
> interaction)..

Sure, I agree.  Although it needn't take THAT long to do with yum and
the cluster shell of your choice.  Versioning is currently done fairly
casually, or outside yum itself.  There is no point and click package
selector (although any editor works just fine).  However, you can buy an
awful lot of FTE minutes for hundreds of dollars a seat plus thousands
for servers, and clusters typically DON'T change often, or much, once
their prototypes are completed and debugged.

> In the end, it's probably just an apples/oranges comparison.. in a science
> lab/school cluster environment, it's probably more a more valuable place
> to use a more manual process because grad students are cheap, and interns
> are free.. :) In a corporate world, the $28k i'd spend for a 100 server
> environment to save a sysadmin's worth of time, pays for itself 10 fold in
> terms of environment consistency..

For servers, especially heterogenous servers, it might be worth it.  If
by "servers" you mean identical nodes (server or otherwise) I'd say this
is a waste of money.  In HPC it is the latter.  In a lot of corporate
environments, it is a mix of mostly the latter and some of the former.
But I totally agree, the tool is designed for that kind of environment
-- structurally complex and deep pocketed.

> Either way, I'm not trying to evangelize, just relate my own experiences, 
> and try to find the best solution for a given problem. What tools out
> there are good for this type of a situation, then? Thanks for the refs to
> werewulf, I'm checking it out now.

No problem.  I just am a cost-benefit fanatic.  You have to work to
convince me that spending order of 20% of the nodes you might be able to
buy in a compute cluster on RHN will get more work done per total dollar
invested, in most HPC cluster environments, compared to any of a number
of GPL free alternatives (many of which have further benefits to their
use anyway).

   rgb

> 
> 
> 
> >
> 
> 
> > On Wed, 13 Oct 2004, Michael T. Halligan wrote:
> >
> >> Has anybody used (or tried to use) the RHN system as a HPC management
> >> tool. I've implemented this
> >> successfully in a 100 host environment for a customer of mine, and am in
> >> the process of
> >> re-architecting an infrastructure with about 150 nodes.. That's about as
> >> far as I've gotten
> >> with it. Once I get past the cost, the poor documentation, and "OK"
> >> support, I'm finding
> >> that it's actually a great (though slightly immature) piece of software
> >> for the enterprise.  The ease of keeping
> >> an infrastructure in sync, and tthe lowered workload for sysadmins
> >
> > <nuke warning="alert">
> >
> > I can only say "why bother".  Everything it does can be done easier,
> > faster, and better with PXE/kickstart for the base install followed by
> > yum for fine tuning the install, updates and maintenance (all totally
> > automagical).  Yum is in RHEL, is fully GPL, is well documented, has a
> > mailing list providing the active support of LOTS of users as well as
> > the developers/maintainers, and is free as in air.  Oh, and it works
> > EQUALLY well with Centos, SuSE, Fedora Core 2, and other RPM-based
> > distros, and is in wide use in clusters (and LANs) across the country.
> >
> > With PXE/kickstart/yum, you just build and test a kickstart file for the
> > basic node install (necessary in any event), bootstrap the install over
> > the net via PXE, and then forget the node altogether.  yum automagically
> > handles updates, and can also manage things like distributed installs
> > and locking a node to a common specified set of packages.  It manages
> > all dependencies for you so that things work properly.
> >
> > It takes me ten minutes to install ten nodes, mostly because I like to
> > watch the install start before moving on to handle the rare install that
> > is interrupted for some reason (e.g. a faulty network connection).  One
> > can do a lot more than this much faster if you control the boot strictly
> > from PXE so you don't even need to interact with the node on the console
> > at all.  How much better than that can you do?
> >
> > Alternatively, there are things like warewulf and scyld where even
> > commercial solutions probably won't work out to be much more (if any
> > more) expensive.  Especially when you add in the cost of those two
> > "beefy boxes acting as RHN servers".  What a waste!  We use a single
> > repository to manage installs and updates for our entire campus (close
> > to 1000 systems just in clusters, plus that many more in LANs and on
> > personal desktops).  And the server isn't terribly beefy -- it is
> > actually a castoff desktop being pressed into extended service, although
> > we finally have plans to put a REAL server in pretty soon.
> >
> > I mean, what kind of load does a cluster node generally PLACE on a
> > repository server after the original install?  Try "none" and you'd be
> > really close to the truth -- an average of a single package a week
> > updated is probably too high an estimate, and that consumes (let's see)
> > something like 1 network-second of capacity between server and node a
> > week with plain old 100BT.
> >
> > There are solutions that are designed to be scalable and easy to
> > understand and maintain, and then there are solutions designed to be
> > topdown manageable with a nifty GUI (and sell a lot of totally unneeded
> > resources at the same time).  Guess which one RHN falls under.
> > </nuke>
> >
> >   Flamingly yours (not at you, but at RHN)
> >
> >       rgb
> >
> >>
> >> At 100 nodes, the pricing seems to be about $274/year per node including
> >> licensing, entitlements, and the
> >> software cost of a RHN server (add another $5k-$7k for a pair of beefy
> >> boxes to act as the
> >> RHN server.. though as far as I can tell, redhat's specs on the RHN
> >> server are far exagerrated.. I
> >> could get by with $2500 worth of servers on that end for the
> >> environments I've deployed on).  So, in the
> >> end, $28k/year for an enterprise of 100 servers, in one environment has
> >> meant being able to shrink the
> >> next year staffing needs by 2 people, and in one by one person, it pays
> >> for itself..
> >>
> >> We have a 512 node render farm project we're bidding on for a new
> >> customer, and I'm wondering how those in the
> >> beowulf community who have used RHN satellite server perceive it. So far
> >> we're considering LFS and Enfusion,
> >> which are both more HPC oriented, but I'm really enjoying RHN as a
> >> management system.
> >>
> >> ----------------
> >> BitPusher, LLC
> >> http://www.bitpusher.com/
> >> 1.888.9PUSHER
> >> (415) 724.7998 - Mobile
> >>
> >>
> >> _______________________________________________
> >> Beowulf mailing list, Beowulf@beowulf.org
> >> To change your subscription (digest mode or unsubscribe) visit
> >> http://www.beowulf.org/mailman/listinfo/beowulf
> >>
> >
> > --
> > Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> > Duke University Dept. of Physics, Box 90305
> > Durham, N.C. 27708-0305
> > Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu
> >
> >
> >
> >
> 
> 
> -------------------
> BitPusher, LLC
> http://www.bitpusher.com/
> 1.888.9PUSHER
> (415) 724.7998 - Mobile
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From agrajag at dragaera.net  Thu Oct 14 11:30:41 2004
From: agrajag at dragaera.net (Sean Dilda)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] RedHat Satellite Server as a cluster management tool.
In-Reply-To: <416E10D9.70006@halligan.org>
References: <416E10D9.70006@halligan.org>
Message-ID: <1097778641.22262.27.camel@pel>

On Thu, 2004-10-14 at 01:38, Michael T. Halligan wrote:
>  So, in the
> end, $28k/year for an enterprise of 100 servers, in one environment has 
> meant being able to shrink the
> next year staffing needs by 2 people, and in one by one person, it pays 
> for itself..

As the only sysadmin for a 260-node cluster, I'm extremely curious what
jobs those 2 people were supposed to be doing.  I have an operations
staff to rely on for some environmental stuff and for handling service
calls with vendors (I report the problem to them and do the hw
replacement, they just take care of the phone call).  However, even with
260 nodes I still find a lot of my time spent in trying to improve the
cluster as opposed to just keeping it running.


From michael at halligan.org  Thu Oct 14 11:54:15 2004
From: michael at halligan.org (Michael T. Halligan)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] RedHat Satellite Server as a cluster management tool.
In-Reply-To: <Pine.LNX.4.44.0410141321370.19292-100000@lucifer.rgb.private.net>
References: <416E10D9.70006@halligan.org>
	<Pine.LNX.4.44.0410141321370.19292-100000@lucifer.rgb.private.net>
Message-ID: <53761.66.150.251.142.1097780055.squirrel@mail3.bitpusher.com>


Robert,

So have you actually used the satellite server? My biggest problem with
using RHN has been the strong lack of deployments it's had.. A lot of
people just naturally assume redhat is bad (hell, I even do. I use debian
for all of my personal and corporate servers).. But very few who
automatically take that stance have actually worked with the products
enough to give emperical evidence as to why.

It took a while to gather enthusiasm enough to evaluate it, and a couple
of months of solid testing before I could recommend it.  I've built about
1/2 dozen similar deployment/management tools at this point, each one
built for a customer (hence the reason building 6 instead of just
improving upon the same one).

Imaging is one thing, and yeah kickstart is easy, no objections to that..
RHN just makes it a lot easier to deal with kickstart. It also gives a
rather useful, but more enterprise focused management system to allow you
to manage (software|config) channels, server groups, and a good method to
deal with groups with unions & intersections.  I'm finding it especially
nice at one site at which 1/2 of their servers are used for testing and
1/2 for their production environment.  Pushing new patches, scripts,
commands, files to select sets of systems requires very little effort.

RedHat's configuration management system is actually really nice. They've
put a simple  (but extensible) macro system into it, which allows you to
keep one configuration file for all of the servers in a given class, when
only a few things change, and having system-specific variables be parsed
out when servers pull configs from the gold server.. Sure, you can do this
with cfengine or pikt, but uploading a config file to a webform is a lot
simpler than setting up cfengine/pikt and implementing it (I know this
from a lot of experience.

One of the lackings of using a yum/pxe/kickstart environment (of which I'm
rather familiar with, currently  managing 6 customers with a similar
environment) is that there's no "already there" configuration/versioning
management system.  That was one of the key points of redhat, the fact
that I can do at-will repurposing/reprovisioning (like turning a 100
server 30/70 app db server/app server environment into a 70/30 app/db
server environment in 5 minutes without kickstarting and zero manual
interaction)..

In the end, it's probably just an apples/oranges comparison.. in a science
lab/school cluster environment, it's probably more a more valuable place
to use a more manual process because grad students are cheap, and interns
are free.. :) In a corporate world, the $28k i'd spend for a 100 server
environment to save a sysadmin's worth of time, pays for itself 10 fold in
terms of environment consistency..

Either way, I'm not trying to evangelize, just relate my own experiences, 
and try to find the best solution for a given problem. What tools out
there are good for this type of a situation, then? Thanks for the refs to
werewulf, I'm checking it out now.


>


> On Wed, 13 Oct 2004, Michael T. Halligan wrote:
>
>> Has anybody used (or tried to use) the RHN system as a HPC management
>> tool. I've implemented this
>> successfully in a 100 host environment for a customer of mine, and am in
>> the process of
>> re-architecting an infrastructure with about 150 nodes.. That's about as
>> far as I've gotten
>> with it. Once I get past the cost, the poor documentation, and "OK"
>> support, I'm finding
>> that it's actually a great (though slightly immature) piece of software
>> for the enterprise.  The ease of keeping
>> an infrastructure in sync, and tthe lowered workload for sysadmins
>
> <nuke warning="alert">
>
> I can only say "why bother".  Everything it does can be done easier,
> faster, and better with PXE/kickstart for the base install followed by
> yum for fine tuning the install, updates and maintenance (all totally
> automagical).  Yum is in RHEL, is fully GPL, is well documented, has a
> mailing list providing the active support of LOTS of users as well as
> the developers/maintainers, and is free as in air.  Oh, and it works
> EQUALLY well with Centos, SuSE, Fedora Core 2, and other RPM-based
> distros, and is in wide use in clusters (and LANs) across the country.
>
> With PXE/kickstart/yum, you just build and test a kickstart file for the
> basic node install (necessary in any event), bootstrap the install over
> the net via PXE, and then forget the node altogether.  yum automagically
> handles updates, and can also manage things like distributed installs
> and locking a node to a common specified set of packages.  It manages
> all dependencies for you so that things work properly.
>
> It takes me ten minutes to install ten nodes, mostly because I like to
> watch the install start before moving on to handle the rare install that
> is interrupted for some reason (e.g. a faulty network connection).  One
> can do a lot more than this much faster if you control the boot strictly
> from PXE so you don't even need to interact with the node on the console
> at all.  How much better than that can you do?
>
> Alternatively, there are things like warewulf and scyld where even
> commercial solutions probably won't work out to be much more (if any
> more) expensive.  Especially when you add in the cost of those two
> "beefy boxes acting as RHN servers".  What a waste!  We use a single
> repository to manage installs and updates for our entire campus (close
> to 1000 systems just in clusters, plus that many more in LANs and on
> personal desktops).  And the server isn't terribly beefy -- it is
> actually a castoff desktop being pressed into extended service, although
> we finally have plans to put a REAL server in pretty soon.
>
> I mean, what kind of load does a cluster node generally PLACE on a
> repository server after the original install?  Try "none" and you'd be
> really close to the truth -- an average of a single package a week
> updated is probably too high an estimate, and that consumes (let's see)
> something like 1 network-second of capacity between server and node a
> week with plain old 100BT.
>
> There are solutions that are designed to be scalable and easy to
> understand and maintain, and then there are solutions designed to be
> topdown manageable with a nifty GUI (and sell a lot of totally unneeded
> resources at the same time).  Guess which one RHN falls under.
> </nuke>
>
>   Flamingly yours (not at you, but at RHN)
>
>       rgb
>
>>
>> At 100 nodes, the pricing seems to be about $274/year per node including
>> licensing, entitlements, and the
>> software cost of a RHN server (add another $5k-$7k for a pair of beefy
>> boxes to act as the
>> RHN server.. though as far as I can tell, redhat's specs on the RHN
>> server are far exagerrated.. I
>> could get by with $2500 worth of servers on that end for the
>> environments I've deployed on).  So, in the
>> end, $28k/year for an enterprise of 100 servers, in one environment has
>> meant being able to shrink the
>> next year staffing needs by 2 people, and in one by one person, it pays
>> for itself..
>>
>> We have a 512 node render farm project we're bidding on for a new
>> customer, and I'm wondering how those in the
>> beowulf community who have used RHN satellite server perceive it. So far
>> we're considering LFS and Enfusion,
>> which are both more HPC oriented, but I'm really enjoying RHN as a
>> management system.
>>
>> ----------------
>> BitPusher, LLC
>> http://www.bitpusher.com/
>> 1.888.9PUSHER
>> (415) 724.7998 - Mobile
>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf@beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
> --
> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu
>
>
>
>


-------------------
BitPusher, LLC
http://www.bitpusher.com/
1.888.9PUSHER
(415) 724.7998 - Mobile


From michael at halligan.org  Thu Oct 14 11:58:46 2004
From: michael at halligan.org (Michael T. Halligan)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] RedHat Satellite Server as a cluster management tool.
In-Reply-To: <1097778641.22262.27.camel@pel>
References: <416E10D9.70006@halligan.org> <1097778641.22262.27.camel@pel>
Message-ID: <53784.66.150.251.142.1097780326.squirrel@mail3.bitpusher.com>

> As the only sysadmin for a 260-node cluster, I'm extremely curious what
> jobs those 2 people were supposed to be doing.  I have an operations
> staff to rely on for some environmental stuff and for handling service
> calls with vendors (I report the problem to them and do the hw
> replacement, they just take care of the phone call).  However, even with
> 260 nodes I still find a lot of my time spent in trying to improve the
> cluster as opposed to just keeping it running.

Well, this is probably an apples to oranges comparison.. I've worked in
environments where I was the only systems administrator, and ran 500
servers on my own.. It's rather trivial to administer a real cluster,
where there's only one or two functions for the entire thing.. It's
exponentially more work to keep good process in terms of consistency,
configuration management, version control, patch management, and the
general overall health in a non-cluster environment where you might have
100 servers, in groups of 2 or 4 servers per function, and maybe even
several one-off servers.

This is my first forray into building a single-function cluster in several
years, and I'm trying to determine if tried & true enterprise management
techniques can be a value or a detriment in a beowulf environment, or at
least figure out which concepts carry over, which are superfluous, and
which just aren't applicable.


-------------------
BitPusher, LLC
http://www.bitpusher.com/
1.888.9PUSHER
(415) 724.7998 - Mobile


From tmattox at gmail.com  Thu Oct 14 19:56:44 2004
From: tmattox at gmail.com (Tim Mattox)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] RedHat Satellite Server as a cluster management tool.
In-Reply-To: <53761.66.150.251.142.1097780055.squirrel@mail3.bitpusher.com>
References: <416E10D9.70006@halligan.org>
	<Pine.LNX.4.44.0410141321370.19292-100000@lucifer.rgb.private.net>
	<53761.66.150.251.142.1097780055.squirrel@mail3.bitpusher.com>
Message-ID: <ea86ce220410141956182e868@mail.gmail.com>

Hello Michael,
I'm one of the co-developers of Warewulf.
(http://warewulf-cluster.org/)
We try to make it as sysadmin friendly as we can.
If you haven't seen it yet, check out the README
file inside the RPM (it gets put in /usr/share/doc/warewulf...)
It explains the philosophy behind Warewulf's development.

If you have any questions about Warewulf, feel free
to post to it's mailing list.  I also follow the Beowulf
mailing list, but not daily.  I admit Warewulf's
documentation can be lacking (or there, but
talking about a previous version), but once you
get into it a bit, the system makes quite a lot of
sense... mostly ;-)

For your specific task you describe, I would think
Warewulf would work well for you.  It's not perfect,
but we eat our own dogfood, and this is the best
tasting "dogfood" I've used for cluster management. ;-)

Managing a cluster with Warewulf is kind of like
sysadmining less than two machines... the boot server,
and then a "virtual node"... which is just a chroot
on the boot server.  If your cluster is heterogeneous,
you can set up more than one VNFS (virtual node
file system).

And I can't pass up commenting about the costs
for "per node" software...  I grimace at anything
where the cost of the software has ANY non-zero
multiple related to the number of nodes.  Why?
The hardware costs in the cluster's I've helped build
tend to be far under $1k per node, and usually
under $500 per node.  RHN is just not an option
for that kind of cluster.

Anyway, good luck choosing a cluster management
tool for your setup.  The ones rgb mentioned are
all worth considering.
--
Tim Mattox - tmattox@gmail.com - http://homepage.mac.com/tmattox/

From llwaeva at 21cn.com  Fri Oct 15 11:38:45 2004
From: llwaeva at 21cn.com (llwaeva@21cn.com)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] about managment
Message-ID: <20041016022741.B9B6.LLWAEVA@21cn.com>

Hi all, 
  I am running 8-node LAM/MPI parallel computing system. I found it's
trouble to maintain the user accounts and software distribution on all
the nodes. For example, whenever I install a new software , I have to
repeat the job 8 times! The most annoying thing is that the
configuration or managment of the user accounts over the network is a
heavy job. Someone suggests that I should utilize NFS and NIS. However,
in my case, it's difficult to have an additional computer as a server.
Would anyone please share your experience in maintaining the beowulf
cluster? 

Thanks in advance.

From rgb at phy.duke.edu  Fri Oct 15 15:55:25 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] about managment
In-Reply-To: <20041016022741.B9B6.LLWAEVA@21cn.com>
References: <20041016022741.B9B6.LLWAEVA@21cn.com>
Message-ID: <Pine.LNX.4.58.0410151818010.3656@lilith.rgb.private.net>

On Sat, 16 Oct 2004 llwaeva@21cn.com wrote:

> Hi all, 
>   I am running 8-node LAM/MPI parallel computing system. I found it's
> trouble to maintain the user accounts and software distribution on all
> the nodes. For example, whenever I install a new software , I have to
> repeat the job 8 times! The most annoying thing is that the
> configuration or managment of the user accounts over the network is a
> heavy job. Someone suggests that I should utilize NFS and NIS. However,
> in my case, it's difficult to have an additional computer as a server.
> Would anyone please share your experience in maintaining the beowulf
> cluster? 

You don't need an additional computer as a server to run NIS -- just use
your head node as an NIS server.  Ditto crossmounting disk space with
NFS -- just export a partition (it doesn't have to be huge, just big
enough to hold your user home directories and workspace) to all the
nodes.  Remember, NIS and NFS were in some twenty years ago -- when I
first started managing Unix systems a "server" supporting dozens of user
accounts might have one or two hundred MEGABYTES of disk in exported
directories (an amount of disk that costs many thousands of dollars) and
might deliver a whopping 4 MIPS of performance, and yet the server would
still be useable for NIS, NFS and even some modest amounts of
computation in its relatively miniscule memory.

For a mini-cluster with only eight nodes, serving NIS and NFS won't even
warm it up, and the smallest disks being sold today are some 60 GB in
size -- an amount that even five or six years ago would have constituted
a departmental server's collective store (and that server might have
served 50 or 100 workstations, 100 or so users, and managed it with a
processor ten to twenty times slower).

There are also numerous alternative solutions, if setting up servers is
something you don't know how to do or don't want to do for other
reasons.  You could use rsync to synchronize user accounts across nodes.
This works well (and even yields a performance advantage) if they change
slowly, but will be a pain if they change a lot (the advantage of NIS is
that it pretty much completely automates this after a bit of work
setting it up originally).  You can and should use tools like ones that
were just discussed, e.g. kickstart and yum, to automate installation
and maintenance.  Finally, look into the various "cluster distributions"
that do it all for you, noteably ROCKS and warewulf.  Be aware, though,
that they are pretty likely to use things like NFS and NIS as (possibly
optional) components of their solutions.

[BTW, y'all are just spiled, spiled rotten.  Why, back in the OLD days
geeks were geeks and had to slam massive amounts of cola to get real
work done on networks, CPUs, memory that my PDA has beat hands down
today.  Here you put together a networked cluster the least component of
which would have been "inconceivably" powerful (in every dimension)
three decades ago, which is more powerful all by itself than the first
twenty or so beowulfs ever built, and you can't find a "server"...;-)]

    rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From tmattox at gmail.com  Fri Oct 15 16:00:29 2004
From: tmattox at gmail.com (Tim Mattox)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] about managment
In-Reply-To: <20041016022741.B9B6.LLWAEVA@21cn.com>
References: <20041016022741.B9B6.LLWAEVA@21cn.com>
Message-ID: <ea86ce2204101516007fe316ac@mail.gmail.com>

You should look at any of a variety of cluster management
software packages.  Some are free, some are commercial.
Here is a short list that I can name off the top of my head:
Rocks, Clustermatic, Oscar, Warewulf, Scyld and others.

I'm a co-developer of Warewulf, a free one that has a
fairly unique approach to the problem.  You can find
out more on it's website: http://warewulf-cluster.org/

The short version is that Warewulf builds a ramdisk
image that it uses to network boot the nodes.  The
ramdisk is built from a Virtual Node File System (VNFS)
that is maintained on the boot server.  You can use pretty
much any RPM based Linux distribution for the boot
server and the VNFS.

With this approach, the nodes get a fresh filesystem
at boot time, without any worries about version or
package creep.  Adding or upgrading programs
and changing the list of users is very easy.

Good luck.

On Sat, 16 Oct 2004 02:38:45 +0800, llwaeva@21cn.com <llwaeva@21cn.com> wrote:
> Hi all,
>   I am running 8-node LAM/MPI parallel computing system. I found it's
> trouble to maintain the user accounts and software distribution on all
> the nodes. For example, whenever I install a new software , I have to
> repeat the job 8 times! The most annoying thing is that the
> configuration or managment of the user accounts over the network is a
> heavy job. Someone suggests that I should utilize NFS and NIS. However,
> in my case, it's difficult to have an additional computer as a server.
> Would anyone please share your experience in maintaining the beowulf
> cluster?
> 
> Thanks in advance.

-- 
Tim Mattox - tmattox@gmail.com - http://homepage.mac.com/tmattox/

From andrewxwang at yahoo.com.tw  Fri Oct 15 19:59:46 2004
From: andrewxwang at yahoo.com.tw (Andrew Wang)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] Grid Engine question
In-Reply-To: <1097612732.18951.14.camel@pel>
Message-ID: <20041016025946.91468.qmail@web18010.mail.tpe.yahoo.com>

They call it "Resource Reservation".

For a list of new features, see this paper:

http://www.sun.com/products-n-solutions/edu/whitepapers/pdf/N1GridEngine6.pdf

Andrew.

 --- Sean Dilda <agrajag@dragaera.net> ªº°T®§¡G
> In SGE 6.0 they added a feature they call 'advanced
> reservations'.  Its
> not really advanced, and its not what I consider
> 'reservations' to be,
> but it is exactly what you want.  When reservations
> are enabled on the
> cluster, and the job is submitted with '-R y', the
> mutli-processor job
> will be able to 'hold' available resources until it
> has enough to run,
> and thus keep lower priority jobs from using them.
> 
> However, to do this you need to upgrade to at least
> version 6.0. 
> However, 6.0 also has cluster queues which I find
> makes administration
> much easier (it allows you to create one queue setup
> and assign it to
> multiple hosts instead of doing a separate setup for
> each compute host).
> 
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>  

-----------------------------------------------------------------
Yahoo!©_¼¯¹q¤l«H½c
100MB §K¶O«H½c¡A¹q¤l«H½c·s¬ö¤¸±q³o¶}©l¡I
http://mail.yahoo.com.tw/

From andrewxwang at yahoo.com.tw  Fri Oct 15 20:17:41 2004
From: andrewxwang at yahoo.com.tw (Andrew Wang)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] 64bit comparisons
In-Reply-To: <436A69503F99214D9EBC27415F66C40C3288E1@blums0008.sd.gd.com>
Message-ID: <20041016031741.31345.qmail@web18007.mail.tpe.yahoo.com>

I believe you can get more info from the following
mailing lists:

"hpc", "scitech", "xgrid-users"

see: http://lists.apple.com/mailman/listinfo

Also, people on those lists (if I remember correctly)
use LAM-MPI, GridEngine, and also the IBM xlc/xlf
compilers (XL compilers generate faster code for the
G5).

Andrew.


 --- "Hujsak, Jonathan T (US SSA)"
<jonathan.hujsak@baesystems.com> ªº°T®§¡G
> Have you gained any new 'lessons learned' since the
> communication 
> below? Can you recommend a good version of MPI to
> use for these?
> 
> We've been looking at MPICH, MPIPro and also the
> Apple xgrid...
> 
>  
> 
> Thanks!
> 
>  
> 
> Jonathan Hujsak
> 
> BAE Systems
> 
> San Diego
> 
>  
> 
> Bill Broadley bill at cse.ucdavis.edu
>
<mailto:beowulf%40beowulf.org?Subject=%5BBeowulf%5D%2064bit%20comparison
>
s&In-Reply-To=200405141644.i4EGi1Aq023213%40marvin.ibest.uidaho.edu>
> 
> Fri May 14 11:48:21 PDT 2004 
> 
> *	Previous message: [Beowulf] 64bit comparisons
>
<http://www.beowulf.org/pipermail/beowulf/2004-May/016707.html>
> 
> *	Next message: [Beowulf] 64bit comparisons
>
<http://www.beowulf.org/pipermail/beowulf/2004-May/016724.html>
> 
> *	Messages sorted by: [ date ]
>
<http://www.beowulf.org/pipermail/beowulf/2004-May/date.html#16711>
>  [
> thread ]
>
<http://www.beowulf.org/pipermail/beowulf/2004-May/thread.html#16711>
>  [
> subject ]
>
<http://www.beowulf.org/pipermail/beowulf/2004-May/subject.html#16711>
> [ author ]
>
<http://www.beowulf.org/pipermail/beowulf/2004-May/author.html#16711>
>  
> 
>   _____  
> 
> On Fri, May 14, 2004 at 09:44:01AM -0700, Robert B
> Heckendorn wrote:
> > One of the options we are strongly considering for
> our next cluster is
> > going with Apple X-servers.  There performance is
> purported to be good
>  
> Careful to benchmark both processors at the same
> time if that is your
> intended usage pattern.  Are the dual-g5's shipping
> yet?  Last I heard
> yield problems were resulting in only uniprocessor
> shipments.  My main
> concern that despite the marketing blurb of 2
> 10GB/sec CPU interfaces
> or similar that there is a shared 6.4 GB/sec memory
> bus.
>  
> > and their power consumption is small.
>  
> Has anyone measured a dual g5 xserv with a
> kill-a-watt or similar?
>  
> > Can people comment on any comparisons betwee Apple
> and (Athlon64
> > or Opteron)?
>  
> Personally I've had problems, I need to spend more
> time resolving them,
> things like:
> *       Need to tweak /etc/rc to allow Mpich to use
> shared memory
> *       Latency between two mpich processes on the
> same node is 10-20
> times the 
>     linux latency.  I've yet to try LAM.
> *   Differences in semaphores requires a rewrite for
> some linux code I
> had
> *   Difference in the IBM fortran compiler required
> a rewrite compared
> to code
>     that ran on Intel's, portland group's, and GNU's
> fortran compiler.
> 
>  
> Given all that I'm still interested to see what the
> G5 is good at and
> under
> what workloads the G5 wins perf/price or perf/watt.
>  
> -- 
> Bill Broadley
> Computational Science and Engineering
> UC Davis
> 
>  
> 
> > _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>  

-----------------------------------------------------------------
Yahoo!©_¼¯¹q¤l«H½c
100MB §K¶O«H½c¡A¹q¤l«H½c·s¬ö¤¸±q³o¶}©l¡I
http://mail.yahoo.com.tw/

From john.hearns at clustervision.com  Fri Oct 15 21:52:10 2004
From: john.hearns at clustervision.com (John Hearns)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] about managment
In-Reply-To: <20041016022741.B9B6.LLWAEVA@21cn.com>
References: <20041016022741.B9B6.LLWAEVA@21cn.com>
Message-ID: <1097902329.6223.945.camel@vigor12>

On Fri, 2004-10-15 at 19:38, llwaeva@21cn.com wrote:
> Hi all, 
>   I am running 8-node LAM/MPI parallel computing system. I found it's
> trouble to maintain the user accounts and software distribution on all
> the nodes. For example, whenever I install a new software , I have to
> repeat the job 8 times!
There are several answers to this question,
which you can learn about by staying on this group, and consulting
online resources.

A quick answer is that you could construct the cluster using one of the
toolkits, such as Rocks or Warewulf - many others.

And a very quick answer to your current dilemma. There are utilities
which allow parallel execution of commands on a set of machines,
or even to have a terminal session in parallel across a set of machines.

Once you have a server (below) you can rsync each node to that.


>  The most annoying thing is that the
> configuration or managment of the user accounts over the network is a
> heavy job. Someone suggests that I should utilize NFS and NIS. However,
> in my case, it's difficult to have an additional computer as a server.
Not meaning to be rude,  but you are wrong there.
Just use one of your compute nodes as the server. The additional CPU
load will not be great.
You should use some sort of centralised account management NIS or LDAP.
Even if you point blank refuse to do that, a cron job to rsync the
relevant files will help cut down your admin load.

And remember - eight machines may not seem a lot. But what happens if
you make a mistake on one machine, or one machine is down when you are
adding an account or software. Are you sure to run identical commands by
hand the next time it is up?


From eugen at leitl.org  Sat Oct 16 11:45:31 2004
From: eugen at leitl.org (Eugen Leitl)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] InfiniBand Drivers Released for Xserve G5 Clusters (fwd
	from brian-slashdotnews@hyperreal.org)
Message-ID: <20041016184531.GF1457@leitl.org>

----- Forwarded message from brian-slashdotnews@hyperreal.org -----

From: brian-slashdotnews@hyperreal.org
Date: 16 Oct 2004 01:26:01 -0000
To: slashdotnews@hyperreal.org
Subject: InfiniBand Drivers Released for Xserve G5 Clusters
User-Agent: SlashdotNewsScooper/0.0.3

Link: http://slashdot.org/article.pl?sid=04/10/15/2135211
Posted by: pudge, on 2004-10-15 23:30:00

   from the insert-grunting-noise-here dept.
   A user writes, "A company called [1]Small Tree just [2]announced the
   release of InfiniBand drivers for the Mac, for more supercomputing
   speed. People have already been making supercomputer clusters for the
   Mac, including Virginia Tech's [3]third-fastest supercomputer in the
   world, but InfiniBand is supposed to make the latency drop. A lot.
   [4]Voltaire also makes some sort of Apple InfiniBand products, though
   it's not clear whether they make the drivers or hardware."

   IFRAME: [5]pos6

References

   1. http://www.small-tree.com/
   2. http://www.wistechnology.com/article.php?id=1255
   3. http://www.macobserver.com/article/2003/11/17.1.shtml
   4. http://www.voltaire.com/apple.html
   5. http://ads.osdn.com/?ad_id=2936&alloc_id=10685&site_id=1&request_id=2846371

----- End forwarded message -----
-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
______________________________________________________________
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
Url : http://www.scyld.com/pipermail/beowulf/attachments/20041016/ff36ac07/attachment.bin
From atp at piskorski.com  Sat Oct 16 12:01:34 2004
From: atp at piskorski.com (Andrew Piskorski)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] s_update() missing from AFAPI ?
Message-ID: <20041016190133.GA42332@piskorski.com>

The old 1997 paper by Dietz, Mattox, and Krishnamurthy, "The Aggregate
Function API: It's Not Just For PAPERS Anymore", briefly mentions that
their AFAPI library also supports, "fully coherent, polyatomic,
replicated shared memory".  It even gives a little chart showing how
many microseconds their s_update() function takes to update that
shared memory.

That sounds interesting (even given the extremely low bandwith of the
PAPERS hardware, etc.), but, no such function exists in the last
1999-12-22 AFAPI release!  s_update() just isn't in there at all.
Why?  Tim M., I know you follow the Beowulf list, so could you fill us
in a bit on what what happened there?

  http://aggregate.org/TechPub/lcpc97.html
  http://aggregate.org/AFAPI/AFAPI_19991222.tgz

While I'm at it I might as well ask this too:  That same old PAPERS
papers says "UDPAPERS", using Ethernet and UDP, was implemented, but
it doesn't seem to be in the AFAPI release either.  What happened with
that?  Did it work?  As well as the custom PAPERS hardware?

If so, how?  Dirt cheap 10/100 cards and UTP cable would certainly be
a lot more convenient than custom PAPERS hardware for anyone wanting
to experiment with the AFAPI stuff, but I'm confused about what part
of the ethernet network could be magically made to act as the NAND
gate for the aggregate operations.  Did it need to use some particular
programmable ethernet switch?  Or the aggregate operations were
actually done on each of the nodes?

-- 
Andrew Piskorski <atp@piskorski.com>
http://www.piskorski.com/

From hahn at physics.mcmaster.ca  Sat Oct 16 14:24:01 2004
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] InfiniBand Drivers Released for Xserve G5 Clusters
	(fwd from brian-slashdotnews@hyperreal.org)
In-Reply-To: <20041016184531.GF1457@leitl.org>
Message-ID: <Pine.LNX.4.44.0410161717480.11384-100000@coffee.psychology.mcmaster.ca>

>    world, but InfiniBand is supposed to make the latency drop. A lot.

sigh.  small-tree claims 6.13 us, which is certainly not exceptional
latency these days.  for instance, there are three vendors who are 
shipping <2 us MPI today.  maybe I'm just being extra-surly, but if
you crow too much about a non-novel accomplishment, you look pretty 
silly to anyone in the field...


From hahn at physics.mcmaster.ca  Sat Oct 16 14:36:01 2004
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] bandwidth: who needs it?
Message-ID: <Pine.LNX.4.44.0410161724130.11384-100000@coffee.psychology.mcmaster.ca>

do you have applications that are pushing the limits of MPI bandwidth?
for instance, code that actually comes close to using the 8-900 MB/s
that current high-end interconnect provides?

we have a fairly wide variety of codes inside SHARCnet, but I haven't
found anyone who is even complaining about our last-generation fabric
(quadrics elan3, around 250 MB/s).  is it just that we don't have the 
right researchers?  I've heard people mutter about earthquake researchers
being able to pin a 800 MB/s network, and claims that big FFT folk can
do so as well.  by contrast, many people claim to notice improvements
in latency from old/mundane (6-7 us) to new/good (<2 us).

I'd be interested in hearing about applications you know of which are 
very sensitive to having large bandwidth (say, .8 GB/s today).

thanks, mark hahn.


From tmattox at gmail.com  Sat Oct 16 15:15:14 2004
From: tmattox at gmail.com (Tim Mattox)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] s_update() missing from AFAPI ?
In-Reply-To: <20041016190133.GA42332@piskorski.com>
References: <20041016190133.GA42332@piskorski.com>
Message-ID: <ea86ce22041016151542d46d7a@mail.gmail.com>

Hello Andrew (and the Beowulf list as well),
You ask some very good questions, and you asked the
person who should know the answers.  Hopefully my
answers below make sense.

On Sat, 16 Oct 2004 15:01:34 -0400, Andrew Piskorski <atp@piskorski.com> wrote:
> The old 1997 paper by Dietz, Mattox, and
> Krishnamurthy, "The Aggregate Function API: It's
> Not Just For PAPERS Anymore", briefly mentions
> that their AFAPI library also supports, "fully
> coherent, polyatomic, replicated shared memory". 
> It even gives a little chart showing how many
> microseconds their s_update() function takes to
> update that shared memory.
> 
> That sounds interesting (even given the extremely
> low bandwith of the PAPERS hardware, etc.), but,
> no such function exists in the last 1999-12-22
> AFAPI release!  s_update() just isn't in there at
> all. Why?  Tim M., I know you follow the Beowulf
> list, so could you fill us in a bit on what what
> happened there?

The s_update() function went away because we changed
the underlying implementation of the "asyncronous"
s_ routines.  The new approach had a hardware limit
of 3 "fast" signals, and we deemed that it was best
to not hard code any of those for this rarely used
shared memory functionality. We had intended to
supply a routine that replaced the functionality of
s_update() that you could install as one of the 3
signal handlers if you chose to.  I'm not sure why
that code wasn't released.

But, over time, it became a moot point, since the
speed of the processors improved so much, that the
busy-wait/polling scheme we were using for the s_
routines made it very difficult to get any speedup
using the equivalent of the s_update() routine.

With the parallel port not actively causing an
interrupt, all the nodes had to poll for pending s_
operations.  Going from the 486 to the Pentium was a
dramatic change on the relative overheads for this
polling operation and general computations.

Basically, the Pentium and later processors were
slowed down so dramatically whenever you would do a
single IO space read (the polling function) to see
if any pending shared memory operations needed to be
dealt with, that it was difficult to get any
speedup, even with only two processors.  On the
testing codes I wrote at the time, it was hard to
find the right balance for how frequently to poll. 
If you polled too frequently, the Pentium was slowed
down to a crawl on purely local operations.  We
speculated that the IO instruction caused a flush of
the Pentium's pipeline, but we didn't explore it to
great detail.  Also, if you polled too infrequently,
the shared memory operations were stalled for long
periods of time, causing the other processor(s) to
sit idle waiting to get their shared memory writes
processed.

Yes, the performance numbers in the LCPC 1997 paper
are measured on a 4 node Pentium cluster, but I
don't think we had time yet to play with "real"
codes that used the s_update routine on a Pentium
cluster.  That was a long time ago, so I might not
be remembering this part very well.  But I do
remember that once we had more time to play with it
on Pentiums, it was clear that no performance
critical codes would be using the s_update routine,
much less any of the s_ routines as far as we could
tell. So, that is why the s_update routine was
pulled from the library, to free up the signaling
slot for potentially more useful things.

>   http://aggregate.org/TechPub/lcpc97.html
>   http://aggregate.org/AFAPI/AFAPI_19991222.tgz
> 
> While I'm at it I might as well ask this too: 
> That same old PAPERS papers says "UDPAPERS", using
> Ethernet and UDP, was implemented, but it doesn't
> seem to be in the AFAPI release either.  What
> happened with that?

The UDPAPERS code was being worked on by a colleague
of mine for his parallel file system work, and
unfortunately for the rest of us, he only
implemented the minimum amount of functionality that
he needed for his project, not the full AFAPI.  Back
in 1999 I had hoped to have time to finish it off
myself, but it wasn't my top priority, and if you
have followed our work, the KLAT2 cluster in the
spring of 2000 brought in some much more interesting
new ideas with the FNN stuff.

> Did it work?

Yes, to some degree, but there were still some
important corner cases (certain packet loss
scenarios) that hadn't been dealt with, and as I
said, the full AFAPI wasn't implemented, just a few
basic routines.

>  As well as the custom PAPERS hardware?

No, not as well as the custom hardware.  Speaking of
which: The custom PAPERS hardware has had some
additional work since we last published on it.  But
due to changing priorities, it has been sitting
waiting for the next bright student or two to revive
it for more modern IO ports (USB, Firewire, ???).
You can see the last parts list and board layouts
here: http://aggregate.org/AFN/000601/
Unfortunately, the assembly documentation for that
board was never written.  It's a "small change" from
the PAPERS 960801 board, but enough that if you
don't know what each thing is intended for, you
might not get it right.  That's why we haven't
posted a public link to the 000601 board design
(until now).  We almost made a 12 port version of
the PCB, but again, the student involved on that
finished their project, and the design hasn't been
validated, so it's not been sent out to a PCB fab to
be built.  As a group we decided it would be better
to find students interested in doing a new design
that used more modern IO ports than the parallel
printer port.  Know anyone interested in a Masters
project were they have to build hardware that
actually works? ;-) Academically, it's hard to make
such a thing be for a Ph.D. due to the fact that
it's mostly just "implementation/development" at
this point, with little "academic" research.

> If so, how?  Dirt cheap 10/100 cards and UTP cable
> would certainly be a lot more convenient than
> custom PAPERS hardware for anyone wanting to
> experiment with the AFAPI stuff, but I'm confused
> about what part of the ethernet network could be
> magically made to act as the NAND gate for the
> aggregate operations.

Yep, no NAND gate in the ethernet...

>  Did it need to use some particular programmable
>  ethernet switch?  Or the aggregate operations
>  were actually done on each of the nodes?

Yeah, the aggregate operations were actually
performed within each node on local copies of the
data from all the nodes. The basic idea was to have
each node send its new data along with all the known
data from anyone else for the current (and previous)
operation with a UDP broadcast/multicast.

Just this semester we finally have a new student
working on a UDP/Multicast implementation of
AFAPI... or something like it. They are just now
getting up to speed on things, so don't hold your
breath.  Also, it's unlikely we would actually
target a new AFAPI release.  With the dominance of
MPI, it would only make sense to build such a thing
for use as a module for LAM-MPI or the new OpenMPI.

I hope this answers your questions, but if not, feel
free to ask more.  I am busy with my own FNN
dissertation work now (plus Warewulf), so I won't be
working on AFN/AFAPI/PAPERS stuff to any degree
until my Ph.D. is finished.
-- 
Tim Mattox - tmattox@gmail.com
http://homepage.mac.com/tmattox/

From atp at piskorski.com  Sat Oct 16 19:20:47 2004
From: atp at piskorski.com (Andrew Piskorski)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] Re: s_update() missing from AFAPI ?
In-Reply-To: <ea86ce22041016151542d46d7a@mail.gmail.com>
References: <20041016190133.GA42332@piskorski.com>
	<ea86ce22041016151542d46d7a@mail.gmail.com>
Message-ID: <20041017022047.GA44676@piskorski.com>

On Sat, Oct 16, 2004 at 06:15:14PM -0400, Tim Mattox wrote:

> I hope this answers your questions, but if not, feel free to ask
> more.  I am busy with my own FNN dissertation work now (plus
> Warewulf), so I won't be working on AFN/AFAPI/PAPERS stuff to any
> degree until my Ph.D. is finished.

Actually, that was excellent, seemed to fill in most of the important
PAPERS-related holes in my basic background knowledge.  Thanks!

-- 
Andrew Piskorski <atp@piskorski.com>
http://www.piskorski.com/

From agrajag at dragaera.net  Sun Oct 17 04:42:38 2004
From: agrajag at dragaera.net (Sean Dilda)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] about managment
In-Reply-To: <20041016022741.B9B6.LLWAEVA@21cn.com>
References: <20041016022741.B9B6.LLWAEVA@21cn.com>
Message-ID: <1098013358.4390.44.camel@pel>

On Fri, 2004-10-15 at 14:38, llwaeva@21cn.com wrote:
> Hi all, 
>   I am running 8-node LAM/MPI parallel computing system. I found it's
> trouble to maintain the user accounts and software distribution on all
> the nodes. For example, whenever I install a new software , I have to
> repeat the job 8 times! The most annoying thing is that the
> configuration or managment of the user accounts over the network is a
> heavy job. Someone suggests that I should utilize NFS and NIS. However,
> in my case, it's difficult to have an additional computer as a server.
> Would anyone please share your experience in maintaining the beowulf
> cluster? 

My cluster relies heavily on NIS and NFS.  NIS is used to share user
login information so that I only have to make an account once, then
create the NIS maps.  (I actually have pam configured to authenticate
off of the campus kerberos server on the head nodes, and using ssh's
host-based authentication across the cluster)  We also use NFS for users
home directories.  However, I make a point of trying to package up third
party software that's used and install the rpms on all of the machines
in the cluster.

As for an extra server, as RGB pointed out, NFS and NIS can easily run
on the same box.  Do you currently have a scheduler?  If so, you can run
NIS/NFS on that box.  If nothing else, you can just run them on whatever
box people use to login to the cluster.  The de facto standard for small
clusters is to have a single head node that users login to to launch
jobs, servers out home directories and possibly NIS, as well as any
scheduler you might have.


From gary at sharcnet.ca  Mon Oct 18 05:59:01 2004
From: gary at sharcnet.ca (Gary Molenkamp)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] about managment
In-Reply-To: <1097902329.6223.945.camel@vigor12>
Message-ID: <Pine.LNX.4.44.0410180845500.12632-100000@thresher.beowulf.uwo.ca>


On Sat, 16 Oct 2004, John Hearns wrote:

> On Fri, 2004-10-15 at 19:38, llwaeva@21cn.com wrote:
> > Hi all, 
> >   I am running 8-node LAM/MPI parallel computing system. I found it's
> > trouble to maintain the user accounts and software distribution on all
> > the nodes. For example, whenever I install a new software , I have to
> > repeat the job 8 times!
> There are several answers to this question,
> which you can learn about by staying on this group, and consulting
> online resources.
> 
> A quick answer is that you could construct the cluster using one of the
> toolkits, such as Rocks or Warewulf - many others.
> 
> And a very quick answer to your current dilemma. There are utilities
> which allow parallel execution of commands on a set of machines,
> or even to have a terminal session in parallel across a set of machines.
> 
> Once you have a server (below) you can rsync each node to that.

For software distribution, I use systemimager from the Systeminstaller 
Suite. It simplifies using rsync for managing images of nodes, and works 
well even across cluster (I have 4 clusters running off one server, that 
is also the master node of a cluster). 

> >  The most annoying thing is that the
> > configuration or managment of the user accounts over the network is a
> > heavy job. Someone suggests that I should utilize NFS and NIS. However,
> > in my case, it's difficult to have an additional computer as a server.
> Not meaning to be rude,  but you are wrong there.
> Just use one of your compute nodes as the server. The additional CPU
> load will not be great.
> You should use some sort of centralised account management NIS or LDAP.

I've recently deployed LDAP at SHARCNET and it really simplifies the 
account management process.  I still nfs mount home accounts, but I used 
to rcp the passwd,shadow, and group files around.  This made it difficult 
for users to maintain there account info, and had a long delay to 
propigate to 200+ busy machines.   


> Even if you point blank refuse to do that, a cron job to rsync the
> relevant files will help cut down your admin load.
> 
> And remember - eight machines may not seem a lot. But what happens if
> you make a mistake on one machine, or one machine is down when you are
> adding an account or software. Are you sure to run identical commands by
> hand the next time it is up?
> 
> 
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Gary Molenkamp			SHARCNET
Systems Administrator		University of Western Ontario
gary@sharcnet.ca		http://www.sharcnet.ca
(519) 661-2111 x88429		(519) 661-4000


From kus at free.net  Mon Oct 18 09:10:42 2004
From: kus at free.net (Mikhail Kuzminsky)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] Mellanox IB problem: xp0 module ?
In-Reply-To: <web-155119@free.net>
Message-ID: <web-156129@free.net>

           Dear colleagues !

I've some problem w/Mellanox IBHPC-0.5.0 software inst (in particular, 
the absence of xp0 kernel module) on "standalone" node which isn't
connected currently w/IB switch or other node w/IB device.

I've installed Infiniband HCA (PCI-X Infinihost MT23108 low profile)
to upgarde my interconnect from GigEth to IB.

Tyan S2880/Opteron 242 under SuSE Linux 9.0 (2.4.21-243) is used as
the node for this installation. It's official software platform 
supported by whole Mellanox IBHPC-0.5.0 software collection
(it includes , in aprticular, THCA-3.2 driver). Software
environment is "fixed" because of a set of binary applications 
requiremnets, so last IBHPC-1.6.0 looks as inappropriate for as. 
1)After minor source modification (in mosal.c) the the IBHPC 
installation (INSTALL script) was finished successfully. IPoIB 
parameters setting was also performed in the frames of INSTALL script
dialog.

2)But after finish of INSTALL and reboot I see that

a) mst tools started successfully
b) and I see then following boot messages:

Setting up network interfaces :
eth0
eth1  - both done
ib0: modprobe: modprobe: can't locate module xp0

and ib0 interface is down (I should note that IB cable isn't connected 
to HCA really). But I may do ib0 "up" manually; in particular, 
/etc/init.d/network start
put ib0 in "up" state.

I didn't find xb0.o  in /lib/modules/..., and in any Mellanox software
rpm's also ! I don't know what do xp0 module and where I may
found it :-(  Any reccomendations/ideas are welcome !

(FYI: some IB things like FLINT verification are OK, and opensm & mst
started successfully).

2) I configured IPoIB at IBHPC installation. (To try IBsNice) I issued
vapi start

after boot, and then I see in particular the message

Loading mod_ib_mgt FAILED

"Manual" modprobe mod_ib_mgt leads to the message
init_module: device or resource busy

If I run IBsNice.sh, then I receive the same message about
mod_ib_mgt

but IBsNice creates virtual eth2 , and ping to the IP of eth2 works
normally.

I'll be very appreciate if somebody clarify me this situation 
w/mod_ib_mgt. May be it's simple because of some misconfiguration of 
some IB software component ? 
(I didn't configure anythings after running of INSTALL script).

Yours
Mikhail Kuzminsky
Zelinsky Institute of Organic Chemistry
Moscow

From bropers at cct.lsu.edu  Mon Oct 18 11:29:28 2004
From: bropers at cct.lsu.edu (Brian D. Ropers-Huilman)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] about managment
In-Reply-To: <Pine.LNX.4.44.0410180845500.12632-100000@thresher.beowulf.uwo.ca>
References: <Pine.LNX.4.44.0410180845500.12632-100000@thresher.beowulf.uwo.ca>
Message-ID: <41740B88.10400@cct.lsu.edu>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Gary Molenkamp said the following on 2004-10-18 07:59:
| I've recently deployed LDAP at SHARCNET and it really simplifies the
| account management process.

How do you allow users to change their passwords, shells, or GECOS
information?

- --
Brian D. Ropers-Huilman  .:. Asst. Director .:.  HPC and Computation
Center for Computation & Technology (CCT)        bropers@cct.lsu.edu
Johnston Hall, Rm. 350                           +1 225.578.3272 (V)
Louisiana State University                       +1 225.578.5362 (F)
Baton Rouge, LA 70803-1900  USA              http://www.cct.lsu.edu/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFBdAuIwRr6eFHB5lgRAiCSAKCa+75nNOWGFDxmZV/hGfb0wW85yQCguMo0
v8F2Mp3kzpCvK4dYBw1SUWk=
=iAHb
-----END PGP SIGNATURE-----

From cjoung at tpg.com.au  Mon Oct 18 19:14:53 2004
From: cjoung at tpg.com.au (cjoung@tpg.com.au)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] MPI & ScaLAPACK: error in MPI_Comm_size: Invalid
	communicator
Message-ID: <1098152093.4174789d863ce@postoffice.tpg.com.au>

Hi, I was hoping someone could help me with a F77,MPI & ScaLAPACK
problem. Basically, I have a problem making the Scalapack 
libraries work in my program.
Programs with MPI-only calls work fine, e.g. the "pi.f" MPI
program that comes with the MPI installation works fine
(the one that predicts pi), 
as do other examples I've gotten from books & simple ones I've written
myself, but whenever I try an example with scalapack & blacs calls, it falls 
over with the same error message (which I can't decipher).

If you can help, then I have a more detailed account of whats
going on below,
Any advice would be most gratefully appreciated, 
Clint Joung
Postdoctoral Research Associate,
Department of Chemical Engineering
University of SYdney, NSW 2006
Australia
**************************************************************

I'm just learning parallel programming. The netlib scalapack website
has an example program called 'example1.f'
It uses a scalapack subroutine PSGESV to solve the standard 
matrix equation [A]x=b, and return the answer, vector x.

It seemed to compile ok, but on running, I got some error
messages. 
So I systematically stripped down 'example1.f' in stages, 
recompling & running each time, trying to achieve a working 
program, eliminating potential bugs & rebuild it from there.

Eventually I got down to the following emaciated F77 program 
(see below).
All it does now is initialize a 2x3 process grid,
then release it - thats all.
****example2.f*******************************************
      program example2
      integer ictxt,mycol,myrow,npcol,nprow

      nprow=2
      nocol=3

      call SL_INIT(ictxt,nprow,npcol)

      call BLACS_EXIT(0)
      STOP
      END
*********************************************************
Yet, it still doesn't work!, the following is the output
when I try to compile and run it,
*********************************************************
[tony@carmine clint]$ mpif77 -o example2 example2.f 
                             -L/opt/intel/mkl70cluster/lib/32 
                             -lmkl_scalapack  
                             -lmkl_blacsF77init 
                             -lmkl_blacs 
                             -lmkl_blacsF77init 
                             -lmkl_lapack 
                             -lmkl_ia32 
                             -lguide 
                             -lpthread 
                             -static-libcxa
[tony@carmine clint]$ mpirun -n 6 ./example2 
aborting job: Fatal error in MPI_Comm_size: Invalid communicator, error stack:
MPI_Comm_size(82): MPI_Comm_size(comm=0x5b, size=0x80d807c) failed
MPI_Comm_size(66): Null Comm pointer
aborting job:
Fatal error in MPI_Comm_size: Invalid communicator, error stack:
MPI_Comm_size(82): MPI_Comm_size(comm=0x5b, size=0x80d807c) failed
MPI_Comm_size(66): Null Comm pointer
rank 5 in job 17  carmine.soprano.org_32782   caused collective abort of all ranks
  exit status of rank 5: return code 13
aborting job:
Fatal error in MPI_Comm_size: Invalid communicator, error stack:
MPI_Comm_size(82): MPI_Comm_size(comm=0x5b, size=0x80d807c) failed
MPI_Comm_size(66): Null Comm pointer
rank 1 in job 17  carmine.soprano.org_32782   caused collective abort of all ranks
  exit status of rank 1: return code 13
rank 0 in job 17  carmine.soprano.org_32782   caused collective abort of all ranks
  exit status of rank 0: return code 13
[tony@carmine clint]$
*********************************************************
..so apparently somethings wrong with MPI_Comm_size, but
beyond that, I can't figure it out.

My system details:
* I am running this on a '1 node' cluster - i.e. my notebook.
(just to prototype before I run on a proper cluster)
* O/S: Redhat Fedora Core 1, Kernel 2.4.22 
* Compiler: Intel Fortran Compiler for linux 8.0
* MPI: MPICH2 ver 0.971 (was compiled with the ifort compiler,
  so it should work ok with the ifort compiler)
* The Scalapack, blacs, blas and lapack come from the
 Intel Cluster Maths Kernel Library for Linux 7.0

If you know how to fix this problem, I'd appreciate to
hear from you.
Please consider me a NOVICE with all three  
  -    linux, MPI and Scalapack. 
The simpler the explanation, the better!

with thanks,
clint joung


From cjoung at tpg.com.au  Mon Oct 18 21:59:55 2004
From: cjoung at tpg.com.au (cjoung@tpg.com.au)
Date: Wed Nov 25 01:03:29 2009
Subject: [Beowulf] Re: MPI & ScaLAPACK: error in MPI_Comm_size: Invalid
	communicator
Message-ID: <1098161995.41749f4b65333@postoffice.tpg.com.au>

Hi again - I need to correct something in my earlier email - I left
out the subroutine SL_INIT,

Here is the entire code again,

****example2.f***************
      program example2
      integer ictxt,npcol,nprow

      nprow=2
      nocol=3

      call sl_init(ictxt,nprow,npcol)

      call BLACS_EXIT(0)
      stop
      end

      subroutine sl_init(ictxt,nprow,npcol)
      integer ictxt,nprow,npcol,iam,nprocs
      external BLACS_GET,BLACS_GRIDINIT,BLACS_PINFO,BLACS_SETUP

      call BLACS_PINFO(iam,nprocs)
        
      if (nprocs.lt.1) then
        if (iam.eq.0) nprocs=nprow*npcol
        call BLACS_SETUP(iam,nprocs)
      endif

      call BLACS_GET(-1,0,ictxt)
      call BLACS_GRIDINIT(ictxt,'Row-major',nprow,npcol)

      return
      end
*****************************

The errors are still the same however - it doesn't like any of my
BLACS calls.

Any help would be greatly appreciated,
thanks
Clint Joung


----- Forwarded message from cjoung@tpg.com.au -----
    Date: Tue, 19 Oct 2004 12:14:53 +1000
    From: cjoung@tpg.com.au
 Subject: MPI & ScaLAPACK: error in MPI_Comm_size: Invalid communicator
      To: beowulf@beowulf.org

Hi, I was hoping someone could help me with a F77,MPI & ScaLAPACK
problem. 

<snip >

Yet, it still doesn't work!, the following is the output
when I try to compile and run it,
*********************************************************
[tony@carmine clint]$ mpif77 -o example2 example2.f 
                             -L/opt/intel/mkl70cluster/lib/32 
                             -lmkl_scalapack  
                             -lmkl_blacsF77init 
                             -lmkl_blacs 
                             -lmkl_blacsF77init 
                             -lmkl_lapack 
                             -lmkl_ia32 
                             -lguide 
                             -lpthread 
                             -static-libcxa
[tony@carmine clint]$ mpirun -n 6 ./example2 
aborting job: Fatal error in MPI_Comm_size: Invalid communicator, error stack:
MPI_Comm_size(82): MPI_Comm_size(comm=0x5b, size=0x80d807c) failed
MPI_Comm_size(66): Null Comm pointer
aborting job:
Fatal error in MPI_Comm_size: Invalid communicator, error stack:
MPI_Comm_size(82): MPI_Comm_size(comm=0x5b, size=0x80d807c) failed
MPI_Comm_size(66): Null Comm pointer
rank 5 in job 17  carmine.soprano.org_32782   caused collective abort of all ranks
  exit status of rank 5: return code 13
aborting job:
Fatal error in MPI_Comm_size: Invalid communicator, error stack:
MPI_Comm_size(82): MPI_Comm_size(comm=0x5b, size=0x80d807c) failed
MPI_Comm_size(66): Null Comm pointer
rank 1 in job 17  carmine.soprano.org_32782   caused collective abort of all ranks
  exit status of rank 1: return code 13
rank 0 in job 17  carmine.soprano.org_32782   caused collective abort of all ranks
  exit status of rank 0: return code 13
[tony@carmine clint]$
*********************************************************
.so apparently somethings wrong with MPI_Comm_size, but
beyond that, I can't figure it out.

<snip>

From kus at free.net  Mon Oct 18 11:34:13 2004
From: kus at free.net (Mikhail Kuzminsky)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Re: [suse-amd64] Mellanox Infiniband on SuSE 9.0 - xp0
 module, etc
In-Reply-To: <20041018184354.GA26195@Baruch.pantasys.com>
Message-ID: <web-156218@free.net>

In message from Bob Lee <bob@pantasys.com> (Mon, 18 Oct 2004 11:43:54 
-0700):
>On Sun, Oct 17, 2004 at 11:00:13PM +0400, Mikhail Kuzminsky wrote:
>>          Dear colleagues !
>
>> I've some problem w/Mellanox IB software installation (in 
>>particular,
>> the absence of xp0 kernel module).
>
>> I've installed Infiniband HCA (PCI-X Infinihost MT23108 low profile)
>> to upgarde my interconnect on Tyan S2880 under SuSE Linux 9.0 
>> (2.4.21-243).
>
>> It's official software platform supported by whole Mellanox 
>> IBHPC-0.5.0 software collection. 
>
>    I had difficulty with the 0.5.0 release on SuSE (9.1), but
>    the latest release (1.6.0) which is available through the
>    their web site (with registration).  This worked seamlessly
>    IPoIB came right up no problem.  The resulting package is
>    a bit bloated, but you do get everything.
    I beleive (according 1.6.0 documentations) that it'll not work 
under
SuSE 9.0 :-(

Yours
Mikhail

>
>    ...
>
>> 3) TO BE MORE CORRECT: pls take into account, that my host 
>>w/installed
>> software& Mellanox hardware *isn't connected* currently with IB 
>>switch
>> (i.e. is "standalone" server !)
>
>    Remember that you need some form of subnet management to
>    assign LIDs to the ports (minism or opensm on one node
>    after the port is in "INIT" state -- using vstat).
>
>    ... remaining deleted to save to old growth electrons ...
>
>> Yours
>> Mikhail Kuzminsky
>> Zelinsky Institute of Organic Chemistry
>> Moscow
>
>hope this helps
>-bob


From Nout.Gemmeke at nl.fujitsu.com  Tue Oct 19 08:33:31 2004
From: Nout.Gemmeke at nl.fujitsu.com (Gemmeke Nout)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Bonding results in 1 HBA for Tx and 1 HBA for Rx.
Message-ID: <C1EBEEBBB842D411B4120000949A1F58064C92FA@wwmessd135>

Hi there Beowulf,
 
Just a quick question.
 
When using bonding on RehHat AS V3.0 I see that eth0 is used for sending
data (Tx) and eth1 for receive of data (Rx). This results in bandwidth of
only 1Gbit/sec.....
 
Fail over works fine however.
 
Any idea if this can be configured ??
 
Thanks,
 
Nout Gemmeke

Consultant

Enterprise Services

 
FUJITSU SERVICES

Fujitsu Services B.V., Het Kwadrant 1

P.O. Box 1067, 3600 BB Maarssen, The Netherlands

Tel: +31 346 598451

Mob: +31 651 218661

Fax:  +31 346 561909            

 
Email: nout.gemmeke@nl.fujitsu.com

Web: nl.fujitsu.com

 
Fujitsu Services B.V., Registered in the Netherlands no 30078286

___________________________________________________________________________

 
The information in this e-mail (and its attachments) is confidential and
intended solely for the 

addressee(s). If this message is not addressed to you, please be aware that
you have no 

authorisation to read this e-mail, to copy it or to forward it to any person
other than the 

addressee(s). Should you have received this e-mail by mistake, please bring
this to the

attention of the sender, after which you are kindly requested to destroy the
original message

and delete any copies held in your system. Fujitsu Services and its
affiliated companies 

cannot be held responsible or liable in any way whatsoever for and/or in
connection with 

any consequences and/or damage resulting from the contents of this e-mail
and its proper

and complete dispatch and receipt. Fujitsu Services does not guarantee that
this e-mail has

not been intercepted and amended, nor that it is virus-free.

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20041019/6a64df37/attachment.html
From hahn at physics.mcmaster.ca  Tue Oct 19 12:33:22 2004
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] MPI & ScaLAPACK: error in MPI_Comm_size: Invalid
	communicator
In-Reply-To: <1098152093.4174789d863ce@postoffice.tpg.com.au>
Message-ID: <Pine.LNX.4.44.0410191530430.31218-100000@coffee.psychology.mcmaster.ca>

> MPI_Comm_size(82): MPI_Comm_size(comm=0x5b, size=0x80d807c) failed

I have no insight into any of this, except that these parameters 
are obviously reversed.  0x5b is a sensible size, and 0x80d807c
is not.  but 0x80d807c is a sensible pointer...

if this isn't a source-level transposition, perhaps it has to do 
with mixed calling conventions?


From henry.gabb at intel.com  Tue Oct 19 14:21:31 2004
From: henry.gabb at intel.com (Gabb, Henry)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] MPI & ScaLAPACK: error in MPI_Comm_size: Invalid
	communicator
Message-ID: <CBEB78A9FCF04346A0F0A34F53AD230A07612188@fmsmsx403.amr.corp.intel.com>

Hi Clint,
The "Null Comm pointer" error that you're seeing is almost always due to
a header mismatch. I don't think Intel Cluster MKL 7.0 supports MPICH2
yet. According to the Intel Cluster MKL system requirements
(http://www.intel.com/software/products/clustermkl/sysreq.htm),
MPICH-1.2.5 is supported. MPICH-1.2.6 should work too. You might post a
question to the Intel MKL forum (http://softwareforums.intel.com/ids)
about MPICH2 support.

I did a quick test of your program on one of my clusters and it ran fine
(after fixing the typo in the nocol=3 statement):

[henry@castor1 cmkl-test]$ /opt/mpich-1.2.6-gcc/bin/mpif77 -o example
example.f \
                             -L/opt/intel/mkl70cluster/lib/64
-lmkl_scalapack \
                             -lmkl_blacsF77init_gnu -lmkl_blacs
-lmkl_blacsF77init_gnu \
                             -lmkl_lapack -lmkl -lguide -lpthread
[henry@castor1 cmkl-test]$ /opt/mpich-1.2.6-gcc/bin/mpirun -n 6
./example
[henry@castor1 cmkl-test]$ 

I used a GNU-built MPICH-1.2.6 because I didn't have an Intel-ready
MPICH installation handy.

Best regards,

Henry Gabb
Intel Parallel and Distributed Solutions Division


From ashley at quadrics.com  Wed Oct 20 06:44:49 2004
From: ashley at quadrics.com (Ashley Pittman)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] bandwidth: who needs it?
In-Reply-To: <Pine.LNX.4.44.0410161724130.11384-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0410161724130.11384-100000@coffee.psychology.mcmaster.ca>
Message-ID: <1098279888.854.220.camel@ashley>

On Sat, 2004-10-16 at 22:36, Mark Hahn wrote:
> do you have applications that are pushing the limits of MPI bandwidth?
> for instance, code that actually comes close to using the 8-900 MB/s
> that current high-end interconnect provides?
> 
> we have a fairly wide variety of codes inside SHARCnet, but I haven't
> found anyone who is even complaining about our last-generation fabric
> (quadrics elan3, around 250 MB/s).  is it just that we don't have the 
> right researchers?  I've heard people mutter about earthquake researchers
> being able to pin a 800 MB/s network, and claims that big FFT folk can
> do so as well.  by contrast, many people claim to notice improvements
> in latency from old/mundane (6-7 us) to new/good (<2 us).
> 
> I'd be interested in hearing about applications you know of which are 
> very sensitive to having large bandwidth (say, .8 GB/s today).

It's not so much that you don't have the right researchers, it's the
type of projects they are researching or at least the way they are
attacking the problem.

Latency is every bit as critical as bandwidth and in many cases more
so.  Latency at scale is also critical, multi-hop networks dictate the
need to use nearest-neighbour algorithms and therefore have trouble
scaling to large CPU counts.  It's also harder for newcomers and non
technical people to conceptualise latency and especially scalable
latency.

>From code optimisation that I've done in the past I've also found that
bandwidth is easier to hide via pipelining than latency and therefore is
less critical to wall clock time.

Also don't forget that SMP boxes are getting wider, think in terms of
Mb/s/CPU and todays 900Mb/s network bandwidth suddenly doesn't sound
that much.  The good news here however is that the large SMPs tend to
have multiple PCI-X busses so can use multiple networks effectively.

Ashley,


From tony at mpi-softtech.com  Wed Oct 20 06:39:19 2004
From: tony at mpi-softtech.com (Anthony Skjellum)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] MPI & ScaLAPACK: error in MPI_Comm_size:
	Invalid	communicator
In-Reply-To: <Pine.LNX.4.44.0410191530430.31218-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0410191530430.31218-100000@coffee.psychology.mcmaster.ca>
Message-ID: <41766A87.5000003@mpi-softtech.com>

Just my two cents...

Size is an out parameter, so it should be an address.
Depending on the MPI, MPI_Comm is also secretly mapped to a pointer, but 
it could also be
an index into an array structure (or hash) inside the MPI.  So, it is 
hard to infer anything from
the value of comm...

It would be interesting to do printf("%x %x",(int)MPI_COMM_WORLD,&size)

before calling the Scalapack to get an idea of these quantities...

-Tony

Mark Hahn wrote:

>>MPI_Comm_size(82): MPI_Comm_size(comm=0x5b, size=0x80d807c) failed
>>    
>>
>
>I have no insight into any of this, except that these parameters 
>are obviously reversed.  0x5b is a sensible size, and 0x80d807c
>is not.  but 0x80d807c is a sensible pointer...
>
>if this isn't a source-level transposition, perhaps it has to do 
>with mixed calling conventions?
>
>
>_______________________________________________
>Beowulf mailing list, Beowulf@beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>  
>


From eugen at leitl.org  Thu Oct 21 02:35:52 2004
From: eugen at leitl.org (Eugen Leitl)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] PARALLEL PROGRAMMING WORKSHOP Nov 29 - Dec 1, 2004,
	Juelich, Call for Participation (fwd from rabenseifner@hlrs.de)
Message-ID: <20041021093552.GS1457@leitl.org>

----- Forwarded message from Rolf Rabenseifner <rabenseifner@hlrs.de> -----

From: Rolf Rabenseifner <rabenseifner@hlrs.de>
Date: Thu, 21 Oct 2004 11:15:53 +0200 (CEST)
To: eugen@leitl.org
Subject: PARALLEL PROGRAMMING WORKSHOP Nov 29 - Dec 1, 2004, Juelich, Call for Participation

Sehr geehrte Dame, sehr geehrter Herr,

k?nnten Sie bitte diese Ank?ndigung an interessierte Kollegen 
weitergeben, da mit dieser Mailingliste die Interessenten f?r 
Kurse zur "Parallelen Programmierung" oft nicht direkt erreicht 
werden k?nnen.   
Es stehen noch Pl?tze im MPI/OpenMP-Kurs im Forschungszentrum 
J?lich zur Verf?gung. 
Die Vortr?ge sind in Deutsch, die Folien in Englisch.

Mit freundlichen Gr??en
Rolf Rabenseifner 

======================================================================
                      Call for Participation
======================================================================

             PARALLEL PROGRAMMING WORKSHOP Fall 2004

Date              Location        Content  (for beginners/advanced)
----------------  --------------  -------
Nov.29 - Dec.1    FZ J?lich, ZAM  Parallel Programming  (70% / 30%)
                                  (3-day course in German)

Registration and further information:

  http://www.hlrs.de/news-events/events/2004/parallel_prog_fall2004/
  (course G) 

The aim of this workshop is to give people with some programming
experience an introduction into the basics of parallel programming.
The focus is on programming models, MPI and OpenMP, and PETSc.
Language support is given for Fortran and C.
The course was developed by HLRS, EPCC, NIC and ZHR.
Hands-on sessions will allow users to test and understand the basic 
constructs of MPI, OpenMP, and PETSc.

Message Passing with MPI is the major programming model on
large distributed-memory systems in high-performance computing.
OpenMP is dedicated to shared memory systems.
PETSc is a high-level progamming interface for parallel solver. 

Lectures will be given by Dr. Rolf Rabenseifner (HLRS, member of MPI-2 Forum).

Extended registration deadline: Nov. 12, 2004. 

The course language is German. 
All slides and handouts are in English.

---------------------------------------------------------------------
Please forward this announcement to any colleagues who may be
interested.  Our apologies if you receive multiple copies.
---------------------------------------------------------------------


---------------------------------------------------------------------
Dr. Rolf Rabenseifner                      High Performance Computing
Parallel Computing                         Center Stuttgart    (HLRS)
Rechenzentrum Universitaet Stuttgart (RUS) Phone:    ++49 711 6855530
Allmandring 30                             FAX:      ++49 711 6787626
D-70550 Stuttgart                                rabenseifner@hlrs.de
Germany                        http://www.hlrs.de/people/rabenseifner
---------------------------------------------------------------------

----- End forwarded message -----
-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
______________________________________________________________
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
Url : http://www.scyld.com/pipermail/beowulf/attachments/20041021/a8a96556/attachment.bin
From scheinin at crs4.it  Thu Oct 21 01:48:33 2004
From: scheinin at crs4.it (Alan Scheinine)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] dual Opteron recommendations
Message-ID: <200410210848.i9L8mXfm005676@dali.crs4.it>

Many venders sell U1 cases for dual Opteron based on the Tyan
main boards, but on the other hand, a vender here says that the
product Newisys 2100 is much more reliable than Tyan though it
costs 10 to 20 percent more.  I have not previously heard of
Newisys and I do not recall it being mentioned in this mailing
list.  Would anyone like to comment?
best regards,
Alan Scheinine   Email: scheinin@crs4.it

From mphelps at cfa.harvard.edu  Thu Oct 21 09:54:31 2004
From: mphelps at cfa.harvard.edu (Matt Phelps)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] dual Opteron recommendations
In-Reply-To: <200410210848.i9L8mXfm005676@dali.crs4.it>
References: <200410210848.i9L8mXfm005676@dali.crs4.it>
Message-ID: <4177E9C7.601@cfa.harvard.edu>

Alan Scheinine wrote:
> Many venders sell U1 cases for dual Opteron based on the Tyan
> main boards, but on the other hand, a vender here says that the
> product Newisys 2100 is much more reliable than Tyan though it
> costs 10 to 20 percent more.  I have not previously heard of
> Newisys and I do not recall it being mentioned in this mailing
> list.  Would anyone like to comment?
> best regards,
> Alan Scheinine   Email: scheinin@crs4.it
> 

Alan,

These are available from Sun as the V20z server. We have a 32 node
cluster of 'em and (so far ;-) are happy.

-- 
Matt Phelps
System Administrator, Computation Facility
Harvard - Smithsonian Center for Astrophysics
mphelps@cfa.harvard.edu, http://cfa-www.harvard.edu

From lindahl at pathscale.com  Thu Oct 21 10:29:04 2004
From: lindahl at pathscale.com (Greg Lindahl)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] dual Opteron recommendations
In-Reply-To: <200410210848.i9L8mXfm005676@dali.crs4.it>
References: <200410210848.i9L8mXfm005676@dali.crs4.it>
Message-ID: <20041021172904.GA1351@greglaptop.internal.keyresearch.com>

> Many venders sell U1 cases for dual Opteron based on the Tyan
> main boards, but on the other hand, a vender here says that the
> product Newisys 2100 is much more reliable than Tyan though it
> costs 10 to 20 percent more.

Newisys came out with one of the first Opteron motherboards, and their
2 cpu motherboard is still shipped by Sun as the Sunfire 20z. It's
a pretty expensive motherboard, but the % added to the final price
depends on how much memory you buy and which cpus you're using.

There are a *lot* of clusters out there using that Tyan motherboard.
I don't think anyone's decided that it's any less reliable than any
other low-end server motherboard.

-- g


From lindahl at pathscale.com  Thu Oct 21 10:58:55 2004
From: lindahl at pathscale.com (Greg Lindahl)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] bandwidth: who needs it?
In-Reply-To: <Pine.LNX.4.44.0410161724130.11384-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0410161724130.11384-100000@coffee.psychology.mcmaster.ca>
Message-ID: <20041021175855.GA1724@greglaptop.internal.keyresearch.com>

> do you have applications that are pushing the limits of MPI bandwidth?
> for instance, code that actually comes close to using the 8-900 MB/s
> that current high-end interconnect provides?

Bandwidth is important not only for huge messages that hit 900 MB/s,
but also for medium sized messages. A naive formula for how long it
takes to send a message is:

T_size = T_0 + size / max_bandwidth

For example, for a 4k message with T_0 = 5 usec and either 400 MB/s or
800 MB/s,

T_4k_400M = 5 + 4k/400M = 5 + 10 = 15 usec
T_4k_800M = 5 + 4k/800M = 5 +  5 = 10 usec

A big difference. But you're only getting 266 MB/s and 400 MB/s
bandwidth, respectively.

Of course performance is usually a bit less than this naive model. But
the effect is real, becoming unimportant for packets smaller than ~ 2k
in this example. The size at which this effect becomes unimportant
depends on T_0 and the bandwidth.

-- greg


From joelja at darkwing.uoregon.edu  Thu Oct 21 10:32:24 2004
From: joelja at darkwing.uoregon.edu (Joel Jaeggli)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] dual Opteron recommendations
In-Reply-To: <200410210848.i9L8mXfm005676@dali.crs4.it>
References: <200410210848.i9L8mXfm005676@dali.crs4.it>
Message-ID: <Pine.LNX.4.61.0410211022480.29514@twin.uoregon.edu>

I haven't personally dealt with newisys although we're looking at them for 
larger multiway boxen.

however if you're looking at tyan, they sell completely integrated 
building blocks built around their motherboards.

take a look at the tyan tranport gx28 which comes in 2 bay and 4 bay 
flavors

http://www.tyan.com/products/html/gx28b2882.html

On Thu, 21 Oct 2004, Alan Scheinine wrote:

> Many venders sell U1 cases for dual Opteron based on the Tyan
> main boards, but on the other hand, a vender here says that the
> product Newisys 2100 is much more reliable than Tyan though it
> costs 10 to 20 percent more.  I have not previously heard of
> Newisys and I do not recall it being mentioned in this mailing
> list.  Would anyone like to comment?
> best regards,
> Alan Scheinine   Email: scheinin@crs4.it
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
-------------------------------------------------------------------------- 
Joel Jaeggli  	       Unix Consulting 	       joelja@darkwing.uoregon.edu 
GPG Key Fingerprint:     5C6E 0104 BAF0 40B0 5BD3 C38B F000 35AB B67F 56B2


From jholmes at psu.edu  Thu Oct 21 11:34:29 2004
From: jholmes at psu.edu (Jason Holmes)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] dual Opteron recommendations
In-Reply-To: <Pine.LNX.4.61.0410211022480.29514@twin.uoregon.edu>
References: <200410210848.i9L8mXfm005676@dali.crs4.it>
	<Pine.LNX.4.61.0410211022480.29514@twin.uoregon.edu>
Message-ID: <41780135.6020806@psu.edu>

FWIW, we have 80 of the Sun v20z's (Newisys 2100) and we've been very 
happy with them (very well built, reliable, and no problems so far).  We 
have 16 Angstrom blades with dual Opteron Tyan motherboards (s2882) in 
them as well.  They initially had problems, but Tyan figured out the 
issue and shipped us replacement motherboards overnight at no cost.

Thanks,

--
Jason Holmes

Joel Jaeggli wrote:
> I haven't personally dealt with newisys although we're looking at them 
> for larger multiway boxen.
> 
> however if you're looking at tyan, they sell completely integrated 
> building blocks built around their motherboards.
> 
> take a look at the tyan tranport gx28 which comes in 2 bay and 4 bay 
> flavors
> 
> http://www.tyan.com/products/html/gx28b2882.html
> 
> On Thu, 21 Oct 2004, Alan Scheinine wrote:
> 
>> Many venders sell U1 cases for dual Opteron based on the Tyan
>> main boards, but on the other hand, a vender here says that the
>> product Newisys 2100 is much more reliable than Tyan though it
>> costs 10 to 20 percent more.  I have not previously heard of
>> Newisys and I do not recall it being mentioned in this mailing
>> list.  Would anyone like to comment?
>> best regards,
>> Alan Scheinine   Email: scheinin@crs4.it
>> _______________________________________________
>> Beowulf mailing list, Beowulf@beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit 
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
> 


From james.p.lux at jpl.nasa.gov  Thu Oct 21 11:53:17 2004
From: james.p.lux at jpl.nasa.gov (Jim Lux)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] bandwidth: who needs it?
References: <Pine.LNX.4.44.0410161724130.11384-100000@coffee.psychology.mcmaster.ca>
	<20041021175855.GA1724@greglaptop.internal.keyresearch.com>
Message-ID: <000801c4b79f$3d016080$33a8a8c0@LAPTOP152422>

Bandwidth is also important if you have any sort of "store and forward"
process in the comm link (say, in a switch), because, typically, you have to
wait for the entire message to arrive (so you can check the CRCC) before you
can send it on it's way to the next destination.  I'm sure that there are
high performance switches around that only wait 'til enough of the header
arrives to make the routing decision, but then, the switch has to passively
pass the data through without error checking.

----- Original Message -----
From: "Greg Lindahl" <lindahl@pathscale.com>
To: "Mark Hahn" <hahn@physics.mcmaster.ca>
Cc: <Beowulf@beowulf.org>
Sent: Thursday, October 21, 2004 10:58 AM
Subject: Re: [Beowulf] bandwidth: who needs it?


> > do you have applications that are pushing the limits of MPI bandwidth?
> > for instance, code that actually comes close to using the 8-900 MB/s
> > that current high-end interconnect provides?
>
> Bandwidth is important not only for huge messages that hit 900 MB/s,
> but also for medium sized messages. A naive formula for how long it
> takes to send a message is:
<snip>


From seth at integratedsolutions.org  Thu Oct 21 10:37:50 2004
From: seth at integratedsolutions.org (Seth Bardash)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] dual Opteron recommendations
In-Reply-To: <200410210848.i9L8mXfm005676@dali.crs4.it>
Message-ID: <200410211737.i9LHbu409434@integratedsolutions.org>

-----Original Message-----
From: beowulf-bounces@beowulf.org
[mailto:beowulf-bounces@beowulf.org] On Behalf Of Alan Scheinine
Sent: Thursday, October 21, 2004 2:49 AM
To: Beowulf@beowulf.org
Subject: [Beowulf] dual Opteron recommendations

Many venders sell U1 cases for dual Opteron based on the Tyan
main boards, but on the other hand, a vender here says that the
product Newisys 2100 is much more reliable than Tyan though it
costs 10 to 20 percent more.  I have not previously heard of
Newisys and I do not recall it being mentioned in this mailing
list.  Would anyone like to comment?

best regards,

Alan Scheinine   Email: scheinin@crs4.it
_______________________________________________

Alan and the list,

First, let me say that I am not trying to suggest to anyone to buy
our products, only provide information we have seen integrating
Dual Opteron systems:

1) The Newisys boards are made expecially for them and only come
in their cases. I can not comment on their reliability as we have
not used nor tried to integrate their systems. I think Sun uses
them and charges appropriately for Sun.

2) We have built over 250 Dual Opteron systems, mostly 1U's used
in large linux clusters.

Initially, we used Tyan MB's and found that we were getting around
a 10% to 15% DOA rate here before burn-in. After burn-in we had no
failures here or deployed. So.... My take on the Tyan Dual Opteron
MB's is that they work fine once they have gone through burn-in
but the DOA rate out of the box is not good. YMMV.

We then switched to the Arima (www.accelertech.com) HDAMA,
ATO-2161 motherboards. These have had DOA's only caused by the UPS
gorillas - All 2 of the HDAMA MB's that were DOA were received in
badly damaged boxes. Over 190 have now been received and
integrated with no failures - either here, in burn-in or in the
field. BTW, this motherboard is the AMD reference design and has
been very robust even with Enhanced Latency memory (CL 2-3-2-6-1).

We have installed Fedora Core 2, RH ES 3.0, White Box Linux and
SUSE 9.1 and they all work fine.

We are testing the Server and Workstation Iwill motherboards
(www.iwillusa.com) and they seem to work fine so far.

There are many other factors that should influence Dual Opteron
vendor selection. These factors are: cooling, performance,
configuration, I/O, reliability, technical expertise, support and
price - your order of importance will usually dictate a vendor.

Hope this provides the feedback required to make an informed
decision about motherboard selection and system vendors.

Seth Bardash

Integrated Solutions and Systems
http://www.integratedsolutions.org

Supplier of AMD and Intel Servers and Systems running Windows and
Linux.

*Failure can not cope with perseverance*

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.779 / Virus Database: 526 - Release Date: 10/19/2004
 

From lindahl at pathscale.com  Thu Oct 21 12:06:59 2004
From: lindahl at pathscale.com (Greg Lindahl)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] bandwidth: who needs it?
In-Reply-To: <000801c4b79f$3d016080$33a8a8c0@LAPTOP152422>
References: <Pine.LNX.4.44.0410161724130.11384-100000@coffee.psychology.mcmaster.ca>
	<20041021175855.GA1724@greglaptop.internal.keyresearch.com>
	<000801c4b79f$3d016080$33a8a8c0@LAPTOP152422>
Message-ID: <20041021190659.GF1724@greglaptop.internal.keyresearch.com>

On Thu, Oct 21, 2004 at 11:53:17AM -0700, Jim Lux wrote:

> I'm sure that there are
> high performance switches around that only wait 'til enough of the header
> arrives to make the routing decision, but then, the switch has to passively
> pass the data through without error checking.

Actually, it's more common to route immediately after you've seen the
header, but to compute the entire CRC and then do something minimal if
the CRC turns out to be wrong.  I may be confusing the exact details,
but I think that Infiniband just counts the bad packets and depends on
the endpoint to discard the packet. Myrinet both counts the error, and
sticks a zero into the CRC, so that subsequent hops will know that the
CRC was found to be bad earlier in the path.

GigE and 10 gigE switches receive the whole packet before sending it
on, and so the higher bandwidth is a huge help for 10 gig's latency --
drops it from 15 usec for a 1500 byte packet to 1.5 usec.

-- greg


From hanzl at noel.feld.cvut.cz  Thu Oct 21 13:09:16 2004
From: hanzl at noel.feld.cvut.cz (hanzl@noel.feld.cvut.cz)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Storage and cachefs on nodes - NFS support
 exists
In-Reply-To: <20041011001046L.hanzl@unknown-domain>
References: <20041008233311G.hanzl@unknown-domain>
	<Pine.LNX.4.44.0410102029290.12097-100000@puffin.ebi.ac.uk>
	<20041011001046L.hanzl@unknown-domain>
Message-ID: <20041021220916X.hanzl@unknown-domain>

I have the pleasure to give you very optimistic update on persistent
file caching. Few days ago I wrote these skeptic lines:

> I am not sure how much can I expect from linux cachefs as seen in
> e.g. 2.6.9-rc3-mm3 - if I got it right, it is a kernel subsystem with
> intra-kernel API, being now tested with AFS and intended as usable for
> NFS. It is however "low" on NFS team priority list. So linux cachefs
> might provide cleaner solutions than Solaris cachefs - if it ever
> provides them.

and now	I see that NFS already can use this local-disk-caching subsystem!

There is linux-cachefs maillist for this, you may want to read:

http://www.redhat.com/archives/linux-cachefs/2004-October/msg00027.html
- 2.6.9-rc4-mm1 patch that will enable NFS (even NFS4) to do
persistent file caching on the local harddisk

http://www.redhat.com/archives/linux-cachefs/2004-October/msg00004.html
- older message explaining what is going on

http://www.redhat.com/archives/linux-cachefs/2004-October/msg00019.html
- about ways to get this to the mainline kernel

http://www.redhat.com/mailman/listinfo/linux-cachefs
- list archives and subscription page


I believe that this subsystem will be an immense help for work on huge
data with mostly read access.

And much less administrative hassle - once this gets to the mainline
kernel (well, yeah, any help to push it there is welcome!) it will be
much much easier to use. Just a normal NFS server. Just a normal NFS
client with the NFS_MOUNT_FSCACHE or NFS4_MOUNT_FSCACHE mounting
option ON.

Hope that this attempt to make relatively simple persistent caching
for Linux will catch up and survive even kernel_version+=0.2 (usual
killer for similar projects).

Best Regards

Vaclav Hanzl


From redboots at ufl.edu  Thu Oct 21 13:19:45 2004
From: redboots at ufl.edu (JOHNSON,PAUL C)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] parallel sparse linear solver choice?
Message-ID: <1377588425.1098389985998.JavaMail.osg@osgjas02.cns.ufl.edu>

All:

I was wondering whats your choice for a parallel sparse linear 
solver?  I have a beowulf cluster(~4 nodes, ok really small I 
know) connected at 100Mbps.  The computers are P4 2.2GHz with 1Gb 
ram.  The matrices are formed by a finite element program.  They 
are sparse, square, symmetric, and I would like to solve problems 
with more than 200000 columns.  Which of the solvers is easiest to 
set up and utilize?   One problem I am trying to solve is 156,240 
x 156,240 with 6,023,241 non-zero entries.

Thanks for any help,
Paul

--
JOHNSON,PAUL C


From rbw at ahpcrc.org  Thu Oct 21 13:59:52 2004
From: rbw at ahpcrc.org (Richard Walsh)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] bandwidth: who needs it?
In-Reply-To: <20041021175855.GA1724@greglaptop.internal.keyresearch.com>
Message-ID: <20041021205952.CF20C6EB10@clam.ahpcrc.org>


Greg Lindahl wrote:

>> do you have applications that are pushing the limits of MPI bandwidth?
>> for instance, code that actually comes close to using the 8-900 MB/s
>> that current high-end interconnect provides?
>
>Bandwidth is important not only for huge messages that hit 900 MB/s,
>but also for medium sized messages. A naive formula for how long it
>takes to send a message is:
>
>T_size = T_0 + size / max_bandwidth
>
>For example, for a 4k message with T_0 = 5 usec and either 400 MB/s or
>800 MB/s,
>
>T_4k_400M = 5 + 4k/400M = 5 + 10 = 15 usec
>T_4k_800M = 5 + 4k/800M = 5 +  5 = 10 usec
>
>A big difference. But you're only getting 266 MB/s and 400 MB/s
>bandwidth, respectively.
>
>Of course performance is usually a bit less than this naive model. But
>the effect is real, becoming unimportant for packets smaller than ~ 2k
>in this example. The size at which this effect becomes unimportant
>depends on T_0 and the bandwidth.

The above also makes a point about a mid-range regime of message sizes 
whose transfer times are affected ~equally by bandwidth and latency
changes.  Halving the latency in the 4K/800M case above is equivalent
to doubling the bandwidth for a message of this size: 

  T_4k_800M. = 2.5 + 4k/800M  = 2.5 +  5.0 =  7.5 usec
  T_4k_800M  = 5.0 + 4k/800M  = 5.0 +  5.0 = 10.0 usec
  T_4k_1600M = 5.0 + 4k/1600M = 5.0 +  2.5 =  7.5 usec

For a given interconnect with a known latency and bandwidth there is
a "characteristic" message size whose transfer time is equally sensitive
to perturbations in bandwidth and latency (latency and bandwidth piece
of the transfer time are equal).  So, for an "Elan-4-like" interconnect 
characteristic message length would be 1.6k:

  T_4k_800M   = 1.0 + 1.6k/800M   = 1.0 + 2.0 = 3.0 usec
  T_4k_800M   = 2.0 + 1.6k/800M   = 2.0 + 2.0 = 4.0 usec
  T_4k_1600M  = 2.0 + 1.6k/1600M  = 2.0 + 1.0 = 3.0  usec

Messages sizes in the vicinity of the characteristic length will 
respond approximately equally to improvements in either factor.
Messages much larger in size will be more sensitive to bandwidth 
improvements in an interconnect upgrade while message sizes much
smaller will be more sensitive to latency improvements in an upgrade. 

One might argue that bandwidth actually matters more because message
sizes (along with problem sizes) can in theory grow indefinitely (drop 
in some more memory and double you array sizes) while they can be made 
only be so small -- this is a position supported by the rate of storage 
growth, but undermined by slower bandwidth growth and processor count 
increases.  

I think I will keep my bandwidth though ... and take any off of the 
hands of those who ... don't need it ... ;-) ...

rbw

#---------------------------------------------------
# Richard Walsh
# Project Manager, Cluster Computing, Computational
#                  Chemistry and Finance
# netASPx, Inc.
# 1200 Washington Ave. So.
# Minneapolis, MN 55415
# VOX:    612-337-3467
# FAX:    612-337-3400
# EMAIL:  rbw@networkcs.com, richard.walsh@netaspx.com
#         rbw@ahpcrc.org
#
#---------------------------------------------------
# "What you can do, or dream you can, begin it;
#  Boldness has genius, power, and magic in it."
#                                  -Goethe
#---------------------------------------------------
# "Without mystery, there can be no authority."
#                                  -Charles DeGaulle
#---------------------------------------------------
# "Why waste time learning when ignornace is
#  instantaneous?"                 -Thomas Hobbes
#---------------------------------------------------


From bill at cse.ucdavis.edu  Thu Oct 21 19:03:50 2004
From: bill at cse.ucdavis.edu (Bill Broadley)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] dual Opteron recommendations
In-Reply-To: <200410210848.i9L8mXfm005676@dali.crs4.it>
References: <200410210848.i9L8mXfm005676@dali.crs4.it>
Message-ID: <20041022020350.GA32640@cse.ucdavis.edu>

On Thu, Oct 21, 2004 at 10:48:33AM +0200, Alan Scheinine wrote:
> Many venders sell U1 cases for dual Opteron based on the Tyan
> main boards, but on the other hand, a vender here says that the
> product Newisys 2100 is much more reliable than Tyan though it
> costs 10 to 20 percent more.  I have not previously heard of
> Newisys and I do not recall it being mentioned in this mailing
> list.  Would anyone like to comment?
> best regards,

I'm familar with 48 sun v20z (newisys) machines around here, only one
died so far with a hard memory error (I.e. won't boot).

We also have 40 ish Tyan systems without any failures, all were burned
in by the vendor.

Speaking of which, has anyone done anything useful with the v20z LCD
display, ours just say something like IP address of the management
interface and OS booted or similar.

I was hoping for hostname, maybe system load, even a way to pull a node
out of the queue (there are several buttons under the LCD).

-- 
Bill Broadley
Computational Science and Engineering
UC Davis

From bill at cse.ucdavis.edu  Thu Oct 21 19:10:54 2004
From: bill at cse.ucdavis.edu (Bill Broadley)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] dual Opteron recommendations
In-Reply-To: <200410210848.i9L8mXfm005676@dali.crs4.it>
References: <200410210848.i9L8mXfm005676@dali.crs4.it>
Message-ID: <20041022021054.GB32640@cse.ucdavis.edu>


Oh, speaking of which the main advantage I've seen in the newisys is
the remote managability.  You can ssh to the management interface, check
temperatures, turn the machine on/off, and other related functionality.

Alas, as far as I can tell the passthru for the management interface
(it has 2 ethernet ports for the management) isn't usable in any sane way.

Unless you want to do something like:
ssh node001
ssh node002
ssh node003
ssh node004
....
turn node off
exit
...
exit
exit

The idea of not requiring a masterswitch+ for power management, a
cyclades or similar for serial management, or a switch for a seperate
management network can be attractive.  Not that other motherboards
don't have management options.

-- 
Bill Broadley
Computational Science and Engineering
UC Davis

From cjoung at tpg.com.au  Thu Oct 21 23:02:17 2004
From: cjoung at tpg.com.au (cjoung@tpg.com.au)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] p4_error: interrupt SIGSEGV: 11 Killed by signal 2.
Message-ID: <1098424937.4178a269ccd90@postoffice.tpg.com.au>

Firstly,
Thank you to Mark H, Anthony S and in particular Henry G for your helpful
comments on my 
MPI/scaLAPACK problem. After reading your comments, I removed MPICH2, installed 
mpich1.2.6 and this did indeed fix the "Null Comm Pointer" error!

> Date: Tue, 19 Oct 2004 14:21:31 -0700 
> From: "Gabb, Henry" <henry.gabb@intel.com>
> Subject: Re: [Beowulf] MPI & ScaLAPACK: 
>              error in MPI_Comm_size: Invalid communicator
> The "Null Comm pointer" error that you're seeing is almost always due to
> a header mismatch. I don't think Intel Cluster MKL 7.0 supports MPICH2 yet. 
>
> > Error message:
> > aborting job: Fatal error in MPI_Comm_size: Invalid communicator, error stack:
> > MPI_Comm_size(82): MPI_Comm_size(comm=0x5b, size=0x80d807c) failed
> > MPI_Comm_size(66): Null Comm pointer
**********************************************************

Current Problem:
I have another problem now, also related to MPI/scaLAPACK.
I tried to compile and run the example1.f program - It uses the
scalapack PDGESV subroutine to solve the equation [A]x=b
(from the netlib/scalapack website - I did NOT modify it)
(see end of email for copy of example1.f)

It seems to compile ok, but on execution it gives this 
error message:
****************************************
[tony@carmine clint]$ mpif77 -o ex1 example1.f 
                             -L/opt/intel/mkl70cluster/lib/32 
                             -lmkl_scalapack 
                             -lmkl_blacsF77init 
                             -lmkl_blacs 
                             -lmkl_blacsF77init 
                             -lmkl_lapack 
                             -lmkl_ia32 
                             -lguide 
                             -lpthread 
                             -static-libcxa
[tony@carmine clint]$ mpirun -np 6 ./ex1 
p0_24505:  p4_error: interrupt SIGSEGV: 11 
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
[tony@carmine clint]$
****************************************
(there are also comments about "broken pipes", but
I didn't include that part above)

..as far as I can discover on the web, SIGSEGV 
represents a "segmentation fault", beyond this,
I don't know what to do for a fix.

Some people have suggested fixes such as:

* increasing memory sizes: i.e. type in:
> >export P4_GLOBMEMSIZE=536870912
> >
> >(=512MB). 

..but this didn't do anything for me.
Generally, I have not seen any other solutions which 
have had a positive response.
They say, a problem like this is normally attributed
to a bug in the source code, but seeing as the 
netlib/scalapack developers give this source code out 
as the beginners basic 'hello world' program, I doubt 
this program would carry a bug in it!
I would have to guess that there's something ELSE wrong.
(In any case, I tried another program out of a book that
does basically the same thing - same problem)

I was hoping a reader in this forum has seen this problem
before, and knows of a solution.

Any suggestions, even speculative, would be most appreciated,
(Please use simple language - I am quite a novice at all of this...)

with many thanks,

Clint Joung

Postdoctoral Research Associate
Department of Chemical Engineering
University of Sydney, NSW 2006
Australia

ps: My system details and the source code 'example1.f':

OS: Linux Redhat Fedora Core 2
FC: Intel Fortran Compiler V8.0 (ifort) 
(I've also tried building MPI libraries using GNU g77 - same problem)
CC: GNU gcc
(for some reason, MPI libraries don't 'make' properly using intel icpc)
Scalapack et al: Intel Cluster Maths Kernel Library for Linux v7.0
MPI: mpich-1.2.6 (I've also tried mpich1.2.5.2 - same problem)

The example1.f program. It runs ok as far as the actual call to
PDGESV, then it falls over.....
**EXAMPLE1.F***************************************
      PROGRAM EXAMPLE1
*
*     Example Program solving Ax=b via ScaLAPACK routine PDGESV
*
*     .. Parameters ..
      INTEGER            DLEN_, IA, JA, IB, JB, M, N, MB, NB, RSRC,
     $                   CSRC, MXLLDA, MXLLDB, NRHS, NBRHS, NOUT,
     $                   MXLOCR, MXLOCC, MXRHSC
      PARAMETER          ( DLEN_ = 9, IA = 1, JA = 1, IB = 1, JB = 1,
     $                   M = 9, N = 9, MB = 2, NB = 2, RSRC = 0,
     $                   CSRC = 0, MXLLDA = 5, MXLLDB = 5, NRHS = 1,
     $                   NBRHS = 1, NOUT = 6, MXLOCR = 5, MXLOCC = 4,
     $                   MXRHSC = 1 )
      DOUBLE PRECISION   ONE
      PARAMETER          ( ONE = 1.0D+0 )
*     ..
*     .. Local Scalars ..
      INTEGER            ICTXT, INFO, MYCOL, MYROW, NPCOL, NPROW
      DOUBLE PRECISION   ANORM, BNORM, EPS, RESID, XNORM
*     ..
*     .. Local Arrays ..
      INTEGER            DESCA( DLEN_ ), DESCB( DLEN_ ),
     $                   IPIV( MXLOCR+NB )
      DOUBLE PRECISION   A( MXLLDA, MXLOCC ), A0( MXLLDA, MXLOCC ),
     $                   B( MXLLDB, MXRHSC ), B0( MXLLDB, MXRHSC ),
     $                   WORK( MXLOCR )
*     ..
*     .. External Functions ..
      DOUBLE PRECISION   PDLAMCH, PDLANGE
      EXTERNAL           PDLAMCH, PDLANGE
*     ..
*     .. External Subroutines ..
      EXTERNAL           BLACS_EXIT, BLACS_GRIDEXIT, BLACS_GRIDINFO,
     $                   DESCINIT, MATINIT, PDGEMM, PDGESV, PDLACPY,
     $                   SL_INIT
*     ..
*     .. Intrinsic Functions ..
      INTRINSIC          DBLE
*     ..
*     .. Data statements ..
      DATA               NPROW / 2 / , NPCOL / 3 /
*     ..
*     .. Executable Statements ..
*
*     INITIALIZE THE PROCESS GRID
*
      CALL SL_INIT( ICTXT, NPROW, NPCOL )
      CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW, MYCOL )
*
*     If I'm not in the process grid, go to the end of the program
*
      IF( MYROW.EQ.-1 )
     $   GO TO 10
*
*     DISTRIBUTE THE MATRIX ON THE PROCESS GRID
*     Initialize the array descriptors for the matrices A and B
*
      CALL DESCINIT( DESCA, M, N, MB, NB, RSRC, CSRC, ICTXT, MXLLDA,
     $               INFO )
      CALL DESCINIT( DESCB, N, NRHS, NB, NBRHS, RSRC, CSRC, ICTXT,
     $               MXLLDB, INFO )
*
*     Generate matrices A and B and distribute to the process grid
*
      CALL MATINIT( A, DESCA, B, DESCB )
*
*     Make a copy of A and B for checking purposes
*
      CALL PDLACPY( 'All', N, N, A, 1, 1, DESCA, A0, 1, 1, DESCA )
      CALL PDLACPY( 'All', N, NRHS, B, 1, 1, DESCB, B0, 1, 1, DESCB )
*
*     CALL THE SCALAPACK ROUTINE
*     Solve the linear system A * X = B
*
      CALL PDGESV( N, NRHS, A, IA, JA, DESCA, IPIV, B, IB, JB, DESCB,
     $             INFO )
*
      IF( MYROW.EQ.0 .AND. MYCOL.EQ.0 ) THEN
         WRITE( NOUT, FMT = 9999 )
         WRITE( NOUT, FMT = 9998 )M, N, NB
         WRITE( NOUT, FMT = 9997 )NPROW*NPCOL, NPROW, NPCOL
         WRITE( NOUT, FMT = 9996 )INFO
      END IF
*
*     Compute residual ||A * X  - B|| / ( ||X|| * ||A|| * eps * N )
*
      EPS = PDLAMCH( ICTXT, 'Epsilon' )
      ANORM = PDLANGE( 'I', N, N, A, 1, 1, DESCA, WORK )
      BNORM = PDLANGE( 'I', N, NRHS, B, 1, 1, DESCB, WORK )
      CALL PDGEMM( 'N', 'N', N, NRHS, N, ONE, A0, 1, 1, DESCA, B, 1, 1,
     $             DESCB, -ONE, B0, 1, 1, DESCB )
      XNORM = PDLANGE( 'I', N, NRHS, B0, 1, 1, DESCB, WORK )
      RESID = XNORM / ( ANORM*BNORM*EPS*DBLE( N ) )
*
      IF( MYROW.EQ.0 .AND. MYCOL.EQ.0 ) THEN
         IF( RESID.LT.10.0D+0 ) THEN
            WRITE( NOUT, FMT = 9995 )
            WRITE( NOUT, FMT = 9993 )RESID
         ELSE
            WRITE( NOUT, FMT = 9994 )
            WRITE( NOUT, FMT = 9993 )RESID
         END IF
      END IF
*
*     RELEASE THE PROCESS GRID
*     Free the BLACS context
*
      CALL BLACS_GRIDEXIT( ICTXT )
   10 CONTINUE
*
*     Exit the BLACS
*
      CALL BLACS_EXIT( 0 )
*
 9999 FORMAT( / 'ScaLAPACK Example Program #1 -- May 1, 1997' )
 9998 FORMAT( / 'Solving Ax=b where A is a ', I3, ' by ', I3,
     $      ' matrix with a block size of ', I3 )
 9997 FORMAT( 'Running on ', I3, ' processes, where the process grid',
     $      ' is ', I3, ' by ', I3 )
 9996 FORMAT( / 'INFO code returned by PDGESV = ', I3 )
 9995 FORMAT( /
     $   'According to the normalized residual the solution is correct.'
     $       )
 9994 FORMAT( /
     $ 'According to the normalized residual the solution is incorrect.'
     $       )
 9993 FORMAT( / '||A*x - b|| / ( ||x||*||A||*eps*N ) = ', 1P, E16.8 )
      STOP
      END
      SUBROUTINE MATINIT( AA, DESCA, B, DESCB )
*
*     MATINIT generates and distributes matrices A and B (depicted in
*     Figures 2.5 and 2.6) to a 2 x 3 process grid
*
*     .. Array Arguments ..
      INTEGER            DESCA( * ), DESCB( * )
      DOUBLE PRECISION   AA( * ), B( * )
*     ..
*     .. Parameters ..
      INTEGER            CTXT_, LLD_
      PARAMETER          ( CTXT_ = 2, LLD_ = 9 )
*     ..
*     .. Local Scalars ..
      INTEGER            ICTXT, MXLLDA, MYCOL, MYROW, NPCOL, NPROW
      DOUBLE PRECISION   A, C, K, L, P, S
*     ..
*     .. External Subroutines ..
      EXTERNAL           BLACS_GRIDINFO
*     ..
*     .. Executable Statements ..
*
      ICTXT = DESCA( CTXT_ )
      CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW, MYCOL )
*
      S = 19.0D0
      C = 3.0D0
      A = 1.0D0
      L = 12.0D0
      P = 16.0D0
      K = 11.0D0
*
      MXLLDA = DESCA( LLD_ )
*
      IF( MYROW.EQ.0 .AND. MYCOL.EQ.0 ) THEN
         AA( 1 ) = S
         AA( 2 ) = -S
         AA( 3 ) = -S
         AA( 4 ) = -S
         AA( 5 ) = -S
         AA( 1+MXLLDA ) = C
         AA( 2+MXLLDA ) = C
         AA( 3+MXLLDA ) = -C
         AA( 4+MXLLDA ) = -C
         AA( 5+MXLLDA ) = -C
         AA( 1+2*MXLLDA ) = A
         AA( 2+2*MXLLDA ) = A
         AA( 3+2*MXLLDA ) = A
         AA( 4+2*MXLLDA ) = A
         AA( 5+2*MXLLDA ) = -A
         AA( 1+3*MXLLDA ) = C
         AA( 2+3*MXLLDA ) = C
         AA( 3+3*MXLLDA ) = C
         AA( 4+3*MXLLDA ) = C
         AA( 5+3*MXLLDA ) = -C
         B( 1 ) = 0.0D0
         B( 2 ) = 0.0D0
         B( 3 ) = 0.0D0
         B( 4 ) = 0.0D0
         B( 5 ) = 0.0D0
      ELSE IF( MYROW.EQ.0 .AND. MYCOL.EQ.1 ) THEN
         AA( 1 ) = A
         AA( 2 ) = A
         AA( 3 ) = -A
         AA( 4 ) = -A
         AA( 5 ) = -A
         AA( 1+MXLLDA ) = L
         AA( 2+MXLLDA ) = L
         AA( 3+MXLLDA ) = -L
         AA( 4+MXLLDA ) = -L
         AA( 5+MXLLDA ) = -L
         AA( 1+2*MXLLDA ) = K
         AA( 2+2*MXLLDA ) = K
         AA( 3+2*MXLLDA ) = K
         AA( 4+2*MXLLDA ) = K
         AA( 5+2*MXLLDA ) = K
      ELSE IF( MYROW.EQ.0 .AND. MYCOL.EQ.2 ) THEN
         AA( 1 ) = A
         AA( 2 ) = A
         AA( 3 ) = A
         AA( 4 ) = -A
         AA( 5 ) = -A
         AA( 1+MXLLDA ) = P
         AA( 2+MXLLDA ) = P
         AA( 3+MXLLDA ) = P
         AA( 4+MXLLDA ) = P
         AA( 5+MXLLDA ) = -P
      ELSE IF( MYROW.EQ.1 .AND. MYCOL.EQ.0 ) THEN
         AA( 1 ) = -S
         AA( 2 ) = -S
         AA( 3 ) = -S
         AA( 4 ) = -S
         AA( 1+MXLLDA ) = -C
         AA( 2+MXLLDA ) = -C
         AA( 3+MXLLDA ) = -C
         AA( 4+MXLLDA ) = C
         AA( 1+2*MXLLDA ) = A
         AA( 2+2*MXLLDA ) = A
         AA( 3+2*MXLLDA ) = A
         AA( 4+2*MXLLDA ) = -A
         AA( 1+3*MXLLDA ) = C
         AA( 2+3*MXLLDA ) = C
         AA( 3+3*MXLLDA ) = C
         AA( 4+3*MXLLDA ) = C
         B( 1 ) = 1.0D0
         B( 2 ) = 0.0D0
         B( 3 ) = 0.0D0
         B( 4 ) = 0.0D0
      ELSE IF( MYROW.EQ.1 .AND. MYCOL.EQ.1 ) THEN
         AA( 1 ) = A
         AA( 2 ) = -A
         AA( 3 ) = -A
         AA( 4 ) = -A
         AA( 1+MXLLDA ) = L
         AA( 2+MXLLDA ) = L
         AA( 3+MXLLDA ) = -L
         AA( 4+MXLLDA ) = -L
         AA( 1+2*MXLLDA ) = K
         AA( 2+2*MXLLDA ) = K
         AA( 3+2*MXLLDA ) = K
         AA( 4+2*MXLLDA ) = K
      ELSE IF( MYROW.EQ.1 .AND. MYCOL.EQ.2 ) THEN
         AA( 1 ) = A
         AA( 2 ) = A
         AA( 3 ) = -A
         AA( 4 ) = -A
         AA( 1+MXLLDA ) = P
         AA( 2+MXLLDA ) = P
         AA( 3+MXLLDA ) = -P
         AA( 4+MXLLDA ) = -P
      END IF
      RETURN
      END
      SUBROUTINE SL_INIT( ICTXT, NPROW, NPCOL )
*
*     .. Scalar Arguments ..
      INTEGER            ICTXT, NPCOL, NPROW
*     ..
*
*  Purpose
*  =======
*
*  SL_INIT initializes an NPROW x NPCOL process grid using a row-major
*  ordering  of  the  processes. This routine retrieves a default system
*  context  which  will  include all available processes. In addition it
*  spawns the processes if needed.
*
*  Arguments
*  =========
*
*  ICTXT   (global output) INTEGER
*          ICTXT specifies the BLACS context handle identifying the
*          created process grid.  The context itself is global.
*
*  NPROW   (global input) INTEGER
*          NPROW specifies the number of process rows in the grid
*          to be created.
*
*  NPCOL   (global input) INTEGER
*          NPCOL specifies the number of process columns in the grid
*          to be created.
*
*  =====================================================================
*
*     .. Local Scalars ..
      INTEGER            IAM, NPROCS
*     ..
*     .. External Subroutines ..
      EXTERNAL           BLACS_GET, BLACS_GRIDINIT, BLACS_PINFO,
     $                   BLACS_SETUP
*     ..
*     .. Executable Statements ..
*
*     Get starting information
*
      CALL BLACS_PINFO( IAM, NPROCS )
*
*     If machine needs additional set up, do it now
*
      IF( NPROCS.LT.1 ) THEN
         IF( IAM.EQ.0 )
     $      NPROCS = NPROW*NPCOL
         CALL BLACS_SETUP( IAM, NPROCS )
      END IF
*
*     Define process grid
*
      CALL BLACS_GET( -1, 0, ICTXT )
      CALL BLACS_GRIDINIT( ICTXT, 'Row-major', NPROW, NPCOL )
*
      RETURN
*
*     End of SL_INIT
*
      END
***************************************************


From philippe.blaise at cea.fr  Fri Oct 22 00:41:58 2004
From: philippe.blaise at cea.fr (Philippe Blaise)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] bandwidth: who needs it?
In-Reply-To: <20041021205952.CF20C6EB10@clam.ahpcrc.org>
References: <20041021205952.CF20C6EB10@clam.ahpcrc.org>
Message-ID: <4178B9C5.4060004@cea.fr>

The naive formula

T_size = T_0 + size / max_bandwidth

for
size1/2 = T_0 * max_bandwith
gives

T_size1/2 = 2 * T0

which is a characteristic message length : you reach (more or less) half 
the bandwith,
and it takes 2 *  latency seconds to send / recv the message.

For example, with T_0 = 5 usec and max_bandwith = 400 or 800 MB/s you obtain

T_size1/2(5, 400) = 5 * 400 = 2 kB
T_size1/2(5, 800) = 5 * 800 = 4 kB

Phil.

Richard Walsh wrote:

>Greg Lindahl wrote:
>
>  
>
>>>do you have applications that are pushing the limits of MPI bandwidth?
>>>for instance, code that actually comes close to using the 8-900 MB/s
>>>that current high-end interconnect provides?
>>>      
>>>
>>Bandwidth is important not only for huge messages that hit 900 MB/s,
>>but also for medium sized messages. A naive formula for how long it
>>takes to send a message is:
>>
>>T_size = T_0 + size / max_bandwidth
>>
>>For example, for a 4k message with T_0 = 5 usec and either 400 MB/s or
>>800 MB/s,
>>
>>T_4k_400M = 5 + 4k/400M = 5 + 10 = 15 usec
>>T_4k_800M = 5 + 4k/800M = 5 +  5 = 10 usec
>>
>>A big difference. But you're only getting 266 MB/s and 400 MB/s
>>bandwidth, respectively.
>>
>>Of course performance is usually a bit less than this naive model. But
>>the effect is real, becoming unimportant for packets smaller than ~ 2k
>>in this example. The size at which this effect becomes unimportant
>>depends on T_0 and the bandwidth.
>>    
>>
>
>The above also makes a point about a mid-range regime of message sizes 
>whose transfer times are affected ~equally by bandwidth and latency
>changes.  Halving the latency in the 4K/800M case above is equivalent
>to doubling the bandwidth for a message of this size: 
>
>  T_4k_800M. = 2.5 + 4k/800M  = 2.5 +  5.0 =  7.5 usec
>  T_4k_800M  = 5.0 + 4k/800M  = 5.0 +  5.0 = 10.0 usec
>  T_4k_1600M = 5.0 + 4k/1600M = 5.0 +  2.5 =  7.5 usec
>
>For a given interconnect with a known latency and bandwidth there is
>a "characteristic" message size whose transfer time is equally sensitive
>to perturbations in bandwidth and latency (latency and bandwidth piece
>of the transfer time are equal).  So, for an "Elan-4-like" interconnect 
>characteristic message length would be 1.6k:
>
>  T_4k_800M   = 1.0 + 1.6k/800M   = 1.0 + 2.0 = 3.0 usec
>  T_4k_800M   = 2.0 + 1.6k/800M   = 2.0 + 2.0 = 4.0 usec
>  T_4k_1600M  = 2.0 + 1.6k/1600M  = 2.0 + 1.0 = 3.0  usec
>
>Messages sizes in the vicinity of the characteristic length will 
>respond approximately equally to improvements in either factor.
>Messages much larger in size will be more sensitive to bandwidth 
>improvements in an interconnect upgrade while message sizes much
>smaller will be more sensitive to latency improvements in an upgrade. 
>
>One might argue that bandwidth actually matters more because message
>sizes (along with problem sizes) can in theory grow indefinitely (drop 
>in some more memory and double you array sizes) while they can be made 
>only be so small -- this is a position supported by the rate of storage 
>growth, but undermined by slower bandwidth growth and processor count 
>increases.  
>
>I think I will keep my bandwidth though ... and take any off of the 
>hands of those who ... don't need it ... ;-) ...
>
>rbw
>
>#---------------------------------------------------
># Richard Walsh
># Project Manager, Cluster Computing, Computational
>#                  Chemistry and Finance
># netASPx, Inc.
># 1200 Washington Ave. So.
># Minneapolis, MN 55415
># VOX:    612-337-3467
># FAX:    612-337-3400
># EMAIL:  rbw@networkcs.com, richard.walsh@netaspx.com
>#         rbw@ahpcrc.org
>#
>#---------------------------------------------------
># "What you can do, or dream you can, begin it;
>#  Boldness has genius, power, and magic in it."
>#                                  -Goethe
>#---------------------------------------------------
># "Without mystery, there can be no authority."
>#                                  -Charles DeGaulle
>#---------------------------------------------------
># "Why waste time learning when ignornace is
>#  instantaneous?"                 -Thomas Hobbes
>#---------------------------------------------------
>
>_______________________________________________
>Beowulf mailing list, Beowulf@beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>  
>


From Thomas.Alrutz at dlr.de  Fri Oct 22 05:05:47 2004
From: Thomas.Alrutz at dlr.de (Thomas Alrutz)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] dual Opteron recommendations
In-Reply-To: <200410210848.i9L8mXfm005676@dali.crs4.it>
References: <200410210848.i9L8mXfm005676@dali.crs4.it>
Message-ID: <4178F79B.5050903@dlr.de>

Alan Scheinine schrieb:
> Many venders sell U1 cases for dual Opteron based on the Tyan
> main boards, but on the other hand, a vender here says that the
> product Newisys 2100 is much more reliable than Tyan though it
> costs 10 to 20 percent more.  I have not previously heard of
> Newisys and I do not recall it being mentioned in this mailing
> list.  Would anyone like to comment?
> best regards,
> Alan Scheinine   Email: scheinin@crs4.it
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

Hi Alan,

we have some Opteron cluster nodes and some Opteron workstations based 
on the Arima HDAMA MB (Rioworks http://www.rioworks.com/HDAMA.htm).
We are quite happy with those systems, because there are running without 
any failure since installation (one year ago).
And as far as I know, there are some 1U barebones with this type of 
board, too. If you looking for a remote managment system on your 
motherboards, there is the possibility to attach an IPMI-Board called 
ARMC to the HDAMA. This would give you the same mamagement features like 
the Newisys boards, but for additional costs (~300 US$ ??) and size.

Thomas
-- 
  __/|__ | Dipl.-Math. Thomas Alrutz
/_/_/_/ | DLR Institute of Aerodynamics and Flow Technology
   |/    | Numerical Methods Department
     DLR | Bunsenstr. 10
         | D-37073 Goettingen/Germany


From thomas.clausen at aoes.com  Fri Oct 22 02:58:26 2004
From: thomas.clausen at aoes.com (Thomas Clausen)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] parallel sparse linear solver choice?
In-Reply-To: <1377588425.1098389985998.JavaMail.osg@osgjas02.cns.ufl.edu>
References: <1377588425.1098389985998.JavaMail.osg@osgjas02.cns.ufl.edu>
Message-ID: <20041022095824.GE19008@aoes.com>

Hi Paul,

You might want to have a look at 

http://www-unix.mcs.anl.gov/petsc/petsc-2/index.html

Thomas

On Thu, Oct 21, 2004 at 04:19:45PM -0400, JOHNSON,PAUL C wrote:
> All:
> 
> I was wondering whats your choice for a parallel sparse linear 
> solver?  I have a beowulf cluster(~4 nodes, ok really small I 
> know) connected at 100Mbps.  The computers are P4 2.2GHz with 1Gb 
> ram.  The matrices are formed by a finite element program.  They 
> are sparse, square, symmetric, and I would like to solve problems 
> with more than 200000 columns.  Which of the solvers is easiest to 
> set up and utilize?   One problem I am trying to solve is 156,240 
> x 156,240 with 6,023,241 non-zero entries.
> 
> Thanks for any help,
> Paul
> 
> --
> JOHNSON,PAUL C
> 
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 

Thomas Clausen, PhD.     <tclausen@aoes.com>
AOES Group BV, http://www.aoes.com
Phone +31(0)71 5795563  Fax +31(0)71572 1277


From hahn at physics.mcmaster.ca  Fri Oct 22 09:41:12 2004
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] dual Opteron recommendations
In-Reply-To: <20041022021054.GB32640@cse.ucdavis.edu>
Message-ID: <Pine.LNX.4.44.0410221225100.17732-100000@coffee.psychology.mcmaster.ca>

> Oh, speaking of which the main advantage I've seen in the newisys is
> the remote managability.  You can ssh to the management interface, check
> temperatures, turn the machine on/off, and other related functionality.

this is not newisis-specific, of course!  we've got a cluster of 
HP DL145's (which look a LOT like Celestica a2210's).  they have 
a nice lan-enabled IPMI card, which you can telnet to or use 
the reasonably secure IPMItools.  HP certainly has ssh on other of 
their products (switches, for instance), so I'd expect them to 
add ssh support everywhere.

> Alas, as far as I can tell the passthru for the management interface
> (it has 2 ethernet ports for the management) isn't usable in any sane way.

ipmitools seem to work nicely.  I've got an perl/expect script for 
dealing with the telnet interface, if anyone wants it.

> The idea of not requiring a masterswitch+ for power management, a
> cyclades or similar for serial management, or a switch for a seperate

decent IPMI support obsoletes all that junk.  the DL145 gives you 
working bios redirection, as well as power control, warm/cold reset,
lm_sensors-type data, etc.

we're requiring this kind of remote management in all future purposes,
and I'm convinced everyone doing clusters should do so as well.

regards, mark hahn.


From bmayer at cs.umn.edu  Fri Oct 22 08:26:46 2004
From: bmayer at cs.umn.edu (Benjamin W. Mayer)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] parallel sparse linear solver choice?
In-Reply-To: <20041022095824.GE19008@aoes.com>
References: <1377588425.1098389985998.JavaMail.osg@osgjas02.cns.ufl.edu>
	<20041022095824.GE19008@aoes.com>
Message-ID: <Pine.LNX.4.58.0410221023001.21934@pe3>

I am not sure if this is a good pointer or not.

Yousef Saad has a lot of work which may be useful for this application.

http://www-users.cs.umn.edu/~saad/software/home.html

SPARSKIT A basic tool-kit for sparse matrix computations.
pARMS , parallel Algebraic recursive Multilevel Solver.

There are also links to related packages on the above page.


On Fri, 22 Oct 2004, Thomas Clausen wrote:

> Hi Paul,
>
> You might want to have a look at
>
> http://www-unix.mcs.anl.gov/petsc/petsc-2/index.html
>
> Thomas
>
> On Thu, Oct 21, 2004 at 04:19:45PM -0400, JOHNSON,PAUL C wrote:
> > All:
> >
> > I was wondering whats your choice for a parallel sparse linear
> > solver?  I have a beowulf cluster(~4 nodes, ok really small I
> > know) connected at 100Mbps.  The computers are P4 2.2GHz with 1Gb
> > ram.  The matrices are formed by a finite element program.  They
> > are sparse, square, symmetric, and I would like to solve problems
> > with more than 200000 columns.  Which of the solvers is easiest to
> > set up and utilize?   One problem I am trying to solve is 156,240
> > x 156,240 with 6,023,241 non-zero entries.
> >
> > Thanks for any help,
> > Paul
> >
> > --
> > JOHNSON,PAUL C
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf@beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
> >
>
> --
>
> Thomas Clausen, PhD.     <tclausen@aoes.com>
> AOES Group BV, http://www.aoes.com
> Phone +31(0)71 5795563  Fax +31(0)71572 1277
>
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

From mathog at mendel.bio.caltech.edu  Fri Oct 22 09:45:04 2004
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Tyan mobo and /proc/mtrr
Message-ID: <E1CL2XI-0000tX-00@mendel.bio.caltech.edu>

How do mtrr settings affect performance?  

Anybody know what /proc/mtrr "should" say on various Tyan
mobos?

Three types of Tyan systems here, this is what /proc/mtrr has for each:

S2468UGN    2.6.8.1  (in MDK10.0)
reg00: base=0x00000000 (   0MB), size= 512MB: write-back, count=1
reg01: base=0xf5000000 (3920MB), size=   1MB: write-combining, count=1
S2466N      2.6.8.1  (in MDK10.0)
reg00: base=0x00000000 (   0MB), size=1024MB: write-back, count=1
reg01: base=0xf5000000 (3920MB), size=   1MB: write-combining, count=1
S2466N      2.4.18-10 (in RH7.3)
reg00: base=0x00000000 (   0MB), size=1024MB: write-back, count=1

The first line describes the total memory in the system.
The second line, if present, corresponds to a  setting from
the ATI RAGE XL graphics card.  The ones running the older OS
don't even do that.

Should there be additional mtrr settings, and if so, why?

Note, lspci reports this for the RAGE XL graphics (on all):

        Memory at f5000000 (32-bit, non-prefetchable) [size=16M]
        I/O ports at 1000 [size=256]
        Memory at f4000000 (32-bit, non-prefetchable) [size=4K]

Unclear to me why only 1M of the reported 16M is mapped (apparently)
by the mtrr.  Possibly  this relates to these messages:

  /var/log/messages:Oct 20 12:44:59 safserver kernel: mtrr:\
  0xf5000000,0x400000 overlaps existing 0xf5000000,0x100000

On the other hand, the graphics aren't really used
on the S2466N compute nodes.  Graphics are used more on the
S2468UGN server. Beats me where one would change this value
though.


Thanks,

David Mathog
mathog@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

From brian.dobbins at yale.edu  Fri Oct 22 10:51:29 2004
From: brian.dobbins at yale.edu (Brian Dobbins)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] parallel sparse linear solver choice?
In-Reply-To: <1377588425.1098389985998.JavaMail.osg@osgjas02.cns.ufl.edu>
References: <1377588425.1098389985998.JavaMail.osg@osgjas02.cns.ufl.edu>
Message-ID: <1098467489.12272.23.camel@rdx.eng.yale.edu>


Hi Paul,

  If you're looking for packaged solvers, then PETSc (already mentioned)
and Aztec could be of interest to you.  Aztec is a parallel iterative
solver for sparse systems developed by Sandia Labs.

  (Aztec: http://www.cs.sandia.gov/CRF/aztec1.html )

  There is also a direct solver for symmetric positive definite matrices
called PSPASES (http://www-users.cs.umn.edu/~mjoshi/pspases/), but I
haven't ever looked at that myself, so I can't really tell you much
about it.

  Another direct solution package, originally written for the RS6000
platform but now available for linux (I believe!), is the Watson Sparse
Matrix Package (WSMP) (http://www-users.cs.umn.edu/~agupta/wsmp.html),
by Anshul Gupta.  Not only is the package said to be very good, but
Gupta has done a LOT of research on sparse matrices, and it couldn't
hurt to read some of his publications.

  On a different note, if you're not looking for packaged solvers, and
just want to know about various methods or want to implement your own
-often faster, if you have a known structure in your matrix-, you might
want to read up on Yousef Saad's work, as mentioned before.  Also, a
very useful (and thick!) book is Golub and Van Loan's "Matrix
Computations".  Finally, if you find you're having difficulty with
getting decent preconditioners (if necessary), I'd also suggest taking a
look at Michele Benzi's work on sparse approximate inverse
preconditioners.  (http://www.mathcs.emory.edu/~benzi/)

  (That last link also has links to other people/places doing research
that may be of interest to you.)

  Hope some of that is of interest to you,
  - Brian


From hahn at physics.mcmaster.ca  Fri Oct 22 15:04:12 2004
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Tyan mobo and /proc/mtrr
In-Reply-To: <E1CL2XI-0000tX-00@mendel.bio.caltech.edu>
Message-ID: <Pine.LNX.4.44.0410221752310.17732-100000@coffee.psychology.mcmaster.ca>

> How do mtrr settings affect performance?  

they, along with per-page attributes, determine how the CPU
treats the cachability of an address.

> Anybody know what /proc/mtrr "should" say on various Tyan
> mobos?

why do you think there's a problem?

> Three types of Tyan systems here, this is what /proc/mtrr has for each:
> 
> S2468UGN    2.6.8.1  (in MDK10.0)
> reg00: base=0x00000000 (   0MB), size= 512MB: write-back, count=1
> reg01: base=0xf5000000 (3920MB), size=   1MB: write-combining, count=1
> S2466N      2.6.8.1  (in MDK10.0)
> reg00: base=0x00000000 (   0MB), size=1024MB: write-back, count=1
> reg01: base=0xf5000000 (3920MB), size=   1MB: write-combining, count=1
> S2466N      2.4.18-10 (in RH7.3)
> reg00: base=0x00000000 (   0MB), size=1024MB: write-back, count=1
> 
> The first line describes the total memory in the system.
> The second line, if present, corresponds to a  setting from
> the ATI RAGE XL graphics card.  The ones running the older OS
> don't even do that.

mtrr values are supposed to be set by the bios; the OS certainly
can change them, but doesn't have to.

I forget for sure, but suspect that in the absence of an mtrr,
the mem-mapped video area on the third machine would default to 
write-through or -combining.

of course, all this stuff was terribly novel back in the dark ages 
of RH 7.x.  it wouldn't be shocking if RH7.3 got it wrong...

> Should there be additional mtrr settings, and if so, why?

the main thing is simply to have all real ram in write-back;
mtrr's for video can make a difference, but often not as much 
as you might expect, because the CPU does a certain amount of 
write coalescing before data even gets to the cache.

> Note, lspci reports this for the RAGE XL graphics (on all):
> 
>         Memory at f5000000 (32-bit, non-prefetchable) [size=16M]
>         I/O ports at 1000 [size=256]
>         Memory at f4000000 (32-bit, non-prefetchable) [size=4K]
> 
> Unclear to me why only 1M of the reported 16M is mapped (apparently)
> by the mtrr.  Possibly  this relates to these messages:
> 
>   /var/log/messages:Oct 20 12:44:59 safserver kernel: mtrr:\
>   0xf5000000,0x400000 overlaps existing 0xf5000000,0x100000

IIRC, this is perfectly legal - in fact, the correct way to define
a 3MB region is to define a 4MB region and an overlapping 1MB region.

> On the other hand, the graphics aren't really used
> on the S2466N compute nodes.

so ignore them.  they can't possibly matter...

> Graphics are used more on the
> S2468UGN server. Beats me where one would change this value
> though.

there's an mtrr doc on kernel/Documentation which is all you need,
*if* you actually need to change anything.  it seems like X started 
fixing mtrr's several years ago using this interface (/proc/mtrr).

I would guess that mtrr's on the framebuffer were less relevant 
for a number of years (using a video card's hw acceleration 
obviates the need for the host to directly manipulate the pixels).
this may be less true now with some of the newfangled client-side
rendering, etc.


From mkamranmustafa at gmail.com  Fri Oct 22 22:53:11 2004
From: mkamranmustafa at gmail.com (Kamran Mustafa)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Need Help...!
Message-ID: <e9da936f041022225338046cb5@mail.gmail.com>

Hi,
 
I am working as an IT Manager at NED University of Engineering &
Technology, Karachi, Pakistan, and currently managing a Linux based
Cluster of 50 nodes. I just wanted to ask you that how to manage
licensing issues on a beowulf cluster. Lets say, if you want to run an
application software on 50 nodes then will you purchase 50 licenses of
that software or if there is any other alternative to handle this
licensing issue, because purchasing such a huge number of licences
will definitely be very expensive. Actually, I also want to purchase
different software for my 50 noded cluster but purchasing 50 licences
of each software costs me alot, thats why I am in need of your
guidance and kind suggestions.
 
Regards,
 
Muhammad Kamran Mustafa
I.T. Manager
Centre for Simulation & Modeling, 
NED University of Engineering & Technology,
Karachi, Pakistan.
Tel: (9221) 9243261-8 ext 2372
Fax: (9221) 9243248

From rgb at phy.duke.edu  Sat Oct 23 09:14:22 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Need Help...!
In-Reply-To: <e9da936f041022225338046cb5@mail.gmail.com>
References: <e9da936f041022225338046cb5@mail.gmail.com>
Message-ID: <Pine.LNX.4.58.0410231158350.7281@lilith.rgb.private.net>

On Sat, 23 Oct 2004, Kamran Mustafa wrote:

> Hi,
>  
> I am working as an IT Manager at NED University of Engineering &
> Technology, Karachi, Pakistan, and currently managing a Linux based
> Cluster of 50 nodes. I just wanted to ask you that how to manage
> licensing issues on a beowulf cluster. Lets say, if you want to run an
> application software on 50 nodes then will you purchase 50 licenses of
> that software or if there is any other alternative to handle this
> licensing issue, because purchasing such a huge number of licences
> will definitely be very expensive. Actually, I also want to purchase
> different software for my 50 noded cluster but purchasing 50 licences
> of each software costs me alot, thats why I am in need of your
> guidance and kind suggestions.
>  
> Regards,
>  
> Muhammad Kamran Mustafa

Dear Kamran,

Please give us a bit more detail. In particular, what software are we
talking about?  Different packages have very different licensing schmea,
and one usually has to go with what a package supports.  For example,
matlab is in use on some clusters on campus here.  matlab uses a
license manager that can regulate the number of instances of matlab in
use on a cluster.  Quite a few packages, actually, use a license manager
that can regulate the number of packages one has to buy relative to the
number of platforms one wishes to run them on, but of course this is a
case by case thing.

Compilers have a slightly different issue.  There there may be floating
license managers, but because compiler usage is sporadic many sites just
buy a single license and put in on a specific node, e.g. the head node
or the server node (which has direct access to the disk and thus avoids
a networking hit).  The issue there is libraries -- many compilers come
with special libraries that are part of how they get good performance.
In some cases the libraries can be used on many systems as long as you
buy the compiler/library package for one.  I don't know the exact state
of things now but at one point in time at least you had to by library
licenses for every node for at least some compilers out there in order
to run the binaries generated by a compiler-licensed node.

Finally there is the OS itself -- commercial linux distributions.  There
the licensing arrangements are whatever you dicker out of the company.
Unfortunately, most of the companies about clusters and what consitutes
"reasonable" cost scaling in a cluster where 50-500 systems are
literally clones of a basic node configuration, and will cheerily charge
hundreds of dollars per node as if those nodes generate some sort of
incremental cost for "support".  I think it is safe to say that "most"
cluster sites avoid this cost by using e.g. Centos (logo-free GPL-based
rebuild of RHEL), Fedora Core, Debian, Caosity -- one of the still-free
linux distributions.  As a FC user, I can attest to the fact that it is
entirely possible to assemble a stable and highly functional cluster
node (or desktop workstation) on top of FC.  Admins tend to lean a bit
more towards Centos for high availability/mission critical servers in
the expectation of a bit more immediate support, but in the case of a
cluster server I'd fully expect FC to be adequately stable and provide
good performance.  So if your issue is OS license management, I'd
suggest going toward one of the fully open/free linuces -- those will
certainly minimize your per-box outlay, and from what I can tell there
is basically no difference whatsoever in ease of installation or
maintenance.  You can even get your cluster installation prepackaged for
you (for free) from e.g. ROCKS or wulfware, which seem to be
stabilizing and have active participants that are keeping them nicely
current.

Hope this helps.  If you want better help, please include detail -- the
specific packages you're concerned about, the particular setup of your
cluster, and what sort of licensing scheme the packages are supposed to
use (the vendors should be able to help you out here).

   rgb

> I.T. Manager
> Centre for Simulation & Modeling, 
> NED University of Engineering & Technology,
> Karachi, Pakistan.
> Tel: (9221) 9243261-8 ext 2372
> Fax: (9221) 9243248
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From atp at piskorski.com  Sat Oct 23 13:03:56 2004
From: atp at piskorski.com (Andrew Piskorski)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Re: Beowulf of bare motherboards
In-Reply-To: <000c01c4a564$c967c430$33a8a8c0@LAPTOP152422>
References: <20040927182134.GA23662@piskorski.com>
	<Pine.LNX.4.61.0409271452000.5279@gauss.snl.salk.edu>
	<4158D794.9090704@verizon.net>
	<20040928040034.GA93760@piskorski.com>
	<000c01c4a564$c967c430$33a8a8c0@LAPTOP152422>
Message-ID: <20041023200355.GA44250@piskorski.com>

I recently experimented with running multiple motherboards off a
single power supply.  This is pretty easy now, because you can buy Y
power cables now - no soldering necessary:

"ZIPPY Power Cable Splitter: ATX 20 pin to Two ATX 20 pin for ATX
Power Supplies", $11 for one; 8 of these cost me $8.49 each shipped:

  http://www.micer.com/viewItem.asp?idProduct=453056250

I wanted to find out how many nodes I could power from a single
supply, and I happened to have 4 different power supplies on hand for
testing.  In all cases, I plugged the supply into my Kill-a-Watt,
attached 3 of the above y-cables to the supply, and simply varied the
number of motherboards plugged into those 4 connectors.

The 4 nodes in question are all Ebay specials, configured like so:

- Motherboard:  ECS P4VXMS
  http://www.ecsusa.com/products/p4vxms.html
- 1 Pentium 4 CPU, socket 423, 256 KB cache, 400 FSB;
  speed GHz:  1.3, 1.4, 1.5, 1.7
- 1 stick RAM, 512 MB PC133 CL3
- 1 AGP graphics card installed, various models.
- 1 Panaflo fan (80 mm, 12 V, 0.1 A) blowing on the CPU heat exchanger.
- THAT'S IT.  (No hard drives, etc.)

Below, the reported Watts is simply the approximate maximum W value I
saw on the Kill-a-Watt as the nodes booted.  The Powe Factor is the
lowest and/or most typical PF reported by the Kill-a-Watt:

ThermalTake Purepower HPC-420-302 DF, Active PFC, 420 W
  http://www.newegg.com/app/ViewProductDesc.asp?description=17-153-005
  http://www.newegg.com/app/viewProductDesc.asp?description=17-153-005R
  $53 +$7 from newegg.com
2 nodes, 175 W, PF 0.98,  $34.25 per node
3 nodes, would not boot, [$25.67 per node]
4 nodes, would not boot, [$21.38 per node]

MGE SuperCharger, 600W
  http://www.newegg.com/app/viewProductDesc.asp?description=17-167-010
  $48 +$7 from newegg.com
2 nodes, 175 W, PF 0.66,  $31.75 per node
3 nodes, 255 W, PF 0.67,  $24.00 per node
4 nodes, would not boot, [$20.13 per node]

Enermax EG301P-VB, 300 W
  http://www.newegg.com/app/viewProductDesc.asp?description=17-103-423
  $31.50 +$7 from newegg.com
2 nodes, 155 W, PF 0.67,  $23.50 per node
3 nodes, 226 W, PF 0.68,  $18.50 per node
4 nodes, would not boot, [$16.00 per node]

Sparkle FSP250-61GT, 250 W
  Ancient, used to power my old AMD K6-II 380 MHz dektop.
2 nodes, 170 W, PF 0.64
3 nodes, 241 W, PF 0.64
4 nodes, 331 W, PF 0.65

Note that I didn't actually RUN anything on the nodes at all, I just
plugged in a monitor and verified that they got through the POST ok
and attempted to boot.  (They attempt to PXE boot, but I don't yet
have anything set up for them to PXE boot FROM.)

Newegg used to advertise the MGE 600 W supply above as having active
PFC, (which is why I bought it), but nothing on the supply itself says
anything about PFC, and the Kill-a-Watt results definitely show that
it doesn't have PFC.

I find it interesting that the smallest, oldest, and probably cheapest
supply is the only one that successfully booted all 4 nodes at once.
Perhaps it is running out of spec, and simply lacks the circuitry to
shut down in such cases?

These motherboards each beep once when they boot, and the beeps seemed
to all come very close together with some supplies, and further apart
with others.  I didn't pay attention to which supplies did this, but
this is probably why the Kill-a-Watt seemed to show lower peak Watts
for the Enermax supply?

Unfortunately I didn't have any el-cheap $12 (plus shipping) supplies
to test.  Particularly since these nodes are diskless, those might
actually work just fine.  Newegg is also now selling the slightly
larger 480 W ThermalTake active PFC supply for about the same price as
the 420 W supply above, which would be worth trying if you really want
PFC.

-- 
Andrew Piskorski <atp@piskorski.com>
http://www.piskorski.com/

From hahn at physics.mcmaster.ca  Sat Oct 23 13:49:36 2004
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Re: Beowulf of bare motherboards
In-Reply-To: <20041023200355.GA44250@piskorski.com>
Message-ID: <Pine.LNX.4.44.0410231644070.13526-100000@coffee.psychology.mcmaster.ca>

> testing.  In all cases, I plugged the supply into my Kill-a-Watt,
> attached 3 of the above y-cables to the supply, and simply varied the
> number of motherboards plugged into those 4 connectors.

very interesting, but somewhat hard to interpret.  I love my killawatt,
too, but unfortunately for this experiment, you need to know how much
current the nodes were drawing on each voltage (and the PS specs.)

for instance, the TT PS might actually provide 420, but only enough
on 3.3 to support a single CPU's VRM, but *lots* of 12V umph.
that wouldn't be unreasonable, given the market for "normal" uses - 
big PS's support machines with lots of disks, or possibly systems
that use extra 12V for hot AGP cards, etc.

on the other hand, Sparkle is the only one of these PS vendors
that I see in OEM settings - TT/MGE/Enermax all seem to be mostly
after-market vendors.  interpret that how you will ;)

regards, mark hahn.


From Glen.Gardner at verizon.net  Sat Oct 23 09:20:36 2004
From: Glen.Gardner at verizon.net (Glen Gardner)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Need Help...!
References: <e9da936f041022225338046cb5@mail.gmail.com>
Message-ID: <417A84D4.8030303@verizon.net>

Muhamed;

In general, one uses a computer as a license server for all the other 
machines.  This might be on the development node of the cluster, or on a 
dedicated server.
This way you can have all of your licenses served from one machine.

As far as commecrcial software being expensive is concerned.... The 
basic idea behind Beowulf is to use readily available freeware to cut 
the operating costs.
For most Beowulf projects, using payware is a great luxury and most 
tools are developed from existing freeware resources. However,  using 
commecrially available compilers on an otherwise freeware Beowulf is 
becoming commonplace (but probably not as big an advantage as some would 
have you think).

If you already have the personnel resources it might prove cheaper to 
develop much of your own software from existing freeware tools  than to 
buy payware all along the way, and augment that with a few carefully 
chosen commercial compilers and maybe one or two carefully chosen 
commercial applications programming libraries. This approach is  
economical and effective, but will require some time from system admins 
and programmers to make it all go.

If you need to have something that is turnkey, prepackaged and ready to 
go all the way, then perhaps Beowulf is the wrong concept for your needs 
and you should consider a commercially built cluster or supercomputer 
with prepackaged software so your people can just login and go straight 
to work, with little or no development time.


Glen Gardner


Kamran Mustafa wrote:

>Hi,
> 
>I am working as an IT Manager at NED University of Engineering &
>Technology, Karachi, Pakistan, and currently managing a Linux based
>Cluster of 50 nodes. I just wanted to ask you that how to manage
>licensing issues on a beowulf cluster. Lets say, if you want to run an
>application software on 50 nodes then will you purchase 50 licenses of
>that software or if there is any other alternative to handle this
>licensing issue, because purchasing such a huge number of licences
>will definitely be very expensive. Actually, I also want to purchase
>different software for my 50 noded cluster but purchasing 50 licences
>of each software costs me alot, thats why I am in need of your
>guidance and kind suggestions.
> 
>Regards,
> 
>Muhammad Kamran Mustafa
>I.T. Manager
>Centre for Simulation & Modeling, 
>NED University of Engineering & Technology,
>Karachi, Pakistan.
>Tel: (9221) 9243261-8 ext 2372
>Fax: (9221) 9243248
>_______________________________________________
>Beowulf mailing list, Beowulf@beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>  
>

-- 
Glen E. Gardner, Jr.
AA8C
AMSAT MEMBER 10593


http://members.bellatlantic.net/~vze24qhw/index.html


From michael at halligan.org  Sat Oct 23 10:45:08 2004
From: michael at halligan.org (Michael T. Halligan)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Need Help...!
In-Reply-To: <e9da936f041022225338046cb5@mail.gmail.com>
References: <e9da936f041022225338046cb5@mail.gmail.com>
Message-ID: <417A98A4.5030602@halligan.org>

Kamran,

My first approach would be to call up the application manufacturer 
directly and ask them how
their licensing works. Explain your situation, and get their 
recommendation.  After you find
out the deal on that piece of software, start talking about bulk 
pricing. A lot of companies,
especially ones with specialized software, are willing to give good bulk 
discounts to universities.

If that fails, find a VAR (Value Added Reseller). Typically a good VAR 
can get you discounts
on anything they sell of 25-45% off of the manufacturer's price.  You'll 
get even better discounts
if you can purchase say the 50 pieces of software, as well as some 
servers, or some other software,
or some support licenses through the var.  I always try to make my 
purchases large, and combined. If
I know I'm going to need to spend $250k in a quarter on software & 
hardware, but I won't need a portion
of it until the end of the quarter, I'll let my vendor know what I plan 
on ordering, and how much I want
to order initially, and they will almost always offer me extra discounts 
to do the entire purchase at once.


>Hi,
> 
>I am working as an IT Manager at NED University of Engineering &
>Technology, Karachi, Pakistan, and currently managing a Linux based
>Cluster of 50 nodes. I just wanted to ask you that how to manage
>licensing issues on a beowulf cluster. Lets say, if you want to run an
>application software on 50 nodes then will you purchase 50 licenses of
>that software or if there is any other alternative to handle this
>licensing issue, because purchasing such a huge number of licences
>will definitely be very expensive. Actually, I also want to purchase
>different software for my 50 noded cluster but purchasing 50 licences
>of each software costs me alot, thats why I am in need of your
>guidance and kind suggestions.
> 
>Regards,
> 
>Muhammad Kamran Mustafa
>I.T. Manager
>Centre for Simulation & Modeling, 
>NED University of Engineering & Technology,
>Karachi, Pakistan.
>Tel: (9221) 9243261-8 ext 2372
>Fax: (9221) 9243248
>_______________________________________________
>Beowulf mailing list, Beowulf@beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>  
>


-- 
-------------------
BitPusher, LLC
http://www.bitpusher.com/
1.888.9PUSHER
(415) 724.7998 - Mobile


From reuti at staff.uni-marburg.de  Sat Oct 23 13:41:52 2004
From: reuti at staff.uni-marburg.de (Reuti)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Need Help...!
In-Reply-To: <Pine.LNX.4.58.0410231158350.7281@lilith.rgb.private.net>
References: <e9da936f041022225338046cb5@mail.gmail.com>
	<Pine.LNX.4.58.0410231158350.7281@lilith.rgb.private.net>
Message-ID: <1098564112.417ac2102bb5b@home.staff.uni-marburg.de>

Hi,
 
<snip>

> Please give us a bit more detail. In particular, what software are we
> talking about?  Different packages have very different licensing schmea,
> and one usually has to go with what a package supports.  For example,
> matlab is in use on some clusters on campus here.  matlab uses a
> license manager that can regulate the number of instances of matlab in
> use on a cluster.  Quite a few packages, actually, use a license manager
> that can regulate the number of packages one has to buy relative to the
> number of platforms one wishes to run them on, but of course this is a
> case by case thing.

also when there is no license manager included, you have to stay in the range 
of the bought licenses with some counter in the queuing system you are using 
(with some of them e.g. SGE you can also control the interactive usage).

Some software companies also have different license conditions for commercial 
usage (pay per machine or sometimes pay per CPU in the machine) or academical 
usage (pay per platform). Depending on the price, it may be cheaper to buy a 
site license in some cases (although you will use it in your cluster only). As 
pointed out, this you have to check for each software you intend to use in 
detail.
 
> Compilers have a slightly different issue.  There there may be floating
> license managers, but because compiler usage is sporadic many sites just
> buy a single license and put in on a specific node, e.g. the head node
> or the server node (which has direct access to the disk and thus avoids

Agreed.

> a networking hit).  The issue there is libraries -- many compilers come
> with special libraries that are part of how they get good performance.
> In some cases the libraries can be used on many systems as long as you
> buy the compiler/library package for one.  I don't know the exact state
> of things now but at one point in time at least you had to by library
> licenses for every node for at least some compilers out there in order
> to run the binaries generated by a compiler-licensed node.

E.g. the Portland license allows you also to sell the compiled program and 
distribute some .so files without any extra fee. For the Intel ones, you may in 
addition distribute the .a files. In each case there is a detailed list, what 
library files are valid for it. So it should be save to use them (the 
libraries) on all nodes in a cluster also.

> Unfortunately, most of the companies about clusters and what consitutes
> "reasonable" cost scaling in a cluster where 50-500 systems are
> literally clones of a basic node configuration, and will cheerily charge
> hundreds of dollars per node as if those nodes generate some sort of
> incremental cost for "support".  I think it is safe to say that "most"
> cluster sites avoid this cost by using e.g. Centos (logo-free GPL-based
> rebuild of RHEL), Fedora Core, Debian, Caosity -- one of the still-free

What about SuSE? You can download some floppies from their server and install 
it over net. And if you want: you can buy support.

Cheers - Reuti


From taylor65 at cox.net  Sat Oct 23 10:52:15 2004
From: taylor65 at cox.net (Ryan Taylor)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Question about v9fs_wire
Message-ID: <000e01c4b929$088a8530$1702a8c0@secondneverman>

Getting a kmod: failed to exec /sbin/mdprobe ..... v9fs_wire

errorno=2, "Unable to handle kernel NULL pointer..."

Couldn't find much info on v9fs_wire, can someone help.

Using RedHat Linux 9 with ClusterMatic 4. Booting the node (just one test node for now) straight from the CD. 

Anyone have any suggestions for me. Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20041023/9357eb6c/attachment.html
From tmattox at gmail.com  Sat Oct 23 09:26:19 2004
From: tmattox at gmail.com (Tim Mattox)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Need Help...!
In-Reply-To: <e9da936f041022225338046cb5@mail.gmail.com>
References: <e9da936f041022225338046cb5@mail.gmail.com>
Message-ID: <ea86ce2204102309263770e89@mail.gmail.com>

Hello Kamran Mustafa,
Our research group tends to avoid software with "per node"
licensing fees.  Depending on the kind of software, you
should have free/open source alternatives that require no
licensing fees at all.  If you could list the specific software you
are worried about, maybe we (the beowulf list) can suggest
free (or single cost) alternatives.

For operating systems, I would suggest you look at
http://caosity.org/ which I am involved with.
Similarly, for a cluster management software, check out
http://warewulf-cluster.org/ which I am also involved with.

There are plenty more alternatives to those two, but
they are a good start looking for the base stuff.

As for application level software and/or compilers,
I'll leave that to other bewoulfers to comment on.
Our group tends to work with application developers,
so they have their own codes they compile.

On Sat, 23 Oct 2004 10:53:11 +0500, Kamran Mustafa
<mkamranmustafa@gmail.com> wrote:
> Hi,
> 
> I am working as an IT Manager at NED University of Engineering &
> Technology, Karachi, Pakistan, and currently managing a Linux based
> Cluster of 50 nodes. I just wanted to ask you that how to manage
> licensing issues on a beowulf cluster. Lets say, if you want to run an
> application software on 50 nodes then will you purchase 50 licenses of
> that software or if there is any other alternative to handle this
> licensing issue, because purchasing such a huge number of licences
> will definitely be very expensive. Actually, I also want to purchase
> different software for my 50 noded cluster but purchasing 50 licences
> of each software costs me alot, thats why I am in need of your
> guidance and kind suggestions.
> 
> Regards,
> 
> Muhammad Kamran Mustafa
> I.T. Manager
> Centre for Simulation & Modeling,
> NED University of Engineering & Technology,
> Karachi, Pakistan.
> Tel: (9221) 9243261-8 ext 2372
> Fax: (9221) 9243248


-- 
Tim Mattox - tmattox@gmail.com - http://homepage.mac.com/tmattox/

From atp at piskorski.com  Sat Oct 23 20:52:24 2004
From: atp at piskorski.com (Andrew Piskorski)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Re: Beowulf of bare motherboards
In-Reply-To: <20041023200355.GA44250@piskorski.com>
References: <20040927182134.GA23662@piskorski.com>
	<Pine.LNX.4.61.0409271452000.5279@gauss.snl.salk.edu>
	<4158D794.9090704@verizon.net>
	<20040928040034.GA93760@piskorski.com>
	<000c01c4a564$c967c430$33a8a8c0@LAPTOP152422>
	<20041023200355.GA44250@piskorski.com>
Message-ID: <20041024035224.GA44285@piskorski.com>

At Mark Hahn's suggestion, I checked the rated amperage on the +3.3
volt line is for each supply.  That didn't seem to correlate with
anything though, so I've recorded all the amps here:

     Rated Amps for each line:
     Volts: +3.3, +5, +12,  -5, -12, +5 Sb
Nodes, PSU  ----  --  ---  ---  ---  -----
2,   TTake:   30, 40,  18, 0.3, 0.8, 2.0
3,     MGE:   20, 45,  24, 0.6, 0.6, 2.0
3, Enermax:   28, 30,  22, 1.0, 1  , 2.2
4, Sparkle:   14, 25,   8, 0.8, 0.8, 0.8

The ratings on the -5 V line seem to line up pretty closely with my
"how many motherboards can this supply power up" metric, but, the 20
pin ATX power connector on the motherboard doesn't even HAVE a -5 V
line, right?  Any thoughts on what the driving factor here could be?

ThermalTake Purepower HPC-420-302 DF, Active PFC, 420 W
+3.3 V:  30 A
  http://www.newegg.com/app/ViewProductDesc.asp?description=17-153-005
  http://www.newegg.com/app/viewProductDesc.asp?description=17-153-005R
  $53 +$7 from newegg.com
2 nodes, 175 W, PF 0.98,  $34.25 per node
3 nodes, would not boot, [$25.67 per node]
4 nodes, would not boot, [$21.38 per node]

MGE SuperCharger, 600W
+3.3 V:  20 A
  http://www.newegg.com/app/viewProductDesc.asp?description=17-167-010
  $48 +$7 from newegg.com
2 nodes, 175 W, PF 0.66,  $31.75 per node
3 nodes, 255 W, PF 0.67,  $24.00 per node
4 nodes, would not boot, [$20.13 per node]

Enermax EG301P-VB, 300 W
+3.3 V:  28 A
  http://www.newegg.com/app/viewProductDesc.asp?description=17-103-423
  $31.50 +$7 from newegg.com
2 nodes, 155 W, PF 0.67,  $23.50 per node
3 nodes, 226 W, PF 0.68,  $18.50 per node
4 nodes, would not boot, [$16.00 per node]

Sparkle FSP250-61GT, 250 W
+3.3 V:  14 A
  Ancient, used to power my old AMD K6-II 380 MHz dektop.
2 nodes, 170 W, PF 0.64
3 nodes, 241 W, PF 0.64
4 nodes, 331 W, PF 0.65

-- 
Andrew Piskorski <atp@piskorski.com>
http://www.piskorski.com/

From Glen.Gardner at verizon.net  Sat Oct 23 22:37:22 2004
From: Glen.Gardner at verizon.net (Glen Gardner)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Re: Beowulf of bare motherboards
References: <20040927182134.GA23662@piskorski.com>	<Pine.LNX.4.61.0409271452000.5279@gauss.snl.salk.edu>	<4158D794.9090704@verizon.net>	<20040928040034.GA93760@piskorski.com>	<000c01c4a564$c967c430$33a8a8c0@LAPTOP152422>	<20041023200355.GA44250@piskorski.com>
	<20041024035224.GA44285@piskorski.com>
Message-ID: <417B3F92.5060104@verizon.net>

You might try turning on one node at a time if you can.

You ought to try to run two nodes at full throttle on the same psu. I 
suspect you will run into more problems.
Install an OS on each computer and run something like a heapsort 
benchmark or linpack on both at the same time and see if you get  a crash.


Andrew Piskorski wrote:

>At Mark Hahn's suggestion, I checked the rated amperage on the +3.3
>volt line is for each supply.  That didn't seem to correlate with
>anything though, so I've recorded all the amps here:
>
>     Rated Amps for each line:
>     Volts: +3.3, +5, +12,  -5, -12, +5 Sb
>Nodes, PSU  ----  --  ---  ---  ---  -----
>2,   TTake:   30, 40,  18, 0.3, 0.8, 2.0
>3,     MGE:   20, 45,  24, 0.6, 0.6, 2.0
>3, Enermax:   28, 30,  22, 1.0, 1  , 2.2
>4, Sparkle:   14, 25,   8, 0.8, 0.8, 0.8
>
>The ratings on the -5 V line seem to line up pretty closely with my
>"how many motherboards can this supply power up" metric, but, the 20
>pin ATX power connector on the motherboard doesn't even HAVE a -5 V
>line, right?  Any thoughts on what the driving factor here could be?
>
>ThermalTake Purepower HPC-420-302 DF, Active PFC, 420 W
>+3.3 V:  30 A
>  http://www.newegg.com/app/ViewProductDesc.asp?description=17-153-005
>  http://www.newegg.com/app/viewProductDesc.asp?description=17-153-005R
>  $53 +$7 from newegg.com
>2 nodes, 175 W, PF 0.98,  $34.25 per node
>3 nodes, would not boot, [$25.67 per node]
>4 nodes, would not boot, [$21.38 per node]
>
>MGE SuperCharger, 600W
>+3.3 V:  20 A
>  http://www.newegg.com/app/viewProductDesc.asp?description=17-167-010
>  $48 +$7 from newegg.com
>2 nodes, 175 W, PF 0.66,  $31.75 per node
>3 nodes, 255 W, PF 0.67,  $24.00 per node
>4 nodes, would not boot, [$20.13 per node]
>
>Enermax EG301P-VB, 300 W
>+3.3 V:  28 A
>  http://www.newegg.com/app/viewProductDesc.asp?description=17-103-423
>  $31.50 +$7 from newegg.com
>2 nodes, 155 W, PF 0.67,  $23.50 per node
>3 nodes, 226 W, PF 0.68,  $18.50 per node
>4 nodes, would not boot, [$16.00 per node]
>
>Sparkle FSP250-61GT, 250 W
>+3.3 V:  14 A
>  Ancient, used to power my old AMD K6-II 380 MHz dektop.
>2 nodes, 170 W, PF 0.64
>3 nodes, 241 W, PF 0.64
>4 nodes, 331 W, PF 0.65
>
>  
>

-- 
Glen E. Gardner, Jr.
AA8C
AMSAT MEMBER 10593
Glen.Gardner@verizon.net


http://members.bellatlantic.net/~vze24qhw/index.html


From hanzl at noel.feld.cvut.cz  Sun Oct 24 01:35:49 2004
From: hanzl at noel.feld.cvut.cz (hanzl@noel.feld.cvut.cz)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Question about v9fs_wire
In-Reply-To: <000e01c4b929$088a8530$1702a8c0@secondneverman>
References: <000e01c4b929$088a8530$1702a8c0@secondneverman>
Message-ID: <20041024103549H.hanzl@unknown-domain>

> Getting a kmod: failed to exec /sbin/mdprobe ..... v9fs_wire
> errorno=2, "Unable to handle kernel NULL pointer..."
> Couldn't find much info on v9fs_wire, can someone help.
> Using RedHat Linux 9 with ClusterMatic 4. Booting the node (just one
> test node for now) straight from the CD.

I guess you could just disable the corresponding kernel module in the
config file.

My knowledge is not quite up-to-date but I know they (Clustermatic
team, Ron Minnich in particular) did interesting experiments with
Plan-9-like filesystem which could export filesystems and create
private namespaces on per-user basis. Then they did not pay too much
attention to this for some time; I do not know the current status.

You might get more details on the bproc list:

http://lists.sourceforge.net/lists/listinfo/bproc-users

HTH

Vaclav Hanzl


From cflau at clc.cuhk.edu.hk  Mon Oct 25 03:58:00 2004
From: cflau at clc.cuhk.edu.hk (John Lau)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Can we set MPICH to use ssh instead of rsh at runtime?
Message-ID: <1098701880.2791.287.camel@nuts.clc.cuhk.edu.hk>

Hi,

Can we set MPICH to use ssh instead of rsh at runtime? I know it can be
set in compile time by configure opinion. And LAM can do that by setting
$LAMRSH environment variable. 

Best regards,
John Lau


From edemir_at_andrew at yahoo.com  Sun Oct 24 21:31:54 2004
From: edemir_at_andrew at yahoo.com (Ergin Demir)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Re: Beowulf of bare motherboards
In-Reply-To: <20041023200355.GA44250@piskorski.com>
Message-ID: <20041025043154.58715.qmail@web53705.mail.yahoo.com>

How do you boot or shut down individual mobos?
I think in this configuration all mobos will boot up or shut down simultaneously.

Andrew Piskorski <atp@piskorski.com> wrote:
I recently experimented with running multiple motherboards off a
single power supply. This is pretty easy now, because you can buy Y
power cables now - no soldering necessary:

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20041024/bf7d81e6/attachment.html
From mkamranmustafa at gmail.com  Sun Oct 24 23:04:52 2004
From: mkamranmustafa at gmail.com (Kamran Mustafa)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Need Help...!
In-Reply-To: <1098564112.417ac2102bb5b@home.staff.uni-marburg.de>
References: <e9da936f041022225338046cb5@mail.gmail.com>
	<Pine.LNX.4.58.0410231158350.7281@lilith.rgb.private.net>
	<1098564112.417ac2102bb5b@home.staff.uni-marburg.de>
Message-ID: <e9da936f04102423047dcc5b64@mail.gmail.com>

Hi,

Thanks alot for the prompt reply. Right at the moment I am asked to
purchased the following for my cluster:

1) MPI/Pro by verari systems
2) PGI CDK Cluster Development Kit by Portland Group

Purchasing 100 processes of MPI/Pro is really very expensive for me.
Similarly, for my 100 processors I have to purchase 256 licences of
PGI CDK because they offer licences in groups of 16/64/256 CPUs. Even
if I purchase 256 licences for just 2 simultaneous counts, it costs me
a lot...

Kindly help me in this issue as soon as possible. I will be thankful to you.

Regards,

Muhammad Kamran Mustafa
I.T. Manager
Centre for Simulation & Modeling,
NED University of Engineering & Technology,
Karachi, Pakistan.
Tel: (9221) 9243261-8 ext 2372
Fax: (9221) 9243248
------------------------------------------------------------------------------------------------------------------------


On Sat, 23 Oct 2004 22:41:52 +0200, Reuti <reuti@staff.uni-marburg.de> wrote:
> Hi,
> 
> <snip>
> 
> > Please give us a bit more detail. In particular, what software are we
> > talking about?  Different packages have very different licensing schmea,
> > and one usually has to go with what a package supports.  For example,
> > matlab is in use on some clusters on campus here.  matlab uses a
> > license manager that can regulate the number of instances of matlab in
> > use on a cluster.  Quite a few packages, actually, use a license manager
> > that can regulate the number of packages one has to buy relative to the
> > number of platforms one wishes to run them on, but of course this is a
> > case by case thing.
> 
> also when there is no license manager included, you have to stay in the range
> of the bought licenses with some counter in the queuing system you are using
> (with some of them e.g. SGE you can also control the interactive usage).
> 
> Some software companies also have different license conditions for commercial
> usage (pay per machine or sometimes pay per CPU in the machine) or academical
> usage (pay per platform). Depending on the price, it may be cheaper to buy a
> site license in some cases (although you will use it in your cluster only). As
> pointed out, this you have to check for each software you intend to use in
> detail.
> 
> > Compilers have a slightly different issue.  There there may be floating
> > license managers, but because compiler usage is sporadic many sites just
> > buy a single license and put in on a specific node, e.g. the head node
> > or the server node (which has direct access to the disk and thus avoids
> 
> Agreed.
> 
> > a networking hit).  The issue there is libraries -- many compilers come
> > with special libraries that are part of how they get good performance.
> > In some cases the libraries can be used on many systems as long as you
> > buy the compiler/library package for one.  I don't know the exact state
> > of things now but at one point in time at least you had to by library
> > licenses for every node for at least some compilers out there in order
> > to run the binaries generated by a compiler-licensed node.
> 
> E.g. the Portland license allows you also to sell the compiled program and
> distribute some .so files without any extra fee. For the Intel ones, you may in
> addition distribute the .a files. In each case there is a detailed list, what
> library files are valid for it. So it should be save to use them (the
> libraries) on all nodes in a cluster also.
> 
> > Unfortunately, most of the companies about clusters and what consitutes
> > "reasonable" cost scaling in a cluster where 50-500 systems are
> > literally clones of a basic node configuration, and will cheerily charge
> > hundreds of dollars per node as if those nodes generate some sort of
> > incremental cost for "support".  I think it is safe to say that "most"
> > cluster sites avoid this cost by using e.g. Centos (logo-free GPL-based
> > rebuild of RHEL), Fedora Core, Debian, Caosity -- one of the still-free
> 
> What about SuSE? You can download some floppies from their server and install
> it over net. And if you want: you can buy support.
> 
> Cheers - Reuti
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

From reuti at staff.uni-marburg.de  Mon Oct 25 01:21:38 2004
From: reuti at staff.uni-marburg.de (Reuti)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Need Help...!
In-Reply-To: <e9da936f04102423047dcc5b64@mail.gmail.com>
References: <e9da936f041022225338046cb5@mail.gmail.com>
	<Pine.LNX.4.58.0410231158350.7281@lilith.rgb.private.net>
	<1098564112.417ac2102bb5b@home.staff.uni-marburg.de>
	<e9da936f04102423047dcc5b64@mail.gmail.com>
Message-ID: <1098692498.417cb792cc1d9@home.staff.uni-marburg.de>

Hi again,
 
> Thanks alot for the prompt reply. Right at the moment I am asked to
> purchased the following for my cluster:
> 
> 1) MPI/Pro by verari systems
> 2) PGI CDK Cluster Development Kit by Portland Group
 
> Purchasing 100 processes of MPI/Pro is really very expensive for me.
> Similarly, for my 100 processors I have to purchase 256 licences of
> PGI CDK because they offer licences in groups of 16/64/256 CPUs. Even
> if I purchase 256 licences for just 2 simultaneous counts, it costs me
> a lot...

is there a direct use of the cluster features of the PGI CDK? According to 
their websites it's just a combination of the compilers, MPICH and a part of 
OpenPBS. And OpenMP is also included in standard package of their compilers.

Is your main application to use the compilers for software development or to 
use the compiled programs?

So you could just buy one license of the normal compiler, download MPICH at 
http://www-unix.mcs.anl.gov/mpi/mpich and choose a queuing system (maybe 
OpenPBS or better) SGE from SUN at http://gridengine.sunsource.net , it's also 
free. In contrast to OpenPBS, you can kill the slave processes on the nodes 
nicely with SGE and it runs really stable.

Just test the performance with this combination, and you can add MPI/Pro later, 
after you tested the performance gain with a demo of it. Because the 
performance will also depend on the used network. I don't know the prices of 
MPI/pro, but maybe a second dedicaded network only for MPI communication is 
also an option and speed up the things (or Myrinet, Infiniband - this depends 
on the amount of MPI traffic your applications will generate).

Best greetings - Reuti

From lusk at mcs.anl.gov  Mon Oct 25 08:35:48 2004
From: lusk at mcs.anl.gov (Rusty Lusk)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Can we set MPICH to use ssh instead of rsh at runtime?
In-Reply-To: <1098701880.2791.287.camel@nuts.clc.cuhk.edu.hk>
References: <1098701880.2791.287.camel@nuts.clc.cuhk.edu.hk>
Message-ID: <20041025.103548.29035877.lusk@localhost>

> Hi,
> 
> Can we set MPICH to use ssh instead of rsh at runtime? I know it can be
> set in compile time by configure opinion. And LAM can do that by setting
> $LAMRSH environment variable. 
> 

No, there is not a way to do that.  I would recommend the use of MPICH2
(http://www.mcs.anl.gov/mpi/mpich2), which doesn't use rsh or ssh to
start jobs, but rather a set of daemons.  The daemons can be started any
way you like, including either rsh or ssh.

Regards,
Rusty Lusk

From reuti at staff.uni-marburg.de  Mon Oct 25 09:08:35 2004
From: reuti at staff.uni-marburg.de (Reuti)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Can we set MPICH to use ssh instead of rsh at runtime?
In-Reply-To: <1098701880.2791.287.camel@nuts.clc.cuhk.edu.hk>
References: <1098701880.2791.287.camel@nuts.clc.cuhk.edu.hk>
Message-ID: <417D2503.3060805@staff.uni-marburg.de>

> Can we set MPICH to use ssh instead of rsh at runtime? I know it can be
> set in compile time by configure opinion. And LAM can do that by setting
> $LAMRSH environment variable. 

export P4_RSHCOMMAND=ssh

or the appropiate path direct to your binary. - Reuti


From lusk at mcs.anl.gov  Mon Oct 25 10:50:39 2004
From: lusk at mcs.anl.gov (Rusty Lusk)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Can we set MPICH to use ssh instead of rsh at runtime?
In-Reply-To: <417D2503.3060805@staff.uni-marburg.de>
References: <1098701880.2791.287.camel@nuts.clc.cuhk.edu.hk>
	<417D2503.3060805@staff.uni-marburg.de>
Message-ID: <20041025.125039.103756653.lusk@localhost>

From: Reuti <reuti@staff.uni-marburg.de>
Subject: Re: [Beowulf] Can we set MPICH to use ssh instead of rsh at runtime?
Date: Mon, 25 Oct 2004 18:08:35 +0200

> > Can we set MPICH to use ssh instead of rsh at runtime? I know it can be
> > set in compile time by configure opinion. And LAM can do that by setting
> > $LAMRSH environment variable. 
> 
> export P4_RSHCOMMAND=ssh
> 
> or the appropiate path direct to your binary. - Reuti

My face is red.  I forgot you could do it in MPICH1 with an environment
variable.

-Rusty

From john.hearns at clustervision.com  Mon Oct 25 08:59:39 2004
From: john.hearns at clustervision.com (John Hearns)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] New OReilly book on clusters
Message-ID: <1098719978.17215.144.camel@vigor12>

A friend pointed me towards the new OReilly book on clusters.

http://www.oreilly.com/catalog/highperlinuxc/

High Performance Linux Clusters with OSCAR, Rocks, OpenMosix, and MPI
by Joseph D Sloan

As I'm a sucker for OReilly books, no doubt I'll be adding this
one to my menagerie.


From john.hearns at clustervision.com  Mon Oct 25 12:00:49 2004
From: john.hearns at clustervision.com (John Hearns)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Need Help...!
In-Reply-To: <1098692498.417cb792cc1d9@home.staff.uni-marburg.de>
References: <e9da936f041022225338046cb5@mail.gmail.com>
	<Pine.LNX.4.58.0410231158350.7281@lilith.rgb.private.net>
	<1098564112.417ac2102bb5b@home.staff.uni-marburg.de>
	<e9da936f04102423047dcc5b64@mail.gmail.com>
	<1098692498.417cb792cc1d9@home.staff.uni-marburg.de>
Message-ID: <1098730849.17215.166.camel@vigor12>

On Mon, 2004-10-25 at 09:21, Reuti wrote:
> Hi again,
>  
> > Thanks alot for the prompt reply. Right at the moment I am asked to
> > purchased the following for my cluster:
> > 
> > 1) MPI/Pro by verari systems
> > 2) PGI CDK Cluster Development Kit by Portland Group

> So you could just buy one license of the normal compiler, download MPICH at 
> http://www-unix.mcs.anl.gov/mpi/mpich and choose a queuing system (maybe 
> OpenPBS or better) SGE from SUN at http://gridengine.sunsource.net , it's also 
> free. In contrast to OpenPBS, you can kill the slave processes on the nodes 
> nicely with SGE and it runs really stable.

I agree with what Reuti says.
Witht he caveat that Portland make excellent products (as I'm sure
everyone on this list agrees).


In addition, I would say that you should start by looking at online
resources such as 
www.clusterworld.com
www.beowulf.org
http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book.php

Once you are a little bit familiar with these,
you should try to find a friendly company which provides turn-key Linux
clusters. They can help you with the choice of toolkits, compilers.
monitoring applications etc.

And if no such company exists in Karachi... well then there's an
opportunity for somebody.


From cflau at clc.cuhk.edu.hk  Mon Oct 25 19:21:01 2004
From: cflau at clc.cuhk.edu.hk (John Lau)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] Can we set MPICH to use ssh instead of rsh at runtime?
In-Reply-To: <20041025.125039.103756653.lusk@localhost>
References: <1098701880.2791.287.camel@nuts.clc.cuhk.edu.hk>
	<417D2503.3060805@staff.uni-marburg.de>
	<20041025.125039.103756653.lusk@localhost>
Message-ID: <1098757261.2791.307.camel@nuts.clc.cuhk.edu.hk>

Hi,

It works! Thank you very much.

John Lau

2004-10-26 01:50, Rusty Lusk¡G
> From: Reuti <reuti@staff.uni-marburg.de>
> Subject: Re: [Beowulf] Can we set MPICH to use ssh instead of rsh at runtime?
> Date: Mon, 25 Oct 2004 18:08:35 +0200
> 
> > > Can we set MPICH to use ssh instead of rsh at runtime? I know it can be
> > > set in compile time by configure opinion. And LAM can do that by setting
> > > $LAMRSH environment variable. 
> > 
> > export P4_RSHCOMMAND=ssh
> > 
> > or the appropiate path direct to your binary. - Reuti
> 
> My face is red.  I forgot you could do it in MPICH1 with an environment
> variable.
> 
> -Rusty


From icub3d at gmail.com  Tue Oct 26 12:08:00 2004
From: icub3d at gmail.com (Joshua Marsh)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] High Performance for Large Database
Message-ID: <38242de90410261208b9ae5f2@mail.gmail.com>

Hi all,

I'm currently working on a project that will require fast access to
data stored in a postgreSQL database server.  I've been told that a
Beowulf cluster may help increase performance.  Since I'm not very
familar with Beowulf clusters, I was hoping that you might have some
advice or information on whether a cluster would increase performance
for a PostgreSQL database.  The major tables accessed are around
150-200 million records.  On a stand alone server, it can take several
minutes to perform a simple select query.

It seems like once we start pricing for servers with 16+ processors
and 64+ GB of RAM, the prices sky rocket.  If I can acheive high
performance with a cluster, using 15-20 dual processor machines, that
would be great.

Thanks for any help you may have!

-Josh

From reuti at staff.uni-marburg.de  Tue Oct 26 14:03:30 2004
From: reuti at staff.uni-marburg.de (Reuti)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] High Performance for Large Database
In-Reply-To: <38242de90410261208b9ae5f2@mail.gmail.com>
References: <38242de90410261208b9ae5f2@mail.gmail.com>
Message-ID: <1098824610.417ebba20d313@home.staff.uni-marburg.de>

Hi, 

> I'm currently working on a project that will require fast access to 
> data stored in a postgreSQL database server. ?I've been told that a 
> Beowulf cluster may help increase performance. ?Since I'm not very 
> familar with Beowulf clusters, I was hoping that you might have some 
> advice or information on whether a cluster would increase performance 
> for a PostgreSQL database. ?The major tables accessed are around 
> 150-200 million records. ?On a stand alone server, it can take several 
> minutes to perform a simple select query. 
> 
> It seems like once we start pricing for servers with 16+ processors 
> and 64+ GB of RAM, the prices sky rocket. ?If I can acheive high 
> performance with a cluster, using 15-20 dual processor machines, that 
> would be great. 

what is your configuration now and what disks are you using at this time? Any 
RAID array with SCSI? - Reuti 

From rgb at phy.duke.edu  Tue Oct 26 14:21:57 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] High Performance for Large Database
In-Reply-To: <38242de90410261208b9ae5f2@mail.gmail.com>
References: <38242de90410261208b9ae5f2@mail.gmail.com>
Message-ID: <Pine.LNX.4.58.0410261701450.2746@lilith.rgb.private.net>

On Tue, 26 Oct 2004, Joshua Marsh wrote:

> Hi all,
> 
> I'm currently working on a project that will require fast access to
> data stored in a postgreSQL database server.  I've been told that a
> Beowulf cluster may help increase performance.  Since I'm not very
> familar with Beowulf clusters, I was hoping that you might have some
> advice or information on whether a cluster would increase performance
> for a PostgreSQL database.  The major tables accessed are around
> 150-200 million records.  On a stand alone server, it can take several
> minutes to perform a simple select query.
> 
> It seems like once we start pricing for servers with 16+ processors
> and 64+ GB of RAM, the prices sky rocket.  If I can acheive high
> performance with a cluster, using 15-20 dual processor machines, that
> would be great.

This sort of cluster isn't a "beowulf" cluster; rather it is a variant
of a high availability cluster.  It's Extreme Linux, just not beowulf.
The beowulf design (and focus of this list) is "high performance
computing" clusters, aka supercomputing clusters.

With that said, there may be some resources out there that can help you,
and listening in on this list and learning how HPC clusters work will
certainly help you with other kinds, as the issues are in many cases
similar.

The first/best place to look is the September issue of Cluster World
Magazine (www.clusterworld.com/issues.html).  Its cover focus is on
"Database Clusters".  My copy is at Duke (and I'm at home:-) so although
I'm pretty sure it covers mysql used in a cluster environment I cannot
recall if it discusses alternatives such as oracle or postgres.

Other CWM issues will also be pertinent, regardless.  One major issue
associated with any kind of file access is assembling a large, shared
file store that avoids the file and communications bottlenecks that are
as much an issue in HPC as they are in HA.  A series of articles just
begun by Jeff Layton deals with SAN's and massive scalable storage in
general -- he's only done a couple of articles so far, so if there are
still September/October issues around you'd be in great shape.  CWM also
abounds with ads for large and scalable and blindingly fast storage
solutions.  We just had an extensive discussion on this very list on
storage (I kicked it off as we have a big proposal out that had a very
large storage component and I needed to learn -- fast!).  The recent
list archives should show you the thread.  Finally, there are some
companies out there that make their bread and butter by assembling
custom clusters to accomplish very specific tasks at a cost (as you
note) far less than the cost of a big multiprocessor machine even though
they make a healthy (and well earned) profit on the deal.  Some of them
have employees or owners on this list -- if any of them can help you I
expect they'll talk to you offline.

That's about all the help I personally can offer; I haven't built a
large database cluster and only have listened halfheartedly when they
were discussed on list in the past (although there have been previous
discussions you can also google for in the list archives, I think).  The
problem is a fairly complex one -- not just various file latency and
bandwidth issues (these are likely the "easy part") but the issue of
sharing the underlying DB brings up locking.  It is one thing to provide
lots of nodes read-only access to a DB on a SAN engineered for fast,
cached, read-only access; it is another to provide all the nodes with
read AND write access, as writing requires a lock, and a lock
effectively serializes access.  This (and related problems) are serious
issues with speeding up databases through parallelism.  I vaguely recall
that big companies like Oracle have dumped pretty serious money into
this kind of thing looking for solutions that scale well.

Maybe somebody else on list knows more than I do, though, and maybe
they'll tell all of us!

   rgb

> 
> Thanks for any help you may have!
> 
> -Josh
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From laurenceliew at yahoo.com.sg  Tue Oct 26 18:29:58 2004
From: laurenceliew at yahoo.com.sg (Laurence Liew)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] High Performance for Large Database
In-Reply-To: <38242de90410261208b9ae5f2@mail.gmail.com>
References: <38242de90410261208b9ae5f2@mail.gmail.com>
Message-ID: <417EFA16.8020503@yahoo.com.sg>

Hi,

You may wish to search thru the beowulf list or google for "beowulf and 
databases and postgresql"... there were a couple of threads on exeactly 
this issue.

Very briefly

1. Beowulf clusters CANNOT help make Postgresql or any databases run 
faster. You need the database code to be modified to do that (think 
Oracle 10g). I met a company at Supercomputer 03 last year that had 
Mysql running on a cluster... you may wish to query for them.

2. You could try to sponsor the development of a parallel postrgresql - 
talk to the postgresql development team... when I broached the idea in 
1998.. there was some interest.. unfortunately.. I could not afford the 
development/sponsorship costs then.

3. Try running Postgresql on a cluster filesystem like PVFS - it is not 
gauranteed as it probably fails the ACID test for a SQL compliant 
database. The basic idea is that if we cannot parallelise the database - 
we make the underlying IO parallel and hence boost the IO performance of 
the system.. and any applications that run on them.. and this includes 
Postgresql.

Hope this helps.

Cheers!
Laurence
Scalable Systems
Singapore


Joshua Marsh wrote:
> Hi all,
> 
> I'm currently working on a project that will require fast access to
> data stored in a postgreSQL database server.  I've been told that a
> Beowulf cluster may help increase performance.  Since I'm not very
> familar with Beowulf clusters, I was hoping that you might have some
> advice or information on whether a cluster would increase performance
> for a PostgreSQL database.  The major tables accessed are around
> 150-200 million records.  On a stand alone server, it can take several
> minutes to perform a simple select query.
> 
> It seems like once we start pricing for servers with 16+ processors
> and 64+ GB of RAM, the prices sky rocket.  If I can acheive high
> performance with a cluster, using 15-20 dual processor machines, that
> would be great.
> 
> Thanks for any help you may have!
> 
> -Josh
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: laurenceliew.vcf
Type: text/x-vcard
Size: 150 bytes
Desc: not available
Url : http://www.scyld.com/pipermail/beowulf/attachments/20041027/bb931a9a/laurenceliew.vcf
From kmurphy at dolphinics.com  Wed Oct 27 03:39:41 2004
From: kmurphy at dolphinics.com (Keith Murphy)
Date: Wed Nov 25 01:03:30 2009
Subject: [Beowulf] High Performance for Large Database
References: <38242de90410261208b9ae5f2@mail.gmail.com>
	<417EFA16.8020503@yahoo.com.sg>
Message-ID: <037701c4bc11$455857e0$6901a8c0@dolphinics.no>

Check out this url http://www.linuxlabs.com/clusgres.html they look like
they have a solution for scaleable Postgres

Kindest Regards

Keith Murphy
Dolphin Interconnect
818-292-5100
kmurphy@dolphinics.com
www.dolphinics.com
----- Original Message ----- 
From: "Laurence Liew" <laurenceliew@yahoo.com.sg>
To: "Joshua Marsh" <icub3d@gmail.com>
Cc: <beowulf@beowulf.org>
Sent: Wednesday, October 27, 2004 3:29 AM
Subject: Re: [Beowulf] High Performance for Large Database


> Hi,
>
> You may wish to search thru the beowulf list or google for "beowulf and
> databases and postgresql"... there were a couple of threads on exeactly
> this issue.
>
> Very briefly
>
> 1. Beowulf clusters CANNOT help make Postgresql or any databases run
> faster. You need the database code to be modified to do that (think
> Oracle 10g). I met a company at Supercomputer 03 last year that had
> Mysql running on a cluster... you may wish to query for them.
>
> 2. You could try to sponsor the development of a parallel postrgresql -
> talk to the postgresql development team... when I broached the idea in
> 1998.. there was some interest.. unfortunately.. I could not afford the
> development/sponsorship costs then.
>
> 3. Try running Postgresql on a cluster filesystem like PVFS - it is not
> gauranteed as it probably fails the ACID test for a SQL compliant
> database. The basic idea is that if we cannot parallelise the database -
> we make the underlying IO parallel and hence boost the IO performance of
> the system.. and any applications that run on them.. and this includes
> Postgresql.
>
> Hope this helps.
>
> Cheers!
> Laurence
> Scalable Systems
> Singapore
>
>
>
> Joshua Marsh wrote:
> > Hi all,
> >
> > I'm currently working on a project that will require fast access to
> > data stored in a postgreSQL database server.  I've been told that a
> > Beowulf cluster may help increase performance.  Since I'm not very
> > familar with Beowulf clusters, I was hoping that you might have some
> > advice or information on whether a cluster would increase performance
> > for a PostgreSQL database.  The major tables accessed are around
> > 150-200 million records.  On a stand alone server, it can take several
> > minutes to perform a simple select query.
> >
> > It seems like once we start pricing for servers with 16+ processors
> > and 64+ GB of RAM, the prices sky rocket.  If I can acheive high
> > performance with a cluster, using 15-20 dual processor machines, that
> > would be great.
> >
> > Thanks for any help you may have!
> >
> > -Josh
> > _______________________________________________
> > Beowulf mailing list, Beowulf@beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
> >
>


----------------------------------------------------------------------------
----


> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>


From hanzl at noel.feld.cvut.cz  Wed Oct 27 02:42:15 2004
From: hanzl at noel.feld.cvut.cz (hanzl@noel.feld.cvut.cz)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] High Performance for Large Database
In-Reply-To: <Pine.LNX.4.58.0410261701450.2746@lilith.rgb.private.net>
References: <38242de90410261208b9ae5f2@mail.gmail.com>
	<Pine.LNX.4.58.0410261701450.2746@lilith.rgb.private.net>
Message-ID: <20041027114215V.hanzl@unknown-domain>

> > I'm currently working on a project that will require fast access to
> > data stored in a postgreSQL database server.  I've been told that a
> > ...
> > and 64+ GB of RAM, the prices sky rocket.  If I can acheive high
> > performance with a cluster, using 15-20 dual processor machines, that
> > would be great.
> 
> This sort of cluster isn't a "beowulf" cluster; rather it is a variant
> of a high availability cluster.  It's Extreme Linux, just not beowulf.
> The beowulf design (and focus of this list) is "high performance
> computing" clusters, aka supercomputing clusters.

I think that while this is true in many particular cases, it is far
from being true in general. There are applications which involve
databases and could be as beowulfish as it can get.

I know reseachers who work with extremely huge and complex graphs and
use a database for this. Should they have say a MPI-based database
with all data in RAM they could get tremendous speedups. They would be
happy to copy the database to the distributed cluster RAM, do few
zillions of operations on it and then copy some results back.

I do agree that a database might not be the best tool for their job
and complete rewrite of all the code they have might help :-)

However I consider programming against a db API to be an important
knowledge reuse and nice split of their problem into two parts which
together take more computer time than one monolith would but one of
them (the db searches) is a problem with commodity solutions.

(And I might even argue that even high availability databases may very
well use The True Beowulf as a component doing searches on mostly
read-only data cached in cluster RAM or even cached in local
harddisks.)

The only difference I can see is the application (which is not a CFD or
galactic evolution or similar). From the point of wiew of
interconnects, OS types, parallel libraries used, RAM, processors,
cluster management etc. I see no reason why databases and beowulf
could not overlap.

Best Regards

Vaclav Hanzl


From mechti01 at luther.edu  Tue Oct 26 22:09:48 2004
From: mechti01 at luther.edu (Timo Mechler)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Clic 2.0 lockup problems
Message-ID: <2668.172.22.17.130.1098853788.squirrel@172.22.17.130>

Hi all,

I just finished installing Clic 2.0 on a cluster of 1 server and 12 nodes.
 After running the setup_auto_cluster script I got everything installed. 
I created a "cluster user" and proceeded to test out some of the included
mpi sample code.  This ran fine.  I next tried to start this code remotely
(through SSH), but when I did this, the server locked up and had to be
rebooted.  It actually locked up while connecting via SSH, not when
executing the sample mpi code.  Any idea what might cause this?  The
server has 3 network interfaces:

eth0 - administration
eth1 - outside (internet)
eth2 - message passing (computing)

(The nodes each have 2 interfaces, one for administration and one for
message passing)

It also seems that when I logon as the "cluster user" or root, and try to
access an external website (e.g. google), the server will lockup and need
to be rebooted again.

Any idea why I'm experiencing these lockups?  Is something configured
incorrectly?  Is it a faulty network card?  I was able to access outside
websites fine before I ran the setup scripts.

Thanks in advance for your help.

Regards,

-Timo Mechler


-- 
Timo R. Mechler
mechti01@luther.edu


From rgb at phy.duke.edu  Wed Oct 27 08:08:04 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] High Performance for Large Database
In-Reply-To: <20041027114215V.hanzl@unknown-domain>
References: <38242de90410261208b9ae5f2@mail.gmail.com>
	<Pine.LNX.4.58.0410261701450.2746@lilith.rgb.private.net>
	<20041027114215V.hanzl@unknown-domain>
Message-ID: <Pine.LNX.4.58.0410271002030.2746@lilith.rgb.private.net>

On Wed, 27 Oct 2004 hanzl@noel.feld.cvut.cz wrote:

(a bunch of stuff, leading to...)

> The only difference I can see is the application (which is not a CFD or
> galactic evolution or similar). From the point of wiew of
> interconnects, OS types, parallel libraries used, RAM, processors,
> cluster management etc. I see no reason why databases and beowulf
> could not overlap.

And I would agree, and even pointed out that there WERE areas of logical
overlap.  The problems being solved and bottlenecks involved are in many
cases the same.  However, by convention "beowulf" clusters per se and
MOST of the energy of this list is devoted to HPC -- numerical
computations and applications.  It is undeniable that numerical
applications exist that interact with data stores of many different
forms, including I'm sure databases.  Databases are also used to manage
clusters.  Grids, in particular, tend to integrate tightly with
databases to be able to manage distributed storage resources that aren't
necessarily viewable or accessible as "mounts" of an ordinary
filesystem.  When one has thousands of (potential) users and millions of
inodes spread out across hundreds of disks with a complicated set of
relationships regarding access to the HPC-generated data, a DB is needed
just to permit search and retrieval of your OWN results, let alone
somebody else's.

Nevertheless, databases per se are not numerical HPC, and a cluster
built to do SQL transactions on a collective shared database is not
properly called a "beowulf" cluster or even a more general HPC cluster
or grid.  That is one reason that they ARE rarely discussed on this
list.  In fact, most of the discussion that has occurred and that is in
the archives concerns why database server clusters aren't really HPC or
beowulf clusters, not how one might build a cluster-based database.  The
latter is more the purview of:

  http://www.linux-ha.org/

and its associated list; at least they address the data reliability,
failover, data store design, logical volume management aspects of shared
DB access.

Even this list doesn't actually address "cluster implementation of a
database server program" though, because that is actually a very narrow
topic. So narrow that it is arguably confined to particular database
servers, one at a time.

To put it another way, writing a SQL database server is a highly
nontrivial task, and good open source servers are rare.  Mysql is common
and open source (if semi-commercial) for example, but there exist
absolute rants on the web against mysql as being a high quality,
scalable DB for all of that.  (I'm not religiously involved in this
debate, BTW, so flames->/dev/null or find the referents with google and
flame them instead, I'm just pointing out that the debate exists;-).

Writing a PARALLEL SQL database server is even MORE nontrivial, and
while yes, some reasons for this are shared by the HPC community, the
bulk of them are related directly to locking and the file system and to
SQL itself.  Indeed, most are humble variants of the time-honored
problem of how to deal with race conditions and the like when I'm trying
to write to a particular record with a SQL statement on node A at the
same time you're trying to read from the record on node B, or worse yet,
when you're trying to write at the same time.  Most of the solutions to
this problem lead to godawful and rapid corruption of the record itself
if not the entire database.  Robust solutions tend to (necessarily)
serialize access, which defeats all trivial parallelizations.

NONtrivial parallelizations are things like distributing the execution
of actual SQL search statements across a cluster.  Whether there is any
point in this depends strongly on the design of the database store
itself; if it is a single point of presence to the entire cluster, there
is an obvious serial bottleneck there that AGAIN defeats most
straightforward parallelizations (depending a bit on just how long a
node has to crunch a block of the DB compared to the time required to
read it in from the server).  It also depends strongly on how the DB
itself is organized, as the very concept of "block of the DB" may be
meaningless.

In fact, to make a really efficient parallel DB program, I believe that
you have to integrate a design from the datastore on up to avoid
serializing bottlenecks.  The actual DB has to be stored in a way that
can be read in units that can be independently processed.  It has to be
organized in such a way that the hashing and
information-theoretic-efficient parsing of the blocked information can
proceed efficiently on the nodes (not easy when there is record linkage
in a relational DB -- maybe not POSSIBLE in general in a relational DB).
The distributed tasks have to be rewritten from scratch by Very Smart
Humans to use parallelizable algorithms (that integrate with the
underlying file store and with the underlying DB organization).  These
algorithms are likely to be so specialized as to be patentable (and I'll
bet that e.g. Oracle owns a slew of patents on this very thing).
Finally the specter of locking looms over everything, threatening all of
your work unless you can arrange for record modification not to
serialize everything.  For read only access, life is probably livable if
not good.  RW access to a large relational DB to be distributed
across N nodes -- just plain ouch...

So yes, it is fun to kick around on this list in the abstract BECAUSE
lots of these are also problems in parallel applications that work with
data (in a DB per se or not) but in direct reference to the question,
no, this list isn't going to provide direct guidance on how to
parallelize mysql or oracle or sybase or postgres or peoplesoft because
EACH of these has to engineer an efficient parallel solution all the way
from the file store to the user interface and API, at least if one wants
to get reliable/predictable and beneficial scalability.

There may, however, be people on the list that have messed with
parallelized versions of at least some of these DBs.  There has
certainly been list discussion on parallizing postgres before (e.g.

   http://beowulf.org/pipermail/beowulf/1998-October/002070.html

Which is alas no longer accessible in the archives at this address,
although google still turns it and a number of other hits up; perhaps it
is a part of what was lost when beowulf.org crashed a short while ago.
Unfortunately, I failed to capture the list archives in my last mirror
of this site.  And Doug Eadline probably can say a few words about
the parallelization of mysql (which has ALSO been discussed on the list
back in 1999 and is ALSO missing from the archives).

Both mysql and postgres appear to have a parallel implementation with at
least some scalability, see:

  http://www.illusionary.com/snort_db.html

Mysql's is an actual cluster implementation:

  http://www.mysql.com/products/cluster/

(note that bit about "Designed for 99.999% Availability" -- high
availability, not HPC).

A thread on mysql and postgres clustering on slashdot:

  http://developers.slashdot.org/comments.pl?sid=62549&cid=5843509

(a search is complicated by the fact that postgres refers to relational
database structures on disk as "clusters" and has actual commands to
create them etc.).

Postgres based clustering project (of sorts) lives here:

  http://gborg.postgresql.org/project/erserver/projdisplay.php

There is a sourceforge project trying to implement some sort of
lowest-common-denominator embarrassingly parallel cluster DB solution
that can be implemented "on top of" SQL DBs (as I make it out, read it
for yourself).

  http://ha-jdbc.sourceforge.net/

Really, google is your friend in this.  In a nutshell, it IS possible to
find support for cluster-type access to at least mysql and postgres in
the open source community, in at least two ways (native and as an add-on
layer in each case).  Add-on layer clustering provides a better than
nothing solution to the serial bottleneck problem, but it will not scale
well for all kinds of access and has the usual problems with the design
of the data store itself.  I can't comment on the native implementations
beyond observing that mysql looks like it is in production while
postgres looks like it is still very much under development, and that
both of them are "replication" models that likely won't scale well at
all for write access (they likely handle locking at the granularity of
the entire DB, although I >>do not know<< and don't plan to look it
up:-).

Hope this helps somebody.  If nothing else, it is likely worthwhile to
reinsert a discussion on this into the archives because of recent
developments and because previous discussions have gone away.

   rgb

> 
> Best Regards
> 
> Vaclav Hanzl
> 
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From hahn at physics.mcmaster.ca  Wed Oct 27 10:25:58 2004
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] High Performance for Large Database
In-Reply-To: <Pine.LNX.4.58.0410271002030.2746@lilith.rgb.private.net>
Message-ID: <Pine.LNX.4.44.0410271306430.26598-100000@coffee.psychology.mcmaster.ca>

> relationships regarding access to the HPC-generated data, a DB is needed
> just to permit search and retrieval of your OWN results, let alone
> somebody else's.

right.  the distinction here is that HPC and filesystems tend to have 
a very simple DB schema ;)

> Writing a PARALLEL SQL database server is even MORE nontrivial, and
> while yes, some reasons for this are shared by the HPC community, the
> bulk of them are related directly to locking and the file system and to
> SQL itself.

depends.  for instance, it's not *that* uncommon to have DB's which 
see almost nothing but read-only queries (and updates, if they happen
at all, can be batched during an off-time.)  that makes a parallel 
version quite easy, actually: imagine a bunch of 8GB dual-opterons
running queries on a simple NFS v3 server over Myrinet.  for a read-mostly
load, especially one with enough locality to make 8GB caches effective,
this would probably *fly*.  tweak it with iSCSI and go to 64 GB quad-
opterons.  how many tables out there wouldn't have a good hit rate
in 64GB?

> NONtrivial parallelizations are things like distributing the execution
> of actual SQL search statements across a cluster.  Whether there is any

it's easy to imagine that a stream of SQL queries could actually 
be handled in sort of an adaptive data refinement manner, where most
of the thought goes in to managing division of the query labor (distributed
indices searched in parallel, etc) , and in placement of data (especially
ownership/locking of writable data). I have no idea whether Oracle-level DB's
do this, but it wouldn't surprise me.  the irony is that most of the thought
that goes into advanced Beowulf applications is doing exactly this sort of 
labor/data division/balancing.

I'd hazard a guess that the place to start putting parallelism in a DB
is the underlying isam-like table layer...


From gary at sharcnet.ca  Wed Oct 27 09:22:58 2004
From: gary at sharcnet.ca (Gary Molenkamp)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] dual Opteron recommendations
In-Reply-To: <20041022020350.GA32640@cse.ucdavis.edu>
Message-ID: <Pine.LNX.4.44.0410271221340.22206-100000@thresher.beowulf.uwo.ca>

On Thu, 21 Oct 2004, Bill Broadley wrote:

> I'm familar with 48 sun v20z (newisys) machines around here, only one
> died so far with a hard memory error (I.e. won't boot).

So far, one of 24 with a power issue (still debugging).

> Speaking of which, has anyone done anything useful with the v20z LCD
> display, ours just say something like IP address of the management
> interface and OS booted or similar.
> 
> I was hoping for hostname, maybe system load, even a way to pull a node
> out of the queue (there are several buttons under the LCD).

Just a hostname. :)

-- 
Gary Molenkamp			SHARCNET
Systems Administrator		University of Western Ontario
gary@sharcnet.ca		http://www.sharcnet.ca
(519) 661-2111 x88429		(519) 661-4000


From hanzl at noel.feld.cvut.cz  Wed Oct 27 10:42:40 2004
From: hanzl at noel.feld.cvut.cz (hanzl@noel.feld.cvut.cz)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] High Performance for Large Database
In-Reply-To: <Pine.LNX.4.58.0410271002030.2746@lilith.rgb.private.net>
References: <Pine.LNX.4.58.0410261701450.2746@lilith.rgb.private.net>
	<20041027114215V.hanzl@unknown-domain>
	<Pine.LNX.4.58.0410271002030.2746@lilith.rgb.private.net>
Message-ID: <20041027194240L.hanzl@unknown-domain>

Hello RGB,

I had no intention to take up your whole morning :-) (Neither did I
intend to exploit your susceptibility to DoS attack by making
provocative comments :-)) )

Of course, I agree with your explanation about databases. However,

> MOST of the energy of this list is devoted to HPC -- numerical
> computations and applications.
> ...
> Nevertheless, databases per se are not numerical HPC, and a cluster
> built to do SQL transactions on a collective shared database is not
> properly called a "beowulf"

yes, but I still have a feeling that you are trying to squeeze
_numerical_ to definition of beowulf which would be a pity because
there are problems with _numerical/symbolic_ mix best solved on
exactly the same type of hardware as the _numerical_ ones. I hope
these can live on this list as well, unless cooling the FPU portion of
the chip bacames the main topic here :-))

OK, I do confess that I do pursue my selfish goals because my problems
are numerical/symbolic mix :-) And no, I do not use SQL databases for
them. However I know people who do misuse SQL databases this way (in
the similar manner we lazy people waste computer power with perl or
Matlab) and who could make easy progress by MPI implementing very very
limited subset of SQL, just enough to run those stupid select()s found
in their code.

But I repeat, normal SQL databases are mostly out of topic here, no
doubt.

Best Regards

Vaclav Hanzl


From atp at piskorski.com  Wed Oct 27 10:48:19 2004
From: atp at piskorski.com (Andrew Piskorski)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Re: High Performance for Large Database
In-Reply-To: <38242de90410261208b9ae5f2@mail.gmail.com>
References: <38242de90410261208b9ae5f2@mail.gmail.com>
Message-ID: <20041027174819.GB29850@piskorski.com>

On Tue, Oct 26, 2004 at 01:08:00PM -0600, Joshua Marsh wrote:
> Hi all,
> 
> I'm currently working on a project that will require fast access to
> data stored in a postgreSQL database server.  I've been told that a
> Beowulf cluster may help increase performance.

Unlikely in general, although possible in certain cases.  For example,
look into Clusgres, memcached, and Backplane.  I've previously given
links and discussion here:

  http://openacs.org/forums/message-view?message_id=128060
  http://openacs.org/forums/message-view?message_id=179348

> Since I'm not very familar with Beowulf clusters, I was hoping that

It is more important that you are extensively familiar with RDBMSs in
general, and PostgreSQL in particular.  Are you?

> you might have some advice or information on whether a cluster would
> increase performance for a PostgreSQL database.  The major tables
> accessed are around 150-200 million records.  On a stand alone
> server, it can take several minutes to perform a simple select
> query.

200 million rows is not that big.  What's the approximate total size
of your database on disk?

Your "several minutes for a simple select" query performance is
abysmal, and this is unlikely to be because of your hardware.  Most
likely, your queries just suck, and you need to do some serious SQL
tuning work before even considering big huge fancy hardware.

Once you have tables with tens or hundreds of millions of rows, doing
ANY full table scans of that table at all sucks really badly, so you
MUST profile and tune your queries.  And eliminating full table scans
of large tables is just the first and most obvious step, it is not
unusual to still have very sucky queries after that.

> It seems like once we start pricing for servers with 16+ processors
> and 64+ GB of RAM, the prices sky rocket.  If I can acheive high
> performance with a cluster, using 15-20 dual processor machines, that
> would be great.

If you are even thinking about buying an 8-way or larger box, then you
are certainly a candidate for several 2-way boxes with an (expensive)
SCI interconnect, so see Clusgres in my links above.  Or, if want want
to spend a lot less money, your access is read-mostly, and you DON'T
need full ACID transactional support for your read-only queries, look
into using memcached to cache query results in other machines' RAM.

However, VERY few people need such large RDBMS boxes.  What makes you
think you do?  What exactly is your application doing, and what sort
of load do you need it to sustain?

Have you profiled and tuned all your SQL?  Tuned your PostgreSQL and
Linux kernel settings?  Have you read and worked through all the
PostgreSQL docs on tuning?  (You didn't install PostgreSQL with its
DEFAULT settings, did you?  Those are intended to just get it up and
running on ALL the platforms PostgreSQL supports, not to give good
performance.)

Investing hundreds of thousands of dollars in fancy server hardware
without first doing your basic RDBMS homework makes no sense at all.
If your database is dog slow because of poor data modeling or grossly
untuned queries, throwing $300k of hardware at the problem may not
help much at all.

-- 
Andrew Piskorski <atp@piskorski.com>
http://www.piskorski.com/

From reuti at staff.uni-marburg.de  Wed Oct 27 11:28:11 2004
From: reuti at staff.uni-marburg.de (Reuti)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Clic 2.0 lockup problems
In-Reply-To: <2668.172.22.17.130.1098853788.squirrel@172.22.17.130>
References: <2668.172.22.17.130.1098853788.squirrel@172.22.17.130>
Message-ID: <1098901691.417fe8bb76972@home.staff.uni-marburg.de>

Hi,
 
> I just finished installing Clic 2.0 on a cluster of 1 server and 12 nodes.
>  After running the setup_auto_cluster script I got everything installed. 
> I created a "cluster user" and proceeded to test out some of the included
> mpi sample code.  This ran fine.  I next tried to start this code remotely
> (through SSH), but when I did this, the server locked up and had to be
> rebooted.  It actually locked up while connecting via SSH, not when
> executing the sample mpi code.  Any idea what might cause this?  The
> server has 3 network interfaces:
> 
> eth0 - administration
> eth1 - outside (internet)
> eth2 - message passing (computing)
> 
> (The nodes each have 2 interfaces, one for administration and one for
> message passing)
> 
> It also seems that when I logon as the "cluster user" or root, and try to
> access an external website (e.g. google), the server will lockup and need
> to be rebooted again.
> 
> Any idea why I'm experiencing these lockups?  Is something configured
> incorrectly?  Is it a faulty network card?  I was able to access outside
> websites fine before I ran the setup scripts.

you tried to login to the server on all of the three interfaces, to check 
whether it's really completely down and not only one interface? - Reuti

From rgb at phy.duke.edu  Wed Oct 27 11:41:31 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] High Performance for Large Database
In-Reply-To: <20041027194240L.hanzl@unknown-domain>
References: <Pine.LNX.4.58.0410261701450.2746@lilith.rgb.private.net>
	<20041027114215V.hanzl@unknown-domain>
	<Pine.LNX.4.58.0410271002030.2746@lilith.rgb.private.net>
	<20041027194240L.hanzl@unknown-domain>
Message-ID: <Pine.LNX.4.58.0410271433430.10175@ganesh.phy.duke.edu>

On Wed, 27 Oct 2004 hanzl@noel.feld.cvut.cz wrote:

> Hello RGB,
> 
> I had no intention to take up your whole morning :-) (Neither did I
> intend to exploit your susceptibility to DoS attack by making
> provocative comments :-)) )

Naaa, it was fun.  And the list has always been fairly tolerant of a
broad definition of HPC as long as it is fun and relevant (I really
think that fun more the criterion than whether it is floating point
intensive) just as it is tolerant of those people who do HPC on clusters
that aren't really "beowulf" style clusters in the original sense of its
definition (like me:-).

I'm also very interested in just what sort of symbolic manipulation you
are working on.  I've worked with some of the various algebraic
manipulation packages that have existed running back into the dawn of
time -- FORMAC, Macsyma, maple, mathematica .. and agree that there is
much in the token parsing and algebraic reconstruction process that
could be parallelized as parts of algebra are intrinsically independent.
There are also still an abundance of problems (in physics, especially)
where a good non-commutative algebra engine that can be taught about a
set of generators/commutators can really help out.  And then there is
geometric algebra (the descendent of quaternions, Grassman algebras,
Clifford algebras) where I think of things as barely being begun,
especially since there is a geometric/visualization component that tags
along with the algebraic component.

Anything like that?

    rgb

> 
> Of course, I agree with your explanation about databases. However,
> 
> > MOST of the energy of this list is devoted to HPC -- numerical
> > computations and applications.
> > ...
> > Nevertheless, databases per se are not numerical HPC, and a cluster
> > built to do SQL transactions on a collective shared database is not
> > properly called a "beowulf"
> 
> yes, but I still have a feeling that you are trying to squeeze
> _numerical_ to definition of beowulf which would be a pity because
> there are problems with _numerical/symbolic_ mix best solved on
> exactly the same type of hardware as the _numerical_ ones. I hope
> these can live on this list as well, unless cooling the FPU portion of
> the chip bacames the main topic here :-))
> 
> OK, I do confess that I do pursue my selfish goals because my problems
> are numerical/symbolic mix :-) And no, I do not use SQL databases for
> them. However I know people who do misuse SQL databases this way (in
> the similar manner we lazy people waste computer power with perl or
> Matlab) and who could make easy progress by MPI implementing very very
> limited subset of SQL, just enough to run those stupid select()s found
> in their code.
> 
> But I repeat, normal SQL databases are mostly out of topic here, no
> doubt.
> 
> Best Regards
> 
> Vaclav Hanzl
> 
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From rgb at phy.duke.edu  Wed Oct 27 11:54:04 2004
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] High Performance for Large Database
In-Reply-To: <Pine.LNX.4.44.0410271306430.26598-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0410271306430.26598-100000@coffee.psychology.mcmaster.ca>
Message-ID: <Pine.LNX.4.58.0410271445300.10175@ganesh.phy.duke.edu>

On Wed, 27 Oct 2004, Mark Hahn wrote:

> > NONtrivial parallelizations are things like distributing the execution
> > of actual SQL search statements across a cluster.  Whether there is any
> 
> it's easy to imagine that a stream of SQL queries could actually 
> be handled in sort of an adaptive data refinement manner, where most
> of the thought goes in to managing division of the query labor (distributed
> indices searched in parallel, etc) , and in placement of data (especially
> ownership/locking of writable data). I have no idea whether Oracle-level DB's
> do this, but it wouldn't surprise me.  the irony is that most of the thought
> that goes into advanced Beowulf applications is doing exactly this sort of 
> labor/data division/balancing.
> 
> I'd hazard a guess that the place to start putting parallelism in a DB
> is the underlying isam-like table layer...

As always, google is your friend. parallel database algorithms turns up
lots of current work; I'm sure a look at specific open source projects
would turn up more (and maybe more relevant) work.

Some of the tools I turned up in my short former query do exploit the
kind of simple read-only data parallelism you described, though, and
wrap it up all pretty.  For small read only databases (backing e.g. a
website), the very simplest approach is likely to put a full copy of the
DB on each server and distribute the transactions themselves round
robin.  Use rsync to periodically update the images to accomodate
distributed changes, if you permit distributed write and you dare
(merging in changes is nontrivial).  Or write an engine that uses idle
time of the node engines themselves to distribute inserts to be
scheduled.

Google itself is a pretty good example, actually.  Complex searches of a
read-only and truly encyclopediac database.

All I really know is that this is real computer science, and I am only
an amateur at best.  I suspect that large scale database parallelism is
the subject of much current algorithmic research, as parts of the
problem are likely NP complete or at least NP hard.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


From mechti01 at luther.edu  Wed Oct 27 12:16:42 2004
From: mechti01 at luther.edu (mechti01)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Clic 2.0 lockup problems
In-Reply-To: <1098901691.417fe8bb76972@home.staff.uni-marburg.de>
References: <2668.172.22.17.130.1098853788.squirrel@172.22.17.130>
	<1098901691.417fe8bb76972@home.staff.uni-marburg.de>
Message-ID: <2277.172.17.4.251.1098904602.squirrel@172.17.4.251>

Hi

Thanks for your help.  No, eth1, because that is the only external
interface.  In other words, I tried to logon from an external machine. 
Ssh works fine on other two interfaces though.  Thanks again.

-Timo


> Hi,
>
>> I just finished installing Clic 2.0 on a cluster of 1 server and 12
>> nodes.
>>  After running the setup_auto_cluster script I got everything installed.
>> I created a "cluster user" and proceeded to test out some of the
>> included
>> mpi sample code.  This ran fine.  I next tried to start this code
>> remotely
>> (through SSH), but when I did this, the server locked up and had to be
>> rebooted.  It actually locked up while connecting via SSH, not when
>> executing the sample mpi code.  Any idea what might cause this?  The
>> server has 3 network interfaces:
>>
>> eth0 - administration
>> eth1 - outside (internet)
>> eth2 - message passing (computing)
>>
>> (The nodes each have 2 interfaces, one for administration and one for
>> message passing)
>>
>> It also seems that when I logon as the "cluster user" or root, and try
>> to
>> access an external website (e.g. google), the server will lockup and
>> need
>> to be rebooted again.
>>
>> Any idea why I'm experiencing these lockups?  Is something configured
>> incorrectly?  Is it a faulty network card?  I was able to access outside
>> websites fine before I ran the setup scripts.
>
> you tried to login to the server on all of the three interfaces, to check
> whether it's really completely down and not only one interface? - Reuti
>


-- 


From hanzl at noel.feld.cvut.cz  Wed Oct 27 12:52:06 2004
From: hanzl at noel.feld.cvut.cz (hanzl@noel.feld.cvut.cz)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] High Performance for Large Database
In-Reply-To: <Pine.LNX.4.58.0410271433430.10175@ganesh.phy.duke.edu>
References: <Pine.LNX.4.58.0410271002030.2746@lilith.rgb.private.net>
	<20041027194240L.hanzl@unknown-domain>
	<Pine.LNX.4.58.0410271433430.10175@ganesh.phy.duke.edu>
Message-ID: <20041027215206U.hanzl@unknown-domain>

> I'm also very interested in just what sort of symbolic manipulation you
> are working on.

My numerical/symbolic mix underlying my opinions is from natural
language processing, mostly speech recognition. Involves training
phase which uses huge amount of recorded speech which is iteratively
turned into estimated statistical distributions of phoneme sounds
(multivariate gaussians with some 500.000 parameters, work for the
FPU) and huge amount of text turned into dictionaries and grammar
rules (symbolic and maybe even SQL). This phase is not very
beowulfish, processes can work locally for minutes.

Then there is the recognition phase when we match unknown utterances
against our models of sounds and pronunciation and dictionaries and
grammar and this is very beowulfish as we need to estimate zillions of
partial hypothesis and compose them together somehow, likely in real
time, and we are happy to pass quick messages around and keep most
things in aggregated cluster RAM.

Training on huge speech data has very much the pattern just described
by Mark Hahn:

> depends.  for instance, it's not *that* uncommon to have DB's which
> see almost nothing but read-only queries (and updates, if they happen
> at all, can be batched during an off-time.)  that makes a parallel
> version quite easy

(thought we do not have wav files in SQL :-) ) and we are much
interested in ways to divide our data to chunks cached on local
harddisks on nodes and repeatedly processed again and again (say 30
times during one itarative process, and we try many variants of this
process on the same data.)

Of course we have just one cluster for both things, so it constantly
switches between being a beowulf and not being a beowulf :-)

Best Regards

Vaclav Hanzl


From mwill at penguincomputing.com  Wed Oct 27 09:29:07 2004
From: mwill at penguincomputing.com (Michael Will)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] High Performance for Large Database
In-Reply-To: <037701c4bc11$455857e0$6901a8c0@dolphinics.no>
References: <38242de90410261208b9ae5f2@mail.gmail.com>
	<417EFA16.8020503@yahoo.com.sg>
	<037701c4bc11$455857e0$6901a8c0@dolphinics.no>
Message-ID: <200410270929.07862.mwill@penguincomputing.com>

On Wednesday 27 October 2004 03:39 am, Keith Murphy wrote:
> Check out this url http://www.linuxlabs.com/clusgres.html they look like
> they have a solution for scaleable Postgres
> 
> Kindest Regards
> 
> Keith Murphy
> Dolphin Interconnect

Hey Keith, that is a really cool link. What interconnect does
that lock them into again, though?

On a more serious side:

They advertise a beowulf-with-shared-memory solution, which demands
low latency high bandwidth interconnects, and AFAIK they only support
Dolphin Interconnect (SCI). Has anybody tried their product yet and can
comment on its efficiency and scalability ? It does sound promising for
any SMP type software that does not run well on a cluster because of
its lack of shared memory.

Also check out mysql.com's in-ram database product - they created a
database that is not relying on any shared memory, but instead redundandly 
distributes the data out onto a cluster, using RAM only and claiming to
be really fast.

http://www.mysql.com/products/cluster/

And then there is oracle that advertises together with Infinicon, HP and AMD they
would have set a new TPC-H One-Terabyte record:

http://www.oracle.com/corporate/press/home/index.html

Michael

> 818-292-5100
> kmurphy@dolphinics.com
> www.dolphinics.com
> ----- Original Message ----- 
> From: "Laurence Liew" <laurenceliew@yahoo.com.sg>
> To: "Joshua Marsh" <icub3d@gmail.com>
> Cc: <beowulf@beowulf.org>
> Sent: Wednesday, October 27, 2004 3:29 AM
> Subject: Re: [Beowulf] High Performance for Large Database
> 
> 
> > Hi,
> >
> > You may wish to search thru the beowulf list or google for "beowulf and
> > databases and postgresql"... there were a couple of threads on exeactly
> > this issue.
> >
> > Very briefly
> >
> > 1. Beowulf clusters CANNOT help make Postgresql or any databases run
> > faster. You need the database code to be modified to do that (think
> > Oracle 10g). I met a company at Supercomputer 03 last year that had
> > Mysql running on a cluster... you may wish to query for them.
> >
> > 2. You could try to sponsor the development of a parallel postrgresql -
> > talk to the postgresql development team... when I broached the idea in
> > 1998.. there was some interest.. unfortunately.. I could not afford the
> > development/sponsorship costs then.
> >
> > 3. Try running Postgresql on a cluster filesystem like PVFS - it is not
> > gauranteed as it probably fails the ACID test for a SQL compliant
> > database. The basic idea is that if we cannot parallelise the database -
> > we make the underlying IO parallel and hence boost the IO performance of
> > the system.. and any applications that run on them.. and this includes
> > Postgresql.
> >
> > Hope this helps.
> >
> > Cheers!
> > Laurence
> > Scalable Systems
> > Singapore
> >
> >
> >
> > Joshua Marsh wrote:
> > > Hi all,
> > >
> > > I'm currently working on a project that will require fast access to
> > > data stored in a postgreSQL database server.  I've been told that a
> > > Beowulf cluster may help increase performance.  Since I'm not very
> > > familar with Beowulf clusters, I was hoping that you might have some
> > > advice or information on whether a cluster would increase performance
> > > for a PostgreSQL database.  The major tables accessed are around
> > > 150-200 million records.  On a stand alone server, it can take several
> > > minutes to perform a simple select query.
> > >
> > > It seems like once we start pricing for servers with 16+ processors
> > > and 64+ GB of RAM, the prices sky rocket.  If I can acheive high
> > > performance with a cluster, using 15-20 dual processor machines, that
> > > would be great.
> > >
> > > Thanks for any help you may have!
> > >
> > > -Josh
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf@beowulf.org
> > > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> > >
> >
> 
> 
> ----------------------------------------------------------------------------
> ----
> 
> 
> > _______________________________________________
> > Beowulf mailing list, Beowulf@beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> >
> 
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Michael Will, Linux Sales Engineer
NEWS: We have moved to a larger iceberg :-)
NEWS: 300 California St., San Francisco, CA.
Tel:  415-954-2822  Toll Free:  888-PENGUIN
Fax:  415-954-2899 
www.penguincomputing.com


From eugen at leitl.org  Thu Oct 28 01:50:18 2004
From: eugen at leitl.org (Eugen Leitl)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Re: MySQL Cluster with SCI interconnect (fwd from
	tamada@acornnetworks.co.jp)
Message-ID: <20041028085018.GN1457@leitl.org>


Of tenuous relevance, but it's a cluster database which uses SCI for
signalling fabric.

MySQL 4.1 (just out) actually has cluster functionality built-in (but for Sun
Solaris, which is a release bug, about to be fixed). 

For those of you who're interested in HA Linux, the next release is not far
away, and removes the 2-node limitation: http://linuxha.trick.ca/

----- Forwarded message from Junzo Tamada <tamada@acornnetworks.co.jp> -----

From: "Junzo Tamada" <tamada@acornnetworks.co.jp>
Date: Thu, 28 Oct 2004 16:00:53 +0900
To: "Hugo Kohmann" <hugo@dolphinics.no>
Cc: "Hugo Kohmann" <hugo@dolphinics.no>, <cluster@lists.mysql.com>
Subject: Re: MySQL Cluster with SCI interconnect
Reply-To: "Junzo Tamada" <tamada@acornnetworks.co.jp>
X-Mailer: Microsoft Outlook Express 6.00.2900.2180

Dear Hugo,

Yes, I'm working on it. The latest code from Mikael works well with 
Ethernet.
I built it with SCI yesterday and will test today.
I'll get back to you after verification of MySQL Cluster with SCI.

Thank you very much.
/Junzo
----- Original Message ----- 
From: "Hugo Kohmann" <hugo@dolphinics.no>
To: "Junzo Tamada" <tamada@acornnetworks.co.jp>
Cc: "Hugo Kohmann" <hugo@dolphinics.no>; <cluster@lists.mysql.com>
Sent: Thursday, October 28, 2004 4:02 AM
Subject: Re: MySQL Cluster with SCI interconnect


>
>Dear Junzo,
>
>MySQL cluster has native support for SCI our you can use the SCI Socket 
>software. You will find more information about this and some
>benchmark numbers at 
>http://dev.mysql.com/doc/mysql/en/MySQL_Cluster_Interconnects.html
>
>With SCI Socket, you can configure MySQL cluster for regular networking 
>and just use SCI Socket between the nodes that requires low latency/high 
>bandiwidth connections. SCI Socket has typically 10x lower latency than 
>Gigagit Ethernet and should support any version of MySQL Cluster that runs 
>properly over Ethernet.
>
>You will find the SCI Socket software and more information at 
>http://www.dolphinics.com/products/software/sci_sockets.html
>
>Best regards
>
>Hugo
>
>
>On Tue, 26 Oct 2004, Junzo Tamada wrote:
>
>>Date: Tue, 26 Oct 2004 23:46:49 +0900
>>From: Junzo Tamada <tamada@acornnetworks.co.jp>
>>To: cluster@lists.mysql.com
>>Subject: MySQL Cluster with SCI interconnect
>>
>>Hello
>>
>>Very recently I found sections in the chapter of Cluster, which is 
>>decsribing utilization of high-speed interconnection called SCI.
>>I would like to test it.
>>Currently I have totaly 5 servers and 4 out of 5 are equiped with SCI 
>>network card.
>>Please anyone provide me any suggestions regarding the following cluster 
>>configuration.
>>All server are connected together with Ethernet.
>>Node a1 through a4 are with SCI.
>>I would like to assign mgmd and api(mysqld) to front-end and ndbd for 
>>a[1-4].
>>Is this feasible and reasonable configuration ?
>>
>>I am using MySQL 4.1.7-gamma as of Oct. 26th, 2004.
>>
>>Thank you in advance.
>>/Junzo
>>
>>-- 
>>MySQL Cluster Mailing List
>>For list archives: http://lists.mysql.com/cluster
>>To unsubscribe: 
>>http://lists.mysql.com/cluster?unsub=hugo@dolphinics.com
>>
>
>
>=========================================================================================
>Hugo Kohmann                           |
>Dolphin Interconnect Solutions AS      | E-mail:
>P.O. Box 150 Oppsal                    | hugo@dolphinics.com
>N-0619 Oslo, Norway                    | Web:
>Tel:+47 23 16 71 73                    | http://www.dolphinics.com
>Fax:+47 23 16 71 80                    | Visiting Address: Olaf Helsets 
>vei 6   |
>


-- 
MySQL Cluster Mailing List
For list archives: http://lists.mysql.com/cluster
To unsubscribe:    http://lists.mysql.com/cluster?unsub=eugen@leitl.org

----- End forwarded message -----
-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
______________________________________________________________
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
Url : http://www.scyld.com/pipermail/beowulf/attachments/20041028/f6fca634/attachment.bin
From john at clustervision.com  Thu Oct 28 02:00:48 2004
From: john at clustervision.com (john@clustervision.com)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Mac OS X and High Performance Heterogenous Environments -
	London
Message-ID: <1098954048.4180b540b0544@www.unreal-inc.net>

There has been interest on this list on MacOS clusters.
This event in London on Monday should be of interest.

The UK Unix Users Group is good at organising technically focussed
events. I would say that this won't be a company product puff.

http://www.ukuug.org/events/apple04/


Who:  	 Jordan Hubbard (Apple), and other speakers
When: 	Monday 1st November, 2004
10:30am start; 10am doors
Where:	Institute of Physics, 76 Portland Place, London
Cost:	Free entry although preregistration is required and places are strictly
limited.
You do not have to be a UKUUG member to attend


Sadly I doubt if I will be able to attend.


From jakob at unthought.net  Thu Oct 28 03:03:56 2004
From: jakob at unthought.net (Jakob Oestergaard)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] High Performance for Large Database
In-Reply-To: <38242de90410261208b9ae5f2@mail.gmail.com>
References: <38242de90410261208b9ae5f2@mail.gmail.com>
Message-ID: <20041028100356.GN12752@unthought.net>

On Tue, Oct 26, 2004 at 01:08:00PM -0600, Joshua Marsh wrote:
> Hi all,
> 
> I'm currently working on a project that will require fast access to
> data stored in a postgreSQL database server.  I've been told that a
> Beowulf cluster may help increase performance.  Since I'm not very
> familar with Beowulf clusters, I was hoping that you might have some
> advice or information on whether a cluster would increase performance
> for a PostgreSQL database.  The major tables accessed are around
> 150-200 million records.  On a stand alone server, it can take several
> minutes to perform a simple select query.
> 
> It seems like once we start pricing for servers with 16+ processors
> and 64+ GB of RAM, the prices sky rocket.  If I can acheive high
> performance with a cluster, using 15-20 dual processor machines, that
> would be great.

It depends.  I was involved in one project where we had some hosts doing
a *massive* number of queries against postgres, but no or few updates.

This parallelizes very well.  A single quiery would not run faster, but
when you run thousands of queries, running them against a cluster of
postgresql databases will even out the load just nicely, giving you
linear scaling (sustained queries per second versus machines in the
cluster).

I don't think you'll have any luck finding off-the-shelf
production-quality database software that will parallelize a single
query on a number of nodes.

If you just want throughput, large numbers of queries on a large number
of databases, and you are doing mostly selects with very few (if any)
updates/inserts/deletes, then PostgreSQL comes with software that can
help you mirror your database.

What you do is, you have a 'master' database - you will perform all
updates/deletes/inserts against this master.

The master will relay updates to a number of slave databases.

You perform all selects against the slaves.

Simply, stable, and works perfectly within the limits inherent in such a
setup (eg. a single query won't parallelize, the master cannot scale to
more updates than what is possible on a single system, etc.)

-- 

 / jakob


From Dennis_Currit at ATK.COM  Thu Oct 28 10:09:00 2004
From: Dennis_Currit at ATK.COM (Currit, Dennis)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Newbie question.
Message-ID: <7B8D37027B57D7459984035D83864E360682AC68@exchangeut1.atk.com>

We are thinking of putting up a cluster to run MSC Nastran and have about
$25,000 budgeted for hardware.  Is this enough to get get started?  Any
suggestions as to what I should buy?  Currently we are running large jobs on
an older AIX multiprocessor system and smaller jobs on a Xeon 2.8 system, so
I think even a small cluster should be an improvement.

From landman at scalableinformatics.com  Thu Oct 28 11:06:09 2004
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Updated NCBI rpms released for the 2.2.10 toolkit
Message-ID: <41813511.6000507@scalableinformatics.com>

Folks:

  We rebuilt the NCBI rpms for AMD64, i386, i586 (non-P4), athlon, and 
i686 (p4).  Feel free to grab them from our site

   http://downloads.scalableinformatics.com/downloads/ncbi/

They are named NCBI-2.2.10-1.*.rpm, where * = 
{src,x86_64,i386,i586,i686}  They were built on RHEL/SuSE/Fedora Core2 
machines.  Should install without problems (and use the source if you 
have problems).

Please note that if you have a non-pentium4/non-athlon machine (PIII) 
you want the i586 or i386 version.  If you have a pentium4 based 
machine, you want the i686 version.  AMD64 (and probably EM64T) will use 
the x86_64.  Athlon's will use the athlon version.  Unless someone 
supplies me with G5 or Itanium2, I probably wont be able to do builds 
for those platforms.

Enjoy, and as usual, bug reports/problems to us, not to NCBI.  We built 
the RPMS, so if they are broken, we need to know.

Joe

-------- Original Message --------
Subject: 	[blast-announce] [ BLAST_Announce #044] BLAST 2.2.10 released
Date: 	Thu, 28 Oct 2004 11:35:37 -0400
From: 	Mcginnis, Scott (NIH/NLM/NCBI) <mcginnis@ncbi.nlm.nih.gov>
To: 	'blast-announce@ncbi.nlm.nih.gov' <blast-announce@ncbi.nlm.nih.gov>


Binaries can be obtained from:
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.10/

Source code can be obtained from:
ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/20041020/

Additionally, NCBI now provides anoncvs access
(http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&rid=too
lkit.section.cvs_external) to toolkit sources. A cvsweb source browser
(http://www.ncbi.nlm.nih.gov/cvsweb/index.cgi/internal/c++/src/algo/blast/co
re/) and doxygen documentation
(http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/group__AlgoBlast.h
tml) are also available.

Notes for the 2.2.10 release

New engine 

We have been rewriting and restructuring the BLAST engine in order to make
BLAST more modular and extensible. bl2seq and megablast currently support
the new engine; it can be enabled with the -V F option. Using the new engine
may result in significant performance improvements in some cases. 

General changes 

	-megablast now performs ungapped extensions in order to prevent
suboptimal alignments 
	-consolidated formatting code 
	-removed fmerge.c 
	-small fixes to sum statistics code 
	-better error handling 
	-fixed masking of translated queries 
	-fixed several readdb threading bugs 
	-improved protein neighbor generation 
	-hsp sorting/inclusion fixes 
	-many changes in HSP linking 
	-several fixes for translated RPS blast


BlastPGP 

	-added code to spread out gap costs when extracting data from
the sequence alignment to build PSSM 
	-changed handling of all-zero columns of residue frequencies to
use the underlying scoring matrix frequency ratios rather than
scoring matrix's scores 	
	- disallowed an ungapped search if more than one iteration is
specified 

scoremat.asn specification 

	-added a new 'formatrpsdb' application; given a collection of
Score-mat ASN.1 files, this application creates a database
suitable for use with RPSBLAST 
	-Simplified NCBI-ScoreMat specification to represent PSSMs
instead of arbitrary scoring matrices. blastpgp and formatrpsdb
can deal with this format.

If you have any questions please write to blast-help@ncbi.nlm.nih.gov


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman@scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 612 4615

_______________________________________________
Bioclusters maillist  -  Bioclusters@bioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman@scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 612 4615


From rokrau at yahoo.com  Thu Oct 28 12:06:40 2004
From: rokrau at yahoo.com (Roland Krause)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron 
Message-ID: <20041028190640.9782.qmail@web52907.mail.yahoo.com>

Folks,
does anybody here have positive or negative experiences with using the
Intel EMT-64 Fortran compiler on AMD Opteron systems? 

I am at this point not so much interested in speed issues but more
stability and correctness especially with respect to OpenMP. Or in
other words: Is it worth trying yet? 

Best regards,
Roland


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

From ctierney at HPTI.com  Thu Oct 28 12:39:52 2004
From: ctierney at HPTI.com (Craig Tierney)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron
In-Reply-To: <20041028190640.9782.qmail@web52907.mail.yahoo.com>
References: <20041028190640.9782.qmail@web52907.mail.yahoo.com>
Message-ID: <1098992392.3052.83.camel@hpti9.fsl.noaa.gov>

On Thu, 2004-10-28 at 13:06, Roland Krause wrote:
> Folks,
> does anybody here have positive or negative experiences with using the
> Intel EMT-64 Fortran compiler on AMD Opteron systems? 
> 

Is it even going to work until the Opteron supports SSE3?
I suspect if you don't vectorize, or only build 32-bit apps
you will be ok.  However, for most applications the vectorization
is going to give you the big win.

Craig

> I am at this point not so much interested in speed issues but more
> stability and correctness especially with respect to OpenMP. Or in
> other words: Is it worth trying yet? 


> 
> Best regards,
> Roland
> 
> 
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at pathscale.com  Thu Oct 28 13:28:35 2004
From: lindahl at pathscale.com (Greg Lindahl)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron
In-Reply-To: <1098992392.3052.83.camel@hpti9.fsl.noaa.gov>
References: <20041028190640.9782.qmail@web52907.mail.yahoo.com>
	<1098992392.3052.83.camel@hpti9.fsl.noaa.gov>
Message-ID: <20041028202835.GB2227@greglaptop.internal.keyresearch.com>

On Thu, Oct 28, 2004 at 01:39:52PM -0600, Craig Tierney wrote:

> However, for most applications the vectorization
> is going to give you the big win.

People think that, but did you know that SIMD vectorization doesn't
help any of the codes in SPECfp? Remember that the Opteron can use
both fp pipes with scalar code. This is very different from the
Pentium4. I'd say this myth is the #1 myth in the HPC industry right
now.

-- greg

From csamuel at vpac.org  Thu Oct 28 21:46:56 2004
From: csamuel at vpac.org (Chris Samuel)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] choosing a high-speed interconnect
In-Reply-To: <1097615626.28704.129.camel@syru212-207.syr.edu>
References: <1097611167.28704.104.camel@syru212-207.syr.edu>
	<416C4339.3040309@scalableinformatics.com>
	<1097615626.28704.129.camel@syru212-207.syr.edu>
Message-ID: <200410291446.58866.csamuel@vpac.org>

On Wed, 13 Oct 2004 07:13 am, Chris Sideroff wrote:

> ? ?We run exclusively computation fluid dynamics on it. ?One program is
> Fluent the other is an in-house turbo-machinery code. ?My experiences so
> far have led me to believe Fluent is much more sensitive to the
> network's performance than the in-house program. ?Thus my inquiry into a
> higher performance network.

Fluent is very latency sensitive, and apparently the next release of Fluent 
will support Myrinet on Opteron, which will be nice to see.

The list of technologies that they support is at:

http://www.fluent.com/about/news/newsletters/03v12i2_fall/img/a26i1_lg.gif

Yes, that really is an image.. :-/

Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.scyld.com/pipermail/beowulf/attachments/20041029/6ec8516d/attachment.bin
From Glen.Gardner at verizon.net  Thu Oct 28 18:37:17 2004
From: Glen.Gardner at verizon.net (Glen Gardner)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Newbie question.
References: <7B8D37027B57D7459984035D83864E360682AC68@exchangeut1.atk.com>
Message-ID: <41819ECD.1080007@verizon.net>


$25K will build a nice cluster.


I'd be tempted to go with something in small cube cases with on-board 
gigabit .

P4 is good, Xeon is very good, as is Opteron. Mac is an option , but 
costs a lot more.

Depending on how you go about it, $25K ought to be enough money to build 
a really nice 16 node cluster
and maybe you can throw in 4-5 cheap machines to use as PVFS I/O servers.

It all depends on how much you want to buy pre-assembled and how much 
you are willing to build yourself.


Glen


Currit, Dennis wrote:

>We are thinking of putting up a cluster to run MSC Nastran and have about
>$25,000 budgeted for hardware.  Is this enough to get get started?  Any
>suggestions as to what I should buy?  Currently we are running large jobs on
>an older AIX multiprocessor system and smaller jobs on a Xeon 2.8 system, so
>I think even a small cluster should be an improvement.
>_______________________________________________
>Beowulf mailing list, Beowulf@beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>  
>

-- 
Glen E. Gardner, Jr.
AA8C
AMSAT MEMBER 10593
Glen.Gardner@verizon.net


http://members.bellatlantic.net/~vze24qhw/index.html


From i.kozin at dl.ac.uk  Fri Oct 29 02:17:19 2004
From: i.kozin at dl.ac.uk (Kozin, I (Igor))
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron 
Message-ID: <D167B04B5CBED5119AF400034743A54203BC069C@exchange06.dl.ac.uk>

Hi Roland,
I think it is. It seems like 8.1 has much better OpenMP support
than the previous versions although it is still not perfect. 
As fas as know it works on Opterons.
Best,
Igor

I. Kozin  (i.kozin at dl.ac.uk)
CCLRC Daresbury Laboratory
tel: 01925 603308
http://www.cse.clrc.ac.uk/disco

> -----Original Message-----
> From: Roland Krause [mailto:rokrau@yahoo.com]
> Sent: 28 October 2004 20:07
> To: beowulf@beowulf.org
> Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron 
> 
> 
> Folks,
> does anybody here have positive or negative experiences with using the
> Intel EMT-64 Fortran compiler on AMD Opteron systems? 
> 
> I am at this point not so much interested in speed issues but more
> stability and correctness especially with respect to OpenMP. Or in
> other words: Is it worth trying yet? 
> 
> Best regards,
> Roland
> 
> 
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) 
> visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

From kus at free.net  Fri Oct 29 08:55:26 2004
From: kus at free.net (Mikhail Kuzminsky)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron
In-Reply-To: <20041028202835.GB2227@greglaptop.internal.keyresearch.com>
Message-ID: <web-168355@free.net>

In message from Greg Lindahl <lindahl@pathscale.com> (Thu, 28 Oct 2004 
13:28:35 -0700):
>On Thu, Oct 28, 2004 at 01:39:52PM -0600, Craig Tierney wrote:
>
>> However, for most applications the vectorization
>> is going to give you the big win.
>
>People think that, but did you know that SIMD vectorization doesn't
>help any of the codes in SPECfp?
It's interesting ! 
Opteron SPECfp2000 results obtained w/help of PGI 5.1-3 includes
-fastsse copmiler option. SPECfp2000 results (for Opteron) based
on old ifc 7.0 compiler include options like -xW which allow
to create SIMD instructions. Etc. There is 2 possibilities

a) These compilers didn't generate SSE2-containing codes for any
program from SPECfp2000 - what looks strange for me 
  
b) In the case we'll re-translate the source of SPECfp2000
w/suppression of SSE commands generation, performance results will be
the same. Do I understand you correctly, that you say about case b) ?

BTW, if I remember correctly, ATLAS dgemm codes for Opteron are better 
if they are using SIMD fp operations - but of course, it's "out of
SPECfp2000 codes"

> Remember that the Opteron can use
>both fp pipes with scalar code. This is very different from the
>Pentium4. 

Yes, but 32-bit ifc compilers (which don't know about Opteron 
microarchitecture) gave better results than pgi compilers oriented
to "right" microarchitecture. Of course, I don't say about yours 
PathScale compilers which usually are the best (in the perofrmance
of codes generated) but too expensive :-( . 

Yours
Mikhail Kuzminsky
Zelinsky Institute of Organic Chemistry
Moscow 
  
>I'd say this myth is the #1 myth in the HPC industry right
>now.
>

From kus at free.net  Fri Oct 29 08:58:07 2004
From: kus at free.net (Mikhail Kuzminsky)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron
In-Reply-To: <1098992392.3052.83.camel@hpti9.fsl.noaa.gov>
Message-ID: <web-168349@free.net>

In message from Craig Tierney <ctierney@HPTI.com> (Thu, 28 Oct 2004 
13:39:52 -0600):
>On Thu, 2004-10-28 at 13:06, Roland Krause wrote:
>> Folks,
>> does anybody here have positive or negative experiences with using 
>>the
>> Intel EMT-64 Fortran compiler on AMD Opteron systems? 
>> 
>
>Is it even going to work until the Opteron supports SSE3?
You may generate the codes w/o SSE3 commands and w/64-bit support.

Yours
Mikhail Kuzminsky
Zelinsky Institute of Organic Chemistry
Moscow

>I suspect if you don't vectorize, or only build 32-bit apps
>you will be ok.  However, for most applications the vectorization
>is going to give you the big win.
>
>Craig
>
>> I am at this point not so much interested in speed issues but more
>> stability and correctness especially with respect to OpenMP. Or in
>> other words: Is it worth trying yet? 
>
>
>> 
>> Best regards,
>> Roland
>> 
>> 
>> 
>> 
>> __________________________________________________
>> Do You Yahoo!?
>> Tired of spam?  Yahoo! Mail has the best spam protection around 
>> http://mail.yahoo.com
>> _______________________________________________
>> Beowulf mailing list, Beowulf@beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit 
>>http://www.beowulf.org/mailman/listinfo/beowulf
>
>_______________________________________________
>Beowulf mailing list, Beowulf@beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf


From kus at free.net  Fri Oct 29 09:09:46 2004
From: kus at free.net (Mikhail Kuzminsky)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron 
In-Reply-To: <20041028190640.9782.qmail@web52907.mail.yahoo.com>
Message-ID: <web-168357@free.net>

In message from Roland Krause <rokrau@yahoo.com> (Thu, 28 Oct 2004 
12:06:40 -0700 (PDT)):
>Folks,
>does anybody here have positive or negative experiences with using 
>the
>Intel EMT-64 Fortran compiler on AMD Opteron systems? 
We have very small ifort/8.1.023 experiense on our Opteron - it's not
enough to say about compilers comparison.

But you may found comparison results at //www.polyhedron.com site.
As I remember, best results are for PathScale, which is really the 
best in a lot of tests, and ifort is on the second position.

But you should take into account that ifort 8.1.023 at some highest
optimization level compiler keys generate the codes, which check
(at the run time) the processor used and will not work on Opteron.

>
>I am at this point not so much interested in speed issues but more
>stability and correctness especially with respect to OpenMP.

According our experience w/ifc -7.1, it is realtive stable. I beleive
ifort-8.1 will be also good. But we didn't use OpenMP in our
application programs (we checked OpenMP only on some tests).

Yours
Mikhail Kuzminsky
Zelinsky Institute of Organic Chemistry
Moscow

> Or in
>other words: Is it worth trying yet? 
>
>Best regards,
>Roland
>
>
>
>
>__________________________________________________
>Do You Yahoo!?
>Tired of spam?  Yahoo! Mail has the best spam protection around 
>http://mail.yahoo.com 
>_______________________________________________
>Beowulf mailing list, Beowulf@beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf


From mack.joseph at epa.gov  Fri Oct 29 05:38:01 2004
From: mack.joseph at epa.gov (Joseph Mack)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] High Performance for Large Database
References: <38242de90410261208b9ae5f2@mail.gmail.com>
	<20041028100356.GN12752@unthought.net>
Message-ID: <418239A9.70C3A907@epa.gov>

Jakob Oestergaard wrote:
> 
> On Tue, Oct 26, 2004 at 01:08:00PM -0600, Joshua Marsh wrote:
> > Hi all,
> >
> > I'm currently working on a project that will require fast access to
> > data stored in a postgreSQL database server.  I've been told that a
> > Beowulf cluster may help increase performance.

The Linux Virtual Server (LVS) project www.linuxvirtualserver.org
is a load balancer which allows multiple requests to be balanced
amongst a set of backend machines. It works perfectly for readonly.
If clients write to the backend machines, then the updates have
to be propagated to the other backend machines, and you have to
do this outside LVS. If your usage is read mostly and you want
a low cost solution, then LVS will do what you want. If you want
a real parallel database, be prepared to pay lots of money to Oracle.

disclaimer: I'm part of the LVS project

Joe

-- 
Joseph Mack PhD, High Performance Computing & Scientific Visualization
LMIT, Supporting the EPA Research Triangle Park, NC 919-541-0007
Federal Contact - John B. Smith 919-541-1087 - smith.johnb@epa.gov

From nick at brealey.org  Fri Oct 29 02:13:36 2004
From: nick at brealey.org (Nicholas Brealey)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron
In-Reply-To: <20041028190640.9782.qmail@web52907.mail.yahoo.com>
References: <20041028190640.9782.qmail@web52907.mail.yahoo.com>
Message-ID: <418209C0.8090007@brealey.org>

Roland Krause wrote:
> Folks,
> does anybody here have positive or negative experiences with using the
> Intel EMT-64 Fortran compiler on AMD Opteron systems? 
> 
> I am at this point not so much interested in speed issues but more
> stability and correctness especially with respect to OpenMP. Or in
> other words: Is it worth trying yet? 
> 
Take a look at the 64 bit AMD Opteron benchmarks results at
http://www.polyhedron.com/ and google comp.lang.fortran.
It seemed to be able to run all the benchmarks correctly.

The Polyhedron benchmarks showed the Intel compiler coming in
just behind the Pathscale compiler in 64 bit mode on an Opteron.
The benchmarks don't use OpenMP though.

Nick

From rsweet at aoes.com  Fri Oct 29 02:18:59 2004
From: rsweet at aoes.com (Ryan Sweet)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Newbie question.
In-Reply-To: <7B8D37027B57D7459984035D83864E360682AC68@exchangeut1.atk.com>
References: <7B8D37027B57D7459984035D83864E360682AC68@exchangeut1.atk.com>
Message-ID: <Pine.LNX.4.61.0410291102560.10131@lapp-0>

On Thu, 28 Oct 2004, Currit, Dennis wrote:

> We are thinking of putting up a cluster to run MSC Nastran and have about
> $25,000 budgeted for hardware.

Have you also discussed licensing with MSC?  It may have 
changed recently, as my experience here is a few months 
old, but Distributed Parallel features are licensed differently than the 
SMP Parallel features, and are significantly more expensive.

> Is this enough to get get started?

Certainly it is enough to buy a few nodes, but also if you are just 
getting started with clustering it may be good to think a couple of times 
about the direction you want to go and to analyse the associated costs of 
things like additional power consumption, air-conditioning, etc....  If you 
already have a sufficient server room infrastructure (racks, power, ac) 
then perhaps this isn't much of a consideration.

> Any suggestions as to what I should buy?

We are currently running NASTRAN with success on opteron, though we have 
also had good experience with PIV and Athlon MPs.  I think the best thing 
to do is to try and benchmark your actual jobs on a few different systems 
if you can.

> Currently we are running large jobs on
> an older AIX multiprocessor system and smaller jobs on a Xeon 2.8 system, so
> I think even a small cluster should be an improvement.

It depends upon the jobs that you are running.  If your jobs do benefit 
well from running in SMP on the AIX system, then you may also have good 
efficiency with DMP on a cluster.  We have had mixed results with our 
structural analyses: some had near linear speedups on smallish (<16 cpus) 
cluster runs, and others would gain only about 20% for each cpu added.

For many of our NASTRAN runs high speed disk IO is nearly as important as 
as the CPU.  Find out what your job mix is like, and then spend your money 
accordingly (you may want to splash on fast SATA local disks for scratch 
space, or on more RAM to use RAM disks for scratch).

Also have the engineers read through the manual regarding the DMP options 
because you have to pay a bit more attention to how your jobs are 
configured when using distributed parallel.

good luck,
-Ryan

-- 
Ryan Sweet             <ryan.sweet@aoes.com> 
Advanced Operations and Engineering Services
AOES Group BV            http://www.aoes.com
Phone +31(0)71 5795521  Fax +31(0)71572 1277


From Serguei.Patchkovskii at sympatico.ca  Thu Oct 28 15:56:48 2004
From: Serguei.Patchkovskii at sympatico.ca (Serguei.Patchkovskii@sympatico.ca)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron 
In-Reply-To: <20041028190640.9782.qmail@web52907.mail.yahoo.com>
Message-ID: <418140F0.24192.5C7CF22@localhost>

On 28 Oct 2004 at 12:06, Roland Krause wrote:

> does anybody here have positive or negative experiences with using the
> Intel EMT-64 Fortran compiler on AMD Opteron systems? 

It works fine, as long as you do not use Prescott new instructions (which are in any
event of less importance on Opterons - they are not as decode-crippled as P4s are),
I've built a few moderately complex quantum chemistry codes with EM64 ifort, and
they run OK on Opterons.

> I am at this point not so much interested in speed issues but more
> stability and correctness especially with respect to OpenMP. Or in
> other words: Is it worth trying yet? 

I can't comment on OpenMP, but for serial code it is a nice, fast and reasonably stable 
compiler. Not as stable or as fast as Pathscale's, but far superior to PGI's. 

As usual, YMMV - and probably will.

Serguei

From vinicius at centrodecitricultura.br  Fri Oct 29 05:33:23 2004
From: vinicius at centrodecitricultura.br (Vinicius)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] error in tstmachines
In-Reply-To: <20041025.125039.103756653.lusk@localhost>
References: <1098701880.2791.287.camel@nuts.clc.cuhk.edu.hk>
	<417D2503.3060805@staff.uni-marburg.de>
	<20041025.125039.103756653.lusk@localhost>
Message-ID: <1099053203.11351.2.camel@swingle>

help me!!!!


rsh ok:
[swingle@swingle bin]$ /usr/bin/rsh swingle3
Last login: Tue Oct 26 16:26:46 from swingle
[swingle@swingle3 swingle]$


[swingle@swingle bin]$ ./tstmachines
Errors while trying to run /usr/bin/rsh swingle3 -n /bin/ls
/home/swingle/programs/mpich-1.2.6/bin/mpichfoo
Unexpected response from swingle3:
--> /bin/ls: /home/swingle/programs/mpich-1.2.6/bin/mpichfoo: No such
file or directory
    The ls test failed on some machines.
    This usually means that you do not have a common filesystem on
    all of the machines in your machines list; MPICH requires this
    for mpirun (it is possible to handle this in a procgroup file; see
    the documentation for more details).

    Other possible problems include:
        The remote shell command /usr/bin/rsh does not allow you to run
ls.
           See the documentation about remote shell and rhosts.
        You have a common file system, but with inconsistent names.
           See the documentation on the automounter fix.


1 errors were encountered while testing the machines list for LINUX
Only these machines seem to be available
    swingle


From hahn at physics.mcmaster.ca  Fri Oct 29 17:36:37 2004
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Intel 64bit (emt) Fortran code and AMD Opteron
In-Reply-To: <1098992392.3052.83.camel@hpti9.fsl.noaa.gov>
Message-ID: <Pine.LNX.4.44.0410292031360.31129-100000@coffee.psychology.mcmaster.ca>

> > does anybody here have positive or negative experiences with using the
> > Intel EMT-64 Fortran compiler on AMD Opteron systems? 
> 
> Is it even going to work until the Opteron supports SSE3?

what SSE3 adds over SSE2 is remarkably minor.

> I suspect if you don't vectorize, or only build 32-bit apps
> you will be ok.  However, for most applications the vectorization
> is going to give you the big win.

the big win is getting away from the x87 FP stack.  vectorization
is a wonderful thing, but practically any FP code will see a nice speedup
with purely scalar SSE usage (such as you'd get with current gcc.)

regards, mark hahn.


From gvinodh1980 at yahoo.co.in  Sat Oct 30 00:38:35 2004
From: gvinodh1980 at yahoo.co.in (Vinodh)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] MPICH fault handling
Message-ID: <20041030073835.95072.qmail@web8503.mail.in.yahoo.com>

hello,
	i established a four node beowulf cluster using
MPICH.

while testing, i started mpd daemon in all the nodes
from the master by mpdboot, then i unplugged one slave
node from LAN, and now i tried to execute a program
using mpiexec, the master node is not recognising that
one of the node has failed.

then i checked in www.beowulf.org - Archives, the last
discussion about the mpi node failure was at Jan -
2003.

so now i want to know, whether there is any update of
MPI fault handling.

what can i do if
1. any slave node fails.
2. master node fails.


__________________________________
Do you Yahoo!?
Yahoo! Mail Address AutoComplete - You start. We finish.
http://promotions.yahoo.com/new_mail 

From mechti01 at luther.edu  Fri Oct 29 23:02:46 2004
From: mechti01 at luther.edu (Timo Mechler)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Rocks Cluster and 2 Ethernet networks
Message-ID: <2937.172.22.17.130.1099116166.squirrel@172.22.17.130>

Hi all,

I'm considering installing the Rocks cluster distro on a cluster that uses
only ethernet.  As I understand it, eth0 (or first network interface) is
used for administration and also message passing if no other high speed
interface is present (e.g. myrinet).  My question is, if each of my
compute node have two ethernet interfaces, say eth0 and eth1, can the
cluster be configured that message passing takes place only over eth1?  It
would be nice to have an interface devoted to just message passing.  If it
is possible, how would I go about setting it up?  If it's not possible, is
there are a lot of performance loss due to the fact that other tasks (such
as administration, etc.) are also taking place over eth0?  Thanks in
advance for your help.

-Timo Mechler


-- 
Timo R. Mechler
mechti01@luther.edu


From jcandy at san.rr.com  Sat Oct 30 12:41:47 2004
From: jcandy at san.rr.com (Jeff Candy)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] PVFS on 80 proc (40 node) cluster
Message-ID: <4183EE7B.7050804@san.rr.com>

Greetings,

Does anyone have experience with PVFS on a cluster
in the range of 80 processors (40 dual nodes with
gigE)?

I am considering this over the usual NFS-master
node stup since we expect to multiple users/jobs
running concurrently.

I am interested to hear any information/horror
stories, etc.

Thanks,

Jeff

From reuti at staff.uni-marburg.de  Sat Oct 30 14:45:42 2004
From: reuti at staff.uni-marburg.de (Reuti)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Rocks Cluster and 2 Ethernet networks
In-Reply-To: <2937.172.22.17.130.1099116166.squirrel@172.22.17.130>
References: <2937.172.22.17.130.1099116166.squirrel@172.22.17.130>
Message-ID: <1099172742.41840b869e823@home.staff.uni-marburg.de>

Hi,

> I'm considering installing the Rocks cluster distro on a cluster that uses
> only ethernet.  As I understand it, eth0 (or first network interface) is
> used for administration and also message passing if no other high speed
> interface is present (e.g. myrinet).  My question is, if each of my
> compute node have two ethernet interfaces, say eth0 and eth1, can the
> cluster be configured that message passing takes place only over eth1?  It
> would be nice to have an interface devoted to just message passing.  If it
> is possible, how would I go about setting it up?  If it's not possible, is
> there are a lot of performance loss due to the fact that other tasks (such
> as administration, etc.) are also taking place over eth0?  Thanks in
> advance for your help.

do you want to use the ch_p4 device of MPICH for communication? Then you simply 
have to set the machinefile for mpirun to include only the names of the second 
interface in all nodes. Maybe your queuingsystem can do this already for you. 
Furthermore, you have to change the setting in mpirun.args that way, that 
instead:

MPI_HOST=`hostname`

will be substituded with the name of the second interface. E.g.

MPI_HOST=`hostname | sed "s/^node/internal/"`

to change the name from node001 to internal001 or whatever names you use. 
Otherwise your machinefile will be scanned in a wrong way (wrong distribution 
of the processes to the nodes in the end), and the communication back from the 
slaves to the head node of the job will still use the wrong interface. You can 
simply include this at the beginning of the mpirun.arg file. If it's already 
set, it will no be set later in the script.

Cheers - Reuti

From reuti at staff.uni-marburg.de  Sat Oct 30 15:50:48 2004
From: reuti at staff.uni-marburg.de (Reuti)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] PVFS on 80 proc (40 node) cluster
In-Reply-To: <4183EE7B.7050804@san.rr.com>
References: <4183EE7B.7050804@san.rr.com>
Message-ID: <1099176648.41841ac8ad892@home.staff.uni-marburg.de>

Hi,

> Does anyone have experience with PVFS on a cluster
> in the range of 80 processors (40 dual nodes with
> gigE)?
> 
> I am considering this over the usual NFS-master
> node stup since we expect to multiple users/jobs
> running concurrently.

on the one hand it sounds interesting. I would fear that in a cluster (where 
each node should do heavy calculations and use the own disk for local scratch 
data) the performance will be worse than a dedicated file server with a RAID. 
What programs will your cluster run and how are the users submitting the jobs? 
- Reuti

From jcandy at san.rr.com  Sat Oct 30 21:14:43 2004
From: jcandy at san.rr.com (Jeff Candy)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] PVFS on 80 proc (40 node) cluster
In-Reply-To: <1099176648.41841ac8ad892@home.staff.uni-marburg.de>
References: <4183EE7B.7050804@san.rr.com>
	<1099176648.41841ac8ad892@home.staff.uni-marburg.de>
Message-ID: <418466B3.4000809@san.rr.com>

Jeff:

>>Does anyone have experience with PVFS on a cluster
>>in the range of 80 processors (40 dual nodes with
>>gigE)?
>>
>>I am considering this over the usual NFS-master
>>node stup since we expect to multiple users/jobs
>>running concurrently.

Reuti:

> on the one hand it sounds interesting. I would fear that in a cluster (where 
> each node should do heavy calculations and use the own disk for local scratch 
> data) the performance will be worse than a dedicated file server with a RAID. 
> What programs will your cluster run and how are the users submitting the jobs? 

- the program is a large physics code that does I/O
(200KB or less) every 10 to 60 sec.  Every 10min or
so, a 100MB file is written.

- users will submit with PBS (typically, I expect <= 3
jobs to run concurrently).

- I want a *single* filesystem, so no local scratch
will be used.

Are you in favour of a single master with a RAID
filesystem, NFS mounted by all nodes?  I wonder
what fraction of systems now use this scheme.

Thanks for your input.

Jeff

From mechti01 at luther.edu  Sat Oct 30 21:54:41 2004
From: mechti01 at luther.edu (mechti01)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Rocks Cluster and 2 Ethernet networks
In-Reply-To: <1099172742.41840b869e823@home.staff.uni-marburg.de>
References: <2937.172.22.17.130.1099116166.squirrel@172.22.17.130>
	<1099172742.41840b869e823@home.staff.uni-marburg.de>
Message-ID: <1152.172.22.17.130.1099198481.squirrel@172.22.17.130>

Hi Reuti,

Thanks for your help.  I have not installed Rocks just yet.  Can you
explain to me what the ch_p4 device of MPI_CH is?  Nochmals, vielen Dank!

-Timo


> Hi,
>
>> I'm considering installing the Rocks cluster distro on a cluster that
>> uses
>> only ethernet.  As I understand it, eth0 (or first network interface) is
>> used for administration and also message passing if no other high speed
>> interface is present (e.g. myrinet).  My question is, if each of my
>> compute node have two ethernet interfaces, say eth0 and eth1, can the
>> cluster be configured that message passing takes place only over eth1?
>> It
>> would be nice to have an interface devoted to just message passing.  If
>> it
>> is possible, how would I go about setting it up?  If it's not possible,
>> is
>> there are a lot of performance loss due to the fact that other tasks
>> (such
>> as administration, etc.) are also taking place over eth0?  Thanks in
>> advance for your help.
>
> do you want to use the ch_p4 device of MPICH for communication? Then you
> simply
> have to set the machinefile for mpirun to include only the names of the
> second
> interface in all nodes. Maybe your queuingsystem can do this already for
> you.
> Furthermore, you have to change the setting in mpirun.args that way, that
> instead:
>
> MPI_HOST=`hostname`
>
> will be substituded with the name of the second interface. E.g.
>
> MPI_HOST=`hostname | sed "s/^node/internal/"`
>
> to change the name from node001 to internal001 or whatever names you use.
> Otherwise your machinefile will be scanned in a wrong way (wrong
> distribution
> of the processes to the nodes in the end), and the communication back from
> the
> slaves to the head node of the job will still use the wrong interface. You
> can
> simply include this at the beginning of the mpirun.arg file. If it's
> already
> set, it will no be set later in the script.
>
> Cheers - Reuti
>


-- 


From reuti at staff.uni-marburg.de  Sun Oct 31 01:54:10 2004
From: reuti at staff.uni-marburg.de (Reuti)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] Rocks Cluster and 2 Ethernet networks
In-Reply-To: <1152.172.22.17.130.1099198481.squirrel@172.22.17.130>
References: <2937.172.22.17.130.1099116166.squirrel@172.22.17.130>
	<1099172742.41840b869e823@home.staff.uni-marburg.de>
	<1152.172.22.17.130.1099198481.squirrel@172.22.17.130>
Message-ID: <1099216450.4184b6425cef7@home.staff.uni-marburg.de>

Hi Timo,

> Thanks for your help.  I have not installed Rocks just yet.  Can you
> explain to me what the ch_p4 device of MPI_CH is?  Nochmals, vielen Dank!

MPI is a standard of a protocol for writing parallel programs. MPICH is one 
implementation of this standard (there are others, also commercial ones). 
Inside MPICH you have different "devices" for communication between the nodes 
to chose from, which will best fit to your computer system and network. The 
ch_p4 device is one, which will use the p4 communication standard. This will 
use rsh/ssh to start the tasks on the nodes. Other devices will need a special 
daemon running on each node. All the programs we use, use the ch_p4 device.

Cheers - Reuti

From reuti at staff.uni-marburg.de  Sun Oct 31 04:29:43 2004
From: reuti at staff.uni-marburg.de (Reuti)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] PVFS on 80 proc (40 node) cluster
In-Reply-To: <418466B3.4000809@san.rr.com>
References: <4183EE7B.7050804@san.rr.com>
	<1099176648.41841ac8ad892@home.staff.uni-marburg.de>
	<418466B3.4000809@san.rr.com>
Message-ID: <1099225783.4184dab7129b7@home.staff.uni-marburg.de>

> Reuti:
>
> > What programs will your cluster run and how are the users submitting the
> jobs? 
> 
> Jeff:
>
> - the program is a large physics code that does I/O
> (200KB or less) every 10 to 60 sec.  Every 10min or
> so, a 100MB file is written.

It's completely different from our requirements. We share /home with the 
(small) input files, and each node needs a large local /scratch space (100GB 
and more).

> - I want a *single* filesystem, so no local scratch
> will be used.

A single file system for /home and /scratch (will your software need a common 
/scratch space)? Is there any error correction in PVFS in case that a disk or 
node fails? Another solution could be IBM's GPFS if you need a big and fast 
common file space.

Cheers - Reuti

From brian at cypher.acomp.usf.edu  Sun Oct 31 19:14:44 2004
From: brian at cypher.acomp.usf.edu (Brian Smith)
Date: Wed Nov 25 01:03:31 2009
Subject: [Beowulf] PVFS on 80 proc (40 node) cluster
In-Reply-To: <4183EE7B.7050804@san.rr.com>
References: <4183EE7B.7050804@san.rr.com>
Message-ID: <1099278884.24797.17.camel@ava>

Jeff,

You should definitely consider PVFS or any other parallel filesystem
over NFS mounting for concurrent scratch space.  I read your
requirements from another post and the number of writes, etc, and those
writes would likely flood even the most respectable file server.  

PVFS2 has much improved fault tolerance over PVFS1 in that there can be
redundant file nodes where as with PVFS1, if one node dropped dead, your
FS was toast.

If you go to their web site, there should be plenty of documentation on
how to set it up.  You may also want to consider investigating GFS from
Red Hat and Luster.  

Brian Smith

On Sat, 2004-10-30 at 12:41 -0700, Jeff Candy wrote:
> Greetings,
> 
> Does anyone have experience with PVFS on a cluster
> in the range of 80 processors (40 dual nodes with
> gigE)?
> 
> I am considering this over the usual NFS-master
> node stup since we expect to multiple users/jobs
> running concurrently.
> 
> I am interested to hear any information/horror
> stories, etc.
> 
> Thanks,
> 
> Jeff
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf