LISTSERV mailing list manager LISTSERV 16.5

Help for AMP-USERS Archives


AMP-USERS Archives

AMP-USERS Archives


AMP-USERS@LISTSERV.BROWN.EDU


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

AMP-USERS Home

AMP-USERS Home

AMP-USERS  July 2017

AMP-USERS July 2017

Subject:

Re: Getting AMP to work on a PBS system

From:

Ben Comer <[log in to unmask]>

Reply-To:

Amp Users List <[log in to unmask]>

Date:

Fri, 28 Jul 2017 09:46:20 -0400

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (40 lines)

Andy,

Thanks for the insight, the problem was that the worker nodes were not inheriting the parent environment as you hypothesized. Passing in an 'envcommand' appears to have made the code work. Also yes, I was talking about running on multiple nodes.

The error files look clean, so I think that made it work. I'll let you know if there is a way the code can be improved once I work through this.

On 07/23/2017 10:16 AM, Peterson, Andrew wrote:
> Ben,
>
> If the code had gotten one step further, it would have generated worker-specific stderr files and this would be easier to debug. ;)
>
> The code assumes you can ssh freely between nodes, and that the environment on the worker nodes is inherited from the master node. To see if this is the case, you might want to debug in a bash script; see if you can ssh to the nodes assigned to your job. Once you are on those nodes, run `printenv` to see if your Amp is listed in your python path. I'm guessing it's one of those two problems. If it's the former, you may need to talk to your system administrator. If it's the latter, you might be able to fix it with the `envcommand` (described here).
>
> Also, I'm assuming from your email that this is only a problem when trying to run on multiple nodes. Is that true? E.g., if you manually specify cores=8, which keeps it on the single node, does it work?
>
> Let us know how it goes, and hopefully with your help we can update the code to make this work better in the future.
>
> Andy
>
> On Fri, Jul 21, 2017 at 4:42 PM, Ben Comer <[log in to unmask]> wrote:
>
>     Hey all, I wrote a little bit of code to get to return the correct value for the cores variable such that it returns the name of the nodes and the number of cores for each on our PBS queuing system. However this alone did not solve the problem. When the system runs start_workers the code times out at the line containing ssh.expect('<amp-connect>'). Could someone provide insight into how this system works and what might be going wrong?
>
>     Thanks,
>     Ben Comer
>     Georgia Tech
>
>
>
>
> -- 
> Andrew Peterson
> Assistant Professor
> Brown University School of Engineering
> Barus & Holley 247
> 184 Hope Street
> Providence, RI 02912
> (401) 863-2153
> http://brown.edu/go/catalyst

Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Search Archives

Search Archives


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017

ATOM RSS1 RSS2



LISTSERV.BROWN.EDU

CataList Email List Search Powered by the LISTSERV Email List Manager