Print

Print


Andy,

Thanks for the insight, the problem was that the worker nodes were not inheriting the parent environment as you hypothesized. Passing in an 'envcommand' appears to have made the code work. Also yes, I was talking about running on multiple nodes.

The error files look clean, so I think that made it work. I'll let you know if there is a way the code can be improved once I work through this.

On 07/23/2017 10:16 AM, Peterson, Andrew wrote:
> Ben,
>
> If the code had gotten one step further, it would have generated worker-specific stderr files and this would be easier to debug. ;)
>
> The code assumes you can ssh freely between nodes, and that the environment on the worker nodes is inherited from the master node. To see if this is the case, you might want to debug in a bash script; see if you can ssh to the nodes assigned to your job. Once you are on those nodes, run `printenv` to see if your Amp is listed in your python path. I'm guessing it's one of those two problems. If it's the former, you may need to talk to your system administrator. If it's the latter, you might be able to fix it with the `envcommand` (described here).
>
> Also, I'm assuming from your email that this is only a problem when trying to run on multiple nodes. Is that true? E.g., if you manually specify cores=8, which keeps it on the single node, does it work?
>
> Let us know how it goes, and hopefully with your help we can update the code to make this work better in the future.
>
> Andy
>
> On Fri, Jul 21, 2017 at 4:42 PM, Ben Comer <[log in to unmask]> wrote:
>
>     Hey all, I wrote a little bit of code to get to return the correct value for the cores variable such that it returns the name of the nodes and the number of cores for each on our PBS queuing system. However this alone did not solve the problem. When the system runs start_workers the code times out at the line containing ssh.expect('<amp-connect>'). Could someone provide insight into how this system works and what might be going wrong?
>
>     Thanks,
>     Ben Comer
>     Georgia Tech
>
>
>
>
> -- 
> Andrew Peterson
> Assistant Professor
> Brown University School of Engineering
> Barus & Holley 247
> 184 Hope Street
> Providence, RI 02912
> (401) 863-2153
> http://brown.edu/go/catalyst