On Wed, Mar 7, 2018 at 11:40 PM, Efrem Braun <[log in to unmask]> wrote:
> Thanks all for that very helpful information.
> Three scenarios I've considered over the last few days:
> 1: If the user has separate data sets and wants to free memory before
> performing runs with the later data sets (e.g., I've been training a neural
> net and then running it on a validation data set within one script, and
> there's no need for me to keep the training data set in memory once I've
> moved on to the validation data set.)
> My first thought was that it would be good to give the user an option to
> remove the Data object from memory by adding a
> delete_from_memory_fingerprints() method to the descriptor classes. That
> way, the data is stored in memory by default, as is currently done, but if
> the user is sure that the they're done with it, they can free it up.
> However, doing this with Python is tricky. As Muammar said, just putting in
> a "del self.neighborlist" and "del self.fingerprints" isn't going to help
> much, since even though this might help the garbage collector return the
> memory to Python, the memory won't get re-used if it isn't the right size.
Thanks for confirming this. That's what theory says. `del` will only
mark objects as available but still, the garbage collector decides
what to do with them. Besides, there is no guarantee that whatever
object we create after the fingerprinting would exactly match in size
with what we freed.
> A much simpler way to clear the memory between two separate data sets is just
> to separate the multiple tasks into two separate Python scripts. In one
> script, train the neural net, and in the second, run it on the validation
> dataset. This obviously forces Python to return memory to the OS. This is
> what I've started doing, and it works fine. It's not particularly inelegant,
> and I think there's not much reason to work on fixing this.
You are right, it seems that separating python scripts (_creating
different process_) is possibly the key here. See this
look for the word `kill`. I am thinking that a possible way of fixing
the problem would be that instead of using threading (calling
fingerprint creation using a thread from the main process), we use
multiprocessing and create a unique process for that purpose. Then,
after they are returned/written we kill/close the process. I think
that that should free memory -- does it make sense?. Then, as you
showed, loading them would consume less memory than keeping memory
consumption low although we will increase I/O operations.
> 2: If the user has a large data set of energies/forces to validate a neural
> After Amp calculates the errors on a given fingerprint, there's no need for
> that fingerprint to be stored in memory anymore since the user just needs
> the error values for plotting purposes. As with scenario 1, the user can
> calculate the errors in multiple Python scripts. This does mean that the
> tools in analysis.py can't be used though. It's not too hard to work around
> this, but it is somewhat inelegant.
> 3: If the user has a large data set of energies/forces to train a neural
> The fingerprints all need to be stored in memory, so all we can do is ensure
> that the program isn't storing duplicates of any data, unnecessarily hogging
> memory. However, Esben's workaround indicates that this IS occurring, since
> he's finding that loading all the fingerprint derivatives is less
> memory-intensive than calculating them all. To me it seems rather silly to
> have to use this workaround though, so I looked into it.
> I've confirmed that Esben's description is happening on my own system using
> Memory Profiler (https://pypi.python.org/pypi/memory_profiler). I ran the
> attached amp_script.py twice (most of the script is making the training and
> validation trajectories; only the last few lines are important). There are 4
> images, and I ran it on my local desktop using 2 cores. The first time, it
> had to calculate the fingerprints, and it maxed out at ~950 MiB (see
> calculating.png). The second time, the fingerprints were already calculated
> and so they just needed to be loaded in; it maxed out at ~340 MiB
I think this is expected from what we know now about the garbage
collector. Fingerprints for energies are different from the ones of
forces making memory size to not match and creating more memory usage.
However, when we have them on the disk and load them, you just have
the memory usage they need to be allocated.
> I haven't been able to figure out where in the code the issue is occurring
> though. I think it has to be in the FingerprintPrimeCalculator class,
> because that's the major difference between loading and calculating
> fingerprints, but I don't see a clear problem. I don't suspect that the
> memory leak is in the Fortran code because most of the increase in memory
> comes at the very end of the run.
I don't think the problem is in Fortran either because in our code
there are `deallocate` calls after each array is returned to Python.
That ensures memory is released. If my assumption in multiprocessing
is True, then the problem would be in the main Amp class at the moment
we call the fingerprinting computations from `calculate`. This has to
be confirmed :).
Muammar W El Khatib Rodriguez
Postdoctoral Research Associate
Brown University School of Engineering
184 Hope Street
Providence, RI, 02912, USA