My wiki | Site / WhyDidMyRunCrash

Option 1: The forces on the atoms became too large to be stored in output.

You can know if this happened by looking at the run information outfile specified in the SBATCH header (--output). Look for this line:

   cudaMemcpy GpuBuffer::Download failed an illegal memory access was encountered

If this occurs occasionally, another replicate can be run to replace the experiment. If this occurs repeatedly, it is likely something went on earlier on in the process.

Option 2: A GPU was lost between Amber calls.

You can know if this happened by looking at the energy information outfile specified in the call to Amber (-o filename.out). The job will have terminated right after reading the input file with the lines:

 
   Note: ig = -1. Setting random seed to    14899 based on wallclock time in 
         microseconds.
   | irandom = 1, using AMBER's internal random number generator (default).

If this occurs, the job can simply be restarted at the point where the GPU was lost. Simply comment out calls to Amber that completed in your original job script and resubmit. For example, if the job was supposed to run experiments 1-4 and the GPU was lost between experiment 2 heat and experiment 2 equilibration, you can comment out everything before experiment 2 equilibration.

Option 3: A GPU was lost during the job.

There is no easy way to check for this but if you don't see either of the two outputs above, you can assume your GPU was lost in the middle of the job. Sometimes an entire compute node has an error, which can be seen as multiple jobs on the same node terminating around the same time. If you suspect this has occurred, you can simply comment out parts of the job that have already completed and resubmit the script (see option 2). One common scenario where this happens is during power outages.