My wiki | Site / WhyIsCottontailSoSlow

Because cottontail is the head node, it is constantly running many requests. Users are always logging on, submitting jobs, and manipulating files. However, even if a job is submitted to the scheduler to be sent to a compute node for calculation, it will often increase the burden on cottontail. This is because for our MD simulations, we are writing output periodically throughout the run. Every time this happens, cottontail has to open a connection to the compute node, write the data, and close the connection. For many users with many jobs, this can overwhelm the node.

A potential solution is to use a scratch directory. This entails navigating into an automatically generated job-specific directory on the node where your job is running. You then (in the script) copy over any input files, run your calculations, and write the output on your job node. Once the run finishes, the data can be copied over to a storage node such as mindstore. The following would be an example script using scratch. One consideration is that the localscratch storage for exx96 nodes is only 800GB. The data should therefore be deleted from localscratch once you have confirmed that it copied correctly.

   #!/bin/bash
   #BSUB -e 5JUP_COD1_C2_mR146_err
   #BSUB -o 5JUP_COD1_C2_mR146_out
   #BSUB -q exx96
   #BSUB -J 5JUP_mR146_12to13
   #BSUB -n 1
   #BSUB -R "rusage[gpu4=1:mem=6288],span[hosts=1]"

   # cuda
   export CUDA_HOME=/usr/local/n37-cuda-9.2
   export PATH=/usr/local/n37-cuda-9.2/bin:$PATH
   export LD_LIBRARY_PATH=/usr/local/n37-cuda-9.2/lib64:$LD_LIBRARY_PATH
   export LD_LIBRARY_PATH="/usr/local/n37-cuda-9.2/lib:${LD_LIBRARY_PATH}"

   # openmpi
   export PATH=/share/apps/CENTOS6/openmpi/1.8.4/bin:$PATH
   export LD_LIBRARY_PATH=/share/apps/CENTOS6/openmpi/1.8.4/lib:$LD_LIBRARY_PATH

   # python
   export PATH=/share/apps/CENTOS6/python/2.7.9/bin:$PATH
   export LD_LIBRARY_PATH=/share/apps/CENTOS6/python/2.7.9/lib:$LD_LIBRARY_PATH

   # amber18
   source /share/apps/CENTOS7/amber/amber18/amber.sh

   # navigate to auto-generated scratch directory
   MYLOCALSCRATCH=/localscratch/$LSB_JOBID
   export MYLOCALSCRATCH
   cd $MYLOCALSCRATCH

   ###########
   # MD Runs #
   ###########

   # stage input data in scratch
   cp /mindstore/home33ext/kscopino/5JUP_COD1_C2_mR146/* .

   # call Amber18
   n37.openmpi.wrapper pmemd.cuda -O -i 20ps_heat.in  -p 5JUP_COD1_C2_mR146_wat.prmtop -c 5JUP_COD1_C2_mR146_emin11.rst -r 5JUP_COD1_C2_mR146_heat_12.rst
   -ref 5JUP_COD1_C2_mR146_emin11.rst -o 5JUP_COD1_C2_mR146_heat_12.out -x 5JUP_COD1_C2_mR146_heat_mdcrd_heat_12

   # copy results back to storage node
   cp 5JUP_COD1_C2_mR146* /mindstore/home33ext/kscopino/5JUP_COD1_C2_mR146

For now, if traffic is significant it should be possible to submit jobs from cottontail2, the backup submission node. It is also a good idea to look at resource usage with 'bjobs -u all' when slowdowns occur. If a user is running jobs consistently during a slowdown, it is possible that they are not using a scratch directory. This can be checked by ssh-ing into their compute node and looking for their job directory.

Additional Resources
Scratch Directory wiki