Background Workflow |
Site /
WhyIsCottontailSoSlowBecause cottontail is the head node, it is constantly running many requests. Users are always logging on, submitting jobs, and manipulating files. However, even if a job is submitted to the scheduler to be sent to a compute node for calculation, it will often increase the burden on cottontail. This is because for our MD simulations, we are writing output periodically throughout the run. Every time this happens, cottontail has to open a connection to the compute node, write the data, and close the connection. For many users with many jobs, this can overwhelm the node. A potential solution is to use a scratch directory. This entails navigating into an automatically generated job-specific directory on the node where your job is running. You then (in the script) copy over any input files, run your calculations, and write the output on your job node. Once the run finishes, the data can be copied over to a storage node such as mindstore. The following would be an example script using scratch. One consideration is that the localscratch storage for exx96 nodes is only 800GB. The data should therefore be deleted from localscratch once you have confirmed that it copied correctly. #!/bin/bash #BSUB -e 5JUP_COD1_C2_mR146_err #BSUB -o 5JUP_COD1_C2_mR146_out #BSUB -q exx96 #BSUB -J 5JUP_mR146_12to13 #BSUB -n 1 #BSUB -R "rusage[gpu4=1:mem=6288],span[hosts=1]" # cuda export CUDA_HOME=/usr/local/n37-cuda-9.2 export PATH=/usr/local/n37-cuda-9.2/bin:$PATH export LD_LIBRARY_PATH=/usr/local/n37-cuda-9.2/lib64:$LD_LIBRARY_PATH export LD_LIBRARY_PATH="/usr/local/n37-cuda-9.2/lib:${LD_LIBRARY_PATH}" # openmpi export PATH=/share/apps/CENTOS6/openmpi/1.8.4/bin:$PATH export LD_LIBRARY_PATH=/share/apps/CENTOS6/openmpi/1.8.4/lib:$LD_LIBRARY_PATH # python export PATH=/share/apps/CENTOS6/python/2.7.9/bin:$PATH export LD_LIBRARY_PATH=/share/apps/CENTOS6/python/2.7.9/lib:$LD_LIBRARY_PATH # amber18 source /share/apps/CENTOS7/amber/amber18/amber.sh # navigate to auto-generated scratch directory MYLOCALSCRATCH=/localscratch/$LSB_JOBID export MYLOCALSCRATCH cd $MYLOCALSCRATCH ########### # MD Runs # ########### # stage input data in scratch cp /mindstore/home33ext/kscopino/5JUP_COD1_C2_mR146/* . # call Amber18 n37.openmpi.wrapper pmemd.cuda -O -i 20ps_heat.in -p 5JUP_COD1_C2_mR146_wat.prmtop -c 5JUP_COD1_C2_mR146_emin11.rst -r 5JUP_COD1_C2_mR146_heat_12.rst -ref 5JUP_COD1_C2_mR146_emin11.rst -o 5JUP_COD1_C2_mR146_heat_12.out -x 5JUP_COD1_C2_mR146_heat_mdcrd_heat_12 # copy results back to storage node cp 5JUP_COD1_C2_mR146* /mindstore/home33ext/kscopino/5JUP_COD1_C2_mR146 For now, if traffic is significant it should be possible to submit jobs from cottontail2, the backup submission node. It is also a good idea to look at resource usage with 'bjobs -u all' when slowdowns occur. If a user is running jobs consistently during a slowdown, it is possible that they are not using a scratch directory. This can be checked by ssh-ing into their compute node and looking for their job directory. Additional Resources |