So I have a fairly hacky bash script that I’ve put together that monitors a particular Python training script (launching/interacting with some blackbox docker containers, which unfortunately have a rare segfault error inside them that can cause the training script to hang). My ugly workaround is because this happens rarely, I just monitor the Python script and check if the current update takes longer than a certain amount of time – if so then close everything and restart the script.
The script works when I test it locally (and seems to work most of the time on the linux Google Cloud machine I’ve been using), but I’ve noticed a really strange issue. I run the bash script inside a screen session on the Cloud machine, and everything seems fine. Then I detach and leave it for a while, and if a reset needs to happen it sometimes just seems to stop (it prints out “Timeout reached…” but nothing else). I can see that the CPU usage goes to zero and stays at zero, BUT, if I then open a terminal and reattach to the screen session it suddenly kicks in and starts again. Does anyone have any idea what could be causing something like this?
A reduced version of the bash script is the following:
#!/bin/bash export CURR_UPDATE_NUM=0 export PREV_UPDATE_NUM=0 export UPDATE_TIMEOUT=250 rm monitor_update_number.txt > /dev/null echo $CURR_UPDATE_NUM > monitor_update_number.txt #current update number saved in temporary file. Python modifies this when it finishes an update #start the main training script: python -u train.py & while true; do CURR_UPDATE_NUM=$(< monitor_update_number.txt) PREV_UPDATE_NUM=$CURR_FRAN_EXPT_UPDATE_NUM sleep $UPDATE_TIMEOUT #sleep for time. If python hasn't updated the txt file after this sleep, something is wrong and we restart everything CURR_UPDATE_NUM=$(< monitor_update_number.txt) if [ $PREV_UPDATE_NUM == $CURR_UPDATE_NUM ]; then echo "Timeout reached and update is not completed. Restarting everything and loading the latest save point." pkill -9 python echo "Killed all Python processes" docker kill $(docker ps -a -q) echo "Stopped all docker containers." docker rm $(docker ps -a -q) rm monitor_update_number.txt > /dev/null echo $CURR_UPDATE_NUM > monitor_update_number.txt python -u train.py --load-from-file & echo "Restarted training script" fi done
Ten-tools.com may not be responsible for the answers or solutions given to any question asked by the users. All Answers or responses are user generated answers and we do not have proof of its validity or correctness. Please vote for the answer that helped you in order to help others find out which is the most helpful answer. Questions labeled as solved may be solved or may not be solved depending on the type of question and the date posted for some posts may be scheduled to be deleted periodically. Do not hesitate to share your response here to help other visitors like you. Thank you, Ten-tools.