instance related¶
How are processor cores and memory allocated for the instance?¶
The processor cores and memory of an instance are calculated based on the ratio of rented graphics cards to the total graphics cards of the machine. Take a rental machine with 64 cores, 512 GB, 8 graphics cards. If only 1 graphics card is rented, the processor is allocated 64 / 8 = 8 cores and the memory is limited to 512 / 8 = 64 GB. The free
command looks at the total memory of the machine, regardless of the instance memory limit. If the process usage exceeds the memory limit, it will be forcibly stopped.
What if the mirror you need is not officially available?¶
Some commonly used libraries and software can be installed through commands. The way to install software is provided in Common Commands. It is also possible to use the conda
command to create a virtual environment and then install in the virtual environment. The use of conda
is provided in conda, such as installing PyTorch 1.7 conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 -c pytorch
.
How to run the training task in the background when it is interrupted by network jitter during training?¶
It is recommended to use the Tmux
terminal multiplexer, which can put processes in the background and take over again when needed. In order to prevent the SSH process from being interrupted due to network disconnection, it is recommended to use the Tmux terminal for all long-running training and other tasks. Refer to the Tmux documentation.
Shutting down the local computer, will the training task be interrupted?¶
- If running the task in the background via Tmux, or using the JupyterLab browser. Shutting down the local computer will not interrupt training.
- If training is performed directly in the terminal, or using an IDE connection such as VSCode, shutting down the computer will interrupt the training.
Close the JupyterLab browser, will the training task be interrupted?¶
Close JupyterLab's browser page, as long as the instance is not closed. Training tasks in NoteBook and Terminal in JupyterLab continue to run.
If I close IDEs or terminals such as VSCode, PyCharm, iTerm2, will the training task be interrupted?¶
Use the IDE to connect to the instance to perform training tasks, closing the IDE or terminal will interrupt the training task. If you need to run in the background, it is recommended to use the Tmux terminal. Refer to the Tmux documentation.
Execute command or program report that the package cannot be found, how to install it?¶
Refer to Common Commands, use apt
to install system software or pip
to install Python packages.
How to realize automatic shutdown after training?¶
Execute the shutdown
command in the instance terminal to realize the shutdown operation. After the training code ends, you can call this command to shut down after the training is completed.
import os
os.system('shutdown')
How to deal with the prompt that the disk space is full and cannot be shut down when shutting down?¶
The instance's root directory disk usage can be viewed with the following command. If it is full, you need to delete some files to free up space, or move the files to /hy-nas
(only shared storage models) or /hy-tmp
(the instance will be emptied after 24 hours of shutdown). When the disk is full, the instance cannot be started normally, so a certain amount of space must be released before shutting down.
After entering the instance terminal, you can use the following commands to find the files occupying space. Or enter the instance list in the console, click the Manage button under System Disk in the instance, and you can delete the files in the instance in the opened panel.
# View instance root directory disk usage
df -h | grep "/$" | awk '{print $5" "$3"/"$2}'
# View the size of each directory under the /root and /home directories
du -h --max-depth=1 /root /home
# View the size of each directory in the current directory
du -h --max-depth=1 .
# View the size of each file in the current directory
ll -h | grep ^- | awk '{print $5"\t"$9}'
Stuck when starting training on RTX 3000 series graphics cards?¶
Check if the CUDA version used by the library is lower than 11.0. RTX 3000 series graphics cards require at least CUDA 11 and above. Using a version lower than 11 will cause the process to freeze.
What is the CUDA, CUDNN version?¶
The CUDA Version
viewed by nvidia-smi
is the version supported by the current driver, and does not represent the version installed by the instance. The specific version is based on the official image version selected when creating the instance.
# View CUDA version
root@I15b96311d0280127d:~# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
# View CUDNN version
root@I15b96311d0280127d:~# dpkg -l | grep libcudnn | awk '{print $2}'
libcudnn8
libcudnn8-dev
# View CUDNN location
root@I15b96311d0280127d:~# dpkg -L libcudnn8 | grep so
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.1
...
How to check the usage of graphics card?¶
Execute the nvidia-smi
command through the terminal to check the status of the graphics card, and you can check the power consumption of the graphics card, the usage of the video memory, etc.
root@I15b96311d0280127d:~# nvidia-smi
Mon Jan 11 13:42:18 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 On | 00000000:02:00.0 Off | N/A |
| 63% 55C P2 298W / 370W | 23997MiB / 24268MiB | 62% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------
Because the instances are all Docker containers, using nvidia-smi
will not see the process due to the limitation of container PID isolation. Execute the py3smi
command in the terminal to see if any processes are using the graphics card.
root@I15b96311d0280127d:~# py3smi
Mon Jan 11 13:43:00 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI Driver Version: 460.27.04 |
+---------------------------------+---------------------+---------------------+
| GPU Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
+=================================+=====================+=====================+
| 0 63% 55C 2 284W / 370W | 23997MiB / 24268MiB | 80% Default |
+---------------------------------+---------------------+---------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU Owner PID Uptime Process Name Usage |
+=============================================================================+
| 0 ??? 10494 23995MiB |
+-----------------------------------------------------------------------------+
GPU utilization not going up during training?¶
Check the usage of the graphics card during the training process, and found that the core utilization and the power consumption of the graphics card are low, and the graphics card is not fully utilized. In this case, it is possible that in each training step, in addition to using the GPU, most of the time is consumed by the CPU. Causes the GPU utilization to change periodically. To solve the problem of utilization, it is necessary to improve the code. You can refer to Xi Xiaoyao's Low training efficiency? GPU utilization is not going up? of this article.
How to speed up cloning code or downloading files from GitHub?¶
To speed up the clone code, you can replace github.com
with the mirror address of github.com.cnpmjs.org
. For example, the warehouse address is https://github.com/kelseyhightower/nocode.git
, and the replaced address is https://github.com.cnpmjs.org/kelseyhightower/nocode.git
.
# Original address https://github.com/kelseyhightower/nocode.git
# github.com is replaced by github.com.cnpmjs.org
git clone https://github.com.cnpmjs.org/kelseyhightower/nocode.git
To download GitHub Releases and Raw files, you can use the GitHub Proxy service. Prefix the full address with https://mirror.ghproxy.com/
.
# Original address https://raw.githubusercontent.com/kelseyhightower/nocode/master/README.md
# Address prefix https://mirror.ghproxy.com/
curl -L https://mirror.ghproxy.com/https://raw.githubusercontent.com/kelseyhightower/nocode/master/README.md
Due to the geographical differences of the machines, the effects of different mirror addresses may be different. If the download speed is still unsatisfactory, you can test the following two:
TensorFlow training reports ptxas fatal
error¶
When training with TensorFlow 2.4 For CUDA 11.0 on RTX 3000 series graphics cards, the following warning appears.
W tensorflow/stream_executor/gpu/asm_compiler.cc:235] Your CUDA software stack is old. We fallback to the NVIDIA driver for some compilation. Update your CUDA version to get the best performance. The ptxas error was: ptxas fatal : Value 'sm_86' is not defined for option 'gpu-name'
The reason is that this version of the PTX compiler does not support the 8.6 compute capability. This error is a warning and does not affect normal training. This warning can be turned off with os.environ['TF_CPP_MIN_LOG_LEVEL'] = "2"
.
Although this problem does not affect training, the performance will be reduced. It is recommended to create an image of TensorFlow 2.5 For CUDA 11.2, which will not have this problem.
How does the application expose the port service to the outside world?¶
Instances do not provide public IP addresses, and services are linked through port mapping to public access points. If necessary, you need to stop the JupyterLab or TensorBoard service, and configure the application port to the same port 8888 or 6006 as JupyterLab or TensorBoard, and the listening address needs to be 0.0.0.0.
# Stop JupyterLab or TensorBoard service
supervisorctl stop tensorboard
supervisorctl stop jupyterlab
# Set the boot to not start JupyterLab or TensorBoard
grep -E "autostart" /etc/supervisor/conf.d/tensorboard.conf || echo "autostart = false" >>/etc/supervisor/conf.d/tensorboard.conf
grep -E "autostart" /etc/supervisor/conf.d/jupyterlab.conf || echo "autostart = false" >>/etc/supervisor/conf.d/jupyterlab.conf
# update configuration
supervisorctl update
Next, start the application listening on 0.0.0.0:6006
or 0.0.0.0:8888
. External calls and access are linked through JupyterLab or TensorBoard tools in the console instance.
JupyterLab What if I want to enter a password?¶
You can get the JupyterLab login Token by executing the jupyter server list
command through the terminal. The Token can be obtained from the execution result as 3fq593blw4afqjtqgdp3ldk5
as follows.
root@I15b96311d0280127d:~# jupyter server list
Currently running servers:
http://0.0.0.0:8888/?token=3fq593blw4afqjtqgdp3ldk5::/