The new Multi-Instance GPU (MIG) feature allows GPUs (starting with NVIDIA Ampere architecture) to be securely partitioned into up to seven separate GPU Instances for CUDA applications, providing multiple users with separate GPU resources for optimal GPU utilization.
This feature is particularly beneficial for workloads that do not fully saturate the GPU’s compute capacity and therefore users may want to run different workloads in parallel to maximize utilization.
MIG supports the following deployment configurations:
Bare-metal, including containers
GPU pass-through virtualization to Linux guests on top of supported hypervisors
vGPU on top of supported hypervisors
Split single GPU instance into multiple GPU.
For e.x.
1 bare metal server have 141GB of Single GPU Card.
With MiG we can create 7 multiple mini-GPU instances with 20GB of memory.
Each contains:
Compute cores
Memory
Cache
Scheduling engine
Each behaves like single independent GPU to the system.
GPU instance (GI)
- A Slice of CPU that includes compute and cache resources. (Hardware allocation)
Compute instance (CI) -
like Virtual machine or containerized environment that uses GI. (Execution unit)
Compute slices:
MIG Profile - Configuration for splitting the GPU.
For e.g.
1g.5gb
4g.5gb
40 GB GPU have
8x5GB Memory slices - Portion of the GPU Memory like VRAM.
7 compute slices - Part of GPU compute power
If GPU has 7 compute engines called (GPCs), a MIG Instance will get:
1 slice = lower power (e.g. 1g.5gb)
4 slice = Medium power (e.g. 4g.20gb)
7 slice = full GPU (e.g. 7g.40gb)
Note:
How to check GPU compute engines: nvidia-smi -q
Compute instance:
3c.4g.20gb
You assigns three CIs(Containers, Execution engines) to use 4 GPU compute and 20GB Memory.
Means 3 compute slices can be combines together to create 3c.4g.20gb
Sharing GPU between 3 apps.
If it is 4c.4g.20gb is equal to 4g.20gb.
Difference between 4g.20gb and 2c.4g.20gb
4g.20gb
- 1 big GPU instance
- Runs a single job
- All 4 GPU slices and 20gb memory are used together by one process
- one big process can run.
2c.4g.20gb
- still 4 slices and 20GB memory.
- but it split into 2 compute instances.
- You can run two seperate jobs(containers, users or apps)
- Run two jobs seperately.
GPU instance and compute instance are enumurated in /proc
# ls -l /proc/driver/nvidia-caps/
To view GPU Nvidia drivers UUID:
# nvidia-smi -L
Enable MIG Mode:
Check MIG is enabled or not.
# nvidia-smi -i 0
# nvidia-smi -i 0 --query-gpu=pci.bus_id,mig.mode.current --format=csv
To enable it,
# sudo nvidia-smi -i <GPU ID> -mig 1
# sudo nvidia-smi -i 0 -mig 1
If no GPU ID is specified, MIG mode will be applied to all the GPUs on the system.
When MIG is enabled on the GPU, depending on the product the driver will attempt to reset the GPU so the MIG mode can take affect.
On Reboot MIG Mode will be disabled explicitly.
In some cases you need to nvsm and dgsm service to enable MIG mode.
# sudo systemctl stop nvsm
# sudo systemctl stop dcgm
# sudo nvidia-smi -i 0 -mig 1
Enabled MIG Mode for GPU 00000000:07:00.0
All done.
List all possible GPU instance profiles:
# nvidia-smi mig -lgip
User can create two instance of 3g.71gb
Seven instance of 1g.18gb
List the possible placements of GPU Instances:
nvidia-smi mig -lgipp
List the possible placements of Compute instances:
# nvidia-smi mig -lcipp
Now Create GPU instances:
Simply Enabling MIG Mode on the GPU is not enough.
Without Creating GPU instances CUDA workloads can not be run on the GPU.
This are not persistent on reboot. So need to recreate it.
For that you need to use tool. mig-parted.
Check available GPU instance profiles:
# nvidia-smi mig -lgip
Create MIG profile:
# sudo nvidia-smi mig -cgi 9,3g.20gb -C
Here 0 - profile ID
9 - Profile name
By default it is creating in GID 0.
OR
# sudo nvidia-smi mig -cgi 19,14,5
This will create in default GID.
How to get the profile ID - nvidia-smi mig -lgip
List the created GPU instances:
# nvidia-smi mig -lgi
Enable MIG on Specific GPU
Enable MIG on GPU ID 1:
# sudo nvidia-smi -i <GPU ID> -mig 1
# sudo nvidia-smi -i 1 -mig 1
After enabling MIG, Check supported profiles:
# nvidia-smi mig -lgip -i 1
Create GPU instance on GPU ID 1:
# nvidia-smi mig -cgi 19,15 -i 1 -C
OR
# OR
# nvidia-smi mig -cgi 19,1g.18gb -i 1 -C
OR
# sudo nvidia-smi mig -cgi 14,19,19,19,19,19
Here,
19 - is the profile ID
1g.35gb - is the profile name
-C : This flag will create Compute instances along with GPU instances.
Note:
Once the GPU instances are created you need to create corrsponding CPU instances CI. By using -c option.
- This creation should be in geometry. Largest first.
If any error during creating then clean the GPU:
sudo nvidia-smi mig -dci -i 1
sudo nvidia-smi mig -dgi -i 1
Now list the Created GI instance:
# nvidia-smi mig -lgi -i 1
OR
# nvidia-smi mig -lgi ## for listing the GPU instances
# nvidia-smi mig -lci ## for listing the Compute instance
Now verify that the GI’s and corresponding CI’s are created:
# nvidia-smi
Available GPU profiles capacity:
# nvidia-smi mig -lgip -i 1
List All available GPU instances (GI)
# sudo nvidia-smi mig -lgip -gi 1
List All available Compute instances (CI)
# sudo nvidia-smi mig -lgip -ci 1
List all created GPU instances:
# nvidia-smi -L
# nvidia-smi
- Delete GI
Check the available MIG instances
# nvidia-smi
- First delete the CI
# sudo nvidia-smi mig -dci -i <GPU ID> -ci <CI ID>
# sudo nvidia-smi mig -dci -i 1 -ci 0
Note: The GI is also deleted which have the CI ID 0
If wanted to remove only GI then use
# sudo nvidia-smi mig -dgi -i <GPU_ID> -gi <GI_ID>
# sudo nvidia-smi mig -dci -gi 13 -i 1
Delete all CIs from GI 1.
# sudo nvidia-smi mig -dci -gi 1 -i 1
# sudo nvidia-smi mig -dgi -gi 1 -i 1
Destroy all CI and GI
nvidia-smi mig -dci && sudo nvidia-smi mig -dgi
Compute instances:
Use case: This is required because if we wanted to give 2g.72gb then this is not available by default. So we can achive this by spliting it from 3g.72gb.
So we can split it up by 2c.3g.72gb and 1c.3g.72gb.
The Split of Compute instance into multiple CIs depend on:
nvidia-smi mig -lcipp
Only 1,2,7 profile ID have multi instance CI supported.
If, GPU 0 GI 2 Profile ID 7 Placements: {0,2}:2
Then you can create 2 CIs under that GI.
For 2c.3g.72gb
nvidia-smi mig -gi <GI_ID> -cci <Profile ID> -C
# nvidia-smi mig -gi 2 -cci 1 -C
If error,
# nvidia-smi mig -gi 2 -cci 1 -C
Unable to create a compute instance on GPU 0 GPU instance ID 2 using profile 1: Insufficient Resources
Failed to create compute instances: Insufficient Resources
Solution:
It might possible that CI is already created automatically with CI.
#nvidia-smi mig -lci
How to resolve?
Delete the created CI instance.
# nvidia-smi mig dci -gi <Instance ID> -ci Profile ID>
# nvidia-smi mig -dci -gi 2 -ci 1 -i 0
Then try to re-create it,
# nvidia-smi mig -gi 2 -cci 1 -C -i 0 # for 2c.3g.72gb
# nvidia-smi mig -gi 2 -cci 0 -C -i 0 # for 1c.3g.72gb
Here,
-cci 0 - 1c
-cci 1 - 2c
-cci 2 - 3c
Compute instance are created automatically generally when you create GPU instances. Especially when you are using -C flag.
# nvidia-smi mig -lci
If it is not created automatically then follow below steps.
Further level of concurrencly achived by using CI.
Now if you wanted to run three CUDA processes can be run on the same GI.
# nvidia-smi mig -lgi -i 1
List Already created compute instances:
# nvidia-smi mig -lci -i 7
# nvidia-smi mig -lci
List the GI:
# nvidia-smi
List all supported compute instance profiles:
# sudo nvidia-smi mig -lcip -gi <GI ID> -i <GPU ID>
# sudo nvidia-smi mig -lcip -gi 2 -i 7
Create 3 CIs, each of type 1c compute capacity (profile ID 0) on the first GI:
# sudo nvidia-smi mig -cci <profile_ID> -gi <GPU instance ID> -i <GPU >
# sudo nvidia-smi mig -cci 7 -gi 2 -i 1
Here 7 - Profile ID
1 - GPU instance ID
1 - GPU ID
OR
Create multiple,
# sudo nvidia-smi mig -cci 0,0,0 -gi 1
Now GI’s and CI’s are created.
# nvidia-smi
Error:
# nvidia-smi mig -cgi 9,19,19,19,19 -i 0 -C
Unable to create a GPU instance on GPU 0 using profile 9: In use by another client
Failed to create GPU instances: In use by another client
Then check some processes might executing and which are using GPU.
sudo lsof /dev/nvidia*
Kill the process and again re-create the Instances.
NVidia mig-parted
When we create GPU partitions and reboot the server, the partitions GPU will be automatically removed and when we create it the UUID will be changed.
To overcome this issue. We have to use nvidia-mig-parted tool.
Link -
Install nvidia-mig-parted:
https://github.com/NVIDIA/mig-parted/releases
Download deb file and install it.
Clone the mig-parted git repository:
cd /home/script
# git clone https://github.com/purvalpatel/mig-parted.git
Now create/edit config YAML file for the configuration inside /home/script/mig-parted/examples/config.yaml
Location on live server: /home/script/mig-parted/
# config.yaml
version: v1
mig-configs:
- devices: [0]
mig-enabled: true
mig-devices:
1c.3g.71gb: 1
1g.18gb: 4
2c.3g.71gb: 1
- devices: [1, 2, 3, 4, 5, 6]
mig-enabled: false
mig-devices: {}
- devices: [7]
mig-enabled: true
mig-devices:
1c.3g.71gb: 1
1g.18gb: 3
2c.3g.71gb: 1
Verify the configuration are proper or not.
# nvidia-mig-parted assert -f config.yaml
Apply the changes:
# nvidia-mig-parted apply -f config.yaml
Verify it is working fine or not?
# reboot
After reboot apply below command.
# nvidia-mig-parted apply -f config.yaml
The same partitions will be created after that with the same UUID.