LICENSE · master · HPCSource / COSMA · 极狐GitLab

查找文件 Blame 历史永久链接

Optimizing Communication on Multi-GPU Systems (#102) · ff5108e0

由 Marko Kabic 创作于 6月 29, 2022

This PR enables COSMA to take advantage of fast GPU-to-GPU interconnects like NVLink, to efficiently utilize modern Multi-GPU Systems. This is achieved in 2 ways:
- **Using `NCCL/RCCL` Libraries:** by specifying `-DCOSMA_WITH_NCCL=ON` (for NVIDIA GPUs) or `-DCOSMA_WITH_RCCL=ON` (for AMD GPUs) cmake options.
- **Using GPU-aware MPI:** by specifying `-DCOSMA_WITH_GPU_AWARE_MPI=ON` cmake option, as proposed [here](https://github.com/eth-cscs/COSMA/pull/101#issuecomment-1126514781).
See [README](https://github.com/eth-cscs/COSMA/blob/master/README.md) and [INSTALL](https://github.com/eth-cscs/COSMA/blob/master/INSTALL.md) for more info on how to build.

In addition, the following performance improvemets have been made:
- **Improved Caching:**
- all nccl buffers, MPI comms, nccl comms are cached and reused when appropriate.
- all device memory is cached and reused.
- **Reduced Data Trasfers:** the GPU backend of COSMA called [Tiled-MM](https://github.com/eth-cscs/Tiled-MM) is extended to offer the possibility to the user to leave the resulting matrix C on the GPU. In that case, there is no need to trasfer matrix C from device to host, which not only reduces the communication, but also speeds up the whole cpu->gpu pipeline as no additional synchronizations are needed. Furthermore, reduce_scatter operation does not have to wait for C to be transfered back to host but is immediately invoked with GPU pointers, thus utilizing fast inter-gpu links. This way, there is no unnecessary data transfers between cpu<->gpu.
- **All collectives updated:** both `all-gather` and `reduce-scatter` collectives are improved.
- **Reduced Data Reshuffling:** avoids double reshuffling of data, i.e. the data from NCCL/RCCL GPU buffers is immediately copied in the right layout, without additional reshuffling.
- **Works for variable blocks:** NCCL/RCCL' reduce_scatter operation assumes that all the blocks are of the same size and is hence not completely equivalent to `MPI_Reduce_scatterv` which we previously used. We padded all the blocks to be able to overcome this issue.
- **Portability:** Supports both NVIDIA and AMD GPUs.

Therefore, this fixes the limitations of https://github.com/eth-cscs/COSMA/pull/101 and brings above-mentioned improvements.

Thanks to @alazzaro and @gsitaram for their great feedback and contribution to this PR!

ff5108e0

该项目在 BSD 3-Clause "New" or "Revised" License下获得许可。进一步了解