Skip to content
  • Marko Kabic's avatar
    Optimizing Communication on Multi-GPU Systems (#102) · ff5108e0
    Marko Kabic 创作于
    This PR enables COSMA to take advantage of fast GPU-to-GPU interconnects like NVLink, to efficiently utilize modern Multi-GPU Systems. This is achieved in 2 ways:
    - **Using `NCCL/RCCL` Libraries:** by specifying `-DCOSMA_WITH_NCCL=ON` (for NVIDIA GPUs) or `-DCOSMA_WITH_RCCL=ON` (for AMD GPUs) cmake options.
    - **Using GPU-aware MPI:** by specifying `-DCOSMA_WITH_GPU_AWARE_MPI=ON` cmake option, as proposed [here](https://github.com/eth-cscs/COSMA/pull/101#issuecomment-1126514781).
    See [README](https://github.com/eth-cscs/COSMA/blob/master/README.md) and [INSTALL](https://github.com/eth-cscs/COSMA/blob/master/INSTALL.md) for more info on how to build.
    
    In addition, the following performance improvemets have been made:
    - **Improved Caching:** 
        - all nccl buffers, MPI comms, nccl comms are cached and reused when appropriate.
        - all device memory is cached and reused.
    - **Reduced Data Trasfers:** the GPU backend of COSMA called [Tiled-MM](https://github.com/eth-cscs/Tiled-MM) is extended to offer the possibility to the user to leave the resulting matrix C on the GPU. In that case, there is no need to trasfer matrix C from device to host, which not only reduces the communication, but also speeds up the whole cpu->gpu pipeline as no additional synchronizations are needed. Furthermore, reduce_scatter operation does not have to wait for C to be transfered back to host but is immediately invoked with GPU pointers, thus utilizing fast inter-gpu links. This way, there is no unnecessary data transfers between cpu<->gpu.
    - **All collectives updated:** both `all-gather` and `reduce-scatter` collectives are improved.
    - **Reduced Data Reshuffling:** avoids double reshuffling of data, i.e. the data from NCCL/RCCL GPU buffers is immediately copied in the right layout, without additional reshuffling.
    - **Works for variable blocks:** NCCL/RCCL' reduce_scatter operation assumes that all the blocks are of the same size and is hence not completely equivalent to `MPI_Reduce_scatterv` which we previously used. We padded all the blocks to be able to overcome this issue.
    - **Portability:** Supports both NVIDIA and AMD GPUs.
    
    Therefore, this fixes the limitations of https://github.com/eth-cscs/COSMA/pull/101 and brings above-mentioned improvements.
    
    Thanks to @alazzaro and @gsitaram for their great feedback and contribution to this PR!
    ff5108e0
该项目在 BSD 3-Clause "New" or "Revised" License下获得许可。 进一步了解