该项目从 https://github.com/microsoft/DeepSpeed 镜像。
Pull mirroring failed .
由于尝试失败次数过多,仓库镜像已暂停,可以由项目维护者或所有者恢复。
上次成功更新 。
由于尝试失败次数过多,仓库镜像已暂停,可以由项目维护者或所有者恢复。
上次成功更新 。
- 8月 18, 2023
-
-
由 Michael Wyatt 创作于
-
由 Lev Kurilenko 创作于
* Add DSE branch input to nv-ds-chat * Use provided DSE branch * Echo DSE branch
-
- 8月 17, 2023
-
-
由 Ma, Guokai 创作于
* distinguish shm name with uid and addr_port * fix formatting
-
由 Lev Kurilenko 创作于
* Add DS Chat CI workflow * Add CRITIC_CKPT_DIR env variable to actions.yml * Update step 2 opt 125m ckpt dir name * Update test dir * Add workflow_dispatch * Add : * Add nv-ds-chat badge to main README * Open GH issue if DS Chat CI fails * Remove pull_request and merge_group conditions * Update and test torch version * Remove PR trigger --------- Co-authored-by:
Logan Adams <114770087+loadams@users.noreply.github.com>
-
- 8月 16, 2023
-
-
由 Michael Wyatt 创作于
-
由 Logan Adams 创作于
-
由 Sam Foreman 创作于
-
- 8月 15, 2023
-
-
由 Molly Smith 创作于
* Return nn.parameter type for weights and biases * whitespace * Fix bias tensor size
-
由 Logan Adams 创作于
* Update library installed checker to use check_cmd * This code was used for checking if aio was installed but this was refactored and this code was left
-
- 8月 14, 2023
-
-
由 Olatunji Ruwase 创作于
* Respect memory pinning config * Bug fix
-
由 Olatunji Ruwase 创作于
* Fix unit test * Fix unit test
-
由 Chris M 创作于
* Update engine.py This branch includes changes to handle potential exceptions that may occur when attempting to change file permissions using the os.chmod function within the DeepSpeed engine. The specific issue addressed is the PermissionError that may arise when working with certain filesystems or under restricted permissions. * Change to use logger * Split permissions out and add unit test * UnitTest(use DistTestClass) + trailing whitespace * update unit test * UT parametrize 1, 2 ,3 * trim white space from unit test * change to PermissionError * run pre-commit formats * Catch FileNotFoundError & PermissionError
-
- 8月 10, 2023
-
-
由 Logan Adams 创作于
* Fix torch19 tests * test pip list and --no-build-isolation * Enable verbosity * pin to older accelerate version * Update oldest tested torch to 1.10 * Properly rename directories * Return PR tests to CI again. * Remove -vv
-
- 8月 09, 2023
-
-
由 Logan Adams 创作于
* Update H100 workflow to open an issue if nightly CI fails * Test running as not CI * Add all nightly/switch envvar name * Test with AMD * Add way to get url, switch path of template * Add additional checkout step * Move actions checkout step * Try absolute path with github workspace * Create issue without template/path * Re-enable and add debug logic * add if failed() * More debug * Try without checkout action uses * Rename file * Update variables * Update issue template * Confirm removing permissions still work * Revert "Confirm removing permissions still work" This reverts commit e7c2915a. * Re-enable permissions * Remove PR trigger for AMD MI200 tests * Revert "Remove PR trigger for AMD MI200 tests" This reverts commit 5c5c5fd6. * Test update_existing * Switch to composite action * Fix line ending encoding issue * Switch failure to be a variable * Test with second workflow * Format fix * Switch failure to always * Switch back to previously working way * Test permission changes * Revert "Test permission changes" This reverts commit e051da75. * Update existing bugs with newest build failure link * Remove PR triggers for that were used for testing.
-
由 Logan Adams 创作于
-
由 Joe Mayer 创作于
* removing bad check * adding offload check for bf16 optimizer * grad reduce for extra large param * check grad_accum exists before converting --------- Co-authored-by:
Michael Wyatt <michaelwyatt@microsoft.com>
-
由 leiwen83 创作于
In cpu ram limited machine, loading checkpoint at the start up may cause oom as all rank in the same node are loading the opt state in the same time. So for this scenario, we make a choice that loading checkpoint could be made pipeline way. Signed-off-by:
Lei Wen <wenlei03@qiyi.com> Co-authored-by:
Lei Wen <wenlei03@qiyi.com>
-
由 Conglong Li 创作于
* add deepspeed chat arxiv report * add zeroquant v2 and fp * add selective enhencement * add ignore for 'Youn' in spell checker --------- Co-authored-by:
yaozhewei <zheweiy@berkeley.edu> Co-authored-by:
Michael Wyatt <michaelwyatt@microsoft.com>
-
由 Connor Holmes 创作于
* Pass correct node size * formatting --------- Co-authored-by:
Connor Holmes <development@cmikeh2.me> Co-authored-by:
Michael Wyatt <michaelwyatt@microsoft.com>
-
- 8月 08, 2023
-
-
由 Olatunji Ruwase 创作于
Co-authored-by:
Michael Wyatt <michaelwyatt@microsoft.com>
-
* base_dir may not present all time and results in incorrect path * Update replace_module.py * Update config.py --------- Co-authored-by:
Michael Wyatt <michaelwyatt@microsoft.com>
-
由 Michael Wyatt 创作于
-
由 Earlee 创作于
Co-authored-by:
Logan Adams <114770087+loadams@users.noreply.github.com>
-
- 8月 07, 2023
-
-
由 Earlee 创作于
-
- 8月 04, 2023
-
-
由 mzl 创作于
* update ut/doc for glm/codegen * formatting/spacing on docs * re-order/alphabetize the models --------- Co-authored-by:
Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by:
Logan Adams <loadams@microsoft.com>
-
由 digger yu 创作于
-
- 8月 03, 2023
-
-
由 marcobellagente93 创作于
* update partition_uniform util function * formatting --------- Co-authored-by:
Olatunji Ruwase <olruwase@microsoft.com>
-
由 Lev Kurilenko 创作于
* Initial commit * Clean up * Fix formatting
-
- 8月 02, 2023
-
-
由 Michael Wyatt 创作于
-
- 8月 01, 2023
-
-
由 Molly Smith 创作于
* Refactor autoTP inference for HE * Formatting * Move redundant functions to autotp * Remove self from loading class * formatting * Some gpt2 autotp path fixes * precommit
-
- 7月 31, 2023
-
-
由 Hugh Pu 创作于
-
由 Xie Zejian 创作于
* add reproducible compilation environment * fix ci * fix typo for formatting check * Fix casing for format --------- Co-authored-by:
Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by:
Michael Wyatt <mrwyattii@gmail.com> Co-authored-by:
Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by:
Logan Adams <loadams@microsoft.com>
-
- 7月 28, 2023
-
-
由 Zhen Zhang 创作于
Co-authored-by:
Zhen Zhang <zhzhn@amazon.com>
-
由 Ma, Guokai 创作于
* Fix deadlock when allreduce spin too fast * Change state to enum to increase readability --------- Co-authored-by:
Olatunji Ruwase <olruwase@microsoft.com>
-
由 Olatunji Ruwase 创作于
* Option to override module apply * Removing early partitioning in override * Unit tests * Cleanup * Adapt unit test to succeed * Handle missed params * Add accelerate * Code cleanup * Add doc * Add doc * Add doc
-
- 7月 27, 2023
-
-
由 Ma, Guokai 创作于
-
由 mzl 创作于
* autoTP for fused qkv weight * fix format * clean up * clean up * clean up * update * make logic flow to util and move to file * fix formatting * remove empty line --------- Co-authored-by:
Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by:
Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by:
Jeff Rasley <jerasley@microsoft.com>
-
由 Wang, Yi 创作于
* enable autoTP for MPT Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * add model specific func to auto_tp_model_utils.py Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Jeff Rasley <jerasley@microsoft.com>
-
由 Wang, Yi 创作于
Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Jeff Rasley <jerasley@microsoft.com> Co-authored-by:
Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
-
由 Ma, Guokai 创作于
-