提交 · v0.10.1 · HPCSource / DeepSpeed

该项目从 https://github.com/microsoft/DeepSpeed 镜像。 Pull mirroring failed 5个月前.
由于尝试失败次数过多，仓库镜像已暂停，可以由项目维护者或所有者恢复。
上次成功更新 5个月前。

8月 18, 2023
- pin transformers to last known good commit (#4174) · 46d859a7
  由 Michael Wyatt 创作于 1年前
  
  v0.10.1
  
  46d859a7
- Add DSE branch input to nv-ds-chat (#4173) · a3540f17
  由 Lev Kurilenko 创作于 1年前
```
* Add DSE branch input to nv-ds-chat

* Use provided DSE branch

* Echo DSE branch
```
  a3540f17
8月 17, 2023

[CPU][Bugfix] Make uid and addr_port part of SHM name in CCL backend (#4115) · 19e9a7c0
由 Ma, Guokai 创作于 1年前
```
* distinguish shm name with uid and addr_port

* fix formatting
```
19e9a7c0

Add DS-Chat CI workflow (#4127) · 64c670ef

由 Lev Kurilenko 创作于 1年前


* Add DS Chat CI workflow

* Add CRITIC_CKPT_DIR env variable to actions.yml

* Update step 2 opt 125m ckpt dir name

* Update test dir

* Add workflow_dispatch

* Add :

* Add nv-ds-chat badge to main README

* Open GH issue if DS Chat CI fails

* Remove pull_request and merge_group conditions

* Update and test torch version

* Remove PR trigger

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

64c670ef

8月 16, 2023
- fix badges (#4162) · bd65eeaf
  由 Michael Wyatt 创作于 1年前
  
  bd65eeaf
- Handling for SIGTERM as well (#4160) · 1a295739
  由 Logan Adams 创作于 1年前
  
  1a295739
- Fixes #4151 (#4152) · 740b7805
  由 Sam Foreman 创作于 1年前
  
  740b7805
8月 15, 2023

Return nn.parameter type for weights and biases (#4146) · 341cefd2
由 Molly Smith 创作于 1年前
```
* Return nn.parameter type for weights and biases

* whitespace

* Fix bias tensor size
```
341cefd2

Remove incorrect async-io library checking code. (#4150) · a4523018

由 Logan Adams 创作于 1年前

* Update library installed checker to use check_cmd

* This code was used for checking if aio was installed but this was refactored and this code was left

a4523018

8月 14, 2023

Respect memory pinning config (#4131) · 9d79cfd1
由 Olatunji Ruwase 创作于 1年前
```
* Respect memory pinning config

* Bug fix
```
9d79cfd1
Generalize frozen weights unit test (#4140) · 7a282db8
由 Olatunji Ruwase 创作于 1年前
```
* Fix unit test

* Fix unit test
```
7a282db8

Handle PermissionError in os.chmod Call - Update engine.py (#4139) · 629b2039

由 Chris M 创作于 1年前

* Update engine.py

This branch includes changes to handle potential exceptions that may occur when attempting to change file permissions using the os.chmod function within the DeepSpeed engine. The specific issue addressed is the PermissionError that may arise when working with certain filesystems or under restricted permissions.

* Change to use logger

* Split permissions out and add unit test

* UnitTest(use DistTestClass) + trailing whitespace

* update unit test

* UT parametrize 1, 2 ,3

* trim white space from unit test

* change to PermissionError

* run pre-commit formats

* Catch FileNotFoundError & PermissionError

629b2039

8月 10, 2023

Update torch1.9 tests to 1.10 to match latest accelerate. (#4126) · ff7d5275

由 Logan Adams 创作于 1年前

* Fix torch19 tests

* test pip list and --no-build-isolation

* Enable verbosity

* pin to older accelerate version

* Update oldest tested torch to 1.10

* Properly rename directories

* Return PR tests to CI again.

* Remove -vv

ff7d5275

8月 09, 2023

Update nightly workflows to open an issue if CI fails (#3952) · 0c75f4a3

由 Logan Adams 创作于 1年前

* Update H100 workflow to open an issue if nightly CI fails

* Test running as not CI

* Add all nightly/switch envvar name

* Test with AMD

* Add way to get url, switch path of template

* Add additional checkout step

* Move actions checkout step

* Try absolute path with github workspace

* Create issue without template/path

* Re-enable and add debug logic

* add if failed()

* More debug

* Try without checkout action uses

* Rename file

* Update variables

* Update issue template

* Confirm removing permissions still work

* Revert "Confirm removing permissions still work"

This reverts commit e7c2915a.

* Re-enable permissions

* Remove PR trigger for AMD MI200 tests

* Revert "Remove PR trigger for AMD MI200 tests"

This reverts commit 5c5c5fd6.

* Test update_existing

* Switch to composite action

* Fix line ending encoding issue

* Switch failure to be a variable

* Test with second workflow

* Format fix

* Switch failure to always

* Switch back to previously working way

* Test permission changes

* Revert "Test permission changes"

This reverts commit e051da75.

* Update existing bugs with newest build failure link

* Remove PR triggers for that were used for testing.

0c75f4a3

Add ops (#4119) · d300517f
由 Logan Adams 创作于 1年前

d300517f

Fix Issue 4083 (#4084) · 8a8683d3

由 Joe Mayer 创作于 1年前


* removing bad check

* adding offload check for bf16 optimizer

* grad reduce for extra large param

* check grad_accum exists before converting

---------

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

8a8683d3

enable pipeline checkpoint loading mode (#3629) · 1e0c39c6

由 leiwen83 创作于 1年前


In cpu ram limited machine, loading checkpoint at the start up may
cause oom as all rank in the same node are loading the opt state
in the same time. So for this scenario, we make a choice that loading
checkpoint could be made pipeline way.

Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>

1e0c39c6

add deepspeed chat arxiv report (#4110) · 78d985ab

由 Conglong Li 创作于 1年前


* add deepspeed chat arxiv report

* add zeroquant v2 and fp

* add selective enhencement

* add ignore for 'Youn' in spell checker

---------

Co-authored-by: yaozhewei <zheweiy@berkeley.edu>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

78d985ab

Pass correct node size for ZeRO++ (#4085) · f0463b4d

由 Connor Holmes 创作于 1年前


* Pass correct node size

* formatting

---------

Co-authored-by: Connor Holmes <development@cmikeh2.me>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

f0463b4d

8月 08, 2023
- Disable z3 tracing profiler (#4106) · 977254c1
  由 Olatunji Ruwase 创作于 1年前
```
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
```
  977254c1
- use correct ckpt path when base_dir not available (#4101) · abe293b4
  由 Polisetty V R K Jyothendra Varma 创作于 1年前
```
* base_dir may not present all time and results in incorrect path

* Update replace_module.py

* Update config.py

---------

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
```
  abe293b4
- set temperature to avoid config validation error (#4107) · 975bcbc0
  由 Michael Wyatt 创作于 1年前
  
  975bcbc0
- add type checker ignore to resolve that pylance can't resolved noqa annotation (#4102) · 57a27b08
  由 Earlee 创作于 1年前
```
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
```
  57a27b08
8月 07, 2023
- zero_to_fp32 script adds support for tag argument (#4089) · 241ae39a
  由 Earlee 创作于 1年前
  
  241ae39a
8月 04, 2023

update ut/doc for glm/codegen (#4057) · 85dc854b

由 mzl 创作于 1年前


* update ut/doc for glm/codegen

* formatting/spacing on docs

* re-order/alphabetize the models

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>

85dc854b

fix typo: change polciies to policies (#4090) · 4cde5da8
由 digger yu 创作于 1年前

4cde5da8

8月 03, 2023
- Spread layers more uniformly when using partition_uniform (#4053) · e8318634
  由 marcobellagente93 创作于 1年前
```
* update partition_uniform util function

* formatting

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
```
  e8318634
- Fix Stable Diffusion Injection (#4078) · 1ba40989
  由 Lev Kurilenko 创作于 1年前
```
* Initial commit

* Clean up

* Fix formatting
```
  1ba40989
8月 02, 2023
- unpin datasets in UT (#4079) · a7fe3bcc
  由 Michael Wyatt 创作于 1年前
  
  a7fe3bcc
8月 01, 2023

Refactor autoTP inference for HE (#4040) · 94c7233a

由 Molly Smith 创作于 1年前

* Refactor autoTP inference for HE

* Formatting

* Move redundant functions to autotp

* Remove self from loading class

* formatting

* Some gpt2 autotp path fixes

* precommit

94c7233a

7月 31, 2023

fix: remove unnessary `#` punct in the second `sed` command (#4061) · e31b4041
由 Hugh Pu 创作于 1年前

e31b4041

add reproducible compilation environment (#3943) · f763b93d

由 Xie Zejian 创作于 1年前


* add reproducible compilation environment

* fix ci

* fix typo for formatting check

* Fix casing for format

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>

f763b93d

7月 28, 2023

save_non_zero_checkpoint on first partition group (#3787) · 8a63754b
由 Zhen Zhang 创作于 1年前
```
Co-authored-by: Zhen Zhang <zhzhn@amazon.com>
```
8a63754b

Fix deadlock when SHM based allreduce spin too fast (#4048) · 82c498d9

由 Ma, Guokai 创作于 1年前


* Fix deadlock when allreduce spin too fast

* Change state to enum to increase readability

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

82c498d9

Multiple zero stage 3 related fixes (#3886) · 7f90ef4b

由 Olatunji Ruwase 创作于 1年前

* Option to override module apply

* Removing early partitioning in override

* Unit tests

* Cleanup

* Adapt unit test to succeed

* Handle missed params

* Add accelerate

* Code cleanup

* Add doc

* Add doc

* Add doc

7f90ef4b

7月 27, 2023

faster allreduce with omp parallel for reduce kernel (#4049) · 7f26bb6a
由 Ma, Guokai 创作于 1年前

7f26bb6a

autoTP for fused qkv weight (#3844) · 6b877d2d

由 mzl 创作于 1年前


* autoTP for fused qkv weight

* fix format

* clean up

* clean up

* clean up

* update

* make logic flow to util and move to file

* fix formatting

* remove empty line

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

6b877d2d

enable autoTP for MPT (#3861) · 0bafeac4

由 Wang, Yi 创作于 1年前


* enable autoTP for MPT

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* add model specific func to auto_tp_model_utils.py

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

0bafeac4

fix opt-350m shard loading issue in AutoTP (#3600) · 76953a37

由 Wang, Yi 创作于 1年前

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>

76953a37

fix comm logging for inference (#4043) · 0b507253
由 Ma, Guokai 创作于 1年前

0b507253