[xla] hlo_computation: compact instructions' vector on Cleanup() (c99f3c4d) · 提交 · HPCSource / tensorflow

获取提交引用时发生错误。请稍后再试。

提交 c99f3c4d 编辑于 11个月前作者:

Emilio Cota 提交者： TensorFlower Gardener 11个月前

[xla] hlo_computation: compact instructions' vector on Cleanup()

tl;dr: this gives a 1.26x compilation time speedup for a large, dense
model in XLA:GPU.

The largest perf leaf seen in profiles of a large, dense model
is related to computing the post order. Surprisingly, it is not
the DFS itself what's most expensive; rather, most of the time is
spent on scanning through HloComputation::Instructions() to identify
DFS roots.

The reason this scan becomes expensive as instructions are removed
is that the vector holding HloInstructionInfo (introduced in
cl/600130708 || https://github.com/openxla/xla/commit/247280ab727)
is not shrunk as it flows through the pipeline, making us having
to walk through many deleted "tombstone" entries. Here is the
histogram of # of tombstones encountered during post order
computations for this model:

```
[        1 - 1,536,345) ****************************** (1,300,248)
[1,536,345 - 3,072,690)  (2)
[3,072,690 - 4,609,034)  (364)
[4,609,034 - 6,145,378)  (10,443)
```

To ameliorate this, this CL shrinks the vector periodically,
so far only between passes. This is done by running compaction
on the vector during HloComputation::Cleanup(), which is called
after every pass. The cost of compaction is made proportional to
the number of deleted entries by swapping--if needed--each tombstone
with the rightmost (within the vector) non-deleted entry.

This brings the number of seen tombstones down significantly:

```
[        1 -   327,699) ****************************** (937,541)
[  327,699 -   655,396)  (308)
[  655,396 -   983,094)  (0)
[  983,094 - 1,310,792)  (1)
```

Note: we could further improve compaction by calling Cleanup()
from some passes, instead of just between passes. However, that
would not yield a significant gain; at least for this model,
scanning the instructions' vector now takes ~1% of total time
(vs. ~17% before).
FUTURE_COPYBARA_INTEGRATE_REVIEW=https://github.com/openxla/xla/pull/10503 from pearu:pearu/log1p d35cef4f5fa09482c49edfee709e86c5ca29adde
PiperOrigin-RevId: 619057964

上级 a4d9df4e

展开全部隐藏空白变更内容

行内左右并排

显示有 2216 个添加和 33 个删除

想要评论请注册或