TensorFlow 和 PyTorch 都是常用的深度学习框架,各自有一套独特但又相似的 GPU 内存管理机制(BFC 算法)。它山之石可以攻玉。了解 TensorFlow 的 BFC 算法有助于学习 PyTorch 管理 GPU 内存的精妙之处。本文重点关注 TensorFlow BFC 算法的核心思想,研究的 TensorFlow 版本为
1f1379c5f721579c58b4ee560daac7df90f8519a
。更多内容请参考:
文章目录
- 关于 TensorFlow BFC 算法
- 关于 XLA
- 初见端倪
- 初识 XLA
- 主要数据结构
- Chunk
- Bin
- 辅助类
- BFC 算法核心
- 1 分配内存
- 1.1 调整大小
- 1.2 查找Bin
- 1.3 查找Chunk
- 1.4 扩展内存
- 2 回收内存
- 2.1 获取ChunkHandle
- 2.2 更新状态
- 2.3 合并Chunk
- 总结
- 参考资料
- 关于 XLA
- 关于 TensorFlow BFC
关于 TensorFlow BFC 算法
以下内容来自:TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack
背景:使用 GPU 训练时,一次训练任务无论是模型参数还是中间结果都需要占用大量显存。为了避免每次训练重新开辟显存带来计算之外的开销,一般框架的做法是在真正的训练任务开始前,将每个节点的输入和输出,以及模型参数的 shape 计算出来并全局开辟一次,例如 Caffe 就是这种做法。随着深度学习模型的发展和迭代,不仅模型训练的数据 shape 可能发生变化,就连模型本身在训练过程中也可能发生变化,那么按照固定 shape 一次开辟显存的做法就不能满足需求了。为此,TensorFlow 重新设计了较为灵活的显存管理机制,它使用了名为 BFC 的分配算法,并通过 BFC Allocator 为每个 Tensor 分配满足需求的显存。
问题:显存分配与回收的性能需求。Tensor 在每次创建时会得到存储区域,而每一轮训练都要重新创建新的 Tensor,那么这里面临的一个问题:如此频繁的分配和回收存储区,如何才能做的高效?试想对于 GPU 来说,如果Allocate
函数直接封装 CUDA 中昂贵的cudaMalloc
函数,当 Tensor 被释放时直接调用cudaFree
函数,那么训练速度将会因为这些 overhead 大打折扣。
思路:存储池。如果你对操作系统这门课比较熟悉,那么应该很容易想到解决办法:将显存按照不同的大小一次性开辟出来,并组成存储池,每次调用Allocate
函数时从存储池中获取,Tensor 回收时将显存重新挂到存储池中。这样做确实可以满足性能需求,但是需要为此设计一个相对复杂的存储管理器。BFC Allocator 就是 TensorFlow 中管理 GPU 显存的存储管理器。
关于 XLA
初见端倪
为什么要先说 XLA 呢?
笔者在 TensorFlow 仓库的 main 分支寻找与 BFC (Best-Fit with Coalescing) 算法有关的源码时,首先是定位到了 tensorflow/core/common_runtime/gpu,该目录下有 gpu_bfc_allocator.h 和 gpu_bfc_allocator.cc 两个文件,但并没有找到具体的与 BFC 算法相关的代码实现。
gpu_bfc_allocator.h 的核心代码:
#include <memory>
#include <optional>
#include <string>
#include "xla/tsl/framework/allocator.h"
#include "xla/tsl/framework/bfc_allocator.h"
#include "xla/tsl/platform/macros.h"
namespace tensorflow {
// A GPU memory allocator that implements a 'best-fit with coalescing'
// algorithm.
class GPUBFCAllocator : public tsl::BFCAllocator {
public:
// See BFCAllocator::Options.
struct Options {
// Overridden by TF_FORCE_GPU_ALLOW_GROWTH if that envvar is set.
bool allow_growth = false;
// If nullopt, defaults to TF_ENABLE_GPU_GARBAGE_COLLECTION, or true if that
// envvar is not present.
//
// Note:
//
// - BFCAllocator defaults garbage_collection to false, not true.
// - this is not the same override behavior as TF_FORCE_GPU_ALLOW_GROWTH.
std::optional<bool> garbage_collection;
double fragmentation_fraction = 0;
bool allow_retry_on_failure = true;
};
GPUBFCAllocator(std::unique_ptr<tsl::SubAllocator> sub_allocator,
size_t total_memory, const std::string& name,
const Options& opts);
~GPUBFCAllocator() override {}
GPUBFCAllocator(const GPUBFCAllocator&) = delete;
void operator=(const GPUBFCAllocator&) = delete;
};
} // namespace tensorflow
但从 gpu_bfc_allocator.h 的代码中,笔者发现GPUBFCAllocator
类继承自tsl::BFCAllocator
,而tsl
命名空间就来自第三方库 third_party/xla。
而 BFC 算法的具体实现就在 xla/tsl/framework 目录下的 bfc_allocator.h 和 bfc_allocator.cc 两个文件中。
初识 XLA
TensorFlow 通过 XLA 编译器优化 GPU 代码执行。
下图来自:openxla/xla: A machine learning compiler for GPUs, CPUs, and ML accelerators
data:image/s3,"s3://crabby-images/ff092/ff092819edc7d720bad5789f535621c7271c7c6a" alt=""
下图来自:技术栈架构 - AI技术栈解析及应用- 作者:张真瑜 | 山东大学智能创新研究院
OpenXLA 作为一个灵活的深度学习编译器框架,与 PyTorch 和 TensorFlow 深度集成,通过自定义算子、JIT 编译和 GPU 内核融合等技术,大幅提升了这些深度学习框架在 GPU 上的执行效率。同时,OpenXLA 还利用 CUDA Runtime API 和 CUDA Driver API,实现了对 GPU 硬件的精细控制和优化,包括内存管理、设备操作和内核启动等。
主要数据结构
Chunk
以下内容部分来自:TensorFlow 源码分析之内存管理BFC算法——DeepReve
整个内存空间由一个按基址升序排列的Chunk双向链表来表示,它们的直接前趋和后继必须在地址连续的内存空间。Chunk结构体里含有实际大小、请求大小、是否被占用、基址、直接前趋、直接后继、Bin索引等信息。
data:image/s3,"s3://crabby-images/2883a/2883a408ff5430ff71f6710df7ac88af2b626ad4" alt=""
以下内容部分来自:Tensorflow 源码分析- 从GPU OOM开始说Tensorflow的BFC内存管理
Chunk 是 TensorFlow 的最小内存单位,由数倍 256 bytes (kMinAllocationSize) 的连续内存块组成。TensorFlow 的内存管理是基于 Chunk 的管理。
data:image/s3,"s3://crabby-images/bcc85/bcc85cd48ca318fda677edfc44475d7d286360ec" alt=""
以下内容部分来自:TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack
Chunk 是 BFC 最核心的数据结构之一,在 TensorFlow 源码中是以struct
来描述的。具体来说,一个 Chunk 代表一段连续的存储空间,BFC 要求各个 Chunk 要按照基地址升序排列并组织成双向链表,下图展示了 Chunk 的结构以及 Chunk 之间的连接关系。初始时,每个 Chunk 都有自己的size
,并且这些size
都是以 256 字节为模。应当注意,每个 Chunk 或者完全被标记为使用,或者完全标记为空闲,不存在该 Chunk 内只有部分空间被使用的情况。
data:image/s3,"s3://crabby-images/d1eba/d1ebac0ae86caae2dc980051285c2f9d96a126c1" alt=""
Chunk
的代码实现:
// A Chunk points to a piece of memory that's either entirely free or entirely
// in use by one user memory allocation.
//
// An AllocationRegion's memory is split up into one or more disjoint Chunks,
// which together cover the whole region without gaps. Chunks participate in
// a doubly-linked list, and the prev/next pointers point to the physically
// adjacent chunks.
//
// Since a chunk cannot be partially in use, we may need to split a free chunk
// in order to service a user allocation. We always merge adjacent free
// chunks.
//
// Chunks contain information about whether they are in use or whether they
// are free, and contain a pointer to the bin they are in.
struct Chunk {
size_t size = 0; // Full size of buffer.
// We sometimes give chunks that are larger than needed to reduce
// fragmentation. requested_size keeps track of what the client
// actually wanted so we can understand whether our splitting
// strategy is efficient.
size_t requested_size = 0;
// allocation_id is set to -1 when the chunk is not in use. It is assigned a
// value greater than zero before the chunk is returned from
// AllocateRaw, and this value is unique among values assigned by
// the parent allocator.
int64_t allocation_id = -1;
void* ptr = nullptr; // pointer to granted subbuffer.
// If not kInvalidChunkHandle, the memory referred to by 'prev' is directly
// preceding the memory used by this chunk. E.g., It should start
// at 'ptr - prev->size'
ChunkHandle prev = kInvalidChunkHandle;
// If not kInvalidChunkHandle, the memory referred to by 'next' is directly
// following the memory used by this chunk. E.g., It should be at
// 'ptr + size'
ChunkHandle next = kInvalidChunkHandle;
// What bin are we in?
BinNum bin_num = kInvalidBinNum;
// Optional count when this chunk was most recently made free.
uint64 freed_at_count = 0;
bool in_use() const { return allocation_id != -1; }
#ifdef TENSORFLOW_MEM_DEBUG
// optional debugging info
const char* op_name = nullptr;
uint64 step_id = 0;
int64 action_count = 0;
#endif
string DebugString(BFCAllocator* a,
bool recurse) ABSL_NO_THREAD_SAFETY_ANALYSIS {
string dbg;
absl::StrAppend(
&dbg, " Size: ", strings::HumanReadableNumBytes(size),
" | Requested Size: ", strings::HumanReadableNumBytes(requested_size),
" | in_use: ", in_use(), " | bin_num: ", bin_num);
if (recurse && prev != BFCAllocator::kInvalidChunkHandle) {
Chunk* p = a->ChunkFromHandle(prev);
absl::StrAppend(&dbg, ", prev: ", p->DebugString(a, false));
}
if (recurse && next != BFCAllocator::kInvalidChunkHandle) {
Chunk* n = a->ChunkFromHandle(next);
absl::StrAppend(&dbg, ", next: ", n->DebugString(a, false));
}
#ifdef TENSORFLOW_MEM_DEBUG
absl::StrAppend(&dbg, ", for: ", op_name ? op_name : "UNKNOWN",
", stepid: ", step_id, ", last_action: ", action_count);
#endif
return dbg;
}
};
Bin
以下内容部分来自:TensorFlow内存管理bfc算法
BFC 算法采取的是被动分块的策略。最开始整个内存是一个 Chunk,随着用户申请空间的次数增加,最开始的大 Chunk 会被不断的 Split 开来,从而产生越来越多的小 Chunk。当 Chunk 数量很大时,为了寻找一个合适的内存块而遍历双链表无疑是一笔巨大的开销。
为了实现对空闲块的高效管理,BFC 算法设计了 Bin 这个抽象数据结构。
- 每个 Bin 都有一个
size
属性,一个 Bin 是一个拥有 Chunk size >= Bin size的空闲 Chunk 的集合。集合中的 Chunk 按照 Chunk size 的升序组织成单链表。 - BFC 算法维护了一个 Bin 的集合:Bins。它由多个 Bin 以及从属于每个 Bin 的 Chunk 组成。内存中所有的空闲 Chunk 都由 Bins 管理。
data:image/s3,"s3://crabby-images/75ff2/75ff280196d45605d1271bec64803f5713924064" alt=""
图中每一列表示一个 Bin,列首方格中的数字表示 Bin 的size。Bin size 的大小都是 256 的 2^n 的倍。每个 Bin 下面挂载了一系列的空闲 Chunk,每个 Chunk 的 Chunk size 都大于等于所属的 Bin 的 Bin size,按照 Chunk size 的升序挂载成单链表。
以下内容部分来自:Tensorflow 源码分析- 从GPU OOM开始说Tensorflow的BFC内存管理
当申请新的内存的时候,如何更快高效的查找匹配的空闲 Chunk,这是非常重要的。TensorFlow 基于 Chunk 构建了一个全局的 Bin,每个 Bin 里管理的是内存大小在一定范围的 Chunk(内存大小范围 (2^bin_num)*256 到 (2^(bin_num+1))*256-1的,bin_num 代表的是 Bin 的序列号)。每个 Bin 里会保存着一个空闲的 Chunk 的 Set。
data:image/s3,"s3://crabby-images/510bb/510bbddd145ffed57dc4b214416a229fe4de55c6" alt=""
以下内容部分来自:TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack
如果我们想查询某一块符合条件的空闲 Chunk 并取出,那么只能对双向链表做遍历,显然这个效率不是很高。为了加速查询某块 Chunk 的速度,可以在创建 Chunk 链表时按一定顺序排列,并将整个有序链表在逻辑上切分成多个段,为每个段记录所包含的 Chunk 的范围,这种结构就是 Bin,它相当于一种索引。因此,Bin 结构是为了方便 Chunk 的查询而出现的。
在 BFC Allocator 中,每个段中 Chunk 的顺序是按照size
和基地址升序排序的,每个 Bin 都设有自己的bin_size
,该bin_size
表示该段包含的最小 Chunk 的size
。这样一来,用户端就可以根据所需要申请的 Memory 大小直接找到对应的 Bin,然后在该 Bin 中遍历寻找适合的 Chunk。为了能够根据bin_size
直接定位到 Bin,规定bin_size
与bin_num
的大小关系为:bin_size=256 * 2^bin_num。用户在申请 Memory 时,会将实际大小映射到最适合的bin_size
上,然后再根据bin_size
与bin_num
的关系找到对应的 Bin,进而在该段中遍历搜索。
Bin 中 Chunk 的是通过 Set 组织的,为了能在 Set 中体现双向链表的逻辑,只需要让 Chunk 在 Set 中按照规则升序排列,并修正前驱后继指针即可。
data:image/s3,"s3://crabby-images/52f9a/52f9a0c391690b963e569f9ac88b3be6fc1997e1" alt=""
以下内容部分来自:极简入门TensorFlow 内存管理
BFCAllocator 下的两个比较重要的数据结构,Chunk 和 Bin,两者之间的关系如下图,看起来像一个个糖葫芦,第一个 bin size 为 256<<1, 第二个为 256<<2, 一次类推,TF 内有 21 个 bin,最后 bin size 为 256 << 21 为 512MB,每一个 bin 下面会接下若干个大于 bin size 的 Chunk,整个内存空间由以下的结构来组织,当分配内存大小指定时,系统会遍历 bin,找到能够第一次满足 Chunk > bin_size,每一个 bin 下的 Chunk 是有序的(Bin 下的 ChunkComparator)
data:image/s3,"s3://crabby-images/0370c/0370cab6ea5ddb635848e7796c12e869678bb851" alt=""
Bin
的代码实现:
// A Bin is a collection of similar-sized free chunks.
// Allocated chunks are never in a Bin.
struct Bin {
// All chunks in this bin have >= bin_size memory.
size_t bin_size = 0;
class ChunkComparator {
public:
explicit ChunkComparator(BFCAllocator* allocator)
: allocator_(allocator) {}
// Sort first by size and then use pointer address as a tie breaker.
bool operator()(const ChunkHandle ha, const ChunkHandle hb) const
ABSL_NO_THREAD_SAFETY_ANALYSIS {
const Chunk* a = allocator_->ChunkFromHandle(ha);
const Chunk* b = allocator_->ChunkFromHandle(hb);
if (a->size != b->size) {
return a->size < b->size;
}
return a->ptr < b->ptr;
}
private:
BFCAllocator* allocator_; // The parent allocator
};
typedef std::set<ChunkHandle, ChunkComparator> FreeChunkSet;
// List of free chunks within the bin, sorted by chunk size.
// Chunk * not owned.
FreeChunkSet free_chunks;
Bin(BFCAllocator* allocator, size_t bs)
: bin_size(bs), free_chunks(ChunkComparator(allocator)) {}
};
辅助类
以下内容部分来自:Nvidia GPU显存池-BFC
AllocationRegion 与 RegionManager 两个辅助类的主要功能是记录每次分配给用户的显存指针和对应 Chunk 的映射关系,方便后续显存回收。
在本文,我们把重点放在 TensorFlow BFC 算法的核心思想,不展开讨论辅助类 AllocationRegion 和 RegionManager,感兴趣可以学习:
- AllocationRegion 的代码实现
- RegionManager 的代码实现
BFC 算法核心
1 分配内存
以下内容部分来自:TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack
这是 BFCAllocator 的为用户分配 Chunk 的总体流程,该过程涉及到了几个比较重要的子过程。它们分别是
- 遍历搜索寻找最佳 Chunk 指针的
FindChunkPtr
过程, - 当 Chunk 链表中不存在合适的 Chunk 以至于不得不向物理设备申请新存储空间的
Extend
过程, - 以及分配 Chunk 时为缓解碎片问题而出现的
SplitChunk
过程。
data:image/s3,"s3://crabby-images/24d16/24d164becff1d7c5871e1fcc18284e0fc8a87e58" alt=""
总流程AllocateRawInternal
的代码实现:
void* BFCAllocator::AllocateRawInternal(size_t unused_alignment,
size_t num_bytes,
bool dump_log_on_failure,
uint64 freed_before) {
if (num_bytes == 0) {
VLOG(2) << "tried to allocate 0 bytes";
return nullptr;
}
// 1) 调整大小
// First, always allocate memory of at least kMinAllocationSize
// bytes, and always allocate multiples of kMinAllocationSize bytes
// so all memory addresses are nicely byte aligned.
size_t rounded_bytes = RoundedBytes(num_bytes);
// 2) 查找 Bin
// The BFC allocator tries to find the best fit first.
BinNum bin_num = BinNumForSize(rounded_bytes);
absl::MutexLock l(&mutex_);
if (!timestamped_chunks_.empty()) {
// Merge timestamped chunks whose counts have become safe for general use.
MergeTimestampedChunks(0);
}
// 3) 查找 Chunk
void* ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);
if (ptr != nullptr) {
AddTraceMe("MemoryAllocation", ptr);
return ptr;
}
// 4) 如果查找失败,那么扩展内存,然后再查找合适的空闲Chunk
// Try to extend
if (Extend(unused_alignment, rounded_bytes)) {
ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before); // 4.8 再次查找 Chunk
if (ptr != nullptr) {
AddTraceMe("MemoryAllocation", ptr);
return ptr;
}
}
// 5) 第一次尝试合并,并再次查找 Chunk
if ((freed_before == 0) && (!timestamped_chunks_.empty())) {
// We're unable to satisfy an allocation request without a specific
// timestamp requirement. Rather than fail, try merging any held-out
// timestamped chunks more aggressively until a free chunk of the necessary
// size is formed.
if (MergeTimestampedChunks(rounded_bytes)) {
ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);
if (ptr != nullptr) {
AddTraceMe("MemoryAllocation", ptr);
return ptr;
}
}
}
// 6) 第二次尝试合并,并再次查找 Chunk
// Reaching this point means that no chunks can satisfy the request. Also,
// the unallocated bytes cannot satisfy the request. Before giving up, let's
// try deallocating free regions so that suballocator can combine them with
// the unallocated bytes and form a larger region.
if (DeallocateFreeRegions(rounded_bytes) &&
Extend(unused_alignment, rounded_bytes)) {
ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);
if (ptr != nullptr) {
AddTraceMe("MemoryAllocation", ptr);
return ptr;
}
}
// 7) 可用内存不足导致分配失败,报告 OOM
// We searched all bins for an existing free chunk to use and
// couldn't find one. This means we must have run out of memory,
// Dump the memory log for analysis.
MaybeWriteMemoryMap();
if (dump_log_on_failure) {
LOG(WARNING)
<< "Allocator (" << Name() << ") ran out of memory trying "
<< "to allocate " << strings::HumanReadableNumBytes(num_bytes)
<< " (rounded to " << rounded_bytes << ")" << "requested by op "
<< tsl::profiler::ScopedMemoryDebugAnnotation::CurrentAnnotation()
.pending_op_name
<< "\nIf the cause is memory fragmentation maybe the environment "
<< "variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will "
<< "improve the situation. \nCurrent allocation summary follows."
<< "\nCurrent allocation summary follows.";
DumpMemoryLog(rounded_bytes);
LOG(WARNING) << RenderOccupancy();
}
return nullptr;
}
1.1 调整大小
RoundedBytes
的代码实现:
size_t BFCAllocator::RoundedBytes(size_t bytes) {
size_t rounded_bytes =
(kMinAllocationSize *
((bytes + kMinAllocationSize - 1) / kMinAllocationSize));
DCHECK_EQ(size_t{0}, rounded_bytes % kMinAllocationSize);
return rounded_bytes;
}
将每次分配的内存大小调整为kMinAllocationSize
的N倍,这样所有内存地址都是很好地按字节对齐了。
1.2 查找Bin
BinNumForSize
的代码实现:
BinNum BinNumForSize(size_t bytes) {
uint64 v = std::max<size_t>(bytes, 256) >> kMinAllocationBits;
int b = std::min(kNumBins - 1, tsl::Log2Floor64(v));
return b;
}
Bin size 是 256 的 2^N 倍。std::max<size_t>(bytes, 256) >> kMinAllocationBits
先将分配内存的字节数右移 8 位,然后把结果用在std::min(kNumBins - 1, tsl::Log2Floor64(v))
,计算出的二进制有效位数即为 Bin 在 Bins 中的索引。
1.3 查找Chunk
FindChunkPtr
的代码实现:
void* BFCAllocator::FindChunkPtr(BinNum bin_num, size_t rounded_bytes,
size_t num_bytes, uint64 freed_before) {
// First identify the first bin that could satisfy rounded_bytes.
for (; bin_num < kNumBins; bin_num++) {
// Start searching from the first bin for the smallest chunk that fits
// rounded_bytes.
Bin* b = BinFromIndex(bin_num);
for (auto citer = b->free_chunks.begin(); citer != b->free_chunks.end();
++citer) {
// 1) 从之前得到的 Bin 索引开始,查找合适的空闲 Chunk
const BFCAllocator::ChunkHandle h = (*citer);
BFCAllocator::Chunk* chunk = ChunkFromHandle(h);
DCHECK(!chunk->in_use());
if (freed_before > 0 && freed_before < chunk->freed_at_count) {
continue;
}
if (chunk->size >= rounded_bytes) {
// 2) 将找到的 Chunk 从 Bin 中移除
// We found an existing chunk that fits us that wasn't in use, so remove
// it from the free bin structure prior to using.
RemoveFreeChunkIterFromBin(&b->free_chunks, citer);
// 3) 拆分 Chunk:当以下两个条件之一满足时,SplitChunk过程将被触发。
// 1. 当 Chunk 的 size 是用户请求的 round size 两倍及以上时(用户请求的 size 会根据最小分配单元做 round 近似)
// 2. 当 Chunk 的 size 减去用户请求的 round size 后依然大于等于最大碎片限定时(128MB)
// If we can break the size of the chunk into two reasonably large
// pieces, do don't waste more than max_internal_fragmentation_bytes on
// padding. If this threshold is not set by the user, then use 128MB as
// the default.
const int64_t max_internal_fragmentation_bytes =
(opts_.fragmentation_fraction > 0.0)
? opts_.fragmentation_fraction * memory_limit_
: 128 << 20;
if (chunk->size >= rounded_bytes * 2 ||
static_cast<int64_t>(chunk->size) - rounded_bytes >=
max_internal_fragmentation_bytes) {
SplitChunk(h, rounded_bytes);
chunk = ChunkFromHandle(h); // Update chunk pointer in case it moved
}
// 4) 修改 Chunk 的请求大小、分配 ID(标记 Chunk 被占用)
// The requested size of the returned chunk is what the user
// has allocated.
chunk->requested_size = num_bytes;
// Assign a unique id and increment the id counter, marking the
// chunk as being in use.
chunk->allocation_id = next_allocation_id_++;
// 5) 更新统计
// Update stats.
++stats_.num_allocs;
stats_.bytes_in_use += chunk->size;
if (stats_.bytes_in_use > stats_.peak_bytes_in_use) {
VLOG(2) << "New Peak memory usage of " << stats_.bytes_in_use
<< " bytes for " << Name();
}
stats_.peak_bytes_in_use =
std::max(stats_.peak_bytes_in_use, stats_.bytes_in_use);
stats_.largest_alloc_size =
std::max<std::size_t>(stats_.largest_alloc_size, chunk->size);
#ifdef TENSORFLOW_MEM_DEBUG
if (ShouldRecordOpName()) {
const auto& annotation =
profiler::ScopedMemoryDebugAnnotation::CurrentAnnotation();
if (annotation.pending_op_name != nullptr) {
chunk->op_name = annotation.pending_op_name;
} else {
LOG(INFO) << "missing pending_op_name for " << Name()
<< " reading addr "
<< static_cast<const void*>(&annotation.pending_op_name)
<< "\n"
<< CurrentStackTrace();
chunk->op_name = nullptr;
}
chunk->action_count = ++action_counter_;
chunk->step_id = annotation.pending_step_id;
int slot = chunk->action_count % MEM_DEBUG_SIZE_HISTORY_SIZE;
size_history_[slot] = stats_.bytes_in_use;
}
#endif
VLOG(4) << "Returning: " << chunk->ptr;
if (VLOG_IS_ON(4)) {
LOG(INFO) << "A: " << RenderOccupancy();
}
// 6) 成功时返回找到的 Chunk 指针
return chunk->ptr;
}
}
}
// 7) 失败时返回空指针
return nullptr;
}
SplitChunk
的代码实现:
void BFCAllocator::SplitChunk(BFCAllocator::ChunkHandle h, size_t num_bytes) {
// Allocate the new chunk before we do any ChunkFromHandle
ChunkHandle h_new_chunk = AllocateChunk();
Chunk* c = ChunkFromHandle(h);
CHECK(!c->in_use() && (c->bin_num == kInvalidBinNum));
// Create a new chunk starting num_bytes after c
BFCAllocator::Chunk* new_chunk = ChunkFromHandle(h_new_chunk);
new_chunk->ptr = static_cast<void*>(static_cast<char*>(c->ptr) + num_bytes);
region_manager_.set_handle(new_chunk->ptr, h_new_chunk);
// Set the new sizes of the chunks.
new_chunk->size = c->size - num_bytes;
c->size = num_bytes;
// The new chunk is not in use.
new_chunk->allocation_id = -1;
// It inherits the freed time.
new_chunk->freed_at_count = c->freed_at_count;
// 1) 调整 Chunk 的前驱后继指针
// Maintain the pointers.
// c <-> c_neighbor becomes
// c <-> new_chunk <-> c_neighbor
BFCAllocator::ChunkHandle h_neighbor = c->next;
new_chunk->prev = h;
new_chunk->next = h_neighbor;
c->next = h_new_chunk;
if (h_neighbor != kInvalidChunkHandle) {
Chunk* c_neighbor = ChunkFromHandle(h_neighbor);
c_neighbor->prev = h_new_chunk;
}
// 2) 根据 Free Chunk 的大小,将它插入到对应的 Bin 中
// Add the newly free chunk to the free bin.
InsertFreeChunkIntoBin(h_new_chunk);
}
1.4 扩展内存
Extend
的代码实现:
bool BFCAllocator::Extend(size_t alignment, size_t rounded_bytes) {
size_t available_bytes = memory_limit_ - *stats_.pool_bytes;
// Rounds available_bytes down to the nearest multiple of kMinAllocationSize.
available_bytes = (available_bytes / kMinAllocationSize) * kMinAllocationSize;
// 1) 已占用的加上申请的内存大小,超过最大内存限制时,返回失败。
// Do we have enough space to handle the client's request?
// If not, fail immediately.
if (rounded_bytes > available_bytes) {
return false;
}
// 2) 循环将当前区域可分配的内存扩充 1 倍,直到大于等于申请的内存大小。
// If curr_region_allocation_bytes_ is not enough to satisfy the
// allocation, keep multiplying by a power of two until that is
// sufficient.
bool increased_allocation = false;
while (rounded_bytes > curr_region_allocation_bytes_) {
curr_region_allocation_bytes_ *= 2;
increased_allocation = true;
}
// 3) 从内存池里分配内存
// Try allocating.
size_t bytes = std::min(curr_region_allocation_bytes_, available_bytes);
size_t bytes_received;
void* mem_addr = sub_allocator_->Alloc(alignment, bytes, &bytes_received);
// 4) 如果失败,尝试分配申请内存大小的 90%。一直重复,直到分配成功或可用内存不足。
if (mem_addr == nullptr) {
static constexpr float kBackpedalFactor = 0.9;
// Try allocating less memory.
while (mem_addr == nullptr) {
bytes = RoundedBytes(bytes * kBackpedalFactor);
if (bytes < rounded_bytes) {
return false;
}
mem_addr = sub_allocator_->Alloc(alignment, bytes, &bytes_received);
}
}
if (!increased_allocation) {
// Increase the region size of the next required allocation.
curr_region_allocation_bytes_ *= 2;
}
VLOG(1) << "Extending allocation by "
<< strings::HumanReadableNumBytes(bytes_received) << " bytes for "
<< Name() << ".";
*stats_.pool_bytes += bytes_received;
*stats_.peak_pool_bytes =
std::max(*stats_.pool_bytes, *stats_.peak_pool_bytes);
VLOG(1) << "Total allocated bytes: "
<< strings::HumanReadableNumBytes(*stats_.pool_bytes);
VLOG(1) << "Allocated memory at " << mem_addr << " to "
<< static_cast<void*>(static_cast<char*>(mem_addr) + bytes_received);
// 5) 给分配的内存添加 AllocationRegion
AllocationRegion* maybe_extended_region = nullptr;
if (coalesce_regions_) {
maybe_extended_region =
region_manager_.AddOrExtendAllocationRegion(mem_addr, bytes_received);
} else {
region_manager_.AddAllocationRegion(mem_addr, bytes_received);
}
// 6) 创建一个空闲 Chunk
// Create one large chunk for the whole memory space that will
// be chunked later.
ChunkHandle h = AllocateChunk();
BFCAllocator::Chunk* c = ChunkFromHandle(h);
c->ptr = mem_addr;
c->size = bytes_received;
c->allocation_id = -1;
c->prev = kInvalidChunkHandle;
c->next = kInvalidChunkHandle;
c->freed_at_count = 0;
region_manager_.set_handle(c->ptr, h);
// If the region was extended, then there exists a previous chunk that should
// be linked to the new chunk.
if (maybe_extended_region != nullptr) {
ChunkHandle prev =
maybe_extended_region->get_handle(maybe_extended_region->ptr());
BFCAllocator::Chunk* prev_chunk = ChunkFromHandle(prev);
// Find the last recorded chunk in the extended region.
while (prev_chunk->next != kInvalidChunkHandle) {
prev = prev_chunk->next;
prev_chunk = ChunkFromHandle(prev);
}
c->prev = prev;
prev_chunk->next = h;
}
// 7) 将空闲 Chunk 插入 Bin 中
// Maybe merge adjacent chunks and insert the chunk into the right bin.
InsertFreeChunkIntoBin(TryToCoalesce(h, /*ignore_freed_at=*/false));
return true;
}
2 回收内存
以下内容部分来自:TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack
因为在回收时只知道存储空间首地址指针,并不知道其对应的 Chunk,所以需要先借助region_manager
等辅助工具获取其所对应的 Chunk 指针,然后考虑其前驱后继节点是否可以合并。下面展示了整体流程。
data:image/s3,"s3://crabby-images/30282/30282f1d3a22ac0eebe8721881c04b16f8195897" alt=""
总流程DeallocateRawInternal
的代码实现:
void BFCAllocator::DeallocateRawInternal(void* ptr) {
if (ptr == nullptr) {
VLOG(2) << "tried to deallocate nullptr";
return;
}
absl::MutexLock l(&mutex_);
// 1) 从 RegionManager 的指针到 ChunkHandle 的映射关系中得到 ChunkHandle
// Find the chunk from the ptr.
BFCAllocator::ChunkHandle h = region_manager_.get_handle(ptr);
CHECK(h != kInvalidChunkHandle);
// Record chunk information before it's freed.
Chunk* chunk = ChunkFromHandle(h);
void* chunk_ptr = chunk->ptr;
int64_t req_bytes = chunk->requested_size;
int64_t alloc_bytes = chunk->size;
// 2) 将 Chunk 标记为空闲,然后把总占用的内存量减去 Chunk 的大小
MarkFree(h);
// Consider coalescing it.
if (timing_counter_) {
InsertFreeChunkIntoBin(h);
timestamped_chunks_.push_back(h);
} else {
// 3) 合并 Chunk
// 4) 将合并后的空闲 Chunk 插入 Bin
InsertFreeChunkIntoBin(TryToCoalesce(h, false));
}
// TraceMe needs to be added after MarkFree and InsertFreeChunkIntoBin for
// correct aggregation stats (bytes_in_use, fragmentation).
AddTraceMe("MemoryDeallocation", chunk_ptr, req_bytes, alloc_bytes);
if (VLOG_IS_ON(4)) {
LOG(INFO) << "F: " << RenderOccupancy();
}
}
2.1 获取ChunkHandle
2.2 更新状态
MarkFree
的代码实现:
void BFCAllocator::MarkFree(BFCAllocator::ChunkHandle h) {
Chunk* c = ChunkFromHandle(h);
CHECK(c->in_use() && (c->bin_num == kInvalidBinNum));
// 1) 将 Chunk 标记为空闲
// Mark the chunk as no longer in use.
c->allocation_id = -1;
// Optionally record the free time.
if (timing_counter_) {
c->freed_at_count = timing_counter_->next();
}
// 2) 更新状态,把总占用的内存量减去 Chunk 的大小
// Updates the stats.
stats_.bytes_in_use -= c->size;
#ifdef TENSORFLOW_MEM_DEBUG
if (ShouldRecordOpName()) {
c->action_count = ++action_counter_;
int slot = c->action_count % MEM_DEBUG_SIZE_HISTORY_SIZE;
size_history_[slot] = stats_.bytes_in_use;
}
#endif
}
2.3 合并Chunk
TryToCoalesce
的代码实现:
BFCAllocator::ChunkHandle BFCAllocator::TryToCoalesce(ChunkHandle h,
bool ignore_freed_at) {
Chunk* c = ChunkFromHandle(h);
if ((!ignore_freed_at) && c->freed_at_count > 0) return h;
ChunkHandle coalesced_chunk = h;
// 1) 合并直接后继
// If the next chunk is free, merge it into c and delete it.
if (c->next != kInvalidChunkHandle && !ChunkFromHandle(c->next)->in_use()) {
Chunk* n = ChunkFromHandle(c->next);
if ((n->freed_at_count == 0) || ignore_freed_at) {
VLOG(4) << "Merging c->next " << n->ptr << " with c " << c->ptr;
// 1.1) 将直接后继从 Bin 中移除
RemoveFreeChunkFromBin(c->next);
// 1.2) 合并
Merge(h, c->next);
}
}
// 2) 合并直接前趋
// If the previous chunk is free, merge c into it and delete c.
if (c->prev != kInvalidChunkHandle && !ChunkFromHandle(c->prev)->in_use()) {
Chunk* n = ChunkFromHandle(c->prev);
if ((n->freed_at_count == 0) || ignore_freed_at) {
VLOG(4) << "Merging c " << c->ptr << " into c->prev " << n->ptr;
coalesced_chunk = c->prev;
// 2.1) 将直接前趋从 Bin 中移除
RemoveFreeChunkFromBin(c->prev);
// 2.2) 合并
Merge(c->prev, h);
}
}
return coalesced_chunk;
}
Merge
的代码实现:
// Merges h1 and h2 when Chunk(h1)->next is h2 and Chunk(h2)->prev is c1.
// We merge Chunk(h2) into Chunk(h1).
void BFCAllocator::Merge(BFCAllocator::ChunkHandle h1,
BFCAllocator::ChunkHandle h2) {
Chunk* c1 = ChunkFromHandle(h1);
Chunk* c2 = ChunkFromHandle(h2);
// We can only merge chunks that are not in use.
CHECK(!c1->in_use() && !c2->in_use());
// c1's prev doesn't change, still points to the same ptr, and is
// still not in use.
// 1) 修改 Chunk 的指向关系
// Fix up neighbor pointers
//
// c1 <-> c2 <-> c3 should become
// c1 <-> c3
BFCAllocator::ChunkHandle h3 = c2->next;
c1->next = h3;
CHECK(c2->prev == h1);
if (h3 != kInvalidChunkHandle) {
BFCAllocator::Chunk* c3 = ChunkFromHandle(h3);
c3->prev = h1;
}
// 2) 更新 Chunk 大小
// Set the new size
c1->size += c2->size;
// Pick latest free time.
c1->freed_at_count = std::max(c1->freed_at_count, c2->freed_at_count);
// 3) 删除被合并的 Chunk
DeleteChunk(h2);
}
总结
以下内容来自:TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack
本文总结了 TensorFlow 中存储管理器——BFC Allocator。它的设计思路来自于经典来的 dlmalloc 分配算法,是 Best fit coalecing 的简单实现版本。BFC Allocator 是为了应对 TensorFlow 中频繁分配释放存储空间需求的场景而出现的解决方案,通过事先将存储空间从物理设备上开辟好,并将这些空闲存储空间封装成 Chunk,组织成有序双向链表,然后利用 Bin 这一种索引结构为 Chunk 的查询做加速,最终完成了高效的分配算法。在实际分配时,可能会遇到 Chunk 链表中不存在符合要求的空闲 Chunk 情况,这时候就可能需要向物理设备中再次开辟新的存储空间,这个过程被视为对 Chunk 链表的扩展,对应的过程是Extend
。因为是按 Chunk 进行分配,势必可能造成存储碎片,为了解决碎片问题,BFC Allocator 设计了SplitChunk
和Merge
函数。
参考资料
关于 XLA
- openxla/xla: A machine learning compiler for GPUs, CPUs, and ML accelerators
- OpenXLA (NVIDIA) - AI技术栈解析及应用- 作者:张真瑜 | 山东大学智能创新研究院
- XLA:优化机器学习编译器 | TensorFlow
- XLA优化原理简介-云社区-华为云
- 如何看待OpenXLA这个开源项目? - TanyoKwok 郭天佑的回答 - 知乎
关于 TensorFlow BFC
- TensorFlow 源码分析之内存管理BFC算法——DeepReve
- TensorFlow内存管理bfc算法
- Tensorflow 源码分析- 从GPU OOM开始说Tensorflow的BFC内存管理(文中还对 Region 进行了介绍)
- TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack(文中有个分析出的结论:存储区分配的时机是在一个 Tensor 对象被创建时立即发生的;文中还对辅助类 AllocationRegion 与 RegionManager 进行了介绍)
- 极简入门TensorFlow 内存管理
- Tensorflow源码剖析:Allocator(基础篇)(文中介绍了 Allocator 类)
- TensorFlow源码剖析:Allocator(提高篇)(文中介绍了 AllocatorStats 类,其余大部分内容和TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack相似)
- TensorFlow源码剖析:Allocator(进阶篇)(文中介绍了 AllocatorRetry 类)
- TensorFlow源码剖析:Allocator(应用篇)(文中介绍了 TensorFlow 下 GPU 相关配置项)
- Nvidia GPU显存池-BFC(文中介绍了 Linux 内核内存池、tcmalloc 和 dlmalloc 等内存分配器、由
allow_growth
参数决定的两种分配模式)