PyTorch 源码学习：GPU 内存管理之它山之石—

TensorFlow 和 PyTorch 都是常用的深度学习框架，各自有一套独特但又相似的 GPU 内存管理机制（BFC 算法）。它山之石可以攻玉。了解 TensorFlow 的 BFC 算法有助于学习 PyTorch 管理 GPU 内存的精妙之处。本文重点关注 TensorFlow BFC 算法的核心思想，研究的 TensorFlow 版本为1f1379c5f721579c58b4ee560daac7df90f8519a。更多内容请参考：

Ubuntu 22.04 LTS 源码编译安装 PyTorch
【翻译】pytorch/CONTRIBUTING.md
深度学习框架与静态/动态计算图【笔记】
PyTorch 源码学习：阅读经验 & 代码结构
PyTorch 源码学习：从 Tensor 到 Storage
PyTorch 源码学习：Dispatch & Autograd & Operators

文章目录

关于 TensorFlow BFC 算法
关于 XLA
- 初见端倪
- 初识 XLA
主要数据结构
- Chunk
- Bin
- 辅助类
BFC 算法核心
- 1 分配内存
- - 1.1 调整大小
  - 1.2 查找Bin
  - 1.3 查找Chunk
  - 1.4 扩展内存
- 2 回收内存
- - 2.1 获取ChunkHandle
  - 2.2 更新状态
  - 2.3 合并Chunk
总结
参考资料
- 关于 XLA
- 关于 TensorFlow BFC

关于 TensorFlow BFC 算法

以下内容来自：TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack

背景：使用 GPU 训练时，一次训练任务无论是模型参数还是中间结果都需要占用大量显存。为了避免每次训练重新开辟显存带来计算之外的开销，一般框架的做法是在真正的训练任务开始前，将每个节点的输入和输出，以及模型参数的 shape 计算出来并全局开辟一次，例如 Caffe 就是这种做法。随着深度学习模型的发展和迭代，不仅模型训练的数据 shape 可能发生变化，就连模型本身在训练过程中也可能发生变化，那么按照固定 shape 一次开辟显存的做法就不能满足需求了。为此，TensorFlow 重新设计了较为灵活的显存管理机制，它使用了名为 BFC 的分配算法，并通过 BFC Allocator 为每个 Tensor 分配满足需求的显存。

问题：显存分配与回收的性能需求。Tensor 在每次创建时会得到存储区域，而每一轮训练都要重新创建新的 Tensor，那么这里面临的一个问题：如此频繁的分配和回收存储区，如何才能做的高效？试想对于 GPU 来说，如果Allocate函数直接封装 CUDA 中昂贵的cudaMalloc函数，当 Tensor 被释放时直接调用cudaFree函数，那么训练速度将会因为这些 overhead 大打折扣。

思路：存储池。如果你对操作系统这门课比较熟悉，那么应该很容易想到解决办法：将显存按照不同的大小一次性开辟出来，并组成存储池，每次调用Allocate函数时从存储池中获取，Tensor 回收时将显存重新挂到存储池中。这样做确实可以满足性能需求，但是需要为此设计一个相对复杂的存储管理器。BFC Allocator 就是 TensorFlow 中管理 GPU 显存的存储管理器。

关于 XLA

初见端倪

为什么要先说 XLA 呢？

笔者在 TensorFlow 仓库的 main 分支寻找与 BFC (Best-Fit with Coalescing) 算法有关的源码时，首先是定位到了 tensorflow/core/common_runtime/gpu，该目录下有 gpu_bfc_allocator.h 和 gpu_bfc_allocator.cc 两个文件，但并没有找到具体的与 BFC 算法相关的代码实现。

gpu_bfc_allocator.h 的核心代码：

#include <memory>
#include <optional>
#include <string>

#include "xla/tsl/framework/allocator.h"
#include "xla/tsl/framework/bfc_allocator.h"
#include "xla/tsl/platform/macros.h"

namespace tensorflow {

// A GPU memory allocator that implements a 'best-fit with coalescing'
// algorithm.
class GPUBFCAllocator : public tsl::BFCAllocator {
 public:
  // See BFCAllocator::Options.
  struct Options {
    // Overridden by TF_FORCE_GPU_ALLOW_GROWTH if that envvar is set.
    bool allow_growth = false;

    // If nullopt, defaults to TF_ENABLE_GPU_GARBAGE_COLLECTION, or true if that
    // envvar is not present.
    //
    // Note:
    //
    //  - BFCAllocator defaults garbage_collection to false, not true.
    //  - this is not the same override behavior as TF_FORCE_GPU_ALLOW_GROWTH.
    std::optional<bool> garbage_collection;

    double fragmentation_fraction = 0;
    bool allow_retry_on_failure = true;
  };

  GPUBFCAllocator(std::unique_ptr<tsl::SubAllocator> sub_allocator,
                  size_t total_memory, const std::string& name,
                  const Options& opts);

  ~GPUBFCAllocator() override {}

  GPUBFCAllocator(const GPUBFCAllocator&) = delete;
  void operator=(const GPUBFCAllocator&) = delete;
};

}  // namespace tensorflow

但从 gpu_bfc_allocator.h 的代码中，笔者发现GPUBFCAllocator类继承自tsl::BFCAllocator，而tsl命名空间就来自第三方库 third_party/xla。

而 BFC 算法的具体实现就在 xla/tsl/framework 目录下的 bfc_allocator.h 和 bfc_allocator.cc 两个文件中。

初识 XLA

TensorFlow 通过 XLA 编译器优化 GPU 代码执行。

下图来自：openxla/xla: A machine learning compiler for GPUs, CPUs, and ML accelerators

下图来自：技术栈架构 - AI技术栈解析及应用- 作者：张真瑜 | 山东大学智能创新研究院

在这里插入图片描述

OpenXLA 作为一个灵活的深度学习编译器框架，与 PyTorch 和 TensorFlow 深度集成，通过自定义算子、JIT 编译和 GPU 内核融合等技术，大幅提升了这些深度学习框架在 GPU 上的执行效率。同时，OpenXLA 还利用 CUDA Runtime API 和 CUDA Driver API，实现了对 GPU 硬件的精细控制和优化，包括内存管理、设备操作和内核启动等。

主要数据结构

Chunk

以下内容部分来自：TensorFlow 源码分析之内存管理BFC算法——DeepReve

整个内存空间由一个按基址升序排列的Chunk双向链表来表示，它们的直接前趋和后继必须在地址连续的内存空间。Chunk结构体里含有实际大小、请求大小、是否被占用、基址、直接前趋、直接后继、Bin索引等信息。

以下内容部分来自：Tensorflow 源码分析- 从GPU OOM开始说Tensorflow的BFC内存管理

Chunk 是 TensorFlow 的最小内存单位，由数倍 256 bytes (kMinAllocationSize) 的连续内存块组成。TensorFlow 的内存管理是基于 Chunk 的管理。

以下内容部分来自：TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack

Chunk 是 BFC 最核心的数据结构之一，在 TensorFlow 源码中是以struct来描述的。具体来说，一个 Chunk 代表一段连续的存储空间，BFC 要求各个 Chunk 要按照基地址升序排列并组织成双向链表，下图展示了 Chunk 的结构以及 Chunk 之间的连接关系。初始时，每个 Chunk 都有自己的size，并且这些size都是以 256 字节为模。应当注意，每个 Chunk 或者完全被标记为使用，或者完全标记为空闲，不存在该 Chunk 内只有部分空间被使用的情况。

Chunk的代码实现：

  // A Chunk points to a piece of memory that's either entirely free or entirely
  // in use by one user memory allocation.
  //
  // An AllocationRegion's memory is split up into one or more disjoint Chunks,
  // which together cover the whole region without gaps.  Chunks participate in
  // a doubly-linked list, and the prev/next pointers point to the physically
  // adjacent chunks.
  //
  // Since a chunk cannot be partially in use, we may need to split a free chunk
  // in order to service a user allocation.  We always merge adjacent free
  // chunks.
  //
  // Chunks contain information about whether they are in use or whether they
  // are free, and contain a pointer to the bin they are in.
  struct Chunk {
    size_t size = 0;  // Full size of buffer.

    // We sometimes give chunks that are larger than needed to reduce
    // fragmentation.  requested_size keeps track of what the client
    // actually wanted so we can understand whether our splitting
    // strategy is efficient.
    size_t requested_size = 0;

    // allocation_id is set to -1 when the chunk is not in use. It is assigned a
    // value greater than zero before the chunk is returned from
    // AllocateRaw, and this value is unique among values assigned by
    // the parent allocator.
    int64_t allocation_id = -1;
    void* ptr = nullptr;  // pointer to granted subbuffer.

    // If not kInvalidChunkHandle, the memory referred to by 'prev' is directly
    // preceding the memory used by this chunk.  E.g., It should start
    // at 'ptr - prev->size'
    ChunkHandle prev = kInvalidChunkHandle;

    // If not kInvalidChunkHandle, the memory referred to by 'next' is directly
    // following the memory used by this chunk.  E.g., It should be at
    // 'ptr + size'
    ChunkHandle next = kInvalidChunkHandle;

    // What bin are we in?
    BinNum bin_num = kInvalidBinNum;

    // Optional count when this chunk was most recently made free.
    uint64 freed_at_count = 0;

    bool in_use() const { return allocation_id != -1; }

#ifdef TENSORFLOW_MEM_DEBUG
    // optional debugging info
    const char* op_name = nullptr;
    uint64 step_id = 0;
    int64 action_count = 0;
#endif

    string DebugString(BFCAllocator* a,
                       bool recurse) ABSL_NO_THREAD_SAFETY_ANALYSIS {
      string dbg;
      absl::StrAppend(
          &dbg, "  Size: ", strings::HumanReadableNumBytes(size),
          " | Requested Size: ", strings::HumanReadableNumBytes(requested_size),
          " | in_use: ", in_use(), " | bin_num: ", bin_num);
      if (recurse && prev != BFCAllocator::kInvalidChunkHandle) {
        Chunk* p = a->ChunkFromHandle(prev);
        absl::StrAppend(&dbg, ", prev: ", p->DebugString(a, false));
      }
      if (recurse && next != BFCAllocator::kInvalidChunkHandle) {
        Chunk* n = a->ChunkFromHandle(next);
        absl::StrAppend(&dbg, ", next: ", n->DebugString(a, false));
      }
#ifdef TENSORFLOW_MEM_DEBUG
      absl::StrAppend(&dbg, ", for: ", op_name ? op_name : "UNKNOWN",
                      ", stepid: ", step_id, ", last_action: ", action_count);
#endif
      return dbg;
    }
  };

Bin

以下内容部分来自：TensorFlow内存管理bfc算法

BFC 算法采取的是被动分块的策略。最开始整个内存是一个 Chunk，随着用户申请空间的次数增加，最开始的大 Chunk 会被不断的 Split 开来，从而产生越来越多的小 Chunk。当 Chunk 数量很大时，为了寻找一个合适的内存块而遍历双链表无疑是一笔巨大的开销。

为了实现对空闲块的高效管理，BFC 算法设计了 Bin 这个抽象数据结构。

每个 Bin 都有一个size属性，一个 Bin 是一个拥有 Chunk size >= Bin size的空闲 Chunk 的集合。集合中的 Chunk 按照 Chunk size 的升序组织成单链表。
BFC 算法维护了一个 Bin 的集合：Bins。它由多个 Bin 以及从属于每个 Bin 的 Chunk 组成。内存中所有的空闲 Chunk 都由 Bins 管理。

图中每一列表示一个 Bin，列首方格中的数字表示 Bin 的size。Bin size 的大小都是 256 的 2^n 的倍。每个 Bin 下面挂载了一系列的空闲 Chunk，每个 Chunk 的 Chunk size 都大于等于所属的 Bin 的 Bin size，按照 Chunk size 的升序挂载成单链表。

以下内容部分来自：Tensorflow 源码分析- 从GPU OOM开始说Tensorflow的BFC内存管理

当申请新的内存的时候，如何更快高效的查找匹配的空闲 Chunk，这是非常重要的。TensorFlow 基于 Chunk 构建了一个全局的 Bin，每个 Bin 里管理的是内存大小在一定范围的 Chunk（内存大小范围 (2^bin_num)*256 到 (2^(bin_num+1))*256-1的，bin_num 代表的是 Bin 的序列号）。每个 Bin 里会保存着一个空闲的 Chunk 的 Set。

以下内容部分来自：TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack

如果我们想查询某一块符合条件的空闲 Chunk 并取出，那么只能对双向链表做遍历，显然这个效率不是很高。为了加速查询某块 Chunk 的速度，可以在创建 Chunk 链表时按一定顺序排列，并将整个有序链表在逻辑上切分成多个段，为每个段记录所包含的 Chunk 的范围，这种结构就是 Bin，它相当于一种索引。因此，Bin 结构是为了方便 Chunk 的查询而出现的。

在 BFC Allocator 中，每个段中 Chunk 的顺序是按照size和基地址升序排序的，每个 Bin 都设有自己的bin_size，该bin_size表示该段包含的最小 Chunk 的size。这样一来，用户端就可以根据所需要申请的 Memory 大小直接找到对应的 Bin，然后在该 Bin 中遍历寻找适合的 Chunk。为了能够根据bin_size直接定位到 Bin，规定bin_size与bin_num的大小关系为：bin_size=256 * 2^bin_num。用户在申请 Memory 时，会将实际大小映射到最适合的bin_size上，然后再根据bin_size与bin_num的关系找到对应的 Bin，进而在该段中遍历搜索。

Bin 中 Chunk 的是通过 Set 组织的，为了能在 Set 中体现双向链表的逻辑，只需要让 Chunk 在 Set 中按照规则升序排列，并修正前驱后继指针即可。

以下内容部分来自：极简入门TensorFlow 内存管理

BFCAllocator 下的两个比较重要的数据结构，Chunk 和 Bin，两者之间的关系如下图，看起来像一个个糖葫芦，第一个 bin size 为 256<<1，第二个为 256<<2, 一次类推，TF 内有 21 个 bin，最后 bin size 为 256 << 21 为 512MB，每一个 bin 下面会接下若干个大于 bin size 的 Chunk，整个内存空间由以下的结构来组织，当分配内存大小指定时，系统会遍历 bin，找到能够第一次满足 Chunk > bin_size，每一个 bin 下的 Chunk 是有序的（Bin 下的 ChunkComparator）

Bin的代码实现：

  // A Bin is a collection of similar-sized free chunks.
  // Allocated chunks are never in a Bin.
  struct Bin {
    // All chunks in this bin have >= bin_size memory.
    size_t bin_size = 0;

    class ChunkComparator {
     public:
      explicit ChunkComparator(BFCAllocator* allocator)
          : allocator_(allocator) {}
      // Sort first by size and then use pointer address as a tie breaker.
      bool operator()(const ChunkHandle ha, const ChunkHandle hb) const
          ABSL_NO_THREAD_SAFETY_ANALYSIS {
        const Chunk* a = allocator_->ChunkFromHandle(ha);
        const Chunk* b = allocator_->ChunkFromHandle(hb);
        if (a->size != b->size) {
          return a->size < b->size;
        }
        return a->ptr < b->ptr;
      }

     private:
      BFCAllocator* allocator_;  // The parent allocator
    };

    typedef std::set<ChunkHandle, ChunkComparator> FreeChunkSet;
    // List of free chunks within the bin, sorted by chunk size.
    // Chunk * not owned.
    FreeChunkSet free_chunks;
    Bin(BFCAllocator* allocator, size_t bs)
        : bin_size(bs), free_chunks(ChunkComparator(allocator)) {}
  };

辅助类

以下内容部分来自：Nvidia GPU显存池-BFC

AllocationRegion 与 RegionManager 两个辅助类的主要功能是记录每次分配给用户的显存指针和对应 Chunk 的映射关系，方便后续显存回收。

在本文，我们把重点放在 TensorFlow BFC 算法的核心思想，不展开讨论辅助类 AllocationRegion 和 RegionManager，感兴趣可以学习：

AllocationRegion 的代码实现
RegionManager 的代码实现

BFC 算法核心

1 分配内存

以下内容部分来自：TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack

这是 BFCAllocator 的为用户分配 Chunk 的总体流程，该过程涉及到了几个比较重要的子过程。它们分别是

遍历搜索寻找最佳 Chunk 指针的FindChunkPtr过程，
当 Chunk 链表中不存在合适的 Chunk 以至于不得不向物理设备申请新存储空间的Extend过程，
以及分配 Chunk 时为缓解碎片问题而出现的SplitChunk过程。

总流程AllocateRawInternal的代码实现：

void* BFCAllocator::AllocateRawInternal(size_t unused_alignment,
                                        size_t num_bytes,
                                        bool dump_log_on_failure,
                                        uint64 freed_before) {
  if (num_bytes == 0) {
    VLOG(2) << "tried to allocate 0 bytes";
    return nullptr;
  }
  
  // 1) 调整大小
  // First, always allocate memory of at least kMinAllocationSize
  // bytes, and always allocate multiples of kMinAllocationSize bytes
  // so all memory addresses are nicely byte aligned.
  size_t rounded_bytes = RoundedBytes(num_bytes);
 
  // 2) 查找 Bin
  // The BFC allocator tries to find the best fit first.
  BinNum bin_num = BinNumForSize(rounded_bytes);

  absl::MutexLock l(&mutex_);
  if (!timestamped_chunks_.empty()) {
    // Merge timestamped chunks whose counts have become safe for general use.
    MergeTimestampedChunks(0);
  }
  
  // 3) 查找 Chunk
  void* ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);
  if (ptr != nullptr) {
    AddTraceMe("MemoryAllocation", ptr);
    return ptr;
  }
  
  // 4) 如果查找失败，那么扩展内存，然后再查找合适的空闲Chunk
  // Try to extend
  if (Extend(unused_alignment, rounded_bytes)) {
    ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);  // 4.8 再次查找 Chunk
    if (ptr != nullptr) {
      AddTraceMe("MemoryAllocation", ptr);
      return ptr;
    }
  }
  
  // 5) 第一次尝试合并，并再次查找 Chunk
  if ((freed_before == 0) && (!timestamped_chunks_.empty())) {
    // We're unable to satisfy an allocation request without a specific
    // timestamp requirement.  Rather than fail, try merging any held-out
    // timestamped chunks more aggressively until a free chunk of the necessary
    // size is formed.
    if (MergeTimestampedChunks(rounded_bytes)) {
      ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);
      if (ptr != nullptr) {
        AddTraceMe("MemoryAllocation", ptr);
        return ptr;
      }
    }
  }

  // 6) 第二次尝试合并，并再次查找 Chunk
  // Reaching this point means that no chunks can satisfy the request. Also,
  // the unallocated bytes cannot satisfy the request. Before giving up, let's
  // try deallocating free regions so that suballocator can combine them with
  // the unallocated bytes and form a larger region.
  if (DeallocateFreeRegions(rounded_bytes) &&
      Extend(unused_alignment, rounded_bytes)) {
    ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);
    if (ptr != nullptr) {
      AddTraceMe("MemoryAllocation", ptr);
      return ptr;
    }
  }
  
  // 7) 可用内存不足导致分配失败，报告 OOM
  // We searched all bins for an existing free chunk to use and
  // couldn't find one.  This means we must have run out of memory,
  // Dump the memory log for analysis.
  MaybeWriteMemoryMap();
  if (dump_log_on_failure) {
    LOG(WARNING)
        << "Allocator (" << Name() << ") ran out of memory trying "
        << "to allocate " << strings::HumanReadableNumBytes(num_bytes)
        << " (rounded to " << rounded_bytes << ")" << "requested by op "
        << tsl::profiler::ScopedMemoryDebugAnnotation::CurrentAnnotation()
               .pending_op_name
        << "\nIf the cause is memory fragmentation maybe the environment "
        << "variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will "
        << "improve the situation. \nCurrent allocation summary follows."
        << "\nCurrent allocation summary follows.";
    DumpMemoryLog(rounded_bytes);
    LOG(WARNING) << RenderOccupancy();
  }
  return nullptr;
}

1.1 调整大小

RoundedBytes的代码实现：

size_t BFCAllocator::RoundedBytes(size_t bytes) {
  size_t rounded_bytes =
      (kMinAllocationSize *
       ((bytes + kMinAllocationSize - 1) / kMinAllocationSize));
  DCHECK_EQ(size_t{0}, rounded_bytes % kMinAllocationSize);
  return rounded_bytes;
}

将每次分配的内存大小调整为kMinAllocationSize的N倍，这样所有内存地址都是很好地按字节对齐了。

1.2 查找Bin

BinNumForSize的代码实现：

  BinNum BinNumForSize(size_t bytes) {
    uint64 v = std::max<size_t>(bytes, 256) >> kMinAllocationBits;
    int b = std::min(kNumBins - 1, tsl::Log2Floor64(v));
    return b;
  }

Bin size 是 256 的 2^N 倍。std::max<size_t>(bytes, 256) >> kMinAllocationBits先将分配内存的字节数右移 8 位，然后把结果用在std::min(kNumBins - 1, tsl::Log2Floor64(v))，计算出的二进制有效位数即为 Bin 在 Bins 中的索引。

1.3 查找Chunk

FindChunkPtr的代码实现：

void* BFCAllocator::FindChunkPtr(BinNum bin_num, size_t rounded_bytes,
                                 size_t num_bytes, uint64 freed_before) {
  // First identify the first bin that could satisfy rounded_bytes.
  for (; bin_num < kNumBins; bin_num++) {
    // Start searching from the first bin for the smallest chunk that fits
    // rounded_bytes.
    Bin* b = BinFromIndex(bin_num);
    for (auto citer = b->free_chunks.begin(); citer != b->free_chunks.end();
         ++citer) {
         
      // 1) 从之前得到的 Bin 索引开始，查找合适的空闲 Chunk
      const BFCAllocator::ChunkHandle h = (*citer);
      BFCAllocator::Chunk* chunk = ChunkFromHandle(h);
      
      DCHECK(!chunk->in_use());
      if (freed_before > 0 && freed_before < chunk->freed_at_count) {
        continue;
      }
      if (chunk->size >= rounded_bytes) {
        
        // 2) 将找到的 Chunk 从 Bin 中移除
        // We found an existing chunk that fits us that wasn't in use, so remove
        // it from the free bin structure prior to using.
        RemoveFreeChunkIterFromBin(&b->free_chunks, citer);

		// 3) 拆分 Chunk：当以下两个条件之一满足时，SplitChunk过程将被触发。
		// 		1. 当 Chunk 的 size 是用户请求的 round size 两倍及以上时（用户请求的 size 会根据最小分配单元做 round 近似）
		// 		2. 当 Chunk 的 size 减去用户请求的 round size 后依然大于等于最大碎片限定时（128MB）
        // If we can break the size of the chunk into two reasonably large
        // pieces, do don't waste more than max_internal_fragmentation_bytes on
        // padding. If this threshold is not set by the user, then use 128MB as
        // the default.
        const int64_t max_internal_fragmentation_bytes =
            (opts_.fragmentation_fraction > 0.0)
                ? opts_.fragmentation_fraction * memory_limit_
                : 128 << 20;
		
        if (chunk->size >= rounded_bytes * 2 ||
            static_cast<int64_t>(chunk->size) - rounded_bytes >=
                max_internal_fragmentation_bytes) {
          SplitChunk(h, rounded_bytes);
          chunk = ChunkFromHandle(h);  // Update chunk pointer in case it moved
        }
		
		// 4) 修改 Chunk 的请求大小、分配 ID（标记 Chunk 被占用）
        // The requested size of the returned chunk is what the user
        // has allocated.
        chunk->requested_size = num_bytes;
        // Assign a unique id and increment the id counter, marking the
        // chunk as being in use.
        chunk->allocation_id = next_allocation_id_++;
		
		// 5) 更新统计
        // Update stats.
        ++stats_.num_allocs;
        stats_.bytes_in_use += chunk->size;
        if (stats_.bytes_in_use > stats_.peak_bytes_in_use) {
          VLOG(2) << "New Peak memory usage of " << stats_.bytes_in_use
                  << " bytes for " << Name();
        }
        stats_.peak_bytes_in_use =
            std::max(stats_.peak_bytes_in_use, stats_.bytes_in_use);
        stats_.largest_alloc_size =
            std::max<std::size_t>(stats_.largest_alloc_size, chunk->size);

#ifdef TENSORFLOW_MEM_DEBUG
        if (ShouldRecordOpName()) {
          const auto& annotation =
              profiler::ScopedMemoryDebugAnnotation::CurrentAnnotation();
          if (annotation.pending_op_name != nullptr) {
            chunk->op_name = annotation.pending_op_name;
          } else {
            LOG(INFO) << "missing pending_op_name for " << Name()
                      << " reading addr "
                      << static_cast<const void*>(&annotation.pending_op_name)
                      << "\n"
                      << CurrentStackTrace();
            chunk->op_name = nullptr;
          }
          chunk->action_count = ++action_counter_;
          chunk->step_id = annotation.pending_step_id;
          int slot = chunk->action_count % MEM_DEBUG_SIZE_HISTORY_SIZE;
          size_history_[slot] = stats_.bytes_in_use;
        }
#endif

        VLOG(4) << "Returning: " << chunk->ptr;
        if (VLOG_IS_ON(4)) {
          LOG(INFO) << "A: " << RenderOccupancy();
        }
        
        // 6) 成功时返回找到的 Chunk 指针
        return chunk->ptr;
      }
    }
  }
  
  // 7) 失败时返回空指针
  return nullptr;
}

SplitChunk的代码实现：

void BFCAllocator::SplitChunk(BFCAllocator::ChunkHandle h, size_t num_bytes) {
  // Allocate the new chunk before we do any ChunkFromHandle
  ChunkHandle h_new_chunk = AllocateChunk();

  Chunk* c = ChunkFromHandle(h);
  CHECK(!c->in_use() && (c->bin_num == kInvalidBinNum));

  // Create a new chunk starting num_bytes after c
  BFCAllocator::Chunk* new_chunk = ChunkFromHandle(h_new_chunk);
  new_chunk->ptr = static_cast<void*>(static_cast<char*>(c->ptr) + num_bytes);
  region_manager_.set_handle(new_chunk->ptr, h_new_chunk);

  // Set the new sizes of the chunks.
  new_chunk->size = c->size - num_bytes;
  c->size = num_bytes;

  // The new chunk is not in use.
  new_chunk->allocation_id = -1;

  // It inherits the freed time.
  new_chunk->freed_at_count = c->freed_at_count;

  // 1) 调整 Chunk 的前驱后继指针
  // Maintain the pointers.
  // c <-> c_neighbor becomes
  // c <-> new_chunk <-> c_neighbor
  BFCAllocator::ChunkHandle h_neighbor = c->next;
  new_chunk->prev = h;
  new_chunk->next = h_neighbor;
  c->next = h_new_chunk;
  if (h_neighbor != kInvalidChunkHandle) {
    Chunk* c_neighbor = ChunkFromHandle(h_neighbor);
    c_neighbor->prev = h_new_chunk;
  }

  // 2) 根据 Free Chunk 的大小，将它插入到对应的 Bin 中
  // Add the newly free chunk to the free bin.
  InsertFreeChunkIntoBin(h_new_chunk);
}

1.4 扩展内存

Extend的代码实现：

bool BFCAllocator::Extend(size_t alignment, size_t rounded_bytes) {
  size_t available_bytes = memory_limit_ - *stats_.pool_bytes;
  // Rounds available_bytes down to the nearest multiple of kMinAllocationSize.
  available_bytes = (available_bytes / kMinAllocationSize) * kMinAllocationSize;
  
  // 1) 已占用的加上申请的内存大小，超过最大内存限制时，返回失败。
  // Do we have enough space to handle the client's request?
  // If not, fail immediately.
  if (rounded_bytes > available_bytes) {
    return false;
  }

  // 2) 循环将当前区域可分配的内存扩充 1 倍，直到大于等于申请的内存大小。
  // If curr_region_allocation_bytes_ is not enough to satisfy the
  // allocation, keep multiplying by a power of two until that is
  // sufficient.
  bool increased_allocation = false;
  while (rounded_bytes > curr_region_allocation_bytes_) {
    curr_region_allocation_bytes_ *= 2;
    increased_allocation = true;
  }
  
  // 3) 从内存池里分配内存
  // Try allocating.
  size_t bytes = std::min(curr_region_allocation_bytes_, available_bytes);
  size_t bytes_received;
  void* mem_addr = sub_allocator_->Alloc(alignment, bytes, &bytes_received);
  
  // 4) 如果失败，尝试分配申请内存大小的 90%。一直重复，直到分配成功或可用内存不足。
  if (mem_addr == nullptr) {
    static constexpr float kBackpedalFactor = 0.9;

    // Try allocating less memory.
    while (mem_addr == nullptr) {
      bytes = RoundedBytes(bytes * kBackpedalFactor);
      if (bytes < rounded_bytes) {
        return false;
      }
      mem_addr = sub_allocator_->Alloc(alignment, bytes, &bytes_received);
    }
  }

  if (!increased_allocation) {
    // Increase the region size of the next required allocation.
    curr_region_allocation_bytes_ *= 2;
  }

  VLOG(1) << "Extending allocation by "
          << strings::HumanReadableNumBytes(bytes_received) << " bytes for "
          << Name() << ".";

  *stats_.pool_bytes += bytes_received;
  *stats_.peak_pool_bytes =
      std::max(*stats_.pool_bytes, *stats_.peak_pool_bytes);
  VLOG(1) << "Total allocated bytes: "
          << strings::HumanReadableNumBytes(*stats_.pool_bytes);

  VLOG(1) << "Allocated memory at " << mem_addr << " to "
          << static_cast<void*>(static_cast<char*>(mem_addr) + bytes_received);
  
  // 5) 给分配的内存添加 AllocationRegion
  AllocationRegion* maybe_extended_region = nullptr;
  if (coalesce_regions_) {
    maybe_extended_region =
        region_manager_.AddOrExtendAllocationRegion(mem_addr, bytes_received);
  } else {
    region_manager_.AddAllocationRegion(mem_addr, bytes_received);
  }
  
  // 6) 创建一个空闲 Chunk
  // Create one large chunk for the whole memory space that will
  // be chunked later.
  ChunkHandle h = AllocateChunk();
  BFCAllocator::Chunk* c = ChunkFromHandle(h);
  c->ptr = mem_addr;
  c->size = bytes_received;
  c->allocation_id = -1;
  c->prev = kInvalidChunkHandle;
  c->next = kInvalidChunkHandle;
  c->freed_at_count = 0;

  region_manager_.set_handle(c->ptr, h);

  // If the region was extended, then there exists a previous chunk that should
  // be linked to the new chunk.
  if (maybe_extended_region != nullptr) {
    ChunkHandle prev =
        maybe_extended_region->get_handle(maybe_extended_region->ptr());
    BFCAllocator::Chunk* prev_chunk = ChunkFromHandle(prev);
    // Find the last recorded chunk in the extended region.
    while (prev_chunk->next != kInvalidChunkHandle) {
      prev = prev_chunk->next;
      prev_chunk = ChunkFromHandle(prev);
    }
    c->prev = prev;
    prev_chunk->next = h;
  }
  
  // 7) 将空闲 Chunk 插入 Bin 中
  // Maybe merge adjacent chunks and insert the chunk into the right bin.
  InsertFreeChunkIntoBin(TryToCoalesce(h, /*ignore_freed_at=*/false));

  return true;
}

2 回收内存

以下内容部分来自：TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack

因为在回收时只知道存储空间首地址指针，并不知道其对应的 Chunk，所以需要先借助region_manager等辅助工具获取其所对应的 Chunk 指针，然后考虑其前驱后继节点是否可以合并。下面展示了整体流程。

总流程DeallocateRawInternal的代码实现：

void BFCAllocator::DeallocateRawInternal(void* ptr) {
  if (ptr == nullptr) {
    VLOG(2) << "tried to deallocate nullptr";
    return;
  }
  absl::MutexLock l(&mutex_);
  
  // 1) 从 RegionManager 的指针到 ChunkHandle 的映射关系中得到 ChunkHandle
  // Find the chunk from the ptr.
  BFCAllocator::ChunkHandle h = region_manager_.get_handle(ptr);
  CHECK(h != kInvalidChunkHandle);
  // Record chunk information before it's freed.
  Chunk* chunk = ChunkFromHandle(h);
  void* chunk_ptr = chunk->ptr;
  int64_t req_bytes = chunk->requested_size;
  int64_t alloc_bytes = chunk->size;
  
  // 2) 将 Chunk 标记为空闲，然后把总占用的内存量减去 Chunk 的大小
  MarkFree(h);

  // Consider coalescing it.
  if (timing_counter_) {
    InsertFreeChunkIntoBin(h);
    timestamped_chunks_.push_back(h);
  } else {
    // 3) 合并 Chunk
    // 4) 将合并后的空闲 Chunk 插入 Bin
    InsertFreeChunkIntoBin(TryToCoalesce(h, false));
  }

  // TraceMe needs to be added after MarkFree and InsertFreeChunkIntoBin for
  // correct aggregation stats (bytes_in_use, fragmentation).
  AddTraceMe("MemoryDeallocation", chunk_ptr, req_bytes, alloc_bytes);

  if (VLOG_IS_ON(4)) {
    LOG(INFO) << "F: " << RenderOccupancy();
  }
}

2.1 获取ChunkHandle

2.2 更新状态

MarkFree的代码实现：

void BFCAllocator::MarkFree(BFCAllocator::ChunkHandle h) {
  Chunk* c = ChunkFromHandle(h);
  CHECK(c->in_use() && (c->bin_num == kInvalidBinNum));

  // 1) 将 Chunk 标记为空闲
  // Mark the chunk as no longer in use.
  c->allocation_id = -1;

  // Optionally record the free time.
  if (timing_counter_) {
    c->freed_at_count = timing_counter_->next();
  }
  
  // 2) 更新状态，把总占用的内存量减去 Chunk 的大小
  // Updates the stats.
  stats_.bytes_in_use -= c->size;

#ifdef TENSORFLOW_MEM_DEBUG
  if (ShouldRecordOpName()) {
    c->action_count = ++action_counter_;
    int slot = c->action_count % MEM_DEBUG_SIZE_HISTORY_SIZE;
    size_history_[slot] = stats_.bytes_in_use;
  }
#endif
}

2.3 合并Chunk

TryToCoalesce的代码实现：

BFCAllocator::ChunkHandle BFCAllocator::TryToCoalesce(ChunkHandle h,
                                                      bool ignore_freed_at) {
  Chunk* c = ChunkFromHandle(h);
  if ((!ignore_freed_at) && c->freed_at_count > 0) return h;
  ChunkHandle coalesced_chunk = h;
  
  // 1) 合并直接后继
  // If the next chunk is free, merge it into c and delete it.
  if (c->next != kInvalidChunkHandle && !ChunkFromHandle(c->next)->in_use()) {
    Chunk* n = ChunkFromHandle(c->next);
    if ((n->freed_at_count == 0) || ignore_freed_at) {
      VLOG(4) << "Merging c->next " << n->ptr << " with c " << c->ptr;
      // 1.1) 将直接后继从 Bin 中移除
      RemoveFreeChunkFromBin(c->next);
      // 1.2) 合并
      Merge(h, c->next);
    }
  }
  
  // 2) 合并直接前趋
  // If the previous chunk is free, merge c into it and delete c.
  if (c->prev != kInvalidChunkHandle && !ChunkFromHandle(c->prev)->in_use()) {
    Chunk* n = ChunkFromHandle(c->prev);
    if ((n->freed_at_count == 0) || ignore_freed_at) {
      VLOG(4) << "Merging c " << c->ptr << " into c->prev " << n->ptr;
      coalesced_chunk = c->prev;
      // 2.1) 将直接前趋从 Bin 中移除
      RemoveFreeChunkFromBin(c->prev);
      // 2.2) 合并
      Merge(c->prev, h);
    }
  }

  return coalesced_chunk;
}

Merge的代码实现：

// Merges h1 and h2 when Chunk(h1)->next is h2 and Chunk(h2)->prev is c1.
// We merge Chunk(h2) into Chunk(h1).
void BFCAllocator::Merge(BFCAllocator::ChunkHandle h1,
                         BFCAllocator::ChunkHandle h2) {
  Chunk* c1 = ChunkFromHandle(h1);
  Chunk* c2 = ChunkFromHandle(h2);
  // We can only merge chunks that are not in use.
  CHECK(!c1->in_use() && !c2->in_use());

  // c1's prev doesn't change, still points to the same ptr, and is
  // still not in use.

  // 1) 修改 Chunk 的指向关系
  // Fix up neighbor pointers
  //
  // c1 <-> c2 <-> c3 should become
  // c1 <-> c3

  BFCAllocator::ChunkHandle h3 = c2->next;
  c1->next = h3;
  CHECK(c2->prev == h1);
  if (h3 != kInvalidChunkHandle) {
    BFCAllocator::Chunk* c3 = ChunkFromHandle(h3);
    c3->prev = h1;
  }

  // 2) 更新 Chunk 大小
  // Set the new size
  c1->size += c2->size;

  // Pick latest free time.
  c1->freed_at_count = std::max(c1->freed_at_count, c2->freed_at_count);
  
  // 3) 删除被合并的 Chunk
  DeleteChunk(h2);
}

总结

以下内容来自：TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack

本文总结了 TensorFlow 中存储管理器——BFC Allocator。它的设计思路来自于经典来的 dlmalloc 分配算法，是 Best fit coalecing 的简单实现版本。BFC Allocator 是为了应对 TensorFlow 中频繁分配释放存储空间需求的场景而出现的解决方案，通过事先将存储空间从物理设备上开辟好，并将这些空闲存储空间封装成 Chunk，组织成有序双向链表，然后利用 Bin 这一种索引结构为 Chunk 的查询做加速，最终完成了高效的分配算法。在实际分配时，可能会遇到 Chunk 链表中不存在符合要求的空闲 Chunk 情况，这时候就可能需要向物理设备中再次开辟新的存储空间，这个过程被视为对 Chunk 链表的扩展，对应的过程是Extend。因为是按 Chunk 进行分配，势必可能造成存储碎片，为了解决碎片问题，BFC Allocator 设计了SplitChunk和Merge函数。

参考资料

关于 XLA

openxla/xla: A machine learning compiler for GPUs, CPUs, and ML accelerators
OpenXLA (NVIDIA) - AI技术栈解析及应用- 作者：张真瑜 | 山东大学智能创新研究院
XLA：优化机器学习编译器 | TensorFlow
XLA优化原理简介-云社区-华为云
如何看待OpenXLA这个开源项目？ - TanyoKwok 郭天佑的回答 - 知乎

关于 TensorFlow BFC

TensorFlow 源码分析之内存管理BFC算法——DeepReve
TensorFlow内存管理bfc算法
Tensorflow 源码分析- 从GPU OOM开始说Tensorflow的BFC内存管理（文中还对 Region 进行了介绍）
TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack（文中有个分析出的结论：存储区分配的时机是在一个 Tensor 对象被创建时立即发生的；文中还对辅助类 AllocationRegion 与 RegionManager 进行了介绍）
极简入门TensorFlow 内存管理
Tensorflow源码剖析：Allocator（基础篇）（文中介绍了 Allocator 类）
- TensorFlow源码剖析：Allocator（提高篇）（文中介绍了 AllocatorStats 类，其余大部分内容和TensorFlow中的显存管理器——BFC Allocator - DeepLearningStack相似）
- TensorFlow源码剖析：Allocator（进阶篇）（文中介绍了 AllocatorRetry 类）
- TensorFlow源码剖析：Allocator（应用篇）（文中介绍了 TensorFlow 下 GPU 相关配置项）
Nvidia GPU显存池-BFC（文中介绍了 Linux 内核内存池、tcmalloc 和 dlmalloc 等内存分配器、由allow_growth参数决定的两种分配模式）