gdbprof分析ceph的4K-randwrite性能

分析ceph进行4k随机写时哪些线程及其函数较为耗资源

Posted by YMG on March 17, 2018

gdbprof,是一款基于系统时间(而非cpu事件)的性能分析工具,它通过sampling的方式进行性能统计的。每隔period(默认为0.1秒),给gdb发送sigint信号,然后循环每个thread,根据该thread的call trace,统计函数的调用情况。 它是基于python实现的,中间用到了gdb的python api。
gdbprof源码 本文采用的gdbprof优化版本 本人forked版本,随时欢迎。

实验环境

搭建ceph-osd集群(本文用了3个osd.0、osd.1、osd.2)
服务端:server_host(osd)
客户端:client_host(mons、mgrs)
服务端需安装gdb,客户端需安装fio(都自行解决),下载gdbprof

实验过程

(1)简单查看下集群是否正常:

$ceph osd tree

$ceph -s

(2)查看所有osd对应进程及进程号

$ps aux|grep ceph-osd

实验一 求解耗系统资源的线程(单image和10-images)

客户端(server_host)操作:
(1)fio 创建pool(或默认rbd)和image,并进行后续操作
$ rbd create –pool ymg –image img01 –size 40G
(2)先进行填充实验
$fio -direct=1 -iodepth=256 -ioengine=rbd -pool=rbd -rbdname=img01 -rw=write -bs=1M -size=40G -ramp_time=5 -group_reporting -name=full-fill
(3)再进行4k-randWrite的单image操作
$ fio -direct=1 -iodepth=256 -ioengine=rbd -pool=rbd -rbdname=img01 -rw=randwrite -bs=4K -runtime=300 -numjobs=1 -ramp_time=5 -group_reporting -name=one_gdbprof
(4)与此同时,服务端机器(client_host)需要并行执行top命令
$top -p 7922 -H
也即追踪进程7922(随意选一个)在过程中的所有线程对资源消耗情况,结果如下所示:

对比实验:10个images的4k-randWrite的实验操作:
$ BS=4k RW=randwrite fio images.fio
得先创建10个images
$rbd create ymg/img00 -s 40G
images.fio内容如下所示:

  [global]
  direct=1
  ioengine=rbd
  pool=ymg
  bs=${BS}
  size=40G
  time_based
  runtime=500
  ramp_time=5
  iodepth=256
  rw=${RW}
  group_reporting

  [00]
  rbdname=img00
  [01]
  rbdname=img01
  [02]
  rbdname=img02
  [03]
  rbdname=img03
  [04]
  rbdname=img04
  [05]
  rbdname=img05
  [06]
  rbdname=img06
  [07]
  rbdname=img07
  [08]
  rbdname=img08
  [09]
  rbdname=img09

服务端执行 $top -p 7922 -H
结果如下所示

总结
由上面的一对儿实验可知:4k-randwrite过程中,比较消耗资源的线程有:msgr-worker-0(7926)/msgr-worker-1(7927)、bstore_kv_sync(7972)、rocksdb::bg0(21792)、finisher(7971)、bstore_kv_final(7973)、log(7924)、tp_osd_tp(8109)
接下来,我们需要具体看这些线程中具体哪些函数最消耗时间。这就是gdbprof的亮相了。

实验二 求解耗系统资源的具体函数(单image)

在客户端进行4k-randWrite操作同时(fio命令),服务端要跑gdbprof工具(可下面参考原文中的.docx文件),切换到gdbprof目录,有gdbprof.py
$vim gdbprof.py
line140,调节采样运行时间(无需太短或长),据说单位为秒,但感觉是采样次数,也就是样本个数,本人设过500,1000,2000,5000(等待时间太长),如下所示:

$sudo gdb -ex ‘set pagination off’ -ex ‘attach 7922’ -ex ‘source gdbprof.py’ -ex ‘profile begin’ -ex ‘quit’
注:在运行过程中,可能会出现一些下面的debuginfo信息
$yum install ceph-debuginfo.x86_64 –enablerepo didi_deph

运行结束后就可以看到打印出来的结果,共有69个线程的信息
先找出打印出的线程函数信息与上面线程的对应关系:
$gdb
(gdb) attach 7922
(gdb) info threads

接着,贴出6个较为耗资源的线程的结果:Thread: 2===21792===rokcsdb::bg0、Thread: 28===8109===tp_osd_tp、Thread: 51===7973===bstore_kv_final、Thread: 52===7972===bstore_kv_sync、Thread: 53===7971===finisher、Thread: 68===7926===msgr_worker_0,可以看出哪些函数较为消耗资源,由于结果较长,部分如下所示:

Thread: 2===21792===rokcsdb::bg0

+ 100.00% clone
  + 100.00% start_thread
    + 100.00% None
      + 100.00% rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper
        + 100.00% rocksdb::ThreadPoolImpl::Impl::BGThread
          + 69.00% rocksdb::DBImpl::BackgroundCallCompaction
          | + 69.00% rocksdb::DBImpl::BackgroundCompaction
          |   + 69.00% rocksdb::CompactionJob::Run
          |     + 69.00% rocksdb::CompactionJob::ProcessKeyValueCompaction
          |       + 56.80% rocksdb::CompactionIterator::Next
          |       | + 42.20% rocksdb::MergingIterator::Next
          |       | | + 42.00% Next
          |       | | | + 25.00% b::(anonymous namespace)::TwoLevelIterator::Next
          |       | | | | + 25.00% Next
          |       | | | |   + 24.40% b::(anonymous namespace)::TwoLevelIterator::SkipEmptyDataBlocksForward
          |       | | | |   | + 24.40% b::(anonymous namespace)::TwoLevelIterator::InitDataBlock
          |       | | | |   |   + 24.20% rocksdb::BlockBasedTable::BlockEntryIteratorState::NewSecondaryIterator
          |       | | | |   |   | + 24.20% rocksdb::BlockBasedTable::NewDataBlockIterator
          |       | | | |   |   |   + 23.60% b::(anonymous namespace)::ReadBlockFromFile
          |       | | | |   |   |   | + 23.40% rocksdb::ReadBlockContents
          |       | | | |   |   |   | | + 23.40% ReadBlock
          |       | | | |   |   |   | |   + 22.40% rocksdb::RandomAccessFileReader::Read
          |       | | | |   |   |   | |   | + 22.40% b::(anonymous namespace)::ReadaheadRandomAccessFile::Read
          |       | | | |   |   |   | |   |   + 22.40% ReadIntoBuffer
          |       | | | |   |   |   | |   |     + 22.40% BlueRocksRandomAccessFile::Read
          |       | | | |   |   |   | |   |       + 22.40% read_random
          |       | | | |   |   |   | |   |         + 22.40% BlueFS::_read_random
          |       | | | |   |   |   | |   |           + 22.40% KernelDevice::read_random
          |       | | | |   |   |   | |   |             + 22.40% pread
          |       | | | |   |   |   | |   |               + 22.40% pread64
          |       | | | |   |   |   | |   + 1.00% Value
          |       | | | |   |   |   | |     + 1.00% b::crc32c::ExtendImpl<rocksdb::crc32c::Fast_CRC32>
          |       | | | |   |   |   | |       + 1.00% Fast_CRC32
          |       | | | |   |   |   | |         + 1.00% Slow_CRC32
          |       | | | |   |   |   | + 0.20% rocksdb::Block::Block(rocksdb::BlockContents&&, unsigned long, unsigned long, rocksdb::Statistics*)
          |       | | | |   |   |   |   + 0.20% BlockContents
          |       | | | |   |   |   |     + 0.20% operator=
          |       | | | |   |   |   + 0.40% rocksdb::BlockBasedTable::MaybeLoadDataBlockToCache
          |       | | | |   |   |     + 0.40% rocksdb::BlockBasedTable::GetDataBlockFromCache
          |       | | | |   |   |       + 0.40% b::(anonymous namespace)::GetEntryFromCache
          |       | | | |   |   |         + 0.40% rocksdb::ShardedCache::Lookup
          |       | | | |   |   |           + 0.20% rocksdb::LRUCache::GetShard
          |       | | | |   |   + 0.20% b::(anonymous namespace)::TwoLevelIterator::SetSecondLevelIterator
          |       | | | |   + 0.40% rocksdb::BlockIter::ParseNextKey
          |       | | | |   | + 0.20% TrimAppend
          |       | | | |   | | + 0.20% EnlargeBufferIfNeeded
          |       | | | |   | |   + 0.20% tc_newarray
          |       | | | |   | + 0.20% DecodeEntry
          |       | | | |   + 0.20% b::(anonymous namespace)::TwoLevelIterator::Next
          |       | | | |     + 0.20% Next
          |       | | | + 17.00% b::(anonymous namespace)::TwoLevelIterator::SkipEmptyDataBlocksForward
          |       | | |   + 17.00% b::(anonymous namespace)::TwoLevelIterator::InitDataBlock
          |       | | |     + 17.00% rocksdb::BlockBasedTable::BlockEntryIteratorState::NewSecondaryIterator
          |       | | |       + 17.00% rocksdb::BlockBasedTable::NewDataBlockIterator
          |       | | |         + 16.80% b::(anonymous namespace)::ReadBlockFromFile
          |       | | |           + 16.80% rocksdb::ReadBlockContents
          |       | | |             + 16.80% ReadBlock
          |       | | |               + 16.60% rocksdb::RandomAccessFileReader::Read
          |       | | |               | + 16.60% b::(anonymous namespace)::ReadaheadRandomAccessFile::Read
          |       | | |               |   + 16.40% ReadIntoBuffer
          |       | | |               |   | + 16.40% BlueRocksRandomAccessFile::Read
          |       | | |               |   |   + 16.40% read_random
          |       | | |               |   |     + 16.40% BlueFS::_read_random
          |       | | |               |   |       + 16.40% KernelDevice::read_random
          |       | | |               |   |         + 16.40% pread
          |       | | |               |   |           + 16.40% pread64
          |       | | |               |   + 0.20% TryReadFromCache
          |       | | |               |     + 0.20% memcpy
          |       | | |               |       + 0.20% __memcpy_ssse3_back
          |       | | |               + 0.20% Value
          |       | | |                 + 0.20% b::crc32c::ExtendImpl<rocksdb::crc32c::Fast_CRC32>
          |       | | |                   + 0.20% Fast_CRC32
          |       | | |                     + 0.20% Slow_CRC32
          |       | | + 0.20% replace_top
          |       | |   + 0.20% downheap
          |       | |     + 0.20% operator()
          |       | + 14.60% rocksdb::CompactionIterator::NextFromInput
          |       |   + 13.20% rocksdb::MergingIterator::Next
          |       |   | + 12.80% Next
          |       |   | | + 9.40% b::(anonymous namespace)::TwoLevelIterator::Next
          |       |   | | | + 9.40% Next
          |       |   | | |   + 9.20% b::(anonymous namespace)::TwoLevelIterator::SkipEmptyDataBlocksForward
          |       |   | | |   | + 9.20% b::(anonymous namespace)::TwoLevelIterator::InitDataBlock
        ......
        ......
        ......

Thread: 28===8109===tp_osd_tp
......
......
......

具体文件可下载 由上面结果可以知道6个线程中哪些函数较为消耗资源,如下所示:
(1)Thread: 2===21792===rokcsdb::bg0:

(2)Thread: 28===8109===tp_osd_tp:

(3)Thread: 51===7973===bstore_kv_final

(4)Thread: 52===7972===bstore_kv_sync

(5)Thread: 53===7971===finisher

(6)Thread: 68===7926===msgr_worker_0

想得到所有69个线程的函数堆栈情况,可去下载