Profiling ngcore in go

Posted on 2021-07-10 Edited on 2021-10-11

因为新版本的ngcore已经把mining模块删除了，因此在空闲状态从理论上看应该是非常低调的一个p2p“哑巴”。但是实际在PC这个3600x相对高级u的环境下都能占据>10%的cpu且高耗电。因此这里就开始对程序尝试调优。

暂时就只关注cpu利用率的问题（高耗电在客户端上太致命了

首先通过注释掉import _ "net/http/pprof"确认了高电量不是由于引入profile导致的。

然后根据profile给出的结果对几个loop都加了runtime.Goshced，并没有太大起色。

profile结果全是runtime.xxx这种内部的方法。

go的profile

首先肯定是需要熟练一下go自带的pprof工具。

为了方便profiling，ngcore会在一个随机可用端口上建立一个http server来为pprof提供当前运行数据。
一般可以使用go tool pprof http://localhost:[port]来开启命令行模式，如果带-http=:[another_port]参数就可以在another_port上建一个网页来展示diagrams

下面说一下在命令行模式下的一些命令

topNN

NN可以换成任何数字，比如top10就profile（CPU）前十的func。

go的blog上示例明显老旧了，这里我拿go1.16下的结果讲解一下

(pprof) top10
Showing nodes accounting for 15330ms, 54.15% of 28310ms total
Dropped 151 nodes (cum <= 141.55ms)
Showing top 10 nodes out of 86
显示节点占15330ms，占28310ms总数的54.15% （pprof标识的有效样本时间，和总样本时间）
删除了 151 个节点（cum <= 141.55ms，意思就是这151个剩余的加起来不超过141ms）
显示 86 个节点中的前 10 个节点（即top10）

      flat  flat%   sum%        cum   cum%
    3530ms 12.47% 12.47%     8150ms 28.79%  runtime.selectgo
    2920ms 10.31% 22.78%     2930ms 10.35%  runtime.stdcall1
    1660ms  5.86% 28.65%     1700ms  6.00%  runtime.unlock2
    1320ms  4.66% 33.31%    11930ms 42.14%  runtime.schedule
    1310ms  4.63% 37.94%     1310ms  4.63%  runtime.stdcall2
    1270ms  4.49% 42.42%     3300ms 11.66%  runtime.lock2
     960ms  3.39% 45.81%     9160ms 32.36%  github.com/ngchain/ngcore/consensus.(*PoWork).eventLoop
     920ms  3.25% 49.06%     6950ms 24.55%  runtime.findrunnable
     850ms  3.00% 52.07%     3230ms 11.41%  runtime.checkTimers
     590ms  2.08% 54.15%     1630ms  5.76%  runtime.sellock

flat：样本中running时间（不包含waiting for a called function to return，默认按此排序）
flat%：占总样本时间百分比
sum：即从上往下sum(flat%)
cum：func出现（包括running或waiting for a called function to return）的样本时间（用-cum就用它排序）
cum%：占总百分比

web

web命令就是在网页显示stack trace。命令行打开的web不如直接host在http。

通过trace就可以看到是主要哪里call了这个在前面疯狂running的func。

list

list Func会给你打印Func这个func的源码，并且在每行的左边显示对应的flat和cum

优化

从上面的可以看出来其实是eventLoop里的select导致了selectgo的大量调用。

在这个函数里我用了两个case的for-select结构，两个case都是从channel里接受内容。

尝试了下把for-select-<- ch分解成了两个for-<- ch（不影响业务），cpu占用瞬间变成1-2%（而且是平板上）可以说是相当显著。

有关这个for-select性能的讨论

https://stackoverflow.com/questions/42059800/performance-of-golang-select-statement-in-a-for-loop/42062485

Ref

https://blog.golang.org/pprof

https://github.com/dgryski/go-perfbook/blob/master/performance.md