一般来说,想检查磁盘I/O情况,可以使用iostat、iotop、sar等,但这些命令只能做一个整体的了解,没法具体到某一次io的详细情况,而今天介绍的blktrace就可以深入到Linux I/O栈的方方面面,把一切都搞得明明白白的!
先了解下Linux内核I/O栈,一般来说,我们对网络协议栈比较熟悉,比如网络协议分了5层,每一层都是干什么的,但对于I/O栈,则了解相对较少,如下:
Linux内核I/O栈大概分了6层,如下:
blktrace跟踪在块存储层(包括卷管理层),它的使用方法如下:
其中第六个字段表示I/O事件,它代表了 I/O 请求进行到了哪一阶段,有如下这些事件:
这些Event中常见的出现顺序如下:
Q – 即将生成 I/O 请求
|
G – I/O 请求生成
|
I – I/O 请求进入 I/O Scheduler 队列
|
D – I/O 请求进入 Driver
|
C – I/O 请求执行完毕
由于每个Event都有出现的时间戳,根据这个时间戳就可以计算出 I/O 请求在每个阶段所消耗的时间,比如从Q事件到C事件的时间叫Q2C,那么常见阶段称呼如下:
Q------------>G----------------->I----------------------------------->D----------------------------------->C
|-Q time(Q2G)-|-Insert time(G2I)-|------schduler wait time(I2D)-------|--------driver,disk time(D2C)-------|
|------------------------------- await time in iostat output(Q2C) -----------------------------------------|
ok,了解了这些关键概念后,就可以看懂blktrace命令的输出结果了,继续往下看...
# 安装blktrace包
$ sudo apt install blktrace
# blktrace依赖debugfs,需要挂载它
$ sudo mount -t debugfs debugfs /sys/kernel/debug
blktrace包安装后有blktrace、blkparse、btt、blkiomon这4个命令,其中blktrace负责采集I/O事件数据,blkparse负责将每一个I/O事件数据解析为纯文本方便阅读,btt、blkiomon负责统计分析。
$ sudo blktrace -d /dev/sda -o - | blkparse -i -
8,0 3 1 0.000000000 3064475 A W 367809144 + 8 <- (8,1) 367807096
8,0 3 2 0.000001498 3064475 Q W 367809144 + 8 [kworker/u256:3]
8,0 3 3 0.000013880 3064475 G W 367809144 + 8 [kworker/u256:3]^C
# 采集io事件数据
$ sudo blktrace -d /dev/sda
^C=== sda ===
CPU 0: 477 events, 23 KiB data
CPU 1: 1089 events, 52 KiB data
CPU 2: 216 events, 11 KiB data
CPU 3: 122 events, 6 KiB data
Total: 1904 events (dropped 0), 90 KiB data
# 数据保存在sda.blktrace.<cpu>中,每个cpu一个文件
$ ls -l *blktrace*
-rw-r--r-- 1 root root 22928 2022-02-12 16:11:37 sda.blktrace.0
-rw-r--r-- 1 root root 52304 2022-02-12 16:11:37 sda.blktrace.1
-rw-r--r-- 1 root root 10408 2022-02-12 16:11:37 sda.blktrace.2
-rw-r--r-- 1 root root 5864 2022-02-12 16:11:37 sda.blktrace.3
# 查看I/O事件
$ blkparse -i sda
8,0 3 1 0.000000000 3064475 A W 367809144 + 8 <- (8,1) 367807096
8,0 3 2 0.000001498 3064475 Q W 367809144 + 8 [kworker/u256:3]
8,0 3 3 0.000013880 3064475 G W 367809144 + 8 [kworker/u256:3]
8,0 3 4 0.000016012 3064475 P N [kworker/u256:3]
8,0 3 5 0.000025289 3064475 A W 367809136 + 8 <- (8,1) 367807088
8,0 3 6 0.000025867 3064475 Q W 367809136 + 8 [kworker/u256:3]
...
Total (sda):
Reads Queued: 14, 220KiB Writes Queued: 299, 1556KiB
Read Dispatches: 14, 220KiB Write Dispatches: 213, 1556KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 14, 220KiB Writes Completed: 213, 1556KiB
Read Merges: 0, 0KiB Write Merges: 86, 348KiB
IO unplugs: 34 Timer unplugs: 39
Throughput (R/W): 4KiB/s / 30KiB/s
Events (sda): 1766 entries
Skips: 0 forward (0 - 0.0%)
由于I/O事件通常很多,一个个的分析并不容易,因此一般使用btt来进行统计分析。
# 先合并为一个文件
$ blkparse -i sda -d sda.blktrace.bin
$ ll sda.blktrace.bin
-rw-rw-r-- 1 work work 90K 2022-02-12 16:12:56 sda.blktrace.bin
$ btt -i sda.blktrace.bin
==================== All Devices ====================
ALL MIN AVG MAX N
--------------- ------------- ------------- ------------- -----------
Q2Q 0.000000410 0.165743313 5.193234517 312
Q2G 0.000000297 0.000169175 0.004261675 227
G2I 0.000000341 0.000023664 0.000284838 225
Q2M 0.000000108 0.000000336 0.000002798 86
I2D 0.000000716 0.000708445 0.003329311 227
M2D 0.000004063 0.000106955 0.000737934 84
D2C 0.000111877 0.000718769 0.004067916 313
Q2C 0.000134521 0.001402772 0.008234969 313
==================== Device Overhead ====================
DEV | Q2G G2I Q2M I2D D2C
---------- | --------- --------- --------- --------- ---------
( 8, 0) | 8.7464% 1.2126% 0.0066% 36.6269% 51.2391%
---------- | --------- --------- --------- --------- ---------
Overall | 8.7464% 1.2126% 0.0066% 36.6269% 51.2391%
==================== Device Merge Information ====================
DEV | #Q #D Ratio | BLKmin BLKavg BLKmax Total
---------- | -------- -------- ------- | -------- -------- -------- --------
( 8, 0) | 313 227 1.4 | 8 15 200 3552
==================== Device Q2Q Seek Information ====================
DEV | NSEEKS MEAN MEDIAN | MODE
---------- | --------------- --------------- --------------- | ---------------
( 8, 0) | 313 24620107.8 0 | 0(94)
---------- | --------------- --------------- --------------- | ---------------
Overall | NSEEKS MEAN MEDIAN | MODE
Average | 313 24620107.8 0 | 0(94)
==================== Device D2D Seek Information ====================
DEV | NSEEKS MEAN MEDIAN | MODE
---------- | --------------- --------------- --------------- | ---------------
( 8, 0) | 227 35300234.0 0 | 0(10) 8
---------- | --------------- --------------- --------------- | ---------------
Overall | NSEEKS MEAN MEDIAN | MODE
Average | 227 35300234.0 0 | 0(10)
...(more)
注意这里面的D2C与Q2C,D2C代表硬盘实际处理时间,不包含在I/O队列中的等待时间,而Q2C代表整个I/O在块层的处理时间。
可以看到上面Q2C(IO处理时间)平均耗时1.402ms,最大8.234ms,D2C(硬盘处理时间)平均耗时0.718ms,最大4.076ms,而旋转磁盘耗时一般在几毫秒到十几毫秒,所以这个磁盘表现正常,但如果D2C经常到100ms以上,则可能磁盘损坏了,需要尽快更换磁盘。
如果要检测磁盘情况,还有一些简便的工具,如ioping或fio等,如下:
$ ioping .
4 KiB <<< . (ext4 /dev/sda1): request=1 time=307.9 us (warmup)
4 KiB <<< . (ext4 /dev/sda1): request=5 time=818.9 us
4 KiB <<< . (ext4 /dev/sda1): request=6 time=381.2 us
4 KiB <<< . (ext4 /dev/sda1): request=7 time=825.4 us (slow)
4 KiB <<< . (ext4 /dev/sda1): request=8 time=791.7 us
4 KiB <<< . (ext4 /dev/sda1): request=11 time=837.8 us (slow)
4 KiB <<< . (ext4 /dev/sda1): request=12 time=815.6 us
...
^C
--- . (ext4 /dev/sda1) ioping statistics ---
23 requests completed in 17.0 ms, 92 KiB read, 1.35 k iops, 5.28 MiB/s
generated 24 requests in 24.0 s, 96 KiB, 1 iops, 4.00 KiB/s
min/avg/max/mdev = 381.2 us / 739.2 us / 842.7 us / 151.3 us
可以发现平均响应时间也在739.2us,和上面blktrace的结果基本一致。
blkiomon也能分析I/O事件,对设备/dev/sda的io监控120秒,每2秒显示一次,如下:
$ sudo blktrace /dev/sda -a issue -a complete -w 120 -o - | blkiomon -I 2 -h -
time: Sat Feb 12 16:37:25 2022
device: 8,0
sizes read (bytes): num 0, min -1, max 0, sum 0, squ 0, avg 0.0, var 0.0
sizes write (bytes): num 2, min 4096, max 24576, sum 28672, squ 620756992, avg 14336.0, var 104857600.0
d2c read (usec): num 0, min -1, max 0, sum 0, squ 0, avg 0.0, var 0.0
d2c write (usec): num 2, min 659, max 731, sum 1390, squ 968642, avg 695.0, var 1296.0
throughput read (bytes/msec): num 0, min -1, max 0, sum 0, squ 0, avg 0.0, var 0.0
throughput write (bytes/msec): num 2, min 6215, max 33619, sum 39834, squ 1168863386, avg 19917.0, var 187744804.0
sizes histogram (bytes):
0: 0 1024: 0 2048: 0 4096: 1
8192: 0 16384: 0 32768: 1 65536: 0
131072: 0 262144: 0 524288: 0 1048576: 0
2097152: 0 4194304: 0 8388608: 0 > 8388608: 0
d2c histogram (usec):
0: 0 8: 0 16: 0 32: 0
64: 0 128: 0 256: 0 512: 0
1024: 2 2048: 0 4096: 0 8192: 0
16384: 0 32768: 0 65536: 0 131072: 0
262144: 0 524288: 0 1048576: 0 2097152: 0
4194304: 0 8388608: 0 16777216: 0 33554432: 0
>33554432: 0
bidirectional requests: 0
另外,发现一大佬基于blkparse写了一个shell脚本来分析I/O事件数据,如下:
========
summary:
========
total number of reads: 1081513
total number of writes: 0
slowest read : 0.032560 second
slowest write: 0.000000 second
reads
> 1ms: 18253
>10ms: 17058
>20ms: 17045
>30ms: 780
writes
> 1ms: 0
>10ms: 0
>20ms: 0
>30ms: 0
block size: Read Count
256: 93756
64: 98084
56: 7475
8: 101218
48: 15889
40: 21693
24: 37787
128: 97382
16: 399850
可以很方便地看到I/O请求耗时分布、I/O请求大小分布。