发布于 

Ceph-Fuse Perms heap-buffer-overflow 问题

Ceph-fuse 带 Address-Sanitizer 跑发现只要 MDS 切换就会导致客户端 heap-buffer-overflow (ASAN 信息附在最后)

问题分析

带 debuginfo 的情况下 ASAN 报的信息还是比较全的,说踩内存的地方在 src/messages/MClientRequest.h:151MClientRequest::set_gid_list 函数中,并且指出这里尝试读了一个野指针,看下 MClientRequest::set_gid_list 函数这里的行为:

1
2
3
4
5
6
	void set_gid_list(int count, const gid_t *gids) {
gid_list.reserve(count);
for (int i = 0; i < count; ++i) {
→ gid_list.push_back(gids[i]);
}
}

可以看到这里问题就在 gids 上,看一下这个指针是哪里传进来的:

1
2
3
4
5
6
7
MClientRequest::ref Client::build_client_request(MetaRequest *request)
{
...
const gid_t *_gids;
int gid_count = request->perms.get_gids(&_gids);
req->set_gid_list(gid_count, _gids);
...

那这里就是从 request->perms 中读出来再尝试去 set 到 MClientRequest 中时出的问题, get_gids 的过程:

1
int get_gids(const gid_t **_gids) const { *_gids = gids; return gid_count; }

看上去没有什么问题,那就看下这个 gids 是哪里来的,在 MetaRequest 中:

1
2
3
4
5
6
7
8
9
struct MetaRequest
{
...
void set_caller_perms(const UserPerm& _perms) {
perms.shallow_copy(_perms);
head.caller_uid = perms.uid();
head.caller_gid = perms.gid();
}
...

这里看到 shallow_copy 就有一些预感了,应该是因为调用 set_caller_perms 时的 UserPerm 已经被释放了导致这里 MetaRequestperms 就变成了野指针,于是在流程里跟了一下发现确实是这样,在 shallow_copy 中对于 gids 只做了一个指针的赋值,那这里从设置 perms 的地方说一下:

1
2
3
4
5
6
7
int Client::make_request(MetaRequest* request, const UserPerm& perms,
InodeRef* ptarget, bool* pcreated, mds_rank_t use_mds,
bufferlist* pdirbl)
{
...
request->set_caller_perms(perms);
...

对于 MetaRequest 来说这里的 perms 都是来自于请求开始时构造或者伪造出来的,并在 make_request 时设置到 MetaRequest::perms 中,那随着请求收到 MDS 回复的 unsafe_reply,客户端实际已经认为这个请求完成了,于是客户端会从 make_request 中唤醒并返回,因此这里实际上之前构造的 UserPerm 就会随着请求结束而被释放掉(如下),此时一旦 MDS 产生切换,客户端在 reconnect 阶段重新发送 unsafe_request 请求就会导致需要重新通过 MetaRequest 构建 MClientRequest,进而读到实际已经释放掉的 gids.

1
2
3
4
5
static void fuse_ll_getattr(fuse_req_t req, fuse_ino_t ino,
struct fuse_file_info *fi)
{
UserPerm perms(ctx->uid, ctx->gid);
...

最后给一下 Address-Sanitizer 打出的全部信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
=================================================================
==39169==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60200056aa70 at pc 0x7fd0c67c82b8 bp 0x7fd0a45ce410 sp 0x7fd0a45ce400
READ of size 4 at 0x60200056aa70 thread T32 (ms_dispatch)
#0 0x7fd0c67c82b7 in MClientRequest::set_gid_list(int, unsigned int const*) /usr/src/debug/ceph-14.2.5.mh218/src/messages/MClientRequest.h:151
#1 0x7fd0c67c82b7 in Client::build_client_request(MetaRequest*) /usr/src/debug/ceph-14.2.5.mh218/src/client/Client.cc:2654
#2 0x7fd0c67c8b62 in Client::send_request(MetaRequest*, MetaSession*, bool) /usr/src/debug/ceph-14.2.5.mh218/src/client/Client.cc:2578
#3 0x7fd0c685b971 in Client::resend_unsafe_requests(MetaSession*, bool) /usr/src/debug/ceph-14.2.5.mh218/src/client/Client.cc:3301
#4 0x7fd0c6910afd in Client::send_reconnect(MetaSession*) /usr/src/debug/ceph-14.2.5.mh218/src/client/Client.cc:3172
#5 0x7fd0c6932d75 in Client::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&) /usr/src/debug/ceph-14.2.5.mh218/src/client/Client.cc:3125
#6 0x7fd0c69366c7 in Client::ms_dispatch2(boost::intrusive_ptr<Message> const&) /usr/src/debug/ceph-14.2.5.mh218/src/client/Client.cc:2945
#7 0x7fd0bb7c5e0b in Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&) /usr/src/debug/ceph-14.2.5.mh218/src/msg/Messenger.h:694
#8 0x7fd0bb7c5e0b in DispatchQueue::entry() /usr/src/debug/ceph-14.2.5.mh218/src/msg/DispatchQueue.cc:199
#9 0x7fd0bbad5d90 in DispatchQueue::DispatchThread::entry() /usr/src/debug/ceph-14.2.5.mh218/src/msg/DispatchQueue.h:102
#10 0x7fd0b8d10dc4 in start_thread (/lib64/libpthread.so.0+0x7dc4)
#11 0x7fd0b79d628c in clone (/lib64/libc.so.6+0xf628c)
Address 0x60200056aa70 is a wild pointer.
SUMMARY: AddressSanitizer: heap-buffer-overflow /usr/src/debug/ceph-14.2.5.mh218/src/messages/MClientRequest.h:151 in MClientRequest::set_gid_list(int, unsigned int const*)
Shadow bytes around the buggy address:
0x0c04800a54f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c04800a5500: fa fa fd fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c04800a5510: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fd fa
0x0c04800a5520: fa fa fa fa fa fa fa fa fa fa fd fa fa fa fa fa
0x0c04800a5530: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c04800a5540: fa fa fa fa fa fa fa fa fa fa fa fa fa fa[fa]fa
0x0c04800a5550: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c04800a5560: fa fa fa fa fa fa fd fa fa fa fa fa fa fa fa fa
0x0c04800a5570: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c04800a5580: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c04800a5590: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
Thread T32 (ms_dispatch) created by T0 here:
#0 0x7fd0c54e0e6f in pthread_create (/lib64/libasan.so.5+0x51e6f)
#1 0x7fd0bb2d7928 in Thread::try_create(unsigned long) /usr/src/debug/ceph-14.2.5.mh218/src/common/Thread.cc:136
#2 0x7fd0bb2d7b46 in Thread::create(char const*, unsigned long) /usr/src/debug/ceph-14.2.5.mh218/src/common/Thread.cc:151
#3 0x7fd0bb7b1fc1 in DispatchQueue::start() /usr/src/debug/ceph-14.2.5.mh218/src/msg/DispatchQueue.cc:233
#4 0x7fd0bbb120a2 in AsyncMessenger::ready() /usr/src/debug/ceph-14.2.5.mh218/src/msg/async/AsyncMessenger.cc:334
#5 0x7fd0c67b4463 in Messenger::add_dispatcher_tail(Dispatcher*) /usr/src/debug/ceph-14.2.5.mh218/src/msg/Messenger.h:400
#6 0x7fd0c67b4463 in StandaloneClient::init() /usr/src/debug/ceph-14.2.5.mh218/src/client/Client.cc:16561
#7 0x7fd0c6739432 in main /usr/src/debug/ceph-14.2.5.mh218/src/ceph_fuse.cc:263
#8 0x7fd0b7901b14 in __libc_start_main (/lib64/libc.so.6+0x21b14)
==39169==ABORTING

修复及后续

关于这个问题修复比较简单,把 shallow_copy 替换成 deep_copy 就好了,重点是解决问题的思路。

给社区提了一个 PR #51188, 已经 approve 了。