Upstream describes LLDB as a next generation, high-performance debugger. It is built on top of LLVM/Clang toolchain, and features great integration with it. At the moment, it primarily supports debugging C, C++ and ObjC code, and there is interest in extending it to more languages.
In February, I have started working on LLDB, as contracted by the NetBSD Foundation. So far I've been working on reenabling continuous integration, squashing bugs, improving NetBSD core file support, extending NetBSD's ptrace interface to cover more register types and fix compat32 issues and fixing watchpoint support. Then, I've started working on improving thread support. You can read more about that in my July 2019 report.
I've been on vacation in August, and in September I've resumed the work on LLDB. I've started by fixing new regressions in LLVM suite, then improved my previous patches and continued debugging test failures and timeouts resulting from my patches.
LLVM 8 and 9 in NetBSD
Updates to LLVM 8 src branch
I have been asked to rebase my llvm8 branch of NetBSD src tree. I've done that, and updated it to LLVM 8.0.1 while at it.
LLVM 9 release
The LLVM 9.0.0 final has been tagged in September. I have been doing the pre-release testing for it, and discovered that the following tests were hanging:
LLVM :: ExecutionEngine/MCJIT/eh-lg-pic.ll
LLVM :: ExecutionEngine/MCJIT/eh.ll
LLVM :: ExecutionEngine/MCJIT/multi-module-eh-a.ll
LLVM :: ExecutionEngine/OrcMCJIT/eh-lg-pic.ll
LLVM :: ExecutionEngine/OrcMCJIT/eh.ll
LLVM :: ExecutionEngine/OrcMCJIT/multi-module-eh-a.ll
I couldn't reproduce the problem with LLVM trunk, so I've instead focused on looking for a fix. I've came to the conclusion that the problem was fixed through adding missing linked library. I've requested backport in bug 43196 and it has been merged in r371042.
I didn't put more effort into figuring out why the lack of this linkage caused issues for us. However, as Lang Hames said on the bug, ‘adding the dependency was the right thing to do’.
LLVM 9 for NetBSD src
Afterwards, I have started working on updating my NetBSD src branch for LLVM 9. However, in middle of that I've been informed that Joerg has already finished doing that independently, so I've stopped.
Furthermore, I was informed that LLVM 9.0.0 will not make it to src, since it still lacks some fixes (most notably, adding a pass to lower is.constant and objectsize intrinsics). Joerg plans to import some revision of the trunk instead.
Buildbot regressions
Initial regressions
The first problem that needed solving was LLDB build failure caused by
replacing std::once_flag
with llvm::once_flag
.
I've came to the conclusion that the build fails because the call site
in LLDB combined std::call_once
with llvm::once_flag
. The solution
was to replace the former with llvm::call_once
.
After fixing the build failure, we had a bunch of test failures on buildbot to address. Kamil helped me and tracked one of them down to a new test for stack exhaustion handling. The test author decided that it ‘is only a best-effort mitigation for the case where things have already gone wrong’, and marked it unsupported on NetBSD.
On the plus side, two of the tests previously failing on NetBSD have been fixed upstream. I've un-XFAIL-ed them appropriately. Five new test failures in LLDB were related to those tests being unconditionally skipped before — I've marked them XFAIL pending further investigation in the future.
Another set of issues was caused by enabling -fvisibility=hidden for libc++ which caused problems when building with GCC. After being pinged, the author decided to enable it only for builds done using clang.
New issues through September
During September, two new issues arose. The first one was my fault,
so I'm going to cover it in appropriate section below. The second one
was new thread_local
test failing. Since it was a newly added test
that failed on most of the supported platforms, I've just added NetBSD
to the list of failing platforms.
Current buildbot status
After fixing the immediate issues, the buildbot returned to previous status. The majority of tests pass, with one flaky test repeatedly timing out. Normally, I would skip this specific test in order to have buildbot report only fresh failures. However, since it is threading-related I'm waiting to finish my threading update and reassert afterwards.
Furthermore, I have added --shuffle
to lit arguments in order to randomize
the order in which the tests are run. According to upstream, this reduces
the chance of load-intensive tests being run simultaneously and therefore
causing timeouts.
The buildbot host seems to have started crashing recently. OpenMP tests were causing similar issues in the past, and I'm currently trying to figure out whether they are the culprit again.
__has_feature(leak_sanitizer)
Kamil asked me to implement a feature check for leak sanitizer being
used. The __has_feature(leak_sanitizer)
preprocessor macro
is complementary to __SANITIZE_LEAK__
used in NetBSD gcc and is used to avoid
reports when leaks are known but the cost of fixing them exceeds the gain.
Progress in threading support
Fixing LLDB bugs
In the course of previous work, I had a patch for threading support in LLDB partially ready. However, the improvements have also resulted in some of the tests starting to hang. The main focus of my late work as investigating those problems.
The first issue that I've discovered was inconsistency in expressing no signal
sent. In some places, LLDB used LLDB_INVALID_SIGNAL
(-1) to express that,
in others it used 0
. So far this went unnoticed since the end result
in ptrace calls was the same. However, the reworked NetBSD threading support
used explicit PT_SET_SIGINFO
which — combined with wrong signal parameter —
wiped previously queued signal.
I've fixed C packet handler,
then fixed c, vCont and s handlers
to use LLDB_INVALID_SIGNAL
correctly. However, I've only tested the fixes
with my updated thread support, causing regression in the old code. Therefore,
I've also had to fix LLDB_INVALID_SIGNAL
handling in NetBSD plugin for the time being.
Thread suspend/resume kernel problem
Sadly, further investigation of hanging tests led me to the conclusion
that they are caused by kernel bugs. The first bug I've noticed is that
PT_SUSPEND
/PT_RESUME
do not cause the thread to be resumed correctly.
I have written the following reproducer for it:
#include <assert.h>
#include <lwp.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
void* thread_func(void* foo) {
int i;
printf("in thread_func, lwp = %d\n", _lwp_self());
for (i = 0; i < 100; ++i) {
printf("t2 %d\n", i);
sleep(2);
}
printf("out thread_func\n");
return NULL;
}
int main() {
int ret;
int pid = fork();
assert(pid != -1);
if (pid == 0) {
int i;
pthread_t t2;
ret = ptrace(PT_TRACE_ME, 0, NULL, 0);
assert(ret != -1);
printf("in main, lwp = %d\n", _lwp_self());
ret = pthread_create(&t2, NULL, thread_func, NULL);
assert(ret == 0);
printf("thread started\n");
for (i = 0; i < 100; ++i) {
printf("t1 %d\n", i);
sleep(2);
}
ret = pthread_join(t2, NULL);
assert(ret == 0);
printf("thread joined\n");
}
sleep(1);
ret = kill(pid, SIGSTOP);
assert(ret == 0);
printf("stopped\n");
pid_t waited = waitpid(pid, &ret, 0);
assert(waited == pid);
printf("wait: %d\n", ret);
printf("t2 suspend\n");
ret = ptrace(PT_SUSPEND, pid, NULL, 2);
assert(ret == 0);
ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
assert(ret == 0);
sleep(3);
ret = kill(pid, SIGSTOP);
assert(ret == 0);
printf("stopped\n");
waited = waitpid(pid, &ret, 0);
assert(waited == pid);
printf("wait: %d\n", ret);
printf("t2 resume\n");
ret = ptrace(PT_RESUME, pid, NULL, 2);
assert(ret == 0);
ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
assert(ret == 0);
sleep(5);
ret = kill(pid, SIGTERM);
assert(ret == 0);
waited = waitpid(pid, &ret, 0);
assert(waited == pid);
printf("wait: %d\n", ret);
return 0;
}
The program should run a two-threaded subprocess, with both threads outputting successive numbers. The second thread should be suspended shortly, then resumed. However, currently it does not resume.
I believe that this caused by ptrace_startstop()
altering process flags without reimplementing the complete logic as used
by lwp_suspend()
and lwp_continue()
.
I've been able to move forward by calling the two latter functions
from ptrace_startstop()
. However, Kamil has indicated that he'd like
to make those routines use separate bits (to distinguish LWPs stopped
by process from LWPs stopped by debugger), so I haven't pushed my patch
forward.
Multiple thread reporting kernel problem
The second and more important problem is related to how new LWPs are reported to the debugger. Or rather, that they are not reported reliably. When many threads are started by the process in a short time (e.g. in a loop), the debugger receives reports only for some of them.
This can be reproduced using the following program:
#include <assert.h>
#include <lwp.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
void* thread_func(void* foo) {
printf("in thread, lwp = %d\n", _lwp_self());
sleep(10);
return NULL;
}
int main() {
int ret;
int pid = fork();
assert(pid != -1);
if (pid == 0) {
int i;
pthread_t t[10];
ret = ptrace(PT_TRACE_ME, 0, NULL, 0);
assert(ret != -1);
printf("in main, lwp = %d\n", _lwp_self());
raise(SIGSTOP);
printf("main resumed\n");
for (i = 0; i < 10; i++) {
ret = pthread_create(&t[i], NULL, thread_func, NULL);
assert(ret == 0);
printf("thread %d started\n", i);
}
for (i = 0; i < 10; i++) {
ret = pthread_join(t[i], NULL);
assert(ret == 0);
printf("thread %d joined\n", i);
}
return 0;
}
pid_t waited = waitpid(pid, &ret, 0);
assert(waited == pid);
printf("wait: %d\n", ret);
assert(WSTOPSIG(ret) == SIGSTOP);
struct ptrace_event ev;
ev.pe_set_event = PTRACE_LWP_CREATE | PTRACE_LWP_EXIT;
ret = ptrace(PT_SET_EVENT_MASK, pid, &ev, sizeof(ev));
assert(ret == 0);
ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
assert(ret == 0);
while (1) {
waited = waitpid(pid, &ret, 0);
assert(waited == pid);
printf("wait: %d\n", ret);
if (WIFSTOPPED(ret)) {
assert(WSTOPSIG(ret) == SIGTRAP);
ptrace_siginfo_t info;
ret = ptrace(PT_GET_SIGINFO, pid, &info, sizeof(info));
assert(ret == 0);
struct ptrace_state pst;
ret = ptrace(PT_GET_PROCESS_STATE, pid, &pst, sizeof(pst));
assert(ret == 0);
printf("SIGTRAP: si_code = %d, ev = %d, lwp = %d\n",
info.psi_siginfo.si_code, pst.pe_report_event, pst.pe_lwp);
ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
assert(ret == 0);
} else
break;
}
return 0;
}
The program starts 10 threads, and the debugger should report 10 SIGTRAP
events for LWPs being started (ev = 8
) and the same number for LWPs
exiting (ev = 16
). However, initially I've been getting as many
as 4 SIGTRAPs, and the remaining 6 threads went unnoticed.
The issue is that do_lwp_create()
does not raise SIGTRAP directly but defers that to mi_startlwp()
that is called asynchronously as the LWP starts. This means that the former
function can return before SIGTRAP is emitted, and the program can start
another LWP. Since signals are not properly queued, multiple SIGTRAPs
can end up being issued simultaneously and lost.
Kamil has already worked on making simultaneous signal deliver more reliable. However, he reverted his commit as it caused regressions. Nevertheless, applying it made it possible for the test program to get all SIGTRAPs at least most of the time.
The ‘repeated’ SIGTRAPs did not include correct LWP information, though. Kamil has recently fixed that by moving the relevant data from process information to signal information struct. Combined with his earlier patch, this makes my test program pass most of the time (sadly, there seem to be some more race conditions involved).
Summary of threading work
My current work-in-progress patch can be found on Differential as D64647. However, it is currently unsuitable for merging as some tests start failing or hanging as a side effect of the changes. I'd like to try to get as many of them fixed as possible before pushing the changes to trunk, in order to avoid causing harm to the build bot.
The status with the current set of Kamil's work-in-progress patches applied to the kernel includes approximately 4 failing tests and 10 hanging tests.
Other LLVM news
Manikishan Ghantasala has been working on NetBSD-specific clang-format improvements in this year's Google Summer of Code. He is continuing to work on clang-format, and has recently been given commit access to the LLVM project!
Besides NetBSD-specific work, I've been trying to improve a few other areas
of LLVM. I've been working on fixing regressions in stand-alone build support
and regressions in support for BUILD_SHARED_LIBS=ON
builds. I have to admit
that while a year ago I was the only person fixing those issues, nowadays I see
more contributions submitting patches for breakages specific to those builds.
I have recently worked on fixing bad assumptions in LLDB's Python support. However, it seems that Haibo Huang has taken it from me and is doing a great job.
My most recent endeavor was fixing LLVM_DISTRIBUTION_COMPONENTS
support
in LLVM projects. This is going to make it possible to precisely fine-tune
which components are installed, both in combined tree and stand-alone builds.
Future plans
My first goal right now is to assert what is causing the test host to crash, and restore buildbot stability. Afterwards, I'd like to continue investigating threading problems and provide more reproducers for any kernel issues we may be having. Once this is done, I'd like to finally push my LLDB patch.
Since threading is not the only goal left in the TODO, I may switch between working on it and on the remaining TODO items. Those are:
-
Add support to backtrace through signal trampoline and extend the support to libexecinfo, unwind implementations (LLVM, nongnu). Examine adding CFI support to interfaces that need it to provide more stable backtraces (both kernel and userland).
-
Add support for i386 and aarch64 targets.
-
Stabilize LLDB and address breaking tests from the test suite.
-
Merge LLDB with the base system (under LLVM-style distribution).
This work is sponsored by The NetBSD Foundation
The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:
Upstream describes LLDB as a next generation, high-performance debugger. It is built on top of LLVM/Clang toolchain, and features great integration with it. At the moment, it primarily supports debugging C, C++ and ObjC code, and there is interest in extending it to more languages.
In February, I have started working on LLDB, as contracted by the NetBSD Foundation. So far I've been working on reenabling continuous integration, squashing bugs, improving NetBSD core file support, extending NetBSD's ptrace interface to cover more register types and fix compat32 issues and fixing watchpoint support. Then, I've started working on improving thread support. You can read more about that in my July 2019 report.
I've been on vacation in August, and in September I've resumed the work on LLDB. I've started by fixing new regressions in LLVM suite, then improved my previous patches and continued debugging test failures and timeouts resulting from my patches.
LLVM 8 and 9 in NetBSD
Updates to LLVM 8 src branch
I have been asked to rebase my llvm8 branch of NetBSD src tree. I've done that, and updated it to LLVM 8.0.1 while at it.
LLVM 9 release
The LLVM 9.0.0 final has been tagged in September. I have been doing the pre-release testing for it, and discovered that the following tests were hanging:
LLVM :: ExecutionEngine/MCJIT/eh-lg-pic.ll
LLVM :: ExecutionEngine/MCJIT/eh.ll
LLVM :: ExecutionEngine/MCJIT/multi-module-eh-a.ll
LLVM :: ExecutionEngine/OrcMCJIT/eh-lg-pic.ll
LLVM :: ExecutionEngine/OrcMCJIT/eh.ll
LLVM :: ExecutionEngine/OrcMCJIT/multi-module-eh-a.ll
I couldn't reproduce the problem with LLVM trunk, so I've instead focused on looking for a fix. I've came to the conclusion that the problem was fixed through adding missing linked library. I've requested backport in bug 43196 and it has been merged in r371042.
I didn't put more effort into figuring out why the lack of this linkage caused issues for us. However, as Lang Hames said on the bug, ‘adding the dependency was the right thing to do’.
LLVM 9 for NetBSD src
Afterwards, I have started working on updating my NetBSD src branch for LLVM 9. However, in middle of that I've been informed that Joerg has already finished doing that independently, so I've stopped.
Furthermore, I was informed that LLVM 9.0.0 will not make it to src, since it still lacks some fixes (most notably, adding a pass to lower is.constant and objectsize intrinsics). Joerg plans to import some revision of the trunk instead.
Buildbot regressions
Initial regressions
The first problem that needed solving was LLDB build failure caused by
replacing std::once_flag
with llvm::once_flag
.
I've came to the conclusion that the build fails because the call site
in LLDB combined std::call_once
with llvm::once_flag
. The solution
was to replace the former with llvm::call_once
.
After fixing the build failure, we had a bunch of test failures on buildbot to address. Kamil helped me and tracked one of them down to a new test for stack exhaustion handling. The test author decided that it ‘is only a best-effort mitigation for the case where things have already gone wrong’, and marked it unsupported on NetBSD.
On the plus side, two of the tests previously failing on NetBSD have been fixed upstream. I've un-XFAIL-ed them appropriately. Five new test failures in LLDB were related to those tests being unconditionally skipped before — I've marked them XFAIL pending further investigation in the future.
Another set of issues was caused by enabling -fvisibility=hidden for libc++ which caused problems when building with GCC. After being pinged, the author decided to enable it only for builds done using clang.
New issues through September
During September, two new issues arose. The first one was my fault,
so I'm going to cover it in appropriate section below. The second one
was new thread_local
test failing. Since it was a newly added test
that failed on most of the supported platforms, I've just added NetBSD
to the list of failing platforms.
Current buildbot status
After fixing the immediate issues, the buildbot returned to previous status. The majority of tests pass, with one flaky test repeatedly timing out. Normally, I would skip this specific test in order to have buildbot report only fresh failures. However, since it is threading-related I'm waiting to finish my threading update and reassert afterwards.
Furthermore, I have added --shuffle
to lit arguments in order to randomize
the order in which the tests are run. According to upstream, this reduces
the chance of load-intensive tests being run simultaneously and therefore
causing timeouts.
The buildbot host seems to have started crashing recently. OpenMP tests were causing similar issues in the past, and I'm currently trying to figure out whether they are the culprit again.
__has_feature(leak_sanitizer)
Kamil asked me to implement a feature check for leak sanitizer being
used. The __has_feature(leak_sanitizer)
preprocessor macro
is complementary to __SANITIZE_LEAK__
used in NetBSD gcc and is used to avoid
reports when leaks are known but the cost of fixing them exceeds the gain.
Progress in threading support
Fixing LLDB bugs
In the course of previous work, I had a patch for threading support in LLDB partially ready. However, the improvements have also resulted in some of the tests starting to hang. The main focus of my late work as investigating those problems.
The first issue that I've discovered was inconsistency in expressing no signal
sent. In some places, LLDB used LLDB_INVALID_SIGNAL
(-1) to express that,
in others it used 0
. So far this went unnoticed since the end result
in ptrace calls was the same. However, the reworked NetBSD threading support
used explicit PT_SET_SIGINFO
which — combined with wrong signal parameter —
wiped previously queued signal.
I've fixed C packet handler,
then fixed c, vCont and s handlers
to use LLDB_INVALID_SIGNAL
correctly. However, I've only tested the fixes
with my updated thread support, causing regression in the old code. Therefore,
I've also had to fix LLDB_INVALID_SIGNAL
handling in NetBSD plugin for the time being.
Thread suspend/resume kernel problem
Sadly, further investigation of hanging tests led me to the conclusion
that they are caused by kernel bugs. The first bug I've noticed is that
PT_SUSPEND
/PT_RESUME
do not cause the thread to be resumed correctly.
I have written the following reproducer for it:
#include <assert.h>
#include <lwp.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
void* thread_func(void* foo) {
int i;
printf("in thread_func, lwp = %d\n", _lwp_self());
for (i = 0; i < 100; ++i) {
printf("t2 %d\n", i);
sleep(2);
}
printf("out thread_func\n");
return NULL;
}
int main() {
int ret;
int pid = fork();
assert(pid != -1);
if (pid == 0) {
int i;
pthread_t t2;
ret = ptrace(PT_TRACE_ME, 0, NULL, 0);
assert(ret != -1);
printf("in main, lwp = %d\n", _lwp_self());
ret = pthread_create(&t2, NULL, thread_func, NULL);
assert(ret == 0);
printf("thread started\n");
for (i = 0; i < 100; ++i) {
printf("t1 %d\n", i);
sleep(2);
}
ret = pthread_join(t2, NULL);
assert(ret == 0);
printf("thread joined\n");
}
sleep(1);
ret = kill(pid, SIGSTOP);
assert(ret == 0);
printf("stopped\n");
pid_t waited = waitpid(pid, &ret, 0);
assert(waited == pid);
printf("wait: %d\n", ret);
printf("t2 suspend\n");
ret = ptrace(PT_SUSPEND, pid, NULL, 2);
assert(ret == 0);
ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
assert(ret == 0);
sleep(3);
ret = kill(pid, SIGSTOP);
assert(ret == 0);
printf("stopped\n");
waited = waitpid(pid, &ret, 0);
assert(waited == pid);
printf("wait: %d\n", ret);
printf("t2 resume\n");
ret = ptrace(PT_RESUME, pid, NULL, 2);
assert(ret == 0);
ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
assert(ret == 0);
sleep(5);
ret = kill(pid, SIGTERM);
assert(ret == 0);
waited = waitpid(pid, &ret, 0);
assert(waited == pid);
printf("wait: %d\n", ret);
return 0;
}
The program should run a two-threaded subprocess, with both threads outputting successive numbers. The second thread should be suspended shortly, then resumed. However, currently it does not resume.
I believe that this caused by ptrace_startstop()
altering process flags without reimplementing the complete logic as used
by lwp_suspend()
and lwp_continue()
.
I've been able to move forward by calling the two latter functions
from ptrace_startstop()
. However, Kamil has indicated that he'd like
to make those routines use separate bits (to distinguish LWPs stopped
by process from LWPs stopped by debugger), so I haven't pushed my patch
forward.
Multiple thread reporting kernel problem
The second and more important problem is related to how new LWPs are reported to the debugger. Or rather, that they are not reported reliably. When many threads are started by the process in a short time (e.g. in a loop), the debugger receives reports only for some of them.
This can be reproduced using the following program:
#include <assert.h>
#include <lwp.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
void* thread_func(void* foo) {
printf("in thread, lwp = %d\n", _lwp_self());
sleep(10);
return NULL;
}
int main() {
int ret;
int pid = fork();
assert(pid != -1);
if (pid == 0) {
int i;
pthread_t t[10];
ret = ptrace(PT_TRACE_ME, 0, NULL, 0);
assert(ret != -1);
printf("in main, lwp = %d\n", _lwp_self());
raise(SIGSTOP);
printf("main resumed\n");
for (i = 0; i < 10; i++) {
ret = pthread_create(&t[i], NULL, thread_func, NULL);
assert(ret == 0);
printf("thread %d started\n", i);
}
for (i = 0; i < 10; i++) {
ret = pthread_join(t[i], NULL);
assert(ret == 0);
printf("thread %d joined\n", i);
}
return 0;
}
pid_t waited = waitpid(pid, &ret, 0);
assert(waited == pid);
printf("wait: %d\n", ret);
assert(WSTOPSIG(ret) == SIGSTOP);
struct ptrace_event ev;
ev.pe_set_event = PTRACE_LWP_CREATE | PTRACE_LWP_EXIT;
ret = ptrace(PT_SET_EVENT_MASK, pid, &ev, sizeof(ev));
assert(ret == 0);
ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
assert(ret == 0);
while (1) {
waited = waitpid(pid, &ret, 0);
assert(waited == pid);
printf("wait: %d\n", ret);
if (WIFSTOPPED(ret)) {
assert(WSTOPSIG(ret) == SIGTRAP);
ptrace_siginfo_t info;
ret = ptrace(PT_GET_SIGINFO, pid, &info, sizeof(info));
assert(ret == 0);
struct ptrace_state pst;
ret = ptrace(PT_GET_PROCESS_STATE, pid, &pst, sizeof(pst));
assert(ret == 0);
printf("SIGTRAP: si_code = %d, ev = %d, lwp = %d\n",
info.psi_siginfo.si_code, pst.pe_report_event, pst.pe_lwp);
ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
assert(ret == 0);
} else
break;
}
return 0;
}
The program starts 10 threads, and the debugger should report 10 SIGTRAP
events for LWPs being started (ev = 8
) and the same number for LWPs
exiting (ev = 16
). However, initially I've been getting as many
as 4 SIGTRAPs, and the remaining 6 threads went unnoticed.
The issue is that do_lwp_create()
does not raise SIGTRAP directly but defers that to mi_startlwp()
that is called asynchronously as the LWP starts. This means that the former
function can return before SIGTRAP is emitted, and the program can start
another LWP. Since signals are not properly queued, multiple SIGTRAPs
can end up being issued simultaneously and lost.
Kamil has already worked on making simultaneous signal deliver more reliable. However, he reverted his commit as it caused regressions. Nevertheless, applying it made it possible for the test program to get all SIGTRAPs at least most of the time.
The ‘repeated’ SIGTRAPs did not include correct LWP information, though. Kamil has recently fixed that by moving the relevant data from process information to signal information struct. Combined with his earlier patch, this makes my test program pass most of the time (sadly, there seem to be some more race conditions involved).
Summary of threading work
My current work-in-progress patch can be found on Differential as D64647. However, it is currently unsuitable for merging as some tests start failing or hanging as a side effect of the changes. I'd like to try to get as many of them fixed as possible before pushing the changes to trunk, in order to avoid causing harm to the build bot.
The status with the current set of Kamil's work-in-progress patches applied to the kernel includes approximately 4 failing tests and 10 hanging tests.
Other LLVM news
Manikishan Ghantasala has been working on NetBSD-specific clang-format improvements in this year's Google Summer of Code. He is continuing to work on clang-format, and has recently been given commit access to the LLVM project!
Besides NetBSD-specific work, I've been trying to improve a few other areas
of LLVM. I've been working on fixing regressions in stand-alone build support
and regressions in support for BUILD_SHARED_LIBS=ON
builds. I have to admit
that while a year ago I was the only person fixing those issues, nowadays I see
more contributions submitting patches for breakages specific to those builds.
I have recently worked on fixing bad assumptions in LLDB's Python support. However, it seems that Haibo Huang has taken it from me and is doing a great job.
My most recent endeavor was fixing LLVM_DISTRIBUTION_COMPONENTS
support
in LLVM projects. This is going to make it possible to precisely fine-tune
which components are installed, both in combined tree and stand-alone builds.
Future plans
My first goal right now is to assert what is causing the test host to crash, and restore buildbot stability. Afterwards, I'd like to continue investigating threading problems and provide more reproducers for any kernel issues we may be having. Once this is done, I'd like to finally push my LLDB patch.
Since threading is not the only goal left in the TODO, I may switch between working on it and on the remaining TODO items. Those are:
-
Add support to backtrace through signal trampoline and extend the support to libexecinfo, unwind implementations (LLVM, nongnu). Examine adding CFI support to interfaces that need it to provide more stable backtraces (both kernel and userland).
-
Add support for i386 and aarch64 targets.
-
Stabilize LLDB and address breaking tests from the test suite.
-
Merge LLDB with the base system (under LLVM-style distribution).
This work is sponsored by The NetBSD Foundation
The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:
Threading support
Threads and synchronization in the kernel, in general, is an evergreen task of the kernel developers. The process of enhancing support for tracing multiple threads has been documented by Michal Gorny in his LLDB entry Threading support in LLDB continued.
Overall I have introduced these changes:
- Separate suspend from userland (_lwp_suspend(2)) flag from suspend by a debugger (PT_SUSPEND). This removes one of the underlying problems of threading stability as a debuggee was able to accidentally unstop suspended thread. This property is needed whenever we want to trace a selection (typically single entity) of threads.
- Store SIGTRAP event information inside siginfo_t, rather than in struct proc. A single signal can only be reported at the time to the debugger, and its context is no longer prone to be overwritten by concurrent threads.
- Change that introduces restarts in functions notifying events for debuggers. There was a time window between registering an event by a thread, stopping the process and unlocking mutexes of the process; as another process could take the mutexes before being stopped and overwrite the event with its own data. Now each event routine for debugger checks whether a process is already stopping (or demising or no longer being tracked) and preserves the signal to be emitted locally in the context of the lwp local variable on the stack and continues stopping self as requested by the other LWP. Once the thread is awaken, it retries to emit the signal and deliver the event signal to the debugger.
- Introduce PT_STOP, that combines kill(SIGSTOP) and ptrace(PT_CONTINUE,SIGSTOP) semantics in a single call.
It works like:
- kill(SIGSTOP) for unstopped tracee
- ptrace(PT_CONTINUE,SIGSTOP) for stopped tracee
For stopped tracee kill(SIGSTOP) has no effect. PT_CONTINUE+SIGSTOP cannot be used on an unstopped process (EBUSY).
This operation is modeled after PT_KILL that is similar for the SIGKILL call. While there, allow PT_KILL on unstopped traced child.
This operation is useful in an abnormal exit of a debugger from a signal handler, usually followed by waitpid(2) and ptrace(PT_DETACH).
For the sake of tracking the missed in action signals emitted by tracee, I have introduced the feature in NetBSD truss (as part of the picotrace repository) to register syscall entry (SCE) and syscall exit (SCX) calls and track missing SCE/SCX events that were never delivered. Unfortunately, the number of missing events was huge, even for simple 2-threaded applications.
truss[2585] running for 22.205305922 seconds truss[2585] attached to child=759 ('firefox') for 22.204289369 seconds syscall seconds calls errors missed-sce missed-scx read 0.048522952 609 0 54 76 write 0.044693735 487 0 35 66 open 0.002516815 18 0 5 5 close 0.001015263 17 0 9 6 unlink 0.001375463 13 0 3 0 getpid 0.093458089 1993 0 16 56 geteuid 0.000049301 1 0 0 1 recvmsg 0.343353019 4828 3685 90 112 access 0.001450653 12 3 5 4 dup 0.000570904 10 0 0 1 munmap 0.010375949 88 0 6 3 mprotect 0.196781932 2251 0 11 62 madvise 0.049820002 430 0 11 18 writev 0.237488362 1507 0 76 67 rename 0.000379918 2 0 1 0 mkdir 0.000283846 2 2 1 2 mmap 0.033342935 481 0 15 40 lseek 0.003341775 62 0 25 24 ftruncate 0.000507707 9 0 1 0 __sysctl 0.000144506 2 0 0 0 poll 18.694195617 4531 0 106 191 __sigprocmask14 0.001585329 20 0 0 2 getcontext 0.000083238 1 0 0 0 _lwp_create 0.000104646 1 0 0 0 _lwp_self 0.001456718 22 0 24 79 _lwp_unpark 0.035319633 607 0 14 39 _lwp_unpark_all 0.020660377 250 0 38 50 _lwp_setname 0.000118418 2 0 0 0 __select50 15.125525493 637 0 82 125 __gettimeofday50 3.279021049 2930 0 40 135 __clock_gettime50 10.673311747 33132 0 1418 3003 __stat50 0.006375356 52 3 12 5 __fstat50 0.001490944 17 0 3 2 __lstat50 0.000110906 1 0 1 0 __getrusage50 0.008863815 109 0 7 1 ___lwp_park60 62.720893458 964 251 454 453 ------------- ------- ------- ------- ------- 111.638589870 56098 3944 2563 4628
With my kernel changes landed, the number of missed sce/scx events is down to zero (with exceptions to signals that e.g. never return such as the exit(2) call).
Once these changes settle in HEAD, I plan to backport them to NetBSD-9. I have already received feedback that GDB works much better now.
The kernel also has now more runtime asserts that validate correctness of the code paths.
Sanitizers
I've introduced a special preprocessor macro to detect LSan (__SANITIZE_LEAK__) and UBSan (__SANITIZE_UNDEFINED__) in GCC. The patches were submitted upstream to the GCC mailing list, in two patches (LSan + UBSan). Unfortunately, GCC does not see value in feature parity with LLVM and for the time being it will be a local NetBSD specific GCC extension. These macros are now integrated into the NetBSD public system headers, for use by the basesystem software.
The LSan macro is now used inside the LLVM codebase and the ps(1) program is the first user of it. The UBSan macro is now used to disable relaxed alignment on x86. While such code is still functional, it is not clean from undefined behavior as specified by C. This is especially needed in the kernel fuzzing process, as we can reduce noise from less interesting reports.
During the previous month a number of reports from kernel fuzzing were fixed. There is still more to go.
Almost all local patches needed for LSan were merged upstream. The last remaining local patch is scheduled for later as it is very invasive for all platforms and sanitizers. In the worst case we just have more false negatives in detection of leaks in specific scenarios.
Miscellaneous changes
I have fixed a regression in upstream GDB with SIGTTOU handling. This was an upstream bug fixed by Alan Hayward and cherry-picked by me. As a side effect, a certain environment setup would cause the tracer to sleep.
I have reverted the regression in changed in6_addr change. It appeased UBSan, but broke at least qemu networking. The regression was tracked down by Andreas Gustafsson and reported in the NetBSD's bug tracking system.
I have landed a patch that returns ELF loader dl_phdr_info information for dl_iterate_phdr(3). This synchronized the behavior with Linux, FreeBSD and OpenBSD and is used by sanitizers.
I have passed through core@ the patch to change the kevent::udata type from intptr_t to void*. The former is slightly more pedantic, but the latter is what is in all other kevent users and this mismatch of types affected specifically C++ users that needed special NetBSD-only workarounds.
I have marked git and hg meta files as ones to be ignored by cvs import. This was causing problems among people repackaging the NetBSD source code with other VCS software than CVS.
I keep working on getting GDB test-suite to run on NetBSD, I spent some time on getting fluent in the TCL programming language (as GDB uses dejagnu and TCL scripting). I have already fixed two bugs that affected NetBSD users in the TCL runtime: getaddrbyname_r and gethostbyaddr_r were falsely reported as available and picked on NetBSD, causing damage in operation. Fluency in TCL will allow me to be more efficient in addressing and debugging failing tests in GDB and likely reuse this knowledge in other fields useful for the project.
I made __CTASSERT a static assert again. Previously, this homegrown check for compile-time checks silently stopped working for C99 compilers supporting VLA (variable length array). It was caught by kUBSan that detected VLA of dynamic size of -1, that is still compatible but has unspecified runtime semantics. The new form is inspired by the Perl ctassert code and uses bit-field constant that enforces the assert to be effective again. Few misuses __CTASSERT, mostly in the Linux DRMKMS code, were fixed.
I have submitted a proposal to the C Working Group a proposal to add new methods for setting and getting the thread name.
Plan for the next milestone
Keep stabilizing the reliability debugging interfaces and get ATF and LLDB threading code reliably pass tests. Cover more scenarios with ptrace(2) in the ATF regression test-suite.
This work was sponsored by The NetBSD Foundation.
The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:
Threading support
Threads and synchronization in the kernel, in general, is an evergreen task of the kernel developers. The process of enhancing support for tracing multiple threads has been documented by Michal Gorny in his LLDB entry Threading support in LLDB continued.
Overall I have introduced these changes:
- Separate suspend from userland (_lwp_suspend(2)) flag from suspend by a debugger (PT_SUSPEND). This removes one of the underlying problems of threading stability as a debuggee was able to accidentally unstop suspended thread. This property is needed whenever we want to trace a selection (typically single entity) of threads.
- Store SIGTRAP event information inside siginfo_t, rather than in struct proc. A single signal can only be reported at the time to the debugger, and its context is no longer prone to be overwritten by concurrent threads.
- Change that introduces restarts in functions notifying events for debuggers. There was a time window between registering an event by a thread, stopping the process and unlocking mutexes of the process; as another process could take the mutexes before being stopped and overwrite the event with its own data. Now each event routine for debugger checks whether a process is already stopping (or demising or no longer being tracked) and preserves the signal to be emitted locally in the context of the lwp local variable on the stack and continues stopping self as requested by the other LWP. Once the thread is awaken, it retries to emit the signal and deliver the event signal to the debugger.
- Introduce PT_STOP, that combines kill(SIGSTOP) and ptrace(PT_CONTINUE,SIGSTOP) semantics in a single call.
It works like:
- kill(SIGSTOP) for unstopped tracee
- ptrace(PT_CONTINUE,SIGSTOP) for stopped tracee
For stopped tracee kill(SIGSTOP) has no effect. PT_CONTINUE+SIGSTOP cannot be used on an unstopped process (EBUSY).
This operation is modeled after PT_KILL that is similar for the SIGKILL call. While there, allow PT_KILL on unstopped traced child.
This operation is useful in an abnormal exit of a debugger from a signal handler, usually followed by waitpid(2) and ptrace(PT_DETACH).
For the sake of tracking the missed in action signals emitted by tracee, I have introduced the feature in NetBSD truss (as part of the picotrace repository) to register syscall entry (SCE) and syscall exit (SCX) calls and track missing SCE/SCX events that were never delivered. Unfortunately, the number of missing events was huge, even for simple 2-threaded applications.
truss[2585] running for 22.205305922 seconds truss[2585] attached to child=759 ('firefox') for 22.204289369 seconds syscall seconds calls errors missed-sce missed-scx read 0.048522952 609 0 54 76 write 0.044693735 487 0 35 66 open 0.002516815 18 0 5 5 close 0.001015263 17 0 9 6 unlink 0.001375463 13 0 3 0 getpid 0.093458089 1993 0 16 56 geteuid 0.000049301 1 0 0 1 recvmsg 0.343353019 4828 3685 90 112 access 0.001450653 12 3 5 4 dup 0.000570904 10 0 0 1 munmap 0.010375949 88 0 6 3 mprotect 0.196781932 2251 0 11 62 madvise 0.049820002 430 0 11 18 writev 0.237488362 1507 0 76 67 rename 0.000379918 2 0 1 0 mkdir 0.000283846 2 2 1 2 mmap 0.033342935 481 0 15 40 lseek 0.003341775 62 0 25 24 ftruncate 0.000507707 9 0 1 0 __sysctl 0.000144506 2 0 0 0 poll 18.694195617 4531 0 106 191 __sigprocmask14 0.001585329 20 0 0 2 getcontext 0.000083238 1 0 0 0 _lwp_create 0.000104646 1 0 0 0 _lwp_self 0.001456718 22 0 24 79 _lwp_unpark 0.035319633 607 0 14 39 _lwp_unpark_all 0.020660377 250 0 38 50 _lwp_setname 0.000118418 2 0 0 0 __select50 15.125525493 637 0 82 125 __gettimeofday50 3.279021049 2930 0 40 135 __clock_gettime50 10.673311747 33132 0 1418 3003 __stat50 0.006375356 52 3 12 5 __fstat50 0.001490944 17 0 3 2 __lstat50 0.000110906 1 0 1 0 __getrusage50 0.008863815 109 0 7 1 ___lwp_park60 62.720893458 964 251 454 453 ------------- ------- ------- ------- ------- 111.638589870 56098 3944 2563 4628
With my kernel changes landed, the number of missed sce/scx events is down to zero (with exceptions to signals that e.g. never return such as the exit(2) call).
Once these changes settle in HEAD, I plan to backport them to NetBSD-9. I have already received feedback that GDB works much better now.
The kernel also has now more runtime asserts that validate correctness of the code paths.
Sanitizers
I've introduced a special preprocessor macro to detect LSan (__SANITIZE_LEAK__) and UBSan (__SANITIZE_UNDEFINED__) in GCC. The patches were submitted upstream to the GCC mailing list, in two patches (LSan + UBSan). Unfortunately, GCC does not see value in feature parity with LLVM and for the time being it will be a local NetBSD specific GCC extension. These macros are now integrated into the NetBSD public system headers, for use by the basesystem software.
The LSan macro is now used inside the LLVM codebase and the ps(1) program is the first user of it. The UBSan macro is now used to disable relaxed alignment on x86. While such code is still functional, it is not clean from undefined behavior as specified by C. This is especially needed in the kernel fuzzing process, as we can reduce noise from less interesting reports.
During the previous month a number of reports from kernel fuzzing were fixed. There is still more to go.
Almost all local patches needed for LSan were merged upstream. The last remaining local patch is scheduled for later as it is very invasive for all platforms and sanitizers. In the worst case we just have more false negatives in detection of leaks in specific scenarios.
Miscellaneous changes
I have fixed a regression in upstream GDB with SIGTTOU handling. This was an upstream bug fixed by Alan Hayward and cherry-picked by me. As a side effect, a certain environment setup would cause the tracer to sleep.
I have reverted the regression in changed in6_addr change. It appeased UBSan, but broke at least qemu networking. The regression was tracked down by Andreas Gustafsson and reported in the NetBSD's bug tracking system.
I have landed a patch that returns ELF loader dl_phdr_info information for dl_iterate_phdr(3). This synchronized the behavior with Linux, FreeBSD and OpenBSD and is used by sanitizers.
I have passed through core@ the patch to change the kevent::udata type from intptr_t to void*. The former is slightly more pedantic, but the latter is what is in all other kevent users and this mismatch of types affected specifically C++ users that needed special NetBSD-only workarounds.
I have marked git and hg meta files as ones to be ignored by cvs import. This was causing problems among people repackaging the NetBSD source code with other VCS software than CVS.
I keep working on getting GDB test-suite to run on NetBSD, I spent some time on getting fluent in the TCL programming language (as GDB uses dejagnu and TCL scripting). I have already fixed two bugs that affected NetBSD users in the TCL runtime: getaddrbyname_r and gethostbyaddr_r were falsely reported as available and picked on NetBSD, causing damage in operation. Fluency in TCL will allow me to be more efficient in addressing and debugging failing tests in GDB and likely reuse this knowledge in other fields useful for the project.
I made __CTASSERT a static assert again. Previously, this homegrown check for compile-time checks silently stopped working for C99 compilers supporting VLA (variable length array). It was caught by kUBSan that detected VLA of dynamic size of -1, that is still compatible but has unspecified runtime semantics. The new form is inspired by the Perl ctassert code and uses bit-field constant that enforces the assert to be effective again. Few misuses __CTASSERT, mostly in the Linux DRMKMS code, were fixed.
I have submitted a proposal to the C Working Group a proposal to add new methods for setting and getting the thread name.
Plan for the next milestone
Keep stabilizing the reliability debugging interfaces and get ATF and LLDB threading code reliably pass tests. Cover more scenarios with ptrace(2) in the ATF regression test-suite.
This work was sponsored by The NetBSD Foundation.
The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can: