Debugging CircleCI build timeouts

the problem

About a month ago, we noticed unexplained timeouts cropping up in our CircleCI builds.

Without understanding the underlying problem, our only workaround was to frantically hit the rebuild button and hope that the timeout gremlins didn’t strike.

With builds turning red seemingly at random, our ability to merge and deploy code was severely restricted; something had to be done!

Step 1: what is timing out?

Looking at the CircleCI dashboard, we saw that the RSpec phase of the build had stopped in its tracks, ceasing to produce any output.

RSpec hanging

The build attempted to move on to the subsequent Cucumber tests, but they ran extremely slowly and timed out after 2 hours.

CircleCI allows you to SSH into the containers running your tests to see exactly what’s going on. Let’s have a look at what processes are running:

ubuntu@box592:~/rails-teespring$ ps -eo pid,rss,args --sort rss | tail -n 3
 17713 269960 ruby /home/ubuntu/rails-teespring/vendor/bundle/ruby/2.2.0/bin/cucumber --format progress …
 17667 343372 ruby /home/ubuntu/rails-teespring/vendor/bundle/ruby/2.2.0/bin/cucumber --format progress …
 15388 626036 ruby /home/ubuntu/rails-teespring/vendor/bundle/ruby/2.2.0/bin/rspec --format progress …

Despite the build having moved on to running Cucumber, we can see there is still an RSpec process running—and using a lot of memory! We inferred that this lingering RSpec process was starving the subsequent Cucumber tests of resources, which led to their slowness and eventual timeout.

Step 2: what was RSpec doing when it hung?

Looking at the tests’ log file on the container, we could see the last thing that happened before RSpec hung was an attempt to use Browserify to compile assets:

Started GET "/compiled_assets/v2/application.js" for 127.0.0.1 at 2015-09-05 22:39:19 +0000
Started GET "/compiled_assets/v2/application.css" for 127.0.0.1 at 2015-09-05 22:39:19 +0000
Started GET "/compiled_assets/jquery.liveaddress.js" for 127.0.0.1 at 2015-09-05 22:39:19 +0000
Browserify: /home/ubuntu/rails-teespring/node_modules/.bin/browserifyinc -t reactify --extension …

What was it about this asset compilation that put RSpec into a flat spin?

Step 3: what’s happening inside RSpec?

We needed to know more about what RSpec’s process was doing. strace is a fantastic tool for spying on what a process is doing at the system call level:

strace

This shows the RSpec process is receiving a SIGVTALRM signal at roughly 100ms intervals, making no forward progress.

Using strace to also show the activity of forked processes revealed that a child process was sending the SIGVTALRM signal to its parent process:

ubuntu@box592:~/rails-teespring$ sudo strace -fftt -p 15388
…
[pid 15390] 23:12:53.204407 <... poll resumed> ) = 0 (Timeout)
[pid 15390] 23:12:53.204456 tgkill(15388, 15388, SIGVTALRM) = 0
[pid 15390] 23:12:53.204536 tgkill(15388, 15388, SIGVTALRM) = 0
[pid 15390] 23:12:53.204599 poll([{fd=3, events=POLLIN}], 1, 100 <unfinished ...>
[pid 15388] 23:12:53.204651 <... futex resumed> ) = ? ERESTARTSYS (To be restarted)
[pid 15388] 23:12:53.204690 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
[pid 15388] 23:12:53.204835 rt_sigreturn(0x1a) = -1 EINTR (Interrupted system call)
[pid 15388] 23:12:53.204898 futex(0xbd2f64, FUTEX_WAIT_PRIVATE, 8951, NULL <unfinished ...>
…

Multi-threaded activity, combined with the references to futex in the log output, got our spidey sense tingling: could this be a deadlock?

Step 4: what are RSpec’s threads doing?

To see exactly what the threads are doing, we can use gdb:

ubuntu@box592:~/rails-teespring$ sudo gdb
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04

(gdb) attach 15388
Attaching to process 15388

(gdb) info threads
  Id   Target Id         Frame
  13   Thread 0x7f41aae8b700 (LWP 15390) "ruby-timer-thr" 0x00007f41aa4f312d in poll () from /lib/x86_64-linux-gnu/libc.so.6
  12   Thread 0x7f41aae80700 (LWP 15393) "SignalSender" 0x00007f41aa1f6680 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
  11   Thread 0x7f419d337700 (LWP 15396) "parallel.rb:10" 0x00007f41aa1f4414 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  10   Thread 0x7f419d135700 (LWP 15397) "parallel.rb:10" 0x00007f41aa1f4414 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  9    Thread 0x7f418fcfc700 (LWP 16247) "timeout.rb:80" 0x00007f41aa1f4414 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  8    Thread 0x7f418f4e8700 (LWP 16248) "server.rb:69" 0x00007f41aa1f4414 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  7    Thread 0x7f418ea14700 (LWP 16251) "utils.rb:162" 0x00007f41aa1f4414 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  6    Thread 0x7f418f0e4700 (LWP 16254) "client.rb:55" 0x00007f41aa4f182d in read () from /lib/x86_64-linux-gnu/libc.so.6
  5    Thread 0x7f418f2e6700 (LWP 16267) "server.rb:283" 0x00007f41aa1f4414 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  4    Thread 0x7f418e711700 (LWP 16303) "server.rb:283" 0x00007f41aa1f6f2c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
  3    Thread 0x7f418e812700 (LWP 16304) "server.rb:283" 0x00007f41aa1f4414 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  2    Thread 0x7f418e610700 (LWP 16306) "server.rb:283" 0x00007f41aa1f4414 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
* 1    Thread 0x7f41aae63740 (LWP 15388) "ruby" 0x00007f41aa1f4414 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0

Most of the threads are blocked on a condition variable—threads 4 and 12 are the odd ones out. Let’s take a look at their stack trace:

(gdb) thread 4
[Switching to thread 4 (Thread 0x7f418e711700 (LWP 16303))]
#0  0x00007f41aa1f6f2c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007f41aa1f6f2c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f41aa1f2672 in _L_lock_953 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f41aa1f24da in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f419f6c54fd in v8::internal::ThreadManager::Lock() () from /home/ubuntu/rails-teespring/vendor/bundle/ruby/2.2.0/gems/therubyracer-0.12.0/lib/v8/init.so
#4  0x00007f419f6c5b0f in v8::Locker::Initialize(v8::Isolate*) () from /home/ubuntu/rails-teespring/vendor/bundle/ruby/2.2.0/gems/therubyracer-0.12.0/lib/v8/init.so
#5  0x00007f419f4b5861 in rr::Locker::setupLockAndCall (state=0x7f418e70961c, code=203531080) at locker.cc:47
…

(gdb) thread 12
[Switching to thread 12 (Thread 0x7f41aae80700 (LWP 15393))]
#0  0x00007f41aa1f6680 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007f41aa1f6680 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f419f74d3ad in v8::internal::LinuxSemaphore::Wait() () from /home/ubuntu/rails-teespring/vendor/bundle/ruby/2.2.0/gems/therubyracer-0.12.0/lib/v8/init.so
#2  0x00007f419f652c9c in v8::internal::RuntimeProfiler::WaitForSomeIsolateToEnterJS() () from /home/ubuntu/rails-teespring/vendor/bundle/ruby/2.2.0/gems/therubyracer-0.12.0/lib/v8/init.so
#3  0x00007f419f74d5b8 in v8::internal::SignalSender::Run() () from /home/ubuntu/rails-teespring/vendor/bundle/ruby/2.2.0/gems/therubyracer-0.12.0/lib/v8/init.so
#4  0x00007f419f74d4be in v8::internal::ThreadEntry(void*) () from /home/ubuntu/rails-teespring/vendor/bundle/ruby/2.2.0/gems/therubyracer-0.12.0/lib/v8/init.so
#5  0x00007f41aa1f0182 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
…

We have our culprit! therubyracer is a gem for interfacing with V8 from Ruby. There seems to be a few other people having problems with deadlock when using it too!

The solution

therubyracer is used by Browserify during asset compilation. Specifically, in our situation it was being used to compile LESS files.

Luckily for us, LESS was only being used in a couple of legacy areas of code (which we’d been meaning to clean up anyway). Our solution was to refactor that legacy code to use our new SASS templates, and remove therubyracer entirely.

Since that change, we’ve been deadlock free!