Discussion:
[asterisk-dev] Deadlock between transaction and transport
Salahuddin Ahmed
2018-07-12 22:01:29 UTC
Permalink
Hello,

Recently I have experienced a deadlock situation in Asterisk Version
13.20.0. According to our gdb analysis there have 3 suspicious thread(3, 11
and 44). Thread 44 push a task to taskpool, thread 11 execute that task and
the thread 3 is a timer event. In thread 11 during a response going to
transmit a group lock has been acquired. and this thread was locked on this
point. In Thread 3 while retransmission timer triggered same group lock
again acquired and try to retransmit response. In this situation asterisk
was blocked.

Let me explain according to GDB log,

Thread-03:
#4 0x00007f00ba6d9db0 in tsx_send_msg (tsx=0x7eff8bdd4568,
tdata=0x7eff892ae668) at ../src/pjsip/sip_transaction.c:2122

Thread-11:
#6 0x00007f00ba6dc4ac in pjsip_tsx_send_msg (tsx=***@entry=0x7eff8bdd4568,
tdata=***@entry=0x7eff892ae668) at ../src/pjsip/sip_transaction.c:1783
(gdb) p tsx->transport_flag
$2 = 1(TSX_HAS_PENDING_TRANSPORT)

Here Thread-03 is execute first and try to retransmit a message before
acquire a tsx group lock in Thread-11, because in thread 3
tsx->transport_flag |= TSX_HAS_PENDING_TRANSPORT has been set, Now if we
read this flag from thread 11 in frame 04 then we have found this flag was
set by someone before acquiring tsx group lock. That means during this
acquiring lock in Thread3, Thread 11 trying to send data on transport. So
this might be the case of deadlock, somehow timer thread 3 execute before
thread-11.

Here I enclose Thread-3 and Thread-11 gdb backtrace.

Could anyone please help me on this?


Thread-11:
#0 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f00f62bf494 in _L_lock_952 () from
/lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f00f62bf2f6 in __GI___pthread_mutex_lock (mutex=0x7eff41197c58)
at ../nptl/pthread_mutex_lock.c:114
#3 0x00007f00b97e49a5 in pj_mutex_lock (mutex=0x7eff41197c58) at
../src/pj/os_core_unix.c:1265
#4 0x00007f00b97eace1 in grp_lock_acquire (p=0x7eff41197b48) at
../src/pj/lock.c:290
#5 0x00007f00b97ead22 in pj_grp_lock_acquire (grp_lock=<optimized out>) at
../src/pj/lock.c:478
#6 0x00007f00ba6dc4ac in pjsip_tsx_send_msg (tsx=***@entry=0x7eff8bdd4568,
tdata=***@entry=0x7eff892ae668) at ../src/pjsip/sip_transaction.c:1783
#7 0x00007f00ba6de952 in pjsip_dlg_send_response (dlg=<optimized out>,
tsx=0x7eff8bdd4568, tdata=***@entry=0x7eff892ae668) at
../src/pjsip/sip_dialog.c:1531
#8 0x00007f00bab1c09a in pjsip_inv_send_msg (inv=0x7eff88bdf9b8,
tdata=***@entry=0x7eff892ae668) at ../src/pjsip-ua/sip_inv.c:3231
#9 0x00007f0092874b4c in ast_sip_session_send_response
(session=***@entry=0x7eff89367700, tdata=0x7eff892ae668) at
res_pjsip_session.c:983
#10 0x00007f002a431841 in answer (data=0x7eff89367700) at chan_pjsip.c:591
#11 0x00007f0094b7a490 in sync_task (data=0x7efdbcf8b810) at
res_pjsip.c:4268
#12 0x00000000005d52ce in ast_taskprocessor_execute (tps=***@entry=0x267f9b0)
at taskprocessor.c:971
#13 0x00000000005dcc80 in execute_tasks (data=0x267f9b0) at
threadpool.c:1322
#14 0x00000000005d52ce in ast_taskprocessor_execute (tps=0x23e0580) at
taskprocessor.c:971
#15 0x00000000005dd558 in threadpool_execute (pool=<optimized out>) at
threadpool.c:351
#16 worker_active (worker=<optimized out>) at threadpool.c:1105
#17 worker_start (arg=0x7f0080001680) at threadpool.c:1024
#18 0x00000000005e6dfa in dummy_start (data=<optimized out>) at utils.c:1238
#19 0x00007f00f62bd064 in start_thread (arg=0x7eff67c9b700) at
pthread_create.c:309
#20 0x00007f00f55a562d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thread-3:
#0 sanitize_tdata (tdata=<optimized out>) at
res_pjsip/pjsip_message_filter.c:211
#1 filter_on_tx_message (tdata=0x7eff892ae668) at
res_pjsip/pjsip_message_filter.c:233
#2 0x00007f00ba6c5b0e in endpt_on_tx_msg (endpt=<optimized out>,
tdata=0x7eff892ae668) at ../src/pjsip/sip_endpoint.c:1088
#3 0x00007f00ba6cc5c0 in pjsip_transport_send (tr=0x2677718,
tdata=***@entry=0x7eff892ae668, addr=***@entry=0x7eff8bdd4644,
addr_len=16, token=***@entry=0x7eff8bdd4568,
cb=***@entry=0x7f00ba6d91b0 <transport_callback>) at
../src/pjsip/sip_transport.c:833
#4 0x00007f00ba6d9db0 in tsx_send_msg (tsx=0x7eff8bdd4568,
tdata=0x7eff892ae668) at ../src/pjsip/sip_transaction.c:2122
#5 0x00007f00ba6da8da in tsx_on_state_proceeding_uas (tsx=0x7eff8bdd4568,
event=<optimized out>) at ../src/pjsip/sip_transaction.c:2882
#6 0x00007f00ba6d9b86 in tsx_timer_callback (theap=<optimized out>,
entry=0x7eff8bdd46e8) at ../src/pjsip/sip_transaction.c:1189
#7 0x00007f00b97f6757 in pj_timer_heap_poll (ht=0x23e53a0,
next_delay=***@entry=0x7f0090649e50) at ../src/pj/timer.c:644
#8 0x00007f00ba6c65c4 in pjsip_endpt_handle_events2 (endpt=0x23e50b8,
max_timeout=***@entry=0x7f0090649ea0, p_count=***@entry=0x0) at
../src/pjsip/sip_endpoint.c:715
#9 0x00007f00ba6c66d7 in pjsip_endpt_handle_events (endpt=<optimized out>,
max_timeout=***@entry=0x7f0090649ea0) at
../src/pjsip/sip_endpoint.c:776
#10 0x00007f0094b7a118 in monitor_thread_exec (endpt=<optimized out>) at
res_pjsip.c:4345
#11 0x00007f00b97e4ca0 in thread_main (param=0x23e3c68) at
../src/pj/os_core_unix.c:541
#12 0x00007f00f62bd064 in start_thread (arg=0x7f009064a700) at
pthread_create.c:309
#13 0x00007f00f55a562d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Thanks,
Salah Ahmed
Richard Mudgett
2018-07-13 22:13:03 UTC
Permalink
Post by Salahuddin Ahmed
Hello,
Recently I have experienced a deadlock situation in Asterisk Version
13.20.0. According to our gdb analysis there have 3 suspicious thread(3, 11
and 44). Thread 44 push a task to taskpool, thread 11 execute that task and
the thread 3 is a timer event. In thread 11 during a response going to
transmit a group lock has been acquired. and this thread was locked on this
point. In Thread 3 while retransmission timer triggered same group lock
again acquired and try to retransmit response. In this situation asterisk
was blocked.
Let me explain according to GDB log,
#4 0x00007f00ba6d9db0 in tsx_send_msg (tsx=0x7eff8bdd4568,
tdata=0x7eff892ae668) at ../src/pjsip/sip_transaction.c:2122
(gdb) p tsx->transport_flag
$2 = 1(TSX_HAS_PENDING_TRANSPORT)
Here Thread-03 is execute first and try to retransmit a message before
acquire a tsx group lock in Thread-11, because in thread 3
tsx->transport_flag |= TSX_HAS_PENDING_TRANSPORT has been set, Now if we
read this flag from thread 11 in frame 04 then we have found this flag was
set by someone before acquiring tsx group lock. That means during this
acquiring lock in Thread3, Thread 11 trying to send data on transport. So
this might be the case of deadlock, somehow timer thread 3 execute before
thread-11.
Here I enclose Thread-3 and Thread-11 gdb backtrace.
Could anyone please help me on this?
#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/
linux/x86_64/lowlevellock.S:135
#1 0x00007f00f62bf494 in _L_lock_952 () from /lib/x86_64-linux-gnu/
libpthread.so.0
#2 0x00007f00f62bf2f6 in __GI___pthread_mutex_lock (mutex=0x7eff41197c58)
at ../nptl/pthread_mutex_lock.c:114
#3 0x00007f00b97e49a5 in pj_mutex_lock (mutex=0x7eff41197c58) at
../src/pj/os_core_unix.c:1265
#4 0x00007f00b97eace1 in grp_lock_acquire (p=0x7eff41197b48) at
../src/pj/lock.c:290
#5 0x00007f00b97ead22 in pj_grp_lock_acquire (grp_lock=<optimized out>)
at ../src/pj/lock.c:478
#7 0x00007f00ba6de952 in pjsip_dlg_send_response (dlg=<optimized out>,
../src/pjsip/sip_dialog.c:1531
#8 0x00007f00bab1c09a in pjsip_inv_send_msg (inv=0x7eff88bdf9b8,
#9 0x00007f0092874b4c in ast_sip_session_send_response
res_pjsip_session.c:983
#10 0x00007f002a431841 in answer (data=0x7eff89367700) at chan_pjsip.c:591
#11 0x00007f0094b7a490 in sync_task (data=0x7efdbcf8b810) at
res_pjsip.c:4268
at taskprocessor.c:971
#13 0x00000000005dcc80 in execute_tasks (data=0x267f9b0) at
threadpool.c:1322
#14 0x00000000005d52ce in ast_taskprocessor_execute (tps=0x23e0580) at
taskprocessor.c:971
#15 0x00000000005dd558 in threadpool_execute (pool=<optimized out>) at
threadpool.c:351
#16 worker_active (worker=<optimized out>) at threadpool.c:1105
#17 worker_start (arg=0x7f0080001680) at threadpool.c:1024
#18 0x00000000005e6dfa in dummy_start (data=<optimized out>) at utils.c:1238
#19 0x00007f00f62bd064 in start_thread (arg=0x7eff67c9b700) at
pthread_create.c:309
#20 0x00007f00f55a562d in clone () at ../sysdeps/unix/sysv/linux/
x86_64/clone.S:111
#0 sanitize_tdata (tdata=<optimized out>) at res_pjsip/pjsip_message_
filter.c:211
#1 filter_on_tx_message (tdata=0x7eff892ae668) at res_pjsip/pjsip_message_
filter.c:233
#2 0x00007f00ba6c5b0e in endpt_on_tx_msg (endpt=<optimized out>,
tdata=0x7eff892ae668) at ../src/pjsip/sip_endpoint.c:1088
#3 0x00007f00ba6cc5c0 in pjsip_transport_send (tr=0x2677718,
../src/pjsip/sip_transport.c:833
#4 0x00007f00ba6d9db0 in tsx_send_msg (tsx=0x7eff8bdd4568,
tdata=0x7eff892ae668) at ../src/pjsip/sip_transaction.c:2122
#5 0x00007f00ba6da8da in tsx_on_state_proceeding_uas (tsx=0x7eff8bdd4568,
event=<optimized out>) at ../src/pjsip/sip_transaction.c:2882
#6 0x00007f00ba6d9b86 in tsx_timer_callback (theap=<optimized out>,
entry=0x7eff8bdd46e8) at ../src/pjsip/sip_transaction.c:1189
#7 0x00007f00b97f6757 in pj_timer_heap_poll (ht=0x23e53a0,
#8 0x00007f00ba6c65c4 in pjsip_endpt_handle_events2 (endpt=0x23e50b8,
at ../src/pjsip/sip_endpoint.c:715
#9 0x00007f00ba6c66d7 in pjsip_endpt_handle_events (endpt=<optimized
../src/pjsip/sip_endpoint.c:776
#10 0x00007f0094b7a118 in monitor_thread_exec (endpt=<optimized out>) at
res_pjsip.c:4345
#11 0x00007f00b97e4ca0 in thread_main (param=0x23e3c68) at
../src/pj/os_core_unix.c:541
#12 0x00007f00f62bd064 in start_thread (arg=0x7f009064a700) at
pthread_create.c:309
#13 0x00007f00f55a562d in clone () at ../sysdeps/unix/sysv/linux/
x86_64/clone.S:111
* The attached backtrace does not match the threads you are highlighting in
your email. In fact
the attached backtrace was done with a non-bundled pjproject without any
symbols.

* Thread 3 above should have acquired the transaction group lock in
tsx_timer_callback() before
calling the state handler routine. From there the pjproject line numbers
do not match the bundled
pjproject version for 13.20.0. Also the thread is not blocked so I don't
see how it as participating
in a deadlock unless you think it is in an endless loop.

You *really* need to be using bundled pjproject [1] for the reasons listed
on that page.

Richard

[1] https://wiki.asterisk.org/wiki/display/AST/PJSIP-pjproject

Loading...