Skip to content

Potential infinite loop from multithreaded use of HashMap by way of "skip" HashSet in SaltScanner? #823

@thatsafunnyname

Description

@thatsafunnyname

Hello and thank you for opentsdb,

We are using opentsdb-2.2.0 with

tsd.storage.salt.buckets = 30
tsd.storage.salt.width = 1

and mapr-hadoop-core-2.7.0.37549.GA-1.x86_64, both using java-1.7.0-openjdk-1.7.0.101.x86_64 on RHEL6.6 on 2.6.32-573.12.1.el6.x86_64

We see the opentsdb process occasionally spinning CPU in user time, logs from querying the api/stats/threads HTTP interface show a thread resizing a HashMap at these times. We do not have a reproducer. We have a monit process that detects the high CPU and restarts opentsdb.

I read http://mailinator.blogspot.co.uk/2009/06/beautiful-race-condition.html and was wondering if we are hitting this race, and if this might explain what we see.

I will attach the api/stats/threads log files. Some process stats for opentsdb process (the host has 12 vCPUS):

# TOP PROCESSES sorted by time (counters are /sec) 16:57:00
# PID  User     PR  PPID THRD S   VSZ   RSS CP  SysT   UsrT Pct  AccuTime  RKB  WKB MajF MinF 
96659  opentsdb 20     1  213 S   15G  719M  7  0.01  0.12   0   0:24.21    0    1    0    8

# TOP PROCESSES sorted by time (counters are /sec) 16:58:00
# PID  User     PR  PPID THRD S   VSZ   RSS CP  SysT   UsrT Pct  AccuTime  RKB  WKB MajF MinF 
96659  opentsdb 20     1  213 S   16G    1G  7  1.01 179.33 300   3:24.55    0   30    0 4227

# TOP PROCESSES sorted by time (counters are /sec) 16:59:00
# PID  User     PR  PPID THRD S   VSZ   RSS CP  SysT   UsrT Pct  AccuTime  RKB  WKB MajF MinF 
96659  opentsdb 20     1  213 S   16G    1G  7  0.65 415.73 693  10:20.93    0    1    0    8

# TOP PROCESSES sorted by time (counters are /sec) 17:00:00
# PID  User     PR  PPID THRD S   VSZ   RSS CP  SysT   UsrT Pct  AccuTime  RKB  WKB MajF MinF 
96659  opentsdb 20     1  213 S   16G    1G  7  0.67 416.33 695  17:17.93    0    1    0    9

Thanks in advance for any help you can provide.

( Whilst looking at the stacks I also noticed that opentsdb is using version 1.4.0 of stumbleupon/async (

SUASYNC_VERSION := 1.4.0
) and not version 1.4.1 which contains a fix for OpenTSDB/async#4 ( OpenTSDB/async@3552d39 ) I was curious why 1.4.1 is not being used. )

Thread seen resizing HashMap

3dow_16h_58m.log-        "interrupted": false,
3dow_16h_58m.log-        "name": "pool-6-thread-48",
3dow_16h_58m.log-        "priority": 5,
3dow_16h_58m.log-        "stack": [
3dow_16h_58m.log-            "java.util.HashMap.transfer(HashMap.java:601)",
3dow_16h_58m.log:            "java.util.HashMap.resize(HashMap.java:581)",
3dow_16h_58m.log-            "java.util.HashMap.addEntry(HashMap.java:879)",
3dow_16h_58m.log-            "java.util.HashMap.put(HashMap.java:505)",
3dow_16h_58m.log-            "java.util.HashSet.add(HashSet.java:217)",
3dow_16h_58m.log-            "net.opentsdb.core.SaltScanner$ScannerCB$1MatchCB.call(SaltScanner.java:420)",
3dow_16h_58m.log-            "net.opentsdb.core.SaltScanner$ScannerCB$1MatchCB.call(SaltScanner.java:414)",
3dow_16h_58m.log-            "com.stumbleupon.async.Deferred.doCall(Deferred.java:1278)",
3dow_16h_58m.log-            "com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)",
3dow_16h_58m.log-            "com.stumbleupon.async.Deferred.handleContinuation(Deferred.java:1313)",
3dow_16h_58m.log-            "com.stumbleupon.async.Deferred.doCall(Deferred.java:1284)",
3dow_16h_58m.log-            "com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)",
3dow_16h_58m.log-            "com.stumbleupon.async.Deferred.callback(Deferred.java:1005)",
3dow_16h_58m.log-            "com.stumbleupon.async.DeferredGroup.done(DeferredGroup.java:173)",
3dow_16h_58m.log-            "com.stumbleupon.async.DeferredGroup.recordCompletion(DeferredGroup.java:158)",
3dow_16h_58m.log-            "com.stumbleupon.async.DeferredGroup.access$200(DeferredGroup.java:36)",
3dow_16h_58m.log-            "com.stumbleupon.async.DeferredGroup$1NotifyOrdered.call(DeferredGroup.java:97)",
3dow_16h_58m.log-            "com.stumbleupon.async.Deferred.doCall(Deferred.java:1278)",
3dow_16h_58m.log-            "com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)",
3dow_16h_58m.log-            "com.stumbleupon.async.Deferred.callback(Deferred.java:1005)",
3dow_16h_58m.log-            "org.hbase.async.HBaseRpc.callback(HBaseRpc.java:734)",
3dow_16h_58m.log-            "org.hbase.async.MapRThreadPool$RpcRunnable.run(MapRThreadPool.java:230)",
3dow_16h_58m.log-            "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)",
3dow_16h_58m.log-            "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)",
3dow_16h_58m.log-            "java.lang.Thread.run(Thread.java:745)"
3dow_16h_58m.log-        ],
3dow_16h_58m.log-        "state": "RUNNABLE",
3dow_16h_58m.log-        "threadID": 109
3dow_16h_59m.log-        "interrupted": false,
3dow_16h_59m.log-        "name": "pool-6-thread-48",
3dow_16h_59m.log-        "priority": 5,
3dow_16h_59m.log-        "stack": [
3dow_16h_59m.log-            "java.util.HashMap.transfer(HashMap.java:601)",
3dow_16h_59m.log:            "java.util.HashMap.resize(HashMap.java:581)",
3dow_16h_59m.log-            "java.util.HashMap.addEntry(HashMap.java:879)",
3dow_16h_59m.log-            "java.util.HashMap.put(HashMap.java:505)",
3dow_16h_59m.log-            "java.util.HashSet.add(HashSet.java:217)",
3dow_16h_59m.log-            "net.opentsdb.core.SaltScanner$ScannerCB$1MatchCB.call(SaltScanner.java:420)",
3dow_16h_59m.log-            "net.opentsdb.core.SaltScanner$ScannerCB$1MatchCB.call(SaltScanner.java:414)",
3dow_16h_59m.log-            "com.stumbleupon.async.Deferred.doCall(Deferred.java:1278)",
3dow_16h_59m.log-            "com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)",
3dow_16h_59m.log-            "com.stumbleupon.async.Deferred.handleContinuation(Deferred.java:1313)",
3dow_16h_59m.log-            "com.stumbleupon.async.Deferred.doCall(Deferred.java:1284)",
3dow_16h_59m.log-            "com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)",
3dow_16h_59m.log-            "com.stumbleupon.async.Deferred.callback(Deferred.java:1005)",
3dow_16h_59m.log-            "com.stumbleupon.async.DeferredGroup.done(DeferredGroup.java:173)",
3dow_16h_59m.log-            "com.stumbleupon.async.DeferredGroup.recordCompletion(DeferredGroup.java:158)",
3dow_16h_59m.log-            "com.stumbleupon.async.DeferredGroup.access$200(DeferredGroup.java:36)",
3dow_16h_59m.log-            "com.stumbleupon.async.DeferredGroup$1NotifyOrdered.call(DeferredGroup.java:97)",
3dow_16h_59m.log-            "com.stumbleupon.async.Deferred.doCall(Deferred.java:1278)",
3dow_16h_59m.log-            "com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)",
3dow_16h_59m.log-            "com.stumbleupon.async.Deferred.callback(Deferred.java:1005)",
3dow_16h_59m.log-            "org.hbase.async.HBaseRpc.callback(HBaseRpc.java:734)",
3dow_16h_59m.log-            "org.hbase.async.MapRThreadPool$RpcRunnable.run(MapRThreadPool.java:230)",
3dow_16h_59m.log-            "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)",
3dow_16h_59m.log-            "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)",
3dow_16h_59m.log-            "java.lang.Thread.run(Thread.java:745)"
3dow_16h_59m.log-        ],
3dow_16h_59m.log-        "state": "RUNNABLE",
3dow_16h_59m.log-        "threadID": 109
3dow_17h_00m.log-        "interrupted": false,
3dow_17h_00m.log-        "name": "pool-6-thread-48",
3dow_17h_00m.log-        "priority": 5,
3dow_17h_00m.log-        "stack": [
3dow_17h_00m.log-            "java.util.HashMap.transfer(HashMap.java:601)",
3dow_17h_00m.log:            "java.util.HashMap.resize(HashMap.java:581)",
3dow_17h_00m.log-            "java.util.HashMap.addEntry(HashMap.java:879)",
3dow_17h_00m.log-            "java.util.HashMap.put(HashMap.java:505)",
3dow_17h_00m.log-            "java.util.HashSet.add(HashSet.java:217)",
3dow_17h_00m.log-            "net.opentsdb.core.SaltScanner$ScannerCB$1MatchCB.call(SaltScanner.java:420)",
3dow_17h_00m.log-            "net.opentsdb.core.SaltScanner$ScannerCB$1MatchCB.call(SaltScanner.java:414)",
3dow_17h_00m.log-            "com.stumbleupon.async.Deferred.doCall(Deferred.java:1278)",
3dow_17h_00m.log-            "com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)",
3dow_17h_00m.log-            "com.stumbleupon.async.Deferred.handleContinuation(Deferred.java:1313)",
3dow_17h_00m.log-            "com.stumbleupon.async.Deferred.doCall(Deferred.java:1284)",
3dow_17h_00m.log-            "com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)",
3dow_17h_00m.log-            "com.stumbleupon.async.Deferred.callback(Deferred.java:1005)",
3dow_17h_00m.log-            "com.stumbleupon.async.DeferredGroup.done(DeferredGroup.java:173)",
3dow_17h_00m.log-            "com.stumbleupon.async.DeferredGroup.recordCompletion(DeferredGroup.java:158)",
3dow_17h_00m.log-            "com.stumbleupon.async.DeferredGroup.access$200(DeferredGroup.java:36)",
3dow_17h_00m.log-            "com.stumbleupon.async.DeferredGroup$1NotifyOrdered.call(DeferredGroup.java:97)",
3dow_17h_00m.log-            "com.stumbleupon.async.Deferred.doCall(Deferred.java:1278)",
3dow_17h_00m.log-            "com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)",
3dow_17h_00m.log-            "com.stumbleupon.async.Deferred.callback(Deferred.java:1005)",
3dow_17h_00m.log-            "org.hbase.async.HBaseRpc.callback(HBaseRpc.java:734)",
3dow_17h_00m.log-            "org.hbase.async.MapRThreadPool$RpcRunnable.run(MapRThreadPool.java:230)",
3dow_17h_00m.log-            "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)",
3dow_17h_00m.log-            "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)",
3dow_17h_00m.log-            "java.lang.Thread.run(Thread.java:745)"
3dow_17h_00m.log-        ],
3dow_17h_00m.log-        "state": "RUNNABLE",
3dow_17h_00m.log-        "threadID": 109

private final Set<String> skips = new HashSet<String>();

private final Set<String> skips = new HashSet<String>();

skips.add(tsuid);

skips.add(tsuid);

Log files from api/stats/threads :

monster4_3dow_16h_57m_log.txt
monster4_3dow_16h_58m_log.txt
monster4_3dow_16h_59m_log.txt
monster4_3dow_17h_00m_log.txt

The config (some names changed):

tsd.core.auto_create_metrics = true
tsd.core.meta.enable_realtime_ts = false
tsd.core.meta.enable_realtime_uid = true
tsd.core.meta.enable_tsuid_incrementing = false
tsd.core.meta.enable_tsuid_tracking = false
tsd.core.plugin_path = /home/opentsdb/opentsdb/releases/opentsdb-2.2.0/usr/share/opentsdb/plugins
tsd.core.tree.enable_processing = false
tsd.core.uid.random_metrics = true
tsd.http.cachedir = /dev/shm/opentsdb/tsdb-mdata
tsd.http.query.allow_delete = true
tsd.http.request.cors_domains = *
tsd.http.request.enable_chunked = true
tsd.http.request.max_chunk = 33554432
tsd.http.show_stack_trace = false
tsd.http.staticroot = /home/opentsdb/opentsdb/releases/opentsdb-2.2.0/usr/share/opentsdb/static
tsd.mode = ro
tsd.network.bind = 0.0.0.0
tsd.network.port = 4272
tsd.network.worker_threads = 24
tsd.query.allow_simultaneous_duplicates = false
tsd.query.timeout = 31000
tsd.storage.enable_appends = false
tsd.storage.enable_compaction = false
tsd.storage.fix_duplicates = true
tsd.storage.hbase.data_table = /mapr/maprtsdb01a/opentsdb/tsdb-mdata-legacy/tsdb
tsd.storage.hbase.meta_table = /mapr/maprtsdb01a/opentsdb/tsdb-mdata-legacy/tsdb-meta
tsd.storage.hbase.tree_table = /mapr/maprtsdb01a/opentsdb/tsdb-mdata-legacy/tsdb-tree
tsd.storage.hbase.uid_table = /mapr/maprtsdb01a/opentsdb/tsdb-mdata-legacy/tsdb-uid
tsd.storage.hbase.zk_quorum = maprtsdb01a01:5181,maprtsdb01a02:5181,maprtsdb01a03:5181,
tsd.storage.repair_appends = false
tsd.storage.salt.buckets = 30
tsd.storage.salt.width = 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions