-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Hello and thank you for opentsdb,
We are using opentsdb-2.2.0 with
tsd.storage.salt.buckets = 30
tsd.storage.salt.width = 1
and mapr-hadoop-core-2.7.0.37549.GA-1.x86_64, both using java-1.7.0-openjdk-1.7.0.101.x86_64 on RHEL6.6 on 2.6.32-573.12.1.el6.x86_64
We see the opentsdb process occasionally spinning CPU in user time, logs from querying the api/stats/threads HTTP interface show a thread resizing a HashMap at these times. We do not have a reproducer. We have a monit process that detects the high CPU and restarts opentsdb.
I read http://mailinator.blogspot.co.uk/2009/06/beautiful-race-condition.html and was wondering if we are hitting this race, and if this might explain what we see.
I will attach the api/stats/threads log files. Some process stats for opentsdb process (the host has 12 vCPUS):
# TOP PROCESSES sorted by time (counters are /sec) 16:57:00
# PID User PR PPID THRD S VSZ RSS CP SysT UsrT Pct AccuTime RKB WKB MajF MinF
96659 opentsdb 20 1 213 S 15G 719M 7 0.01 0.12 0 0:24.21 0 1 0 8
# TOP PROCESSES sorted by time (counters are /sec) 16:58:00
# PID User PR PPID THRD S VSZ RSS CP SysT UsrT Pct AccuTime RKB WKB MajF MinF
96659 opentsdb 20 1 213 S 16G 1G 7 1.01 179.33 300 3:24.55 0 30 0 4227
# TOP PROCESSES sorted by time (counters are /sec) 16:59:00
# PID User PR PPID THRD S VSZ RSS CP SysT UsrT Pct AccuTime RKB WKB MajF MinF
96659 opentsdb 20 1 213 S 16G 1G 7 0.65 415.73 693 10:20.93 0 1 0 8
# TOP PROCESSES sorted by time (counters are /sec) 17:00:00
# PID User PR PPID THRD S VSZ RSS CP SysT UsrT Pct AccuTime RKB WKB MajF MinF
96659 opentsdb 20 1 213 S 16G 1G 7 0.67 416.33 695 17:17.93 0 1 0 9
Thanks in advance for any help you can provide.
( Whilst looking at the stacks I also noticed that opentsdb is using version 1.4.0 of stumbleupon/async (
opentsdb/third_party/suasync/include.mk
Line 16 in e2efe9a
| SUASYNC_VERSION := 1.4.0 |
Thread seen resizing HashMap
3dow_16h_58m.log- "interrupted": false,
3dow_16h_58m.log- "name": "pool-6-thread-48",
3dow_16h_58m.log- "priority": 5,
3dow_16h_58m.log- "stack": [
3dow_16h_58m.log- "java.util.HashMap.transfer(HashMap.java:601)",
3dow_16h_58m.log: "java.util.HashMap.resize(HashMap.java:581)",
3dow_16h_58m.log- "java.util.HashMap.addEntry(HashMap.java:879)",
3dow_16h_58m.log- "java.util.HashMap.put(HashMap.java:505)",
3dow_16h_58m.log- "java.util.HashSet.add(HashSet.java:217)",
3dow_16h_58m.log- "net.opentsdb.core.SaltScanner$ScannerCB$1MatchCB.call(SaltScanner.java:420)",
3dow_16h_58m.log- "net.opentsdb.core.SaltScanner$ScannerCB$1MatchCB.call(SaltScanner.java:414)",
3dow_16h_58m.log- "com.stumbleupon.async.Deferred.doCall(Deferred.java:1278)",
3dow_16h_58m.log- "com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)",
3dow_16h_58m.log- "com.stumbleupon.async.Deferred.handleContinuation(Deferred.java:1313)",
3dow_16h_58m.log- "com.stumbleupon.async.Deferred.doCall(Deferred.java:1284)",
3dow_16h_58m.log- "com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)",
3dow_16h_58m.log- "com.stumbleupon.async.Deferred.callback(Deferred.java:1005)",
3dow_16h_58m.log- "com.stumbleupon.async.DeferredGroup.done(DeferredGroup.java:173)",
3dow_16h_58m.log- "com.stumbleupon.async.DeferredGroup.recordCompletion(DeferredGroup.java:158)",
3dow_16h_58m.log- "com.stumbleupon.async.DeferredGroup.access$200(DeferredGroup.java:36)",
3dow_16h_58m.log- "com.stumbleupon.async.DeferredGroup$1NotifyOrdered.call(DeferredGroup.java:97)",
3dow_16h_58m.log- "com.stumbleupon.async.Deferred.doCall(Deferred.java:1278)",
3dow_16h_58m.log- "com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)",
3dow_16h_58m.log- "com.stumbleupon.async.Deferred.callback(Deferred.java:1005)",
3dow_16h_58m.log- "org.hbase.async.HBaseRpc.callback(HBaseRpc.java:734)",
3dow_16h_58m.log- "org.hbase.async.MapRThreadPool$RpcRunnable.run(MapRThreadPool.java:230)",
3dow_16h_58m.log- "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)",
3dow_16h_58m.log- "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)",
3dow_16h_58m.log- "java.lang.Thread.run(Thread.java:745)"
3dow_16h_58m.log- ],
3dow_16h_58m.log- "state": "RUNNABLE",
3dow_16h_58m.log- "threadID": 109
3dow_16h_59m.log- "interrupted": false,
3dow_16h_59m.log- "name": "pool-6-thread-48",
3dow_16h_59m.log- "priority": 5,
3dow_16h_59m.log- "stack": [
3dow_16h_59m.log- "java.util.HashMap.transfer(HashMap.java:601)",
3dow_16h_59m.log: "java.util.HashMap.resize(HashMap.java:581)",
3dow_16h_59m.log- "java.util.HashMap.addEntry(HashMap.java:879)",
3dow_16h_59m.log- "java.util.HashMap.put(HashMap.java:505)",
3dow_16h_59m.log- "java.util.HashSet.add(HashSet.java:217)",
3dow_16h_59m.log- "net.opentsdb.core.SaltScanner$ScannerCB$1MatchCB.call(SaltScanner.java:420)",
3dow_16h_59m.log- "net.opentsdb.core.SaltScanner$ScannerCB$1MatchCB.call(SaltScanner.java:414)",
3dow_16h_59m.log- "com.stumbleupon.async.Deferred.doCall(Deferred.java:1278)",
3dow_16h_59m.log- "com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)",
3dow_16h_59m.log- "com.stumbleupon.async.Deferred.handleContinuation(Deferred.java:1313)",
3dow_16h_59m.log- "com.stumbleupon.async.Deferred.doCall(Deferred.java:1284)",
3dow_16h_59m.log- "com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)",
3dow_16h_59m.log- "com.stumbleupon.async.Deferred.callback(Deferred.java:1005)",
3dow_16h_59m.log- "com.stumbleupon.async.DeferredGroup.done(DeferredGroup.java:173)",
3dow_16h_59m.log- "com.stumbleupon.async.DeferredGroup.recordCompletion(DeferredGroup.java:158)",
3dow_16h_59m.log- "com.stumbleupon.async.DeferredGroup.access$200(DeferredGroup.java:36)",
3dow_16h_59m.log- "com.stumbleupon.async.DeferredGroup$1NotifyOrdered.call(DeferredGroup.java:97)",
3dow_16h_59m.log- "com.stumbleupon.async.Deferred.doCall(Deferred.java:1278)",
3dow_16h_59m.log- "com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)",
3dow_16h_59m.log- "com.stumbleupon.async.Deferred.callback(Deferred.java:1005)",
3dow_16h_59m.log- "org.hbase.async.HBaseRpc.callback(HBaseRpc.java:734)",
3dow_16h_59m.log- "org.hbase.async.MapRThreadPool$RpcRunnable.run(MapRThreadPool.java:230)",
3dow_16h_59m.log- "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)",
3dow_16h_59m.log- "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)",
3dow_16h_59m.log- "java.lang.Thread.run(Thread.java:745)"
3dow_16h_59m.log- ],
3dow_16h_59m.log- "state": "RUNNABLE",
3dow_16h_59m.log- "threadID": 109
3dow_17h_00m.log- "interrupted": false,
3dow_17h_00m.log- "name": "pool-6-thread-48",
3dow_17h_00m.log- "priority": 5,
3dow_17h_00m.log- "stack": [
3dow_17h_00m.log- "java.util.HashMap.transfer(HashMap.java:601)",
3dow_17h_00m.log: "java.util.HashMap.resize(HashMap.java:581)",
3dow_17h_00m.log- "java.util.HashMap.addEntry(HashMap.java:879)",
3dow_17h_00m.log- "java.util.HashMap.put(HashMap.java:505)",
3dow_17h_00m.log- "java.util.HashSet.add(HashSet.java:217)",
3dow_17h_00m.log- "net.opentsdb.core.SaltScanner$ScannerCB$1MatchCB.call(SaltScanner.java:420)",
3dow_17h_00m.log- "net.opentsdb.core.SaltScanner$ScannerCB$1MatchCB.call(SaltScanner.java:414)",
3dow_17h_00m.log- "com.stumbleupon.async.Deferred.doCall(Deferred.java:1278)",
3dow_17h_00m.log- "com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)",
3dow_17h_00m.log- "com.stumbleupon.async.Deferred.handleContinuation(Deferred.java:1313)",
3dow_17h_00m.log- "com.stumbleupon.async.Deferred.doCall(Deferred.java:1284)",
3dow_17h_00m.log- "com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)",
3dow_17h_00m.log- "com.stumbleupon.async.Deferred.callback(Deferred.java:1005)",
3dow_17h_00m.log- "com.stumbleupon.async.DeferredGroup.done(DeferredGroup.java:173)",
3dow_17h_00m.log- "com.stumbleupon.async.DeferredGroup.recordCompletion(DeferredGroup.java:158)",
3dow_17h_00m.log- "com.stumbleupon.async.DeferredGroup.access$200(DeferredGroup.java:36)",
3dow_17h_00m.log- "com.stumbleupon.async.DeferredGroup$1NotifyOrdered.call(DeferredGroup.java:97)",
3dow_17h_00m.log- "com.stumbleupon.async.Deferred.doCall(Deferred.java:1278)",
3dow_17h_00m.log- "com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)",
3dow_17h_00m.log- "com.stumbleupon.async.Deferred.callback(Deferred.java:1005)",
3dow_17h_00m.log- "org.hbase.async.HBaseRpc.callback(HBaseRpc.java:734)",
3dow_17h_00m.log- "org.hbase.async.MapRThreadPool$RpcRunnable.run(MapRThreadPool.java:230)",
3dow_17h_00m.log- "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)",
3dow_17h_00m.log- "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)",
3dow_17h_00m.log- "java.lang.Thread.run(Thread.java:745)"
3dow_17h_00m.log- ],
3dow_17h_00m.log- "state": "RUNNABLE",
3dow_17h_00m.log- "threadID": 109
opentsdb/src/core/SaltScanner.java
Line 309 in e2efe9a
| private final Set<String> skips = new HashSet<String>(); |
private final Set<String> skips = new HashSet<String>();
opentsdb/src/core/SaltScanner.java
Line 420 in e2efe9a
| skips.add(tsuid); |
skips.add(tsuid);
Log files from api/stats/threads :
monster4_3dow_16h_57m_log.txt
monster4_3dow_16h_58m_log.txt
monster4_3dow_16h_59m_log.txt
monster4_3dow_17h_00m_log.txt
The config (some names changed):
tsd.core.auto_create_metrics = true
tsd.core.meta.enable_realtime_ts = false
tsd.core.meta.enable_realtime_uid = true
tsd.core.meta.enable_tsuid_incrementing = false
tsd.core.meta.enable_tsuid_tracking = false
tsd.core.plugin_path = /home/opentsdb/opentsdb/releases/opentsdb-2.2.0/usr/share/opentsdb/plugins
tsd.core.tree.enable_processing = false
tsd.core.uid.random_metrics = true
tsd.http.cachedir = /dev/shm/opentsdb/tsdb-mdata
tsd.http.query.allow_delete = true
tsd.http.request.cors_domains = *
tsd.http.request.enable_chunked = true
tsd.http.request.max_chunk = 33554432
tsd.http.show_stack_trace = false
tsd.http.staticroot = /home/opentsdb/opentsdb/releases/opentsdb-2.2.0/usr/share/opentsdb/static
tsd.mode = ro
tsd.network.bind = 0.0.0.0
tsd.network.port = 4272
tsd.network.worker_threads = 24
tsd.query.allow_simultaneous_duplicates = false
tsd.query.timeout = 31000
tsd.storage.enable_appends = false
tsd.storage.enable_compaction = false
tsd.storage.fix_duplicates = true
tsd.storage.hbase.data_table = /mapr/maprtsdb01a/opentsdb/tsdb-mdata-legacy/tsdb
tsd.storage.hbase.meta_table = /mapr/maprtsdb01a/opentsdb/tsdb-mdata-legacy/tsdb-meta
tsd.storage.hbase.tree_table = /mapr/maprtsdb01a/opentsdb/tsdb-mdata-legacy/tsdb-tree
tsd.storage.hbase.uid_table = /mapr/maprtsdb01a/opentsdb/tsdb-mdata-legacy/tsdb-uid
tsd.storage.hbase.zk_quorum = maprtsdb01a01:5181,maprtsdb01a02:5181,maprtsdb01a03:5181,
tsd.storage.repair_appends = false
tsd.storage.salt.buckets = 30
tsd.storage.salt.width = 1