GridGain Developers Hub

Iep Monitoring Metrics

IEP Monitoring

GridGain provides you with the ability to monitor processes in the cluster identifying which user tasks are executed and what resources are used by each task. The available metrics are provided in the table below:

cache:

# Metric Units Description

1

cache.<cache_name>.CacheEvictions

number

Total number of cache evictions.

2

cache.<cache_name>.CacheGets

number

The total number of gets to the cache.

3

cache.<cache_name>.CacheHits

number

The number of get requests that were satisfied by the cache.

4

cache.<cache_name>.CacheMisses

number

Miss is a get request that is not satisfied.

5

cache.<cache_name>.CachePuts

number

The total number of puts to the cache.

6

cache.<cache_name>.CacheRemovals

number

The total number of removals from the cache.

7

cache.<cache_name>.CacheTxCommits,

number

Total number of transaction commits.

8

cache.<cache_name>.CacheTxRollbacks,

number

Total number of transaction rollbacks.

9

cache.<cache_name>.CommitTime

number

Commit time in nanoseconds.

10

cache.<cache_name>.CommitTimeTotal

number

The total time of commit, in nanoseconds.

11

cache.<cache_name>.EntryProcessorHits

number

The total number of invocations on keys, which exist in cache.

12

cache.<cache_name>.EntryProcessorInvokeTimeNanos

number

The total time of cache invocations, in nanoseconds.

13

cache.<cache_name>.EntryProcessorMaxInvocationTime

number

Maximum time to execute cache invokes.

14

cache.<cache_name>.EntryProcessorMinInvocationTime

number

Minimum time to execute cache invokes.

15

cache.<cache_name>.EntryProcessorMisses,

number

The total number of invocations on keys, which don’t exist in cache.

16

cache.<cache_name>.EntryProcessorPuts

number

The total number of cache invocations, caused update.

17

cache.<cache_name>.EntryProcessorReadOnlyInvocations

number

The total number of cache invocations, caused no updates.

18

cache.<cache_name>.EntryProcessorRemovals

number

The total number of cache invocations, caused removals.

19

cache.<cache_name>.EstimatedRebalancingKeys

number

Number estimated to rebalance keys.

20

cache.<cache_name>.GetTime

ns

Get time in nanoseconds.

21

cache.<cache_name>.GetTimeTotal

ns

The total time of cache gets, in nanoseconds.

22

cache.<cache_name>.OffHeapEvictions

number

The total number of evictions from the off-heap memory.

23

cache.<cache_name>.OffHeapGets

number

The total number of get requests to the off-heap memory.

24

cache.<cache_name>.OffHeapHits

number

The number of get requests that were satisfied by the off-heap memory.

25

cache.<cache_name>.OffHeapMisses

number

The total number of misses (get requests that is not satisfied by off-heap memory).

26

cache.<cache_name>.OffHeapPuts

number

The total number of put requests to the off-heap memory.

27

cache.<cache_name>.OffHeapRemovals

number

The total number of removals from the off-heap memory.

28

cache.<cache_name>.PutTime

ns

Put time in nanoseconds.

29

cache.<cache_name>.PutTimeTotal

ns

The total time of cache puts, in nanoseconds.

30

cache.<cache_name>.QueryCompleted

number

Total number of successfully completed queries initiated by the node.

31

cache.<cache_name>.QueryExecuted

number

Total number of completed queries initiated by the node.

32

cache.<cache_name>.QueryFailed

number

Total number of failed queries initiated by the node.

33

cache.<cache_name>.QueryMaximumTime

ms

Maximum time spent for a single query initiated by the node ran.

34

cache.<cache_name>.QueryMinimalTime

ms

Minimum time spent for a single query initiated by the node ran.

35

cache.<cache_name>.QuerySumTime

ms

Sum of execution time periods of all queries initiated by the node.

36

cache.<cache_name>.RebalanceClearingPartitionsLeft

number

Number of partitions that need to be cleared before actual rebalance start.

37

cache.<cache_name>.RebalanceStartTime

Rebalance start time.

38

cache.<cache_name>.RebalancedKeys

number

Number of already rebalanced keys.

39

cache.<cache_name>.RebalancingBytesRate

bytes/min

Estimated rebalancing speed in bytes.

40

cache.<cache_name>.RebalancingKeysRate

number of keys/min

Estimated rebalancing speed in keys.

41

cache.<cache_name>.RemoveTime

ns

Remove time in nanoseconds.

42

cache.<cache_name>.RemoveTimeTotal

ns

The total time of cache removal, in nanoseconds.

43

cache.<cache_name>.RollbackTime

ns

Rollback time in nanoseconds.

44

cache.<cache_name>.RollbackTimeTotal

ns

The total time of rollback, in nanoseconds.

45

cache.<cache_name>.TotalRebalancedBytes

number

Number of already rebalanced bytes.

46

cache.<cache_name>.TxKeyCollisions

List of cache keys with a large number of lock collisions and the long wait queue.

cacheGroups:

# Metric Units Description

47

cacheGroups.<group_name>.AffinityPartitionsAssignmentMap

Affinity partitions assignment map.

48

cacheGroups.<group_name>.Caches

List of caches in the cache group.

49

cacheGroups.<group_name>.IndexBuildCountPartitionsLeft

number

Number of partitions that need to be processed for finished indexes create or rebuilding.

50

cacheGroups.<group_name>.InitializedLocalPartitionsNumber

number

Number of local partitions initialized on current node.

51

cacheGroups.<group_name>.LocalNodeMovingPartitionsCount

number

Count of partitions with state MOVING for this cache group located on this node.

52

cacheGroups.<group_name>.LocalNodeOwningPartitionsCount

number

Count of partitions with state OWNING for this cache group located on this node.

53

cacheGroups.<group_name>.LocalNodeRentingEntriesCount

number

Count of entries remains to evict in RENTING partitions located on this node for this cache group.

54

cacheGroups.<group_name>.LocalNodeRentingPartitionsCount

number

Count of partitions with state RENTING for this cache group located on this node.

55

cacheGroups.<group_name>.MaximumNumberOfPartitionCopies

number

Maximum number of partition copies for all partitions of this cache group.

56

cacheGroups.<group_name>.MinimumNumberOfPartitionCopies

number

Minimum number of partition copies for all partitions of this cache group.

57

cacheGroups.<group_name>.MovingPartitionsAllocationMap

Allocation map of partitions with the MOVING state in the cluster.

58

cacheGroups.<group_name>.OwningPartitionsAllocationMap

Allocation map of partitions with the OWNING state in the cluster.

59

cacheGroups.<group_name>.PartitionIds

Local partition ids.

60

cacheGroups.<group_name>.SparseStorageSize

bytes

Storage space allocated for a group adjusted for possible sparsity, in bytes.

61

cacheGroups.<group_name>.StorageSize

bytes

Storage space allocated for a group, in bytes.

62

cacheGroups.<group_name>.TotalAllocatedPages

number

Cache group total allocated pages.

63

cacheGroups.<group_name>.TotalAllocatedSize

bytes

Total size of memory allocated for group, in bytes.

communication:

# Metric Units Description

64

communication.tcp.<node_id>.receivedMessagesFromNode

number

Total number of messages received by the current node from the given node

65

communication.tcp.<node_id>.sentMessagesToNode

number

Total number of messages sent by the current node to the given node.

66

communication.tcp.outboundMessagesQueueSize

number

Number of messages waiting to be sent.

67

communication.tcp.receivedBytes

number

Total number of bytes received by current node

68

communication.tcp.receivedMessagesByType.<message_id>

number

Total number of messages of the given type received by the current node

69

communication.tcp.receivedMessagesCount

number

Total number of messages received by current node

70

communication.tcp.sentBytes

bytes

Total number of bytes sent by current node

71

communication.tcp.sentMessagesByType.<message_id>

number

Total number of messages of the given type sent by the current node

72

communication.tcp.sentMessagesCount

number

Total number of messages sent by current node.

compute:

# Metric Units Description

73

compute.jobs.Active

number

Number of canceled jobs that are still running.

75

compute.jobs.ExecutionTime

ms

Total execution time of jobs.

76

compute.jobs.Finished

number

Number of finished jobs.

77

compute.jobs.Rejected

number

Number of jobs rejected after more recent collision resolution operation.

78

compute.jobs.Started

number

Number of started jobs.

79

compute.jobs.Waiting

number

Number of currently queued jobs waiting to be executed.

80

compute.jobs.WaitingTime

ms

Total time jobs spent on the waiting queue.

io:

# Metric Units Description

81

io.communication.OutboundMessagesQueueSize

bytes

Outbound messages queue size.

82

io.communication.ReceivedBytesCount

number

Received bytes count.

83

io.communication.ReceivedMessagesCount

number

Received messages count.

84

io.communication.SentBytesCount

number

Sent bytes count.

85

io.communication.SentMessagesCount

number

Sent messages count.

86

io.dataregion.<region_name>.AllocationRate

number/s

Allocation rate (pages per second) averaged across rateTimeInternal.

87

io.dataregion.<region_name>.CheckpointBufferSize

bytes

Checkpoint buffer size in bytes.

88

io.dataregion.<region_name>.DirtyPages

number

Number of pages in memory not yet synchronized with persistent storage.

89

io.dataregion.<region_name>.EmptyDataPages

number

Calculates empty data pages count for the region. It counts only totally free pages that can be reused (e. g., pages that are contained in the reuse bucket of the free list).

90

io.dataregion.<region_name>.EvictionRate

number

Eviction rate (pages per second).

91

io.dataregion.<region_name>.LargeEntriesPagesCount

number

Count of pages that fully ocupied by large entries that go beyond page size

92

io.dataregion.<region_name>.OffHeapSize

bytes

Offheap size in bytes.

93

io.dataregion.<region_name>.OffheapUsedSize

bytes

offheap used size in bytes.

94

io.dataregion.<region_name>.PagesFillFactor

%

The percentage of the used space.

95

io.dataregion.<region_name>.PagesRead

number

Number of pages read from last restart.

96

io.dataregion.<region_name>.PagesReplaceAge

ms

Average age at which pages in memory are replaced with pages from persistent storage (milliseconds).

97

io.dataregion.<region_name>.PagesReplaceRate

number/s

Rate at which pages in memory are replaced with pages from persistent storage (pages per second).

98

io.dataregion.<region_name>.PagesReplaced

number

Number of pages replaced from last restart.

99

io.dataregion.<region_name>.PagesWritten

number

Number of pages written from last restart.

100

io.dataregion.<region_name>.PhysicalMemoryPages

number

Number of pages residing in physical RAM.

101

io.dataregion.<region_name>.PhysicalMemorySize

bytes

Gets total size of pages loaded to the RAM, in bytes

102

io.dataregion.<region_name>.TotalAllocatedPages

number

Total number of allocated pages.

103

io.dataregion.<region_name>.TotalAllocatedSize

bytes

Gets a total size of memory allocated in the data region, in bytes

104

io.dataregion.<region_name>.UsedCheckpointBufferSize

bytes

Gets used checkpoint buffer size in bytes

105

io.datastorage.CheckpointTotalTime

ms

Total duration of checkpoint.

106

io.datastorage.LastCheckpointCopiedOnWritePagesNumber

number

Number of pages copied to a temporary checkpoint buffer during the last checkpoint.

107

io.datastorage.LastCheckpointDataPagesNumber

number

Total number of data pages written during the last checkpoint.

108

io.datastorage.LastCheckpointDuration

ms

Duration of the last checkpoint in milliseconds.

109

io.datastorage.LastCheckpointFsyncDuration

ms

Duration of the sync phase of the last checkpoint in milliseconds.

110

io.datastorage.LastCheckpointLockWaitDuration,

ms

Duration of the checkpoint lock wait in milliseconds.

111

io.datastorage.LastCheckpointMarkDuration

ms

Duration of the checkpoint lock wait in milliseconds.

112

io.datastorage.LastCheckpointPagesWriteDuration

ms

Duration of the checkpoint pages write in milliseconds.

113

io.datastorage.LastCheckpointTotalPagesNumber

number

Total number of pages written during the last checkpoint.

114

io.datastorage.SparseStorageSize

bytes

Storage space allocated adjusted for possible sparsity, in bytes.

115

io.datastorage.StorageSize

bytes

Storage space allocated, in bytes.

116

io.datastorage.WalArchiveSegments

number

Current number of WAL segments in the WAL archive.

117

io.datastorage.WalBuffPollSpinsRate

number

WAL buffer poll spins number over the last time interval.

118

io.datastorage.WalFsyncTimeDuration

ms

Total duration of fsync.

119

io.datastorage.WalFsyncTimeNum

number

Total count of fsync

120

io.datastorage.WalLastRollOverTime

timestamp

Time of the last WAL segment rollover.

121

io.datastorage.WalLoggingRate

number/s

Average number of WAL records per second written during the last time interval.

122

io.datastorage.WalTotalSize

bytes

Total size in bytes for storage wal files.

123

io.datastorage.WalWritingRate

bytes

Average number of bytes per second written during the last time interval.

124

io.statistics.cacheGroups.<group_name>.LOGICAL_READS

number

Number of times a page was read regardless whether the page was in memory or not

125

io.statistics.cacheGroups.<group_name>.PHYSICAL_READS

number

Number of times a page was read from disk to memory.

126

io.statistics.cacheGroups.<group_name>.grpId

string

Group identifier.

127

io.statistics.cacheGroups.<group_name>.name

string

Group name.

128

io.statistics.cacheGroups.<group_name>.startTime

timestamp

Group start timestamp.

129

io.statistics.hashIndexes.<cache_name>.<index_name>.LOGICAL_READS_INNER

number

Number of times an inner index page was read regardless of whether the page was in memory or not.

130

io.statistics.hashIndexes.<cache_name>.<index_name>.LOGICAL_READS_LEAF

number

Number of times a leaf index page was read regardless of whether the page was in memory or not.

131

io.statistics.hashIndexes.<cache_name>.<index_name>.PHYSICAL_READS_INNER

number

Number of times an inner index page was read from disk to memory.

132

io.statistics.hashIndexes.<cache_name>.<index_name>.PHYSICAL_READS_LEAF

number

Number of times a leaf index page was read from disk to memory.

133

io.statistics.hashIndexes.<cache_name>.<index_name>.indexName

string

Index name.

134

io.statistics.hashIndexes.<cache_name>.<index_name>.name

string

Cache name

135

io.statistics.hashIndexes.<cache_name>.<index_name>.startTime

timestamp

Index creation time

136

io.statistics.sortedIndexes.<cache_name>.<index_name>.LOGICAL_READS_INNER

number

Number of times an inner index page was read regardless of whether the page was in memory or not

137

io.statistics.sortedIndexes.<cache_name>.<index_name>.LOGICAL_READS_LEAF

number

Number of times a leaf index page was read regardless of whether the page was in memory or not

138

io.statistics.sortedIndexes.<cache_name>.<index_name>.PHYSICAL_READS_INNER

number

Number of times an inner index page was read from disk to memory

139

io.statistics.sortedIndexes.<cache_name>.<index_name>.PHYSICAL_READS_LEAF

number

Number of times a leaf index page was read from disk to memory

140

io.statistics.sortedIndexes.<cache_name>.<index_name>.indexName

string

Index name

141

io.statistics.sortedIndexes.<cache_name>.<index_name>.name

string

Cache name

pme:

# Metric Units Description

142

pme.CacheOperationsBlockedDuration

ms

Current PME cache operations blocked duration in milliseconds.

144

pme.CacheOperationsBlockedDurationHistogram

ms

Histogram of cache operations blocked PME durations in milliseconds.

145

pme.Duration

ms

Current PME duration in milliseconds.

146

pme.DurationHistogram

ms

Histogram of PME durations in milliseconds.

sql:

# Metric Units Description

147

sql.memory.quotas.OffloadedQueriesNumber

number

Number of queries that were offloaded to disk locally

148

sql.memory.quotas.OffloadingRead

bytes

Number of bytes read from the disk during SQL query offloading

149

sql.memory.quotas.OffloadingWritten

bytes

Number of bytes written to the disk during SQL query offloading

150

sql.memory.quotas.freeMem

bytes

Amount of memory left available for the queries on this node, in bytes (negative value if SQL memory quotas are disabled)

151

sql.memory.quotas.maxMem

bytes

Total amount of memory available for all queries on the current node (negative value if SQL memory quotas are disabled)

152

sql.memory.quotas.requests

number

Total number of times memory quota has been requested on the current node by all the queries

153

sql.parser.cache.hits

number

Number of hits for queries cache

154

sql.parser.cache.misses

number

Number of misses for queries cache

155

sql.queries.user.canceled

number

Number of canceled queries initiated by the current node. This number is included in the general 'failed' metric.

156

sql.queries.user.failed

number

Number of failed queries (including OOME) initiated by the current node

157

sql.queries.user.failedByOOM

number

Number of queries failed due to out of memory protection initiated by the current node. This number is included in the general 'failed' metric.

158

sql.queries.user.success

number

Number of successfully executed queries initiated by the current node

sys:

# Metric Units Description

159

sys.CpuLoad

%

CPU load.

160

sys.CurrentThreadCpuTime

ns

Total CPU time for the current thread in nanoseconds.

161

sys.CurrentThreadUserTime

ns

CPU time that the current thread has executed in user mode in nanoseconds.

162

sys.DaemonThreadCount

number

Number of live daemon threads.

163

sys.GcCpuLoad

%

GC CPU load.

164

sys.PeakThreadCount

number

Peak number of live JVM threads.

165

sys.SystemLoadAverage

%

System load average reported by the JVM OS MBean.

166

sys.ThreadCount

number

Number of live JVM threads.

167

sys.TotalExecutedTasks

number

Total executed tasks.

168

sys.TotalStartedThreadCount

number

Total number of created and started threads since JVM started.

169

sys.UpTime

ms

JVM uptime.

170

sys.memory.heap.committed

bytes

Amount of memory that is committed for the JVM to use in bytes.

171

sys.memory.heap.init

bytes

Amount of memory that the JVM initially requests from the operating system for memory management in bytes.

172

sys.memory.heap.max

bytes

Maximum amount of memory that can be used for memory management in bytes.

173

sys.memory.heap.used

bytes

Amount of used memory in bytes.

174

sys.memory.nonheap.committed

bytes

Amount of memory that is committed for the JVM to use in bytes.

175

sys.memory.nonheap.init

bytes

Amount of memory that the JVM initially requests from the operating system for memory management in bytes.

176

sys.memory.nonheap.max

bytes

Maximum amount of memory that can be used for memory management in bytes.

177

sys.memory.nonheap.used

bytes

Amount of used memory in bytes.

threadPools:

# Metric Units Description

178

threadPools.<executor_name>.ActiveCount

number

Approximate number of threads that are actively executing tasks.

179

threadPools.<executor_name>.CompletedTaskCount

number

Approximate total number of tasks that have completed execution.

180

threadPools.<executor_name>.CorePoolSize

number

The core number of threads.

181

threadPools.<executor_name>.DetectStarvation

boolean

True if starvation in a striped pool is detected.

182

threadPools.<executor_name>.KeepAliveTime

boolean

Thread keep-alive time, which is the amount of time which threads in excess of the core pool size may remain idle before being terminated.

183

threadPools.<executor_name>.LargestPoolSize

number

Largest number of threads that have ever simultaneously been in the pool.

184

threadPools.<executor_name>.MaximumPoolSize

number

Maximum number of allowed threads.

185

threadPools.<executor_name>.PoolSize

number

Current number of threads in the pool.

186

threadPools.<executor_name>.QueueSize

number

Current size of the execution queue.

187

threadPools.<executor_name>.RejectedExecutionHandlerClass

number

Class name of current rejection handler.

188

threadPools.<executor_name>.Shutdown

number

True if the executor has been shut down.

189

threadPools.<executor_name>.StripesActiveStatuses

number

Number of active tasks per stripe.

190

threadPools.<executor_name>.StripesCompletedTasksCounts

number

Number of completed tasks per stripe.

191

threadPools.<executor_name>.StripesCount

number

Stripes count.

192

threadPools.<executor_name>.StripesQueueSizes

number

Size of queue per stripe.

193

threadPools.<executor_name>.TaskCount

number

Approximate total number of tasks that have been scheduled for execution.

194

threadPools.<executor_name>.Terminated

number

True if all tasks have completed following shut down.

195

threadPools.<executor_name>.Terminating

boolean

True if terminating but not yet terminated.

196

threadPools.<executor_name>.ThreadFactoryClass

string

Class name of thread factory used to create new threads.

197

threadPools.<executor_name>.TotalCompletedTasksCount

number

Completed tasks count of all stripes.

198

threadPools.<executor_name>.TotalQueueSize

number

Total queue size of all stripes.

tx:

# Metric Units Description

199

tx.AllOwnerTransactions

Map of local node owning transactions.

200

tx.LockedKeysNumber

number

The number of keys locked on the node.

201

tx.OwnerTransactionsNumber

number

The number of active transactions for which this node is the initiator.

202

tx.TransactionsHoldingLockNumber

number

The number of active transactions holding at least one key lock.

203

tx.commitTime

ms

Last commit time.

204

tx.nodeSystemTimeHistogram

ms

Transactions system times on node represented as histogram.

205

tx.nodeUserTimeHistogram

ms

Transactions user times on node represented as histogram.

206

tx.rollbackTime

ms

Last rollback time.

207

tx.totalNodeSystemTime

ms

Total transactions system time on node.

208

tx.totalNodeUserTime

ms

Total transactions user time on node.

209

tx.txCommits

number

Number of transaction commits.

210

tx.txRollbacks

number

Number of transaction rollbacks.