Iep Monitoring Metrics
IEP Monitoring
GridGain provides you with the ability to monitor processes in the cluster identifying which user tasks are executed and what resources are used by each task. The available metrics are provided in the table below:
cache:
# | Metric | Units | Description |
---|---|---|---|
1 |
cache.<cache_name>.CacheEvictions |
number |
Total number of cache evictions. |
2 |
cache.<cache_name>.CacheGets |
number |
The total number of gets to the cache. |
3 |
cache.<cache_name>.CacheHits |
number |
The number of get requests that were satisfied by the cache. |
4 |
cache.<cache_name>.CacheMisses |
number |
Miss is a get request that is not satisfied. |
5 |
cache.<cache_name>.CachePuts |
number |
The total number of puts to the cache. |
6 |
cache.<cache_name>.CacheRemovals |
number |
The total number of removals from the cache. |
7 |
cache.<cache_name>.CacheTxCommits, |
number |
Total number of transaction commits. |
8 |
cache.<cache_name>.CacheTxRollbacks, |
number |
Total number of transaction rollbacks. |
9 |
cache.<cache_name>.CommitTime |
number |
Commit time in nanoseconds. |
10 |
cache.<cache_name>.CommitTimeTotal |
number |
The total time of commit, in nanoseconds. |
11 |
cache.<cache_name>.EntryProcessorHits |
number |
The total number of invocations on keys, which exist in cache. |
12 |
cache.<cache_name>.EntryProcessorInvokeTimeNanos |
number |
The total time of cache invocations, in nanoseconds. |
13 |
cache.<cache_name>.EntryProcessorMaxInvocationTime |
number |
Maximum time to execute cache invokes. |
14 |
cache.<cache_name>.EntryProcessorMinInvocationTime |
number |
Minimum time to execute cache invokes. |
15 |
cache.<cache_name>.EntryProcessorMisses, |
number |
The total number of invocations on keys, which don’t exist in cache. |
16 |
cache.<cache_name>.EntryProcessorPuts |
number |
The total number of cache invocations, caused update. |
17 |
cache.<cache_name>.EntryProcessorReadOnlyInvocations |
number |
The total number of cache invocations, caused no updates. |
18 |
cache.<cache_name>.EntryProcessorRemovals |
number |
The total number of cache invocations, caused removals. |
19 |
cache.<cache_name>.EstimatedRebalancingKeys |
number |
Number estimated to rebalance keys. |
20 |
cache.<cache_name>.GetTime |
ns |
Get time in nanoseconds. |
21 |
cache.<cache_name>.GetTimeTotal |
ns |
The total time of cache gets, in nanoseconds. |
22 |
cache.<cache_name>.OffHeapEvictions |
number |
The total number of evictions from the off-heap memory. |
23 |
cache.<cache_name>.OffHeapGets |
number |
The total number of get requests to the off-heap memory. |
24 |
cache.<cache_name>.OffHeapHits |
number |
The number of get requests that were satisfied by the off-heap memory. |
25 |
cache.<cache_name>.OffHeapMisses |
number |
The total number of misses (get requests that is not satisfied by off-heap memory). |
26 |
cache.<cache_name>.OffHeapPuts |
number |
The total number of put requests to the off-heap memory. |
27 |
cache.<cache_name>.OffHeapRemovals |
number |
The total number of removals from the off-heap memory. |
28 |
cache.<cache_name>.PutTime |
ns |
Put time in nanoseconds. |
29 |
cache.<cache_name>.PutTimeTotal |
ns |
The total time of cache puts, in nanoseconds. |
30 |
cache.<cache_name>.QueryCompleted |
number |
Total number of successfully completed queries initiated by the node. |
31 |
cache.<cache_name>.QueryExecuted |
number |
Total number of completed queries initiated by the node. |
32 |
cache.<cache_name>.QueryFailed |
number |
Total number of failed queries initiated by the node. |
33 |
cache.<cache_name>.QueryMaximumTime |
ms |
Maximum time spent for a single query initiated by the node ran. |
34 |
cache.<cache_name>.QueryMinimalTime |
ms |
Minimum time spent for a single query initiated by the node ran. |
35 |
cache.<cache_name>.QuerySumTime |
ms |
Sum of execution time periods of all queries initiated by the node. |
36 |
cache.<cache_name>.RebalanceClearingPartitionsLeft |
number |
Number of partitions that need to be cleared before actual rebalance start. |
37 |
cache.<cache_name>.RebalanceStartTime |
Rebalance start time. |
|
38 |
cache.<cache_name>.RebalancedKeys |
number |
Number of already rebalanced keys. |
39 |
cache.<cache_name>.RebalancingBytesRate |
bytes/min |
Estimated rebalancing speed in bytes. |
40 |
cache.<cache_name>.RebalancingKeysRate |
number of keys/min |
Estimated rebalancing speed in keys. |
41 |
cache.<cache_name>.RemoveTime |
ns |
Remove time in nanoseconds. |
42 |
cache.<cache_name>.RemoveTimeTotal |
ns |
The total time of cache removal, in nanoseconds. |
43 |
cache.<cache_name>.RollbackTime |
ns |
Rollback time in nanoseconds. |
44 |
cache.<cache_name>.RollbackTimeTotal |
ns |
The total time of rollback, in nanoseconds. |
45 |
cache.<cache_name>.TotalRebalancedBytes |
number |
Number of already rebalanced bytes. |
46 |
cache.<cache_name>.TxKeyCollisions |
List of cache keys with a large number of lock collisions and the long wait queue. |
cacheGroups:
# | Metric | Units | Description |
---|---|---|---|
47 |
cacheGroups.<group_name>.AffinityPartitionsAssignmentMap |
Affinity partitions assignment map. |
|
48 |
cacheGroups.<group_name>.Caches |
List of caches in the cache group. |
|
49 |
cacheGroups.<group_name>.IndexBuildCountPartitionsLeft |
number |
Number of partitions that need to be processed for finished indexes create or rebuilding. |
50 |
cacheGroups.<group_name>.InitializedLocalPartitionsNumber |
number |
Number of local partitions initialized on current node. |
51 |
cacheGroups.<group_name>.LocalNodeMovingPartitionsCount |
number |
Count of partitions with state MOVING for this cache group located on this node. |
52 |
cacheGroups.<group_name>.LocalNodeOwningPartitionsCount |
number |
Count of partitions with state OWNING for this cache group located on this node. |
53 |
cacheGroups.<group_name>.LocalNodeRentingEntriesCount |
number |
Count of entries remains to evict in RENTING partitions located on this node for this cache group. |
54 |
cacheGroups.<group_name>.LocalNodeRentingPartitionsCount |
number |
Count of partitions with state RENTING for this cache group located on this node. |
55 |
cacheGroups.<group_name>.MaximumNumberOfPartitionCopies |
number |
Maximum number of partition copies for all partitions of this cache group. |
56 |
cacheGroups.<group_name>.MinimumNumberOfPartitionCopies |
number |
Minimum number of partition copies for all partitions of this cache group. |
57 |
cacheGroups.<group_name>.MovingPartitionsAllocationMap |
Allocation map of partitions with the MOVING state in the cluster. |
|
58 |
cacheGroups.<group_name>.OwningPartitionsAllocationMap |
Allocation map of partitions with the OWNING state in the cluster. |
|
59 |
cacheGroups.<group_name>.PartitionIds |
Local partition ids. |
|
60 |
cacheGroups.<group_name>.SparseStorageSize |
bytes |
Storage space allocated for a group adjusted for possible sparsity, in bytes. |
61 |
cacheGroups.<group_name>.StorageSize |
bytes |
Storage space allocated for a group, in bytes. |
62 |
cacheGroups.<group_name>.TotalAllocatedPages |
number |
Cache group total allocated pages. |
63 |
cacheGroups.<group_name>.TotalAllocatedSize |
bytes |
Total size of memory allocated for group, in bytes. |
communication:
# | Metric | Units | Description |
---|---|---|---|
64 |
communication.tcp.<node_id>.receivedMessagesFromNode |
number |
Total number of messages received by the current node from the given node |
65 |
communication.tcp.<node_id>.sentMessagesToNode |
number |
Total number of messages sent by the current node to the given node. |
66 |
communication.tcp.outboundMessagesQueueSize |
number |
Number of messages waiting to be sent. |
67 |
communication.tcp.receivedBytes |
number |
Total number of bytes received by current node |
68 |
communication.tcp.receivedMessagesByType.<message_id> |
number |
Total number of messages of the given type received by the current node |
69 |
communication.tcp.receivedMessagesCount |
number |
Total number of messages received by current node |
70 |
communication.tcp.sentBytes |
bytes |
Total number of bytes sent by current node |
71 |
communication.tcp.sentMessagesByType.<message_id> |
number |
Total number of messages of the given type sent by the current node |
72 |
communication.tcp.sentMessagesCount |
number |
Total number of messages sent by current node. |
compute:
# | Metric | Units | Description |
---|---|---|---|
73 |
compute.jobs.Active |
number |
Number of canceled jobs that are still running. |
75 |
compute.jobs.ExecutionTime |
ms |
Total execution time of jobs. |
76 |
compute.jobs.Finished |
number |
Number of finished jobs. |
77 |
compute.jobs.Rejected |
number |
Number of jobs rejected after more recent collision resolution operation. |
78 |
compute.jobs.Started |
number |
Number of started jobs. |
79 |
compute.jobs.Waiting |
number |
Number of currently queued jobs waiting to be executed. |
80 |
compute.jobs.WaitingTime |
ms |
Total time jobs spent on the waiting queue. |
io:
# | Metric | Units | Description |
---|---|---|---|
81 |
io.communication.OutboundMessagesQueueSize |
bytes |
Outbound messages queue size. |
82 |
io.communication.ReceivedBytesCount |
number |
Received bytes count. |
83 |
io.communication.ReceivedMessagesCount |
number |
Received messages count. |
84 |
io.communication.SentBytesCount |
number |
Sent bytes count. |
85 |
io.communication.SentMessagesCount |
number |
Sent messages count. |
86 |
io.dataregion.<region_name>.AllocationRate |
number/s |
Allocation rate (pages per second) averaged across rateTimeInternal. |
87 |
io.dataregion.<region_name>.CheckpointBufferSize |
bytes |
Checkpoint buffer size in bytes. |
88 |
io.dataregion.<region_name>.DirtyPages |
number |
Number of pages in memory not yet synchronized with persistent storage. |
89 |
io.dataregion.<region_name>.EmptyDataPages |
number |
Calculates empty data pages count for the region. It counts only totally free pages that can be reused (e. g., pages that are contained in the reuse bucket of the free list). |
90 |
io.dataregion.<region_name>.EvictionRate |
number |
Eviction rate (pages per second). |
91 |
io.dataregion.<region_name>.LargeEntriesPagesCount |
number |
Count of pages that fully ocupied by large entries that go beyond page size |
92 |
io.dataregion.<region_name>.OffHeapSize |
bytes |
Offheap size in bytes. |
93 |
io.dataregion.<region_name>.OffheapUsedSize |
bytes |
offheap used size in bytes. |
94 |
io.dataregion.<region_name>.PagesFillFactor |
% |
The percentage of the used space. |
95 |
io.dataregion.<region_name>.PagesRead |
number |
Number of pages read from last restart. |
96 |
io.dataregion.<region_name>.PagesReplaceAge |
ms |
Average age at which pages in memory are replaced with pages from persistent storage (milliseconds). |
97 |
io.dataregion.<region_name>.PagesReplaceRate |
number/s |
Rate at which pages in memory are replaced with pages from persistent storage (pages per second). |
98 |
io.dataregion.<region_name>.PagesReplaced |
number |
Number of pages replaced from last restart. |
99 |
io.dataregion.<region_name>.PagesWritten |
number |
Number of pages written from last restart. |
100 |
io.dataregion.<region_name>.PhysicalMemoryPages |
number |
Number of pages residing in physical RAM. |
101 |
io.dataregion.<region_name>.PhysicalMemorySize |
bytes |
Gets total size of pages loaded to the RAM, in bytes |
102 |
io.dataregion.<region_name>.TotalAllocatedPages |
number |
Total number of allocated pages. |
103 |
io.dataregion.<region_name>.TotalAllocatedSize |
bytes |
Gets a total size of memory allocated in the data region, in bytes |
104 |
io.dataregion.<region_name>.UsedCheckpointBufferSize |
bytes |
Gets used checkpoint buffer size in bytes |
105 |
io.datastorage.CheckpointTotalTime |
ms |
Total duration of checkpoint. |
106 |
io.datastorage.LastCheckpointCopiedOnWritePagesNumber |
number |
Number of pages copied to a temporary checkpoint buffer during the last checkpoint. |
107 |
io.datastorage.LastCheckpointDataPagesNumber |
number |
Total number of data pages written during the last checkpoint. |
108 |
io.datastorage.LastCheckpointDuration |
ms |
Duration of the last checkpoint in milliseconds. |
109 |
io.datastorage.LastCheckpointFsyncDuration |
ms |
Duration of the sync phase of the last checkpoint in milliseconds. |
110 |
io.datastorage.LastCheckpointLockWaitDuration, |
ms |
Duration of the checkpoint lock wait in milliseconds. |
111 |
io.datastorage.LastCheckpointMarkDuration |
ms |
Duration of the checkpoint lock wait in milliseconds. |
112 |
io.datastorage.LastCheckpointPagesWriteDuration |
ms |
Duration of the checkpoint pages write in milliseconds. |
113 |
io.datastorage.LastCheckpointTotalPagesNumber |
number |
Total number of pages written during the last checkpoint. |
114 |
io.datastorage.SparseStorageSize |
bytes |
Storage space allocated adjusted for possible sparsity, in bytes. |
115 |
io.datastorage.StorageSize |
bytes |
Storage space allocated, in bytes. |
116 |
io.datastorage.WalArchiveSegments |
number |
Current number of WAL segments in the WAL archive. |
117 |
io.datastorage.WalBuffPollSpinsRate |
number |
WAL buffer poll spins number over the last time interval. |
118 |
io.datastorage.WalFsyncTimeDuration |
ms |
Total duration of fsync. |
119 |
io.datastorage.WalFsyncTimeNum |
number |
Total count of fsync |
120 |
io.datastorage.WalLastRollOverTime |
timestamp |
Time of the last WAL segment rollover. |
121 |
io.datastorage.WalLoggingRate |
number/s |
Average number of WAL records per second written during the last time interval. |
122 |
io.datastorage.WalTotalSize |
bytes |
Total size in bytes for storage wal files. |
123 |
io.datastorage.WalWritingRate |
bytes |
Average number of bytes per second written during the last time interval. |
124 |
io.statistics.cacheGroups.<group_name>.LOGICAL_READS |
number |
Number of times a page was read regardless whether the page was in memory or not |
125 |
io.statistics.cacheGroups.<group_name>.PHYSICAL_READS |
number |
Number of times a page was read from disk to memory. |
126 |
io.statistics.cacheGroups.<group_name>.grpId |
string |
Group identifier. |
127 |
io.statistics.cacheGroups.<group_name>.name |
string |
Group name. |
128 |
io.statistics.cacheGroups.<group_name>.startTime |
timestamp |
Group start timestamp. |
129 |
io.statistics.hashIndexes.<cache_name>.<index_name>.LOGICAL_READS_INNER |
number |
Number of times an inner index page was read regardless of whether the page was in memory or not. |
130 |
io.statistics.hashIndexes.<cache_name>.<index_name>.LOGICAL_READS_LEAF |
number |
Number of times a leaf index page was read regardless of whether the page was in memory or not. |
131 |
io.statistics.hashIndexes.<cache_name>.<index_name>.PHYSICAL_READS_INNER |
number |
Number of times an inner index page was read from disk to memory. |
132 |
io.statistics.hashIndexes.<cache_name>.<index_name>.PHYSICAL_READS_LEAF |
number |
Number of times a leaf index page was read from disk to memory. |
133 |
io.statistics.hashIndexes.<cache_name>.<index_name>.indexName |
string |
Index name. |
134 |
io.statistics.hashIndexes.<cache_name>.<index_name>.name |
string |
Cache name |
135 |
io.statistics.hashIndexes.<cache_name>.<index_name>.startTime |
timestamp |
Index creation time |
136 |
io.statistics.sortedIndexes.<cache_name>.<index_name>.LOGICAL_READS_INNER |
number |
Number of times an inner index page was read regardless of whether the page was in memory or not |
137 |
io.statistics.sortedIndexes.<cache_name>.<index_name>.LOGICAL_READS_LEAF |
number |
Number of times a leaf index page was read regardless of whether the page was in memory or not |
138 |
io.statistics.sortedIndexes.<cache_name>.<index_name>.PHYSICAL_READS_INNER |
number |
Number of times an inner index page was read from disk to memory |
139 |
io.statistics.sortedIndexes.<cache_name>.<index_name>.PHYSICAL_READS_LEAF |
number |
Number of times a leaf index page was read from disk to memory |
140 |
io.statistics.sortedIndexes.<cache_name>.<index_name>.indexName |
string |
Index name |
141 |
io.statistics.sortedIndexes.<cache_name>.<index_name>.name |
string |
Cache name |
pme:
# | Metric | Units | Description |
---|---|---|---|
142 |
pme.CacheOperationsBlockedDuration |
ms |
Current PME cache operations blocked duration in milliseconds. |
144 |
pme.CacheOperationsBlockedDurationHistogram |
ms |
Histogram of cache operations blocked PME durations in milliseconds. |
145 |
pme.Duration |
ms |
Current PME duration in milliseconds. |
146 |
pme.DurationHistogram |
ms |
Histogram of PME durations in milliseconds. |
sql:
# | Metric | Units | Description |
---|---|---|---|
147 |
sql.memory.quotas.OffloadedQueriesNumber |
number |
Number of queries that were offloaded to disk locally |
148 |
sql.memory.quotas.OffloadingRead |
bytes |
Number of bytes read from the disk during SQL query offloading |
149 |
sql.memory.quotas.OffloadingWritten |
bytes |
Number of bytes written to the disk during SQL query offloading |
150 |
sql.memory.quotas.freeMem |
bytes |
Amount of memory left available for the queries on this node, in bytes (negative value if SQL memory quotas are disabled) |
151 |
sql.memory.quotas.maxMem |
bytes |
Total amount of memory available for all queries on the current node (negative value if SQL memory quotas are disabled) |
152 |
sql.memory.quotas.requests |
number |
Total number of times memory quota has been requested on the current node by all the queries |
153 |
sql.parser.cache.hits |
number |
Number of hits for queries cache |
154 |
sql.parser.cache.misses |
number |
Number of misses for queries cache |
155 |
sql.queries.user.canceled |
number |
Number of canceled queries initiated by the current node. This number is included in the general 'failed' metric. |
156 |
sql.queries.user.failed |
number |
Number of failed queries (including OOME) initiated by the current node |
157 |
sql.queries.user.failedByOOM |
number |
Number of queries failed due to out of memory protection initiated by the current node. This number is included in the general 'failed' metric. |
158 |
sql.queries.user.success |
number |
Number of successfully executed queries initiated by the current node |
sys:
# | Metric | Units | Description |
---|---|---|---|
159 |
sys.CpuLoad |
% |
CPU load. |
160 |
sys.CurrentThreadCpuTime |
ns |
Total CPU time for the current thread in nanoseconds. |
161 |
sys.CurrentThreadUserTime |
ns |
CPU time that the current thread has executed in user mode in nanoseconds. |
162 |
sys.DaemonThreadCount |
number |
Number of live daemon threads. |
163 |
sys.GcCpuLoad |
% |
GC CPU load. |
164 |
sys.PeakThreadCount |
number |
Peak number of live JVM threads. |
165 |
sys.SystemLoadAverage |
% |
System load average reported by the JVM OS MBean. |
166 |
sys.ThreadCount |
number |
Number of live JVM threads. |
167 |
sys.TotalExecutedTasks |
number |
Total executed tasks. |
168 |
sys.TotalStartedThreadCount |
number |
Total number of created and started threads since JVM started. |
169 |
sys.UpTime |
ms |
JVM uptime. |
170 |
sys.memory.heap.committed |
bytes |
Amount of memory that is committed for the JVM to use in bytes. |
171 |
sys.memory.heap.init |
bytes |
Amount of memory that the JVM initially requests from the operating system for memory management in bytes. |
172 |
sys.memory.heap.max |
bytes |
Maximum amount of memory that can be used for memory management in bytes. |
173 |
sys.memory.heap.used |
bytes |
Amount of used memory in bytes. |
174 |
sys.memory.nonheap.committed |
bytes |
Amount of memory that is committed for the JVM to use in bytes. |
175 |
sys.memory.nonheap.init |
bytes |
Amount of memory that the JVM initially requests from the operating system for memory management in bytes. |
176 |
sys.memory.nonheap.max |
bytes |
Maximum amount of memory that can be used for memory management in bytes. |
177 |
sys.memory.nonheap.used |
bytes |
Amount of used memory in bytes. |
threadPools:
# | Metric | Units | Description |
---|---|---|---|
178 |
threadPools.<executor_name>.ActiveCount |
number |
Approximate number of threads that are actively executing tasks. |
179 |
threadPools.<executor_name>.CompletedTaskCount |
number |
Approximate total number of tasks that have completed execution. |
180 |
threadPools.<executor_name>.CorePoolSize |
number |
The core number of threads. |
181 |
threadPools.<executor_name>.DetectStarvation |
boolean |
True if starvation in a striped pool is detected. |
182 |
threadPools.<executor_name>.KeepAliveTime |
boolean |
Thread keep-alive time, which is the amount of time which threads in excess of the core pool size may remain idle before being terminated. |
183 |
threadPools.<executor_name>.LargestPoolSize |
number |
Largest number of threads that have ever simultaneously been in the pool. |
184 |
threadPools.<executor_name>.MaximumPoolSize |
number |
Maximum number of allowed threads. |
185 |
threadPools.<executor_name>.PoolSize |
number |
Current number of threads in the pool. |
186 |
threadPools.<executor_name>.QueueSize |
number |
Current size of the execution queue. |
187 |
threadPools.<executor_name>.RejectedExecutionHandlerClass |
number |
Class name of current rejection handler. |
188 |
threadPools.<executor_name>.Shutdown |
number |
True if the executor has been shut down. |
189 |
threadPools.<executor_name>.StripesActiveStatuses |
number |
Number of active tasks per stripe. |
190 |
threadPools.<executor_name>.StripesCompletedTasksCounts |
number |
Number of completed tasks per stripe. |
191 |
threadPools.<executor_name>.StripesCount |
number |
Stripes count. |
192 |
threadPools.<executor_name>.StripesQueueSizes |
number |
Size of queue per stripe. |
193 |
threadPools.<executor_name>.TaskCount |
number |
Approximate total number of tasks that have been scheduled for execution. |
194 |
threadPools.<executor_name>.Terminated |
number |
True if all tasks have completed following shut down. |
195 |
threadPools.<executor_name>.Terminating |
boolean |
True if terminating but not yet terminated. |
196 |
threadPools.<executor_name>.ThreadFactoryClass |
string |
Class name of thread factory used to create new threads. |
197 |
threadPools.<executor_name>.TotalCompletedTasksCount |
number |
Completed tasks count of all stripes. |
198 |
threadPools.<executor_name>.TotalQueueSize |
number |
Total queue size of all stripes. |
tx:
# | Metric | Units | Description |
---|---|---|---|
199 |
tx.AllOwnerTransactions |
Map of local node owning transactions. |
|
200 |
tx.LockedKeysNumber |
number |
The number of keys locked on the node. |
201 |
tx.OwnerTransactionsNumber |
number |
The number of active transactions for which this node is the initiator. |
202 |
tx.TransactionsHoldingLockNumber |
number |
The number of active transactions holding at least one key lock. |
203 |
tx.commitTime |
ms |
Last commit time. |
204 |
tx.nodeSystemTimeHistogram |
ms |
Transactions system times on node represented as histogram. |
205 |
tx.nodeUserTimeHistogram |
ms |
Transactions user times on node represented as histogram. |
206 |
tx.rollbackTime |
ms |
Last rollback time. |
207 |
tx.totalNodeSystemTime |
ms |
Total transactions system time on node. |
208 |
tx.totalNodeUserTime |
ms |
Total transactions user time on node. |
209 |
tx.txCommits |
number |
Number of transaction commits. |
210 |
tx.txRollbacks |
number |
Number of transaction rollbacks. |
© 2021 GridGain Systems, Inc. All Rights Reserved. Privacy Policy | Legal Notices. GridGain® is a registered trademark of GridGain Systems, Inc.
Apache, Apache Ignite, the Apache feather and the Apache Ignite logo are either registered trademarks or trademarks of The Apache Software Foundation.