»Metrics

The Nomad agent collects various runtime metrics about the performance of different libraries and subsystems. These metrics are aggregated on a ten second interval and are retained for one minute.

This data can be accessed via an HTTP endpoint or via sending a signal to the Nomad process.

As of Nomad version 0.7, this data is available via HTTP at /metrics. See Metrics for more information.

To view this data via sending a signal to the Nomad process: on Unix, this is USR1 while on Windows it is BREAK. Once Nomad receives the signal, it will dump the current telemetry information to the agent's stderr.

This telemetry information can be used for debugging or otherwise getting a better view of what Nomad is doing.

Telemetry information can be streamed to both statsite as well as statsd based on providing the appropriate configuration options.

To configure the telemetry output please see the agent configuration.

Below is sample output of a telemetry dump:

[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_blocked': 0.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.plan.queue_depth': 0.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.malloc_count': 7568.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_runs': 8.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_ready': 0.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.num_goroutines': 56.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.sys_bytes': 3999992.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.heap_objects': 4135.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.heartbeat.active': 1.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_unacked': 0.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_waiting': 0.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.alloc_bytes': 634056.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.free_count': 3433.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_pause_ns': 6572135.000
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.memberlist.msg.alive': Count: 1 Sum: 1.000
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.serf.member.join': Count: 1 Sum: 1.000
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.barrier': Count: 1 Sum: 1.000
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.apply': Count: 1 Sum: 1.000
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.nomad.rpc.query': Count: 2 Sum: 2.000
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Query': Count: 6 Sum: 0.000
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.fsm.register_node': Count: 1 Sum: 1.296
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Intent': Count: 6 Sum: 0.000
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.runtime.gc_pause_ns': Count: 8 Min: 126492.000 Mean: 821516.875 Max: 3126670.000 Stddev: 1139250.294 Sum: 6572135.000
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.leader.dispatchLog': Count: 3 Min: 0.007 Mean: 0.018 Max: 0.039 Stddev: 0.018 Sum: 0.054
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcileMember': Count: 1 Sum: 0.007
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcile': Count: 1 Sum: 0.025
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.fsm.apply': Count: 1 Sum: 1.306
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.get_allocs': Count: 1 Sum: 0.110
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.worker.dequeue_eval': Count: 29 Min: 0.003 Mean: 363.426 Max: 503.377 Stddev: 228.126 Sum: 10539.354
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Event': Count: 6 Sum: 0.000
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.commitTime': Count: 3 Min: 0.013 Mean: 0.037 Max: 0.079 Stddev: 0.037 Sum: 0.110
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.barrier': Count: 1 Sum: 0.071
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.register': Count: 1 Sum: 1.626
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.eval.dequeue': Count: 21 Min: 500.610 Mean: 501.753 Max: 503.361 Stddev: 1.030 Sum: 10536.813
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.memberlist.gossip': Count: 12 Min: 0.009 Mean: 0.017 Max: 0.025 Stddev: 0.005 Sum: 0.204

»Key Metrics

When telemetry is being streamed to statsite or statsd, interval is defined to be their flush interval. Otherwise, the interval can be assumed to be 10 seconds when retrieving metrics using the above described signals.

MetricDescriptionUnitType
nomad.runtime.num_goroutinesNumber of goroutines and general load pressure indicator# of goroutinesGauge
nomad.runtime.alloc_bytesMemory utilization# of bytesGauge
nomad.runtime.heap_objectsNumber of objects on the heap. General memory pressure indicator# of heap objectsGauge
nomad.raft.applyNumber of Raft transactionsRaft transactions / `interval`Counter
nomad.raft.replication.appendEntriesRaft transaction commit timems / Raft Log AppendTimer
nomad.raft.leader.lastContactTime since last contact to leader. General indicator of Raft latencyms / Leader ContactTimer
nomad.broker.total_readyNumber of evaluations ready to be processed# of evaluationsGauge
nomad.broker.total_unackedEvaluations dispatched for processing but incomplete# of evaluationsGauge
nomad.broker.total_blockedEvaluations that are blocked until an existing evaluation for the same job completes# of evaluationsGauge
nomad.plan.queue_depthNumber of scheduler Plans waiting to be evaluated# of plansGauge
nomad.plan.submitTime to submit a scheduler Plan. Higher values cause lower scheduling throughputms / Plan SubmitTimer
nomad.plan.evaluateTime to validate a scheduler Plan. Higher values cause lower scheduling throughput. Similar to nomad.plan.submit but does not include RPC time or time in the Plan Queuems / Plan EvaluationTimer
nomad.worker.invoke_scheduler.<type>Time to run the scheduler of the given typems / Scheduler RunTimer
nomad.worker.wait_for_indexTime waiting for Raft log replication from leader. High delays result in lower scheduling throughputms / Raft Index WaitTimer
nomad.heartbeat.activeNumber of active heartbeat timers. Each timer represents a Nomad Client connection# of heartbeat timersGauge
nomad.heartbeat.invalidateThe length of time it takes to invalidate a Nomad Client due to failed heartbeatsms / Heartbeat InvalidationTimer
nomad.rpc.queryNumber of RPC queriesRPC Queries / `interval`Counter
nomad.rpc.requestNumber of RPC requests being handledRPC Requests / `interval`Counter
nomad.rpc.request_errorNumber of RPC requests being handled that result in an errorRPC Errors / `interval`Counter

»Client Metrics

The Nomad client emits metrics related to the resource usage of the allocations and tasks running on it and the node itself. Operators have to explicitly turn on publishing host and allocation metrics. Publishing allocation and host metrics can be turned on by setting the value of publish_allocation_metrics publish_node_metrics to true.

By default the collection interval is 1 second but it can be changed by the changing the value of the collection_interval key in the telemetry configuration block.

Please see the agent configuration page for more details.

As of Nomad 0.9, Nomad will emit additional labels for parameterized and periodic jobs. Nomad emits the parent job id as a new label parent_id. Also, the labels dispatch_id and periodic_id are emitted, containing the ID of the specific invocation of the parameterized or periodic job respectively. For example, a dispatch job with the id myjob/dispatch-1312323423423, will have the following labels.

LabelValue
jobmyjob/dispatch-1312323423423
parent_idmyjob
dispatch_id1312323423423

»Host Metrics (post Nomad version 0.7)

Starting in version 0.7, Nomad will emit tagged metrics, in the below format:

MetricDescriptionUnitTypeLabels
nomad.client.allocated.cpuTotal amount of CPU shares the scheduler has allocated to tasksMHzGaugenode_id, datacenter
nomad.client.unallocated.cpuTotal amount of CPU shares free for the scheduler to allocate to tasksMHzGaugenode_id, datacenter
nomad.client.allocated.memoryTotal amount of memory the scheduler has allocated to tasksMegabytesGaugenode_id, datacenter
nomad.client.unallocated.memoryTotal amount of memory free for the scheduler to allocate to tasksMegabytesGaugenode_id, datacenter
nomad.client.allocated.diskTotal amount of disk space the scheduler has allocated to tasksMegabytesGaugenode_id, datacenter
nomad.client.unallocated.diskTotal amount of disk space free for the scheduler to allocate to tasksMegabytesGaugenode_id, datacenter
nomad.client.allocated.networkTotal amount of bandwidth the scheduler has allocated to tasks on the given deviceMegabitsGaugenode_id, datacenter, device
nomad.client.unallocated.networkTotal amount of bandwidth free for the scheduler to allocate to tasks on the given deviceMegabitsGaugenode_id, datacenter, device
nomad.client.host.memory.totalTotal amount of physical memory on the nodeBytesGaugenode_id, datacenter
nomad.client.host.memory.availableTotal amount of memory available to processes which includes free and cached memoryBytesGaugenode_id, datacenter
nomad.client.host.memory.usedAmount of memory used by processesBytesGaugenode_id, datacenter
nomad.client.host.memory.freeAmount of memory which is freeBytesGaugenode_id, datacenter
nomad.client.uptimeUptime of the host running the Nomad clientSecondsGaugenode_id, datacenter
nomad.client.host.cpu.totalTotal CPU utilizationPercentageGaugenode_id, datacenter, cpu
nomad.client.host.cpu.userCPU utilization in the user spacePercentageGaugenode_id, datacenter, cpu
nomad.client.host.cpu.systemCPU utilization in the system spacePercentageGaugenode_id, datacenter, cpu
nomad.client.host.cpu.idleIdle time spent by the CPUPercentageGaugenode_id, datacenter, cpu
nomad.client.host.disk.sizeTotal size of the deviceBytesGaugenode_id, datacenter, disk
nomad.client.host.disk.usedAmount of space which has been usedBytesGaugenode_id, datacenter, disk
nomad.client.host.disk.availableAmount of space which is availableBytesGaugenode_id, datacenter, disk
nomad.client.host.disk.used_percentPercentage of disk space usedPercentageGaugenode_id, datacenter, disk
nomad.client.host.disk.inodes_percentDisk space consumed by the inodesPercentGaugenode_id, datacenter, disk
nomad.client.allocs.startNumber of allocations startingIntegerCounternode_id, job, task_group
nomad.client.allocs.runningNumber of allocations starting to runIntegerCounternode_id, job, task_group
nomad.client.allocs.failedNumber of allocations failingIntegerCounternode_id, job, task_group
nomad.client.allocs.restartNumber of allocations restartingIntegerCounternode_id, job, task_group
nomad.client.allocs.completeNumber of allocations completingIntegerCounternode_id, job, task_group
nomad.client.allocs.destroyNumber of allocations being destroyedIntegerCounternode_id, job, task_group

Nomad 0.9 adds an additional node_class label from the client's NodeClass attribute. This label is set to the string "none" if empty.

»Host Metrics (deprecated post Nomad 0.7)

The below are metrics emitted by Nomad in versions prior to 0.7. These metrics can be emitted in the below format post-0.7 (as well as the new format, detailed above) but any new metrics will only be available in the new format.

MetricDescriptionUnitType
nomad.client.allocated.cpu.<HostID>Total amount of CPU shares the scheduler has allocated to tasksMHzGauge
nomad.client.unallocated.cpu.<HostID>Total amount of CPU shares free for the scheduler to allocate to tasksMHzGauge
nomad.client.allocated.memory.<HostID>Total amount of memory the scheduler has allocated to tasksMegabytesGauge
nomad.client.unallocated.memory.<HostID>Total amount of memory free for the scheduler to allocate to tasksMegabytesGauge
nomad.client.allocated.disk.<HostID>Total amount of disk space the scheduler has allocated to tasksMegabytesGauge
nomad.client.unallocated.disk.<HostID>Total amount of disk space free for the scheduler to allocate to tasksMegabytesGauge
nomad.client.allocated.network.<Device-Name>.<HostID>Total amount of bandwidth the scheduler has allocated to tasks on the given deviceMegabitsGauge
nomad.client.unallocated.network.<Device-Name>.<HostID>Total amount of bandwidth free for the scheduler to allocate to tasks on the given deviceMegabitsGauge
nomad.client.host.memory.<HostID>.totalTotal amount of physical memory on the nodeBytesGauge
nomad.client.host.memory.<HostID>.availableTotal amount of memory available to processes which includes free and cached memoryBytesGauge
nomad.client.host.memory.<HostID>.usedAmount of memory used by processesBytesGauge
nomad.client.host.memory.<HostID>.freeAmount of memory which is freeBytesGauge
nomad.client.uptime.<HostID>Uptime of the host running the Nomad clientSecondsGauge
nomad.client.host.cpu.<HostID>.<CPU-Core>.totalTotal CPU utilizationPercentageGauge
nomad.client.host.cpu.<HostID>.<CPU-Core>.userCPU utilization in the user spacePercentageGauge
nomad.client.host.cpu.<HostID>.<CPU-Core>.systemCPU utilization in the system spacePercentageGauge
nomad.client.host.cpu.<HostID>.<CPU-Core>.idleIdle time spent by the CPUPercentageGauge
nomad.client.host.disk.<HostID>.<Device-Name>.sizeTotal size of the deviceBytesGauge
nomad.client.host.disk.<HostID>.<Device-Name>.usedAmount of space which has been usedBytesGauge
nomad.client.host.disk.<HostID>.<Device-Name>.availableAmount of space which is availableBytesGauge
nomad.client.host.disk.<HostID>.<Device-Name>.used_percentPercentage of disk space usedPercentageGauge
nomad.client.host.disk.<HostID>.<Device-Name>.inodes_percentDisk space consumed by the inodesPercentGauge

»Allocation Metrics

MetricDescriptionUnitType
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.rssAmount of RSS memory consumed by the taskBytesGauge
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.cacheAmount of memory cached by the taskBytesGauge
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.swapAmount of memory swapped by the taskBytesGauge
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.max_usageMaximum amount of memory ever used by the taskBytesGauge
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.kernel_usageAmount of memory used by the kernel for this taskBytesGauge
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.kernel_max_usageMaximum amount of memory ever used by the kernel for this taskBytesGauge
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.total_percentTotal CPU resources consumed by the task across all coresPercentageGauge
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.systemTotal CPU resources consumed by the task in the system spacePercentageGauge
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.userTotal CPU resources consumed by the task in the user spacePercentageGauge
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.throttled_timeTotal time that the task was throttledNanosecondsGauge
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.total_ticksCPU ticks consumed by the process in the last collection intervalIntegerGauge

»Job Summary Metrics

Job summary metrics are emitted by the Nomad leader server.

MetricDescriptionUnitTypeLabels
nomad.job_summary.queuedNumber of queued allocations for a jobIntegerGaugejob, task_group
nomad.job_summary.completeNumber of complete allocations for a jobIntegerGaugejob, task_group
nomad.job_summary.failedNumber of failed allocations for a jobIntegerGaugejob, task_group
nomad.job_summary.runningNumber of running allocations for a jobIntegerGaugejob, task_group
nomad.job_summary.startingNumber of starting allocations for a jobIntegerGaugejob, task_group
nomad.job_summary.lostNumber of lost allocations for a jobIntegerGaugejob, task_group

»Job Status Metrics

Job status metrics are emitted by the Nomad leader server.

MetricDescriptionUnitType
nomad.job_status.pendingNumber jobs pendingIntegerGauge
nomad.job_status.runningNumber jobs runningIntegerGauge
nomad.job_status.deadNumber of dead jobsIntegerGauge

»Metric Types

TypeDescriptionQuantiles
GaugeGauge types report an absolute number at the end of the aggregation intervalfalse
CounterCounts are incremented and flushed at the end of the aggregation interval and then are reset to zerotrue
TimerTimers measure the time to complete a task and will include quantiles, means, standard deviation, etc per interval.true

»Tagged Metrics

As of version 0.7, Nomad will start emitting metrics in a tagged format. Each metric can support more than one tag, meaning that it is possible to do a match over metrics for datapoints such as a particular datacenter, and return all metrics with this tag. Nomad supports labels for namespaces as well.