Skip to content

Sled-level resource metrics #9559

@jmcarp

Description

@jmcarp

As an operator, I want to be able to understand the state of various physical resources on the rack. Is a physical cpu core heavily utilized, or waiting on another resource, or queueing operations? Is a disk almost full, or saturating IOPS? We have coverage for some of these metrics in oxql for virtual machines (virtual_machine:, virtual_disk:), but less for physical resources on the sled.

As @rmustacc pointed out on a call yesterday, there's a lot of nuance to consider here. For example, RFD 526 goes into great detail on just a single resource type. But there's also low-hanging fruit that can produce value to operators more quickly: we can add basic metrics before enumerating all the telemetry we eventually want to include.

One potential starting point is to identify physical resources of interest, and collect USE metrics (utilization, saturation, errors) (or analogously the four "golden signals") for each one. We could start with cpu, memory, disk, and network (although note that we already have some sled-level network metrics in sled_data_link:*).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions