Skip to content

Conversation

@karencfv
Copy link
Contributor

@karencfv karencfv commented Nov 21, 2025

This is the first PR for #9412

This PR only implements checking SMF service health for each sled. There will be follow up PRs that deal with adding the data types to the DB.

NB: This PR is less than 1000 lines, not ~10,000. Most of the changes here are API JSON additions

Manual testing on a simulated omicron

System has two unhealthy enabled services

$ pfexec svcs -Za -H -o state,fmri,zone
<...>
maintenance    svc:/site/fake-service2:default                    global
maintenance    svc:/site/fake-service:default                     global
<...>

$ curl -H "api-version: 11.0.0"  http://[::1]:32824/inventory | jq
<...>
  "health_monitor": {
    "smf_services_in_maintenance": {
      "ok": {
        "services": [
          {
            "fmri": "svc:/site/fake-service2:default",
            "zone": "global"
          },
          {
            "fmri": "svc:/site/fake-service:default",
            "zone": "global"
          }
        ],
        "time_of_status": "2025-12-10T08:35:10.604488858Z"
      }
    }
  }
}

Disabling one of the unhealthy services should only show a single entry

$ svcadm disable svc:/site/fake-service:default
$ pfexec svcs -Za -H -o state,fmri,zone
<...>
maintenance    svc:/site/fake-service2:default                    global
disabled       svc:/site/fake-service:default                     global
<...>
$ curl -H "api-version: 11.0.0"  http://[::1]:32824/inventory | jq
<...>
  "health_monitor": {
    "smf_services_in_maintenance": {
      "ok": {
        "services": [
          {
            "fmri": "svc:/site/fake-service2:default",
            "zone": "global"
          }
        ],
        "time_of_status": "2025-12-10T08:35:10.604488858Z"
      }
    }
  }
}

Disabling the other service should not return any services

$ svcadm disable svc:/site/fake-service2:default
$ pfexec svcs -Za -H -o state,fmri,zone
<...>
disabled       svc:/site/fake-service2:default                    global
disabled       svc:/site/fake-service:default                     global
<...>
$ curl -H "api-version: 11.0.0"  http://[::1]:32824/inventory | jq
<...>
  "health_monitor": {
    "smf_services_in_maintenance": {
      "ok": {
        "services": [],
        "time_of_status": "2025-12-10T08:35:10.604488858Z"
      }
    }
  }
}

@karencfv karencfv marked this pull request as ready for review December 2, 2025 07:37
@karencfv karencfv requested a review from davepacheco December 2, 2025 07:38
@karencfv
Copy link
Contributor Author

karencfv commented Dec 3, 2025

After the sled agent version change I tested the API on a simulated omicron to make sure everything still works:

All enabled services are healthy

Using v10 the ouput doesn't contain the smf_services_enabled_not_running field:

$ curl -H "api-version: 10.0.0"  http://[::1]:41963/inventory | jq
<...>
    "mupdate_override": {
      "boot_disk_path": "/fake/path/install/mupdate_override.json",
      "boot_override": {
        "ok": null
      },
      "non_boot_status": []
    }
  }
}

Using v11 does

$ curl -H "api-version: 11.0.0"  http://[::1]:41963/inventory | jq
<...>
    "mupdate_override": {
      "boot_disk_path": "/fake/path/install/mupdate_override.json",
      "boot_override": {
        "ok": null
      },
      "non_boot_status": []
    }
  },
  "smf_services_enabled_not_running": []
}

There is an unhealthy enabled service

$ svcadm enable svc:/site/fake-service:default

Using v10 the ouput doesn't contain services that aren't running:

$ curl -H "api-version: 10.0.0"  http://[::1]:41963/inventory | jq
<...>
    "mupdate_override": {
      "boot_disk_path": "/fake/path/install/mupdate_override.json",
      "boot_override": {
        "ok": null
      },
      "non_boot_status": []
    }
  }
}

Using v11 does

$ curl -H "api-version: 11.0.0"  http://[::1]:41963/inventory | jq
<...>
    "mupdate_override": {
      "boot_disk_path": "/fake/path/install/mupdate_override.json",
      "boot_override": {
        "ok": null
      },
      "non_boot_status": []
    }
  },
  "smf_services_enabled_not_running": [
    {
      "fmri": "svc:/site/fake-service:default",
      "zone": "global",
      "state": "maintenance",
      "state_since": "2025-12-03T02:27:20Z",
      "reason": "Restarting too quickly.",
      "impact": "This service is not running.",
      "additional_info": [
        "http://illumos.org/msg/SMF-8000-L5",
        "/var/svc/log/site-fake-service:default.log"
      ]
    }
  ]
}

// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

//! Utilities for reporting SMF services' status.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about this, but it looks like we have an smf crate:
https://github.com/oxidecomputer/smf

and it supports a Query operation that that seems to basically be svcs:
https://docs.rs/smf/latest/smf/struct.Query.html

I think one problem with that is that it doesn't seem to support -Z (query all zones). It either does the current zone or a specific zone. But it looks like it could be easy to add -Z and then you could use that here?


If for whatever reason we don't want to do that: I think better than trying to parse svcs -Zx would be using svcs -Za -H -o fmri,state. svcs -xv seems much harder to parse and less stable. (It looks like neither is guaranteed to be stable but it would be much more surprising for -H -o to change.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! I was not aware of this smf crate either, thanks. I took a look and I think I'm going to go with the second option though. The from_str method there returns an error if it is unable to parse a line of text. My intention here was to skip over any missing fields or lines that couldn't be parsed, and fill out as much of the struct as possible.

I wonder about the additional information that svcs -x provides though. Would it not be useful to support to have the time of when the service went into maintenance, if it is in a degraded state, or the reason why the service is in such a state? Or is this information available to support elsewhere?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! I was not aware of this smf crate either, thanks. I took a look and I think I'm going to go with the second option though. The from_str method there returns an error if it is unable to parse a line of text. My intention here was to skip over any missing fields or lines that couldn't be parsed, and fill out as much of the struct as possible.

Hmm, that's interesting. This makes me wonder if what we want is actually a list of errors, which would be an enum with two variants: one is like ServiceInMaintenance and the other is like SvcsFailed. Then you could still use the existing thing. We wouldn't have information in the case that svcs failed, but that's already true if the whole command fails, and I believe this really shouldn't happen in practice. (If it does, we'll be able to see that and fix the bug.) I'm not totally sold on this, but it also doesn't seem great to duplicate this code just for an error case that shouldn't ever happen.

I wonder about the additional information that svcs -x provides though. Would it not be useful to support to have the time of when the service went into maintenance, if it is in a degraded state, or the reason why the service is in such a state? Or is this information available to support elsewhere?

I assume support is going to need a support bundle to debug a service in maintenance. Ultimately, the information here just feeds a boolean to the user about whether to call support, and the value is really for the cases where there's no service in maintenance. (Right?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's interesting. This makes me wonder if what we want is actually a list of errors, which would be an enum with two variants: one is like ServiceInMaintenance and the other is like SvcsFailed. Then you could still use the existing thing. We wouldn't have information in the case that svcs failed, but that's already true if the whole command fails, and I believe this really shouldn't happen in practice. (If it does, we'll be able to see that and fix the bug.) I'm not totally sold on this, but it also doesn't seem great to duplicate this code just for an error case that shouldn't ever happen.

Hmmm... Perhaps? I'll play around with this idea. But yeah, duplicating code is not the best. I'm a little concerned that we may be shoehorning what already exists into an enum that may not feel as natural. Let's say the parsing fails midway a service in maintenance we want to collect, but instead of just filling the enum with as much information we have, we end up with a SvcsFailed. Would we then report the health check as failed or not? There could also be the case where parsing fails with a service that is healthy, we would also get a SvcsFailed there, but in this case it doesn't mean the health check failed.

Thinking about all of this, I'm more inclined to use svcs -Za -H -o fmri,state for less ambiguity. Which brings me to your next point

I assume support is going to need a support bundle to debug a service in maintenance. Ultimately, the information here just feeds a boolean to the user about whether to call support, and the value is really for the cases where there's no service in maintenance. (Right?)

If this is the case, then svcs -Za -H -o fmri,state should be enough no?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that makes sense, but there's still the question of what to do if the command fails or we fail to parse it. This should not come up much in practice but it's certainly a thing that can happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of just logging an error tbh. If it makes sense we can also keep track of the time when the last successful check was taken.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have https://github.com/oxidecomputer/omicron/blob/745d8323643c7e1276b51c004ae5553a22a3bf33/illumos-utils/src/scf.rs, which wraps the SMF library in Crucible (https://github.com/oxidecomputer/crucible/tree/ec87da9b0d8997a8c1c969a11d7d41583526ad20/smf). That one uses libscf directly so doesn't have to do any subprocess management or parsing. Not on this PR but in principle: should we extend our libscf wrapper to be able to do this kind of thing without shelling out to svcs? (Or do we just end up reimplementing a subset of svcs in Rust?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooof I didn't know we had other SMF libraries.

Not on this PR but in principle: should we extend our libscf wrapper to be able to do this kind of thing without shelling out to svcs

That would be nice I think. But it would be a big effort no? We'd probably want to migrate everything that's shelling out in Illumos-utils as well? Otherwise everyone is just going to keep writing little helper functions no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think there's a real project here (write a nice Rust wrapper for libscf) that is kind of a lot of work for not a ton of gain. "Less forking" is a real benefit, but it wouldn't remove all of sled-agent's forking. I'd grump and complain but stick with parsing smf's output (with the "make it parseable" flags as Dave suggested).

Comment on lines 807 to 808
let smf_services_enabled_not_running =
Svcs::enabled_not_running(&self.log).await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the simulated sled agent should call out to the real svcs command to see what services are in maintenance. It should probably just fake up a response indicating no services in maintenance. For testing, we might want a way to get it to report some services in maintenance, but we may not really need that since we can create fake inventory collections instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh yeah sorry, I was using this to test on my machine without having to spin up an a4x2 environment every time I tried out the endpoint and forgot to remove it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can see the appeal of testing this in the simulated environment. When I've wanted to do something like this in the past, I added an API to the simulated sled agent that would let you configure its behavior. I could see doing something like that in the limit, but I could also see it not really being worth it -- whatever you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give it a go and see if it doesn't take too long to implement. This may be useful for any further health checks.

Comment on lines 1136 to 1137
let smf_services_enabled_not_running =
Svcs::enabled_not_running(&self.log).await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious for @jgallagher's thoughts on this but I feel like it would be better if this function remained non-async and just had this information available already. Maybe the config-reconciler could collect it? Or some other background task?

There are a few things I'm worried about here:

  • a stuck svcs call blocks our ability to collect any inventory at all
  • if a svcs call gets stuck, every call to inventory will accumulate another one

If we instead put this into a separate background task, there can be only one at a time, and we can always report the latest state, even if it's old.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly agreed; failing to collect inventory because we're stuck in a hung SMF interaction would be quite painful (since inventory is supposed to report that kind of problem in the first place).

I'm less sure about putting it in the config reconciler vs spawning a new background task to do it. My gut reaction was to put it in a separate task, but since the config reconciler is responsible for starting up zones I could see it wanting know whether any services in that zone are in trouble.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that all makes sense.

I'm less sure about putting it in the config reconciler vs spawning a new background task to do it. My gut reaction was to put it in a separate task, but since the config reconciler is responsible for starting up zones I could see it wanting know whether any services in that zone are in trouble.

This comment makes me think having it in the config-reconciler would be more useful?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment makes me think having it in the config-reconciler would be more useful?

(Recapping a live conversation) I thought so too until we were chatting, and realized that the config-reconciler is mostly dormant. It only runs when there's been a change it needs to reconcile (new config, change in hardware, ...) or if the previous attempt to reconcile didn't complete successfully (in which case it continues retrying periodically until it does succeed, at which point it goes dormant again waiting for changes). We want health check info to be fresher than that, so a new background task seems right.

Copy link
Contributor Author

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look! I think I've addressed all of the comments, please let me know if there's anything missing.

NOTE: I think I should hold off on rebasing/merging main until #9514 lands. It seems things are going back where they were? Sort of?

Comment on lines +142 to +148
// Example of the reponse from running `svcs -Za -H -o state,fmri,zone`
//
// legacy_run lrc:/etc/rc2_d/S20sysetup global
// maintenance svc:/site/fake-service:default global
// disabled svc:/network/tcpkey:default global
// disabled svc:/system/omicron/baseline:default global
// online svc:/milestone/sysconfig:default global
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙇‍♀️

}
}

if !health_monitor.is_empty() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. Do you mind if I make this visible with the follow up PR that adds the types to the DB? Otherwise, this will always show up as if everything was healthy and it could be misleading. I'll add a comment explaining for now.


#[derive(Debug, Clone)]
pub struct HealthMonitorHandle {
// Return a String instead of a custom error type as inventory requires
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried it and same issue with it not being cloneable :(

FWIW I copied this pattern from BootPartitionContents, which is used by ConfigReconcilerInventory

@karencfv karencfv requested a review from davepacheco December 15, 2025 08:36
Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is looking good! Mostly small nits, with a couple questions about error reporting.

Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes all LGTM, but there are quite a few merge conflicts (probably mostly the versioning reorganization that's now landed on main?). Happy to give it a quick once-over after those are resolved.

@karencfv
Copy link
Contributor Author

Thanks for taking a look @jgallagher @davepacheco ! I've updated the PR to use the new versioning system, manually tested it and it all worked as expected 🎉

@karencfv karencfv requested a review from jgallagher December 17, 2025 23:38

#[derive(Debug, Clone)]
pub struct HealthMonitorHandle {
// Return a String instead of a custom error type as inventory requires
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mentioning it because I used to use Strings for cases like this and I've always either changed them or wish that I had. Usually it's because I want to treat it like a StdError in various ways: either use InlineErrorChain, or use anyhow's .context(), have a higher-level thiserror with a #[from] that includes this error, etc. If you need it to impl Clone, maybe you could wrap it in an Arc?

This is not a big deal. We can always change it later if/when we want to do these other things with it.

@davepacheco
Copy link
Collaborator

Another thing I noticed is that in recent revs of this PR, the delta in the OpenAPI spec is pretty big and seems to have a bunch of unrelated changes. I might definitely be doing something wrong but:

$ diff openapi/sled-agent/sled-agent-{11,12}*
10c10
<     "version": "11.0.0"
---
>     "version": "12.0.0"
2372a2373
>         "description": "Artifact configuration.\n\nThis type is used in both GET (response) and PUT (request) operations.",
2392a2394
>         "description": "Request body for copying artifacts from a depot.",
2403a2406
>         "description": "Response for copying artifacts from a depot.",
2406a2410
>         "description": "Response for listing artifacts.",
2426a2431
>         "description": "Response for putting an artifact.",
3053a3059
>         "description": "Status of the local bootstore node.",
4497c4503
<         "description": "Network configuration required to bring up the control plane\n\nThe fields in this structure are those from [`crate::rack_init::RackInitializeRequest`] necessary for use beyond RSS. This is just for the initial rack configuration and cold boot purposes. Updates come from Nexus.",
---
>         "description": "Network configuration required to bring up the control plane\n\nThe fields in this structure are those from `RackInitializeRequest` necessary for use beyond RSS. This is just for the initial rack configuration and cold boot purposes. Updates come from Nexus.",
4563a4570
>         "description": "An established connection to a bootstore peer.",
4882a4890,4937
>       "HealthMonitorInventory": {
>         "description": "Fields of sled-agent inventory reported by the health monitor subsystem.",
>         "type": "object",
>         "properties": {
>           "smf_services_in_maintenance": {
>             "x-rust-type": {
>               "crate": "std",
>               "parameters": [
>                 {
>                   "$ref": "#/components/schemas/SvcsInMaintenanceResult"
>                 },
>                 {
>                   "type": "string"
>                 }
>               ],
>               "path": "::std::result::Result",
>               "version": "*"
>             },
>             "oneOf": [
>               {
>                 "type": "object",
>                 "properties": {
>                   "ok": {
>                     "$ref": "#/components/schemas/SvcsInMaintenanceResult"
>                   }
>                 },
>                 "required": [
>                   "ok"
>                 ]
>               },
>               {
>                 "type": "object",
>                 "properties": {
>                   "err": {
>                     "type": "string"
>                   }
>                 },
>                 "required": [
>                   "err"
>                 ]
>               }
>             ]
>           }
>         },
>         "required": [
>           "smf_services_in_maintenance"
>         ]
>       },
5263c5318
<         "description": "Represents a multicast group membership for an instance.",
---
>         "description": "Represents a multicast group membership for an instance.\n\nIntroduced in v7.",
5433a5489,5491
>           "health_monitor": {
>             "$ref": "#/components/schemas/HealthMonitorInventory"
>           },
5487a5546
>           "health_monitor",
5770c5829
<         "description": "Dataset and Volume details for a Local Storage dataset ensure request",
---
>         "description": "Dataset and Volume details for a Local Storage dataset ensure request.",
5811a5871,6015
>       "ManifestBootInventory": {
>         "description": "Inventory representation of zone artifacts on the boot disk.\n\nPart of [`ManifestInventory`].",
>         "type": "object",
>         "properties": {
>           "artifacts": {
>             "title": "IdOrdMap",
>             "description": "The artifacts on disk.",
>             "x-rust-type": {
>               "crate": "iddqd",
>               "parameters": [
>                 {
>                   "$ref": "#/components/schemas/ZoneArtifactInventory"
>                 }
>               ],
>               "path": "iddqd::IdOrdMap",
>               "version": "*"
>             },
>             "type": "array",
>             "items": {
>               "$ref": "#/components/schemas/ZoneArtifactInventory"
>             },
>             "uniqueItems": true
>           },
>           "source": {
>             "description": "The manifest source.\n\nIn production this is [`OmicronInstallManifestSource::Installinator`], but in some development and testing flows Sled Agent synthesizes zone manifests. In those cases, the source is [`OmicronInstallManifestSource::SledAgent`].",
>             "allOf": [
>               {
>                 "$ref": "#/components/schemas/OmicronInstallManifestSource"
>               }
>             ]
>           }
>         },
>         "required": [
>           "artifacts",
>           "source"
>         ]
>       },
>       "ManifestInventory": {
>         "description": "Inventory representation of a manifest.\n\nPart of [`ZoneImageResolverInventory`].\n\nA manifest is used for both zones and reference measurements. A zone manifest is a listing of all the zones present in a system's install dataset. A measurement manifset is a listing of all the reference measurements present in a system's install dataset. This struct contains information about the install dataset gathered from a system.",
>         "type": "object",
>         "properties": {
>           "boot_disk_path": {
>             "description": "The full path to the zone manifest file on the boot disk.",
>             "type": "string",
>             "format": "Utf8PathBuf"
>           },
>           "boot_inventory": {
>             "description": "The manifest read from the boot disk, and whether the manifest is valid.",
>             "x-rust-type": {
>               "crate": "std",
>               "parameters": [
>                 {
>                   "$ref": "#/components/schemas/ManifestBootInventory"
>                 },
>                 {
>                   "type": "string"
>                 }
>               ],
>               "path": "::std::result::Result",
>               "version": "*"
>             },
>             "oneOf": [
>               {
>                 "type": "object",
>                 "properties": {
>                   "ok": {
>                     "$ref": "#/components/schemas/ManifestBootInventory"
>                   }
>                 },
>                 "required": [
>                   "ok"
>                 ]
>               },
>               {
>                 "type": "object",
>                 "properties": {
>                   "err": {
>                     "type": "string"
>                   }
>                 },
>                 "required": [
>                   "err"
>                 ]
>               }
>             ]
>           },
>           "non_boot_status": {
>             "title": "IdOrdMap",
>             "description": "Information about the install dataset on non-boot disks.",
>             "x-rust-type": {
>               "crate": "iddqd",
>               "parameters": [
>                 {
>                   "$ref": "#/components/schemas/ManifestNonBootInventory"
>                 }
>               ],
>               "path": "iddqd::IdOrdMap",
>               "version": "*"
>             },
>             "type": "array",
>             "items": {
>               "$ref": "#/components/schemas/ManifestNonBootInventory"
>             },
>             "uniqueItems": true
>           }
>         },
>         "required": [
>           "boot_disk_path",
>           "boot_inventory",
>           "non_boot_status"
>         ]
>       },
>       "ManifestNonBootInventory": {
>         "description": "Inventory representation of a zone manifest on a non-boot disk.\n\nUnlike [`ManifestBootInventory`] which is structured since Reconfigurator makes decisions based on it, information about non-boot disks is purely advisory. For simplicity, we store information in an unstructured format.",
>         "type": "object",
>         "properties": {
>           "is_valid": {
>             "description": "Whether the status is valid.",
>             "type": "boolean"
>           },
>           "message": {
>             "description": "A message describing the status.\n\nIf `is_valid` is true, then the message describes the list of artifacts found and their hashes.\n\nIf `is_valid` is false, then this message describes the reason for the invalid status. This could include errors reading the zone manifest, or zone file mismatches.",
>             "type": "string"
>           },
>           "path": {
>             "description": "The full path to the zone manifest JSON on the non-boot disk.",
>             "type": "string",
>             "format": "Utf8PathBuf"
>           },
>           "zpool_id": {
>             "description": "The ID of the non-boot zpool.",
>             "allOf": [
>               {
>                 "$ref": "#/components/schemas/InternalZpoolUuid"
>               }
>             ]
>           }
>         },
>         "required": [
>           "is_valid",
>           "message",
>           "path",
>           "zpool_id"
>         ]
>       },
6198a6403,6446
>       "OmicronInstallManifestSource": {
>         "description": "The source of truth for an Omicron zone manifest.",
>         "oneOf": [
>           {
>             "description": "The manifest was written out by installinator and the mupdate process.",
>             "type": "object",
>             "properties": {
>               "mupdate_id": {
>                 "description": "The UUID of the mupdate.",
>                 "allOf": [
>                   {
>                     "$ref": "#/components/schemas/MupdateUuid"
>                   }
>                 ]
>               },
>               "source": {
>                 "type": "string",
>                 "enum": [
>                   "installinator"
>                 ]
>               }
>             },
>             "required": [
>               "mupdate_id",
>               "source"
>             ]
>           },
>           {
>             "description": "The zone manifest was not found during the install process. A synthetic zone manifest was generated by Sled Agent by looking at all the `.tar.gz` files in the install dataset.",
>             "type": "object",
>             "properties": {
>               "source": {
>                 "type": "string",
>                 "enum": [
>                   "sled_agent"
>                 ]
>               }
>             },
>             "required": [
>               "source"
>             ]
>           }
>         ]
>       },
6394,6437d6641
<       "OmicronZoneManifestSource": {
<         "description": "The source of truth for an Omicron zone manifest.",
<         "oneOf": [
<           {
<             "description": "The manifest was written out by installinator and the mupdate process.",
<             "type": "object",
<             "properties": {
<               "mupdate_id": {
<                 "description": "The UUID of the mupdate.",
<                 "allOf": [
<                   {
<                     "$ref": "#/components/schemas/MupdateUuid"
<                   }
<                 ]
<               },
<               "source": {
<                 "type": "string",
<                 "enum": [
<                   "installinator"
<                 ]
<               }
<             },
<             "required": [
<               "mupdate_id",
<               "source"
<             ]
<           },
<           {
<             "description": "The zone manifest was not found during the install process. A synthetic zone manifest was generated by Sled Agent by looking at all the `.tar.gz` files in the install dataset.",
<             "type": "object",
<             "properties": {
<               "source": {
<                 "type": "string",
<                 "enum": [
<                   "sled_agent"
<                 ]
<               }
<             },
<             "required": [
<               "source"
<             ]
<           }
<         ]
<       },
7100c7304
<         "description": "The priority order for bundles during cleanup.\n\nBundles are sorted along the dimensions in [`PriorityDimension`], with each dimension appearing exactly once. During cleanup, lesser-priority bundles are pruned first, to maintain the dataset quota. Note that bundles are sorted by each dimension in the order in which they appear, with each dimension having higher priority than the next.",
---
>         "description": "The priority order for bundles during cleanup.\n\nBundles are sorted along the dimensions in [`PriorityDimension`], with each dimension appearing exactly once. During cleanup, lesser-priority bundles are pruned first, to maintain the dataset quota. Note that bundles are sorted by each dimension in the order in which they appear, with each dimension having higher priority than the next.\n\nTODO: The serde deserializer does not currently verify uniqueness of dimensions.",
8224c8428
<         "description": "Metadata about a support bundle",
---
>         "description": "Metadata about a support bundle.",
8239a8444
>         "description": "State of a support bundle.",
8254a8460,8502
>       "SvcInMaintenance": {
>         "description": "Information about an SMF service that is enabled but not running",
>         "type": "object",
>         "properties": {
>           "fmri": {
>             "type": "string"
>           },
>           "zone": {
>             "type": "string"
>           }
>         },
>         "required": [
>           "fmri",
>           "zone"
>         ]
>       },
>       "SvcsInMaintenanceResult": {
>         "description": "Lists services in maintenance status if any, and the time the health check for SMF services ran",
>         "type": "object",
>         "properties": {
>           "errors": {
>             "type": "array",
>             "items": {
>               "type": "string"
>             }
>           },
>           "services": {
>             "type": "array",
>             "items": {
>               "$ref": "#/components/schemas/SvcInMaintenance"
>             }
>           },
>           "time_of_status": {
>             "nullable": true,
>             "type": "string",
>             "format": "date-time"
>           }
>         },
>         "required": [
>           "errors",
>           "services"
>         ]
>       },
8444a8693
>         "description": "Request body for VMM disk snapshot requests.",
8456a8706
>         "description": "Response for VMM disk snapshot requests.",
8533c8783
<         "description": "Specifies the virtual hardware configuration of a new Propolis VMM in the form of a Propolis instance specification.\n\nSled-agent expects that when an instance spec is provided alongside an `InstanceSledLocalConfig` to initialize a new instance, the NIC IDs in that config's network interface list will match the IDs of the virtio network backends in the instance spec.",
---
>         "description": "Specifies the virtual hardware configuration of a new Propolis VMM in the form of a Propolis instance specification.",
8809c9059
<         "description": "Inventory representation of a single zone artifact on a boot disk.\n\nPart of [`ZoneManifestBootInventory`].",
---
>         "description": "Inventory representation of a single zone artifact on a boot disk.\n\nPart of [`ManifestBootInventory`].",
8984c9234
<                 "$ref": "#/components/schemas/ZoneManifestInventory"
---
>                 "$ref": "#/components/schemas/ManifestInventory"
8994,9138d9243
<       "ZoneManifestBootInventory": {
<         "description": "Inventory representation of zone artifacts on the boot disk.\n\nPart of [`ZoneManifestInventory`].",
<         "type": "object",
<         "properties": {
<           "artifacts": {
<             "title": "IdOrdMap",
<             "description": "The artifacts on disk.",
<             "x-rust-type": {
<               "crate": "iddqd",
<               "parameters": [
<                 {
<                   "$ref": "#/components/schemas/ZoneArtifactInventory"
<                 }
<               ],
<               "path": "iddqd::IdOrdMap",
<               "version": "*"
<             },
<             "type": "array",
<             "items": {
<               "$ref": "#/components/schemas/ZoneArtifactInventory"
<             },
<             "uniqueItems": true
<           },
<           "source": {
<             "description": "The manifest source.\n\nIn production this is [`OmicronZoneManifestSource::Installinator`], but in some development and testing flows Sled Agent synthesizes zone manifests. In those cases, the source is [`OmicronZoneManifestSource::SledAgent`].",
<             "allOf": [
<               {
<                 "$ref": "#/components/schemas/OmicronZoneManifestSource"
<               }
<             ]
<           }
<         },
<         "required": [
<           "artifacts",
<           "source"
<         ]
<       },
<       "ZoneManifestInventory": {
<         "description": "Inventory representation of a zone manifest.\n\nPart of [`ZoneImageResolverInventory`].\n\nA zone manifest is a listing of all the zones present in a system's install dataset. This struct contains information about the install dataset gathered from a system.",
<         "type": "object",
<         "properties": {
<           "boot_disk_path": {
<             "description": "The full path to the zone manifest file on the boot disk.",
<             "type": "string",
<             "format": "Utf8PathBuf"
<           },
<           "boot_inventory": {
<             "description": "The manifest read from the boot disk, and whether the manifest is valid.",
<             "x-rust-type": {
<               "crate": "std",
<               "parameters": [
<                 {
<                   "$ref": "#/components/schemas/ZoneManifestBootInventory"
<                 },
<                 {
<                   "type": "string"
<                 }
<               ],
<               "path": "::std::result::Result",
<               "version": "*"
<             },
<             "oneOf": [
<               {
<                 "type": "object",
<                 "properties": {
<                   "ok": {
<                     "$ref": "#/components/schemas/ZoneManifestBootInventory"
<                   }
<                 },
<                 "required": [
<                   "ok"
<                 ]
<               },
<               {
<                 "type": "object",
<                 "properties": {
<                   "err": {
<                     "type": "string"
<                   }
<                 },
<                 "required": [
<                   "err"
<                 ]
<               }
<             ]
<           },
<           "non_boot_status": {
<             "title": "IdOrdMap",
<             "description": "Information about the install dataset on non-boot disks.",
<             "x-rust-type": {
<               "crate": "iddqd",
<               "parameters": [
<                 {
<                   "$ref": "#/components/schemas/ZoneManifestNonBootInventory"
<                 }
<               ],
<               "path": "iddqd::IdOrdMap",
<               "version": "*"
<             },
<             "type": "array",
<             "items": {
<               "$ref": "#/components/schemas/ZoneManifestNonBootInventory"
<             },
<             "uniqueItems": true
<           }
<         },
<         "required": [
<           "boot_disk_path",
<           "boot_inventory",
<           "non_boot_status"
<         ]
<       },
<       "ZoneManifestNonBootInventory": {
<         "description": "Inventory representation of a zone manifest on a non-boot disk.\n\nUnlike [`ZoneManifestBootInventory`] which is structured since Reconfigurator makes decisions based on it, information about non-boot disks is purely advisory. For simplicity, we store information in an unstructured format.",
<         "type": "object",
<         "properties": {
<           "is_valid": {
<             "description": "Whether the status is valid.",
<             "type": "boolean"
<           },
<           "message": {
<             "description": "A message describing the status.\n\nIf `is_valid` is true, then the message describes the list of artifacts found and their hashes.\n\nIf `is_valid` is false, then this message describes the reason for the invalid status. This could include errors reading the zone manifest, or zone file mismatches.",
<             "type": "string"
<           },
<           "path": {
<             "description": "The full path to the zone manifest JSON on the non-boot disk.",
<             "type": "string",
<             "format": "Utf8PathBuf"
<           },
<           "zpool_id": {
<             "description": "The ID of the non-boot zpool.",
<             "allOf": [
<               {
<                 "$ref": "#/components/schemas/InternalZpoolUuid"
<               }
<             ]
<           }
<         },
<         "required": [
<           "is_valid",
<           "message",
<           "path",
<           "zpool_id"
<         ]
<       },

Copy link
Contributor Author

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look @davepacheco I think this is ready to go! 🤞

@karencfv
Copy link
Contributor Author

Another thing I noticed is that in recent revs of this PR, the delta in the OpenAPI spec is pretty big and seems to have a bunch of unrelated changes. I might definitely be doing something wrong but:

Just recapping a live conversation: This is expected. Currently new versions aren't rolled out for trivial changes, so when you do create a new version these trivial changes show up in the JSON

@davepacheco
Copy link
Collaborator

Another thing I noticed is that in recent revs of this PR, the delta in the OpenAPI spec is pretty big and seems to have a bunch of unrelated changes. I might definitely be doing something wrong but:

Just recapping a live conversation: This is expected. Currently new versions aren't rolled out for trivial changes, so when you do create a new version these trivial changes show up in the JSON

oxidecomputer/dropshot-api-manager#35

@jgallagher
Copy link
Contributor

Merging with @karencfv's permission

@jgallagher jgallagher merged commit 63c0189 into oxidecomputer:main Dec 19, 2025
18 checks passed
@karencfv karencfv deleted the inventory-health-check-info branch December 22, 2025 01:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants