-
Notifications
You must be signed in to change notification settings - Fork 139
Description
Hi, so I was performing some benchmarking with stargz-snapshotter. My test setup basically included, me creating regular images, an esgz image with optimizations disabled and one with optimizations enabled. For each of these three variants, I go through, pull, create, and start individually then measure the time taken.
I have been testing primarily with large AI models, as one of our use case is to see if stargz-snapshotter would be a viable alternative to get the start time for containers with AI models quicker.
During the experimentation I stumbled upon some measurements that I am having hard time explaining. For instance while running, llama 3.2, here are the timings I got, (Start basically measures, from when we issue a container start request until, the model is fully loaded and ready to serve)
| Pull | Create | Start |
|---|---|---|
| 2m9s | 135ms | 47s |
While for esgz without optimization this is what I saw,
| Pull | Create | Start |
|---|---|---|
| 1.34s | 210ms | 4m27s |
For sake of completion here is also the timing we received for esgz with optimization enabled,
| Pull | Create | Start |
|---|---|---|
| 2.54s | 10.57s | 2m5s |
As can be seen from the measurements the timing for run phase always takes very long. I had been monitoring the network bandwidth usage during each of the runs, and it's quite apparent that the while the regular pull, nearly saturated the available bandwidth, the same wasn't true during run phase. I did find a blob, chunk_size property in config, set to default 50,000. When increased to say 50Mb, I do see a boost in run, for instance here are the measurements after setting blog, chunk_size.
| Pull | Create | Start |
|---|---|---|
| 2m26s | 246ms | 45s |
for regular overlayfs snapshotter,
| Pull | Create | Start |
|---|---|---|
| 1.9s | 129ms | 2m26s |
for esgz images without optimization and,
| Pull | Create | Start |
|---|---|---|
| 2s | 10.28s | 2m10s |
for esgz images with optimization enabled.
This setting did also see increased number of fetched bytes per layer as per, /.stargz-snapshotter/*.json files.
It would be great if you could help explain the difference in pulling the bytes during "run phase", and if there is a way to increase it.