Measuring the effectiveness of a scheduler
I’m currently writing a new scheduler, credit2, which I mean to be the general-purpose scheduler for the next generation of Xen, lasting at least several years before needing a significant overhaul.
But there are some problems with writing a scheduler:
- They’re used in every workload that uses a CPU — in other words,
every workload. - Schedulers really only matter when there’s competition. Simply
running a single workload in isolation won’t tell you much about the
effectiveness of the algorithm. - Schedulers may work perfectly well for one combination of competing
workloads, but poorly for others. - Small changes can have unpredictable effects.
I’ve learned a lot so far about algorithms, what didn’t work for
credit1 and what I think works for credit2. But one of the most
important things I’ve learned though is not to rely on my intuition,
but make sure to do actual testing.
So what I’m working on now is an automated test framework, which will
measure the effectiveness of the scheduler. This is actually a bit
trickier than one might expect. You can’t simply run a workload by
itself and measure its performance; for that use case, you don’t
actually need a scheduler. You need to run each workload in
competition with an array of other workloads, and at various levels of
CPU “pressure”. The tested workload will get less CPU time, and the
performance will degrade. So we need to have an idea how much
performance degradation is expected and acceptable.
So what do we want from a scheduler? I’ve identified three high-level
goals:
- Scheduler overhead should be low; about as low as the old scheduler.
- The amount of actual CPU should be close to its “fair share”
- The impact on workload performance should be close to its “fair
share”.
“Fair share” is defined on a per-scheduler basis. For the credit1
scheduler, the ideal was that CPU time was divided according to
weight; if some VMs were using less than their “fair share”, that time
was divided by weight among other VMs that wished to use it. I won’t
go into calculating exact fair share here, but it’s essentially doing
what the interface says it will do.
So how do we define this precisely? First, we need to define several
variables:
- p_o: Performance of the workload running on the old workload on an
otherwise empty system. - p_b: “Baseline” performance of the workload on the new scheduler (on
an otherwise empty system) - t_b: The “baseline” amount of CPU time used by the taret VM (on an
otherwise empty system) - f: “Fair share” for the target VM, given the load of the system
- p_f: The performance of the workload, given fair share f
- t_f: The amount of CPU consumed given fair share f
Using these variables, and introducing some specific targets, we can
quantify the above goals:
- Overhead within 1% of the old scheduler:
- p_b >= 0.99 * p_o
- CPU consumed within 10% of “fair”:
- as f = t_f >= 0.9*f
- Performance fairness
- p_f >= p_b * (f / t_b) * F
- Where F is:
- Ideal: 1
- Good enough: 0.9 (within 10% of ideal)
- Cut-off: 0.7 (within 30% of ideal)
- p_f >= p_b * (f / t_b) * F
These constraints are a starting point; I’m sure that at some point
they will run up against reality and need to be adjusted. But the
poitn is to have specific, measurable objectives, so that we know when
we have work to do, and when we can say, “This is good enough.”