How do real time scheduling policies work with Intel’s hybrid architecture?
How Do Real-Time Scheduling Policies Work with Intel’s Hybrid Architecture?
As revWhiteShadow, we at kts personal blog site have been investigating performance bottlenecks on Intel’s hybrid architecture, specifically concerning real-time (RT) scheduling policies. Our findings indicate that when employing SCHED_RR or SCHED_FIFO for CPU-intensive threads, these threads are often preferentially scheduled on Efficiency (E) cores rather than Performance (P) cores, leading to suboptimal performance. This article delves into the potential causes behind this behavior, explores kernel configuration considerations, and discusses potential solutions beyond simply forcing thread affinity.
Observed Behavior: Real-Time Threads Favoring E-Cores
Our initial observation stems from running CPU-intensive applications using real-time scheduling policies (SCHED_RR and SCHED_FIFO) on a system equipped with an Intel Core i9-12900 processor (a hybrid architecture featuring both P-cores and E-cores). The expectation was that computationally demanding real-time threads would be scheduled on P-cores to leverage their superior single-threaded performance. However, contrary to this expectation, these threads consistently ended up being scheduled on E-cores.
Using the stress-ng
utility, we confirmed this behavior. When running CPU stress tests with the default SCHED_NORMAL scheduling policy (stress-ng --cpu 4 --timeout 60s
), the threads are dynamically scheduled across both P and E cores, with the scheduler seemingly prioritizing P-cores for the most demanding tasks. However, switching to a real-time scheduling policy (stress-ng --sched rr --cpu 4 --timeout 60s
) consistently resulted in these stress threads being scheduled solely on E-cores, leading to a noticeable performance decrease. This observation held true regardless of whether we used SCHED_RR or SCHED_FIFO.
Kernel Configuration and Scheduling Policies
The Linux kernel’s scheduler is responsible for allocating CPU time to different processes and threads. Real-time scheduling policies (SCHED_FIFO and SCHED_RR) are designed to provide predictable and low-latency execution for critical tasks. However, the interaction between these policies and the hybrid architecture of Intel processors introduces new complexities.
Real-Time Scheduling Fundamentals
- SCHED_FIFO (First-In, First-Out): This policy grants a thread uninterrupted access to the CPU until it voluntarily relinquishes control (e.g., by blocking for I/O) or is preempted by a higher-priority SCHED_FIFO or SCHED_RR thread.
- SCHED_RR (Round-Robin): Similar to SCHED_FIFO, but with a time quantum. If a SCHED_RR thread exceeds its time quantum, it is preempted and moved to the end of the run queue for threads of the same priority.
Potential Kernel Configuration Issues
Given that we are running a custom kernel (5.14.0-427.76.1.el9_4.x86_64+rt), it’s crucial to examine specific kernel configurations that might influence scheduling behavior on hybrid architectures. Several factors could contribute to the observed affinity for E-cores:
- CPU Frequency Scaling Governors: The CPU frequency scaling governor dictates how the CPU frequency is adjusted based on workload. Certain governors might aggressively favor E-cores for power efficiency, even when real-time threads demand higher performance. We should examine the active governor using
cpupower frequency-info
and consider switching to a performance-oriented governor such asperformance
orschedutil
for testing. - Intel Turbo Boost Technology: Turbo Boost allows individual cores to run above their base operating frequency. Improper configuration or driver issues related to Turbo Boost could inadvertently limit the performance of P-cores or skew the scheduler’s perception of core capabilities. Ensuring that the
intel_pstate
driver is correctly configured and functioning is essential. - C-States and P-States: Deeper C-states (power-saving states) for P-cores might introduce latency that the scheduler attempts to avoid by favoring E-cores, which may have shallower C-states. Disabling or limiting C-state usage for P-cores, while increasing power consumption, might improve real-time performance.
- Scheduler Tunables: The kernel exposes several scheduler tunables through the
/proc/sys/kernel/
directory. These tunables control various aspects of the scheduler’s behavior. While the default values are usually adequate, it’s worth investigating whether any custom modifications have been made that might inadvertently prioritize E-cores. Specifically, examinesched_migration_cost
andsched_nr_migrate
which influence thread migration behavior. - Real-Time Preemption: The
CONFIG_PREEMPT_RT
patch offers full real-time preemption capabilities. Verify that this patch is correctly applied and configured. Incomplete or improper application of the RT patch can lead to unpredictable scheduling behavior. - IRQ Affinity: Check the Interrupt Request (IRQ) affinity. If the IRQs are heavily concentrated on the E-cores this will likely cause the scheduler to schedule more threads there.
- NUMA Configuration: Although the CPU is on a single socket, it is important to verify that the NUMA configuration is correct. Improper NUMA settings could lead to unexpected scheduling behavior.
- CPU Isolation: Check the setting for CPU isolation to make sure you have not accidentally excluded P-cores.
Kernel Patches and Bug Reports
Given the complexity of the Linux kernel and the relatively recent introduction of hybrid architectures, it’s possible that a kernel bug or missing patch is contributing to this behavior. We should investigate the following:
- Kernel Bug Trackers: Search kernel bug trackers (e.g., Bugzilla, mailing list archives) for reports related to real-time scheduling on Intel hybrid architectures.
- Relevant Patches: Review recent kernel patches related to the scheduler, CPU frequency scaling, and Intel-specific drivers. Pay close attention to patches that address performance issues or scheduling anomalies on hybrid systems.
- Community Forums: Consult relevant community forums (e.g., Linux kernel mailing lists, real-time Linux forums) for discussions about similar issues. Other users might have encountered the same problem and found a solution or workaround.
Alternatives to Forced Affinity
While manually setting thread affinity to P-cores provides a workaround, it defeats the purpose of dynamic scheduling and might not be optimal in all scenarios. We should explore alternative solutions that allow the scheduler to intelligently utilize both P-cores and E-cores:
- cgroups (Control Groups): cgroups allow you to group processes and apply resource limits and scheduling policies to the group as a whole. It might be possible to create a cgroup specifically for real-time threads and configure it to prefer P-cores. While this doesn’t guarantee placement on P-cores, it can influence the scheduler’s decisions. Utilize
cpuset
andschedtune
controllers to prioritize P-cores for a specific cgroup. - cpupower and Performance Profiles: Utilize
cpupower
to set the CPU governor to “performance” mode and create custom performance profiles that prioritize P-core usage. While this affects the entire system, it can provide a more consistent and predictable scheduling environment. - Enhanced Intel Speed Shift Technology: Intel Speed Shift Technology allows the operating system to have more fine-grained control over CPU frequency scaling. Ensure that Speed Shift is enabled and properly configured to allow for rapid frequency adjustments on P-cores. This can help the scheduler respond more quickly to real-time thread demands.
- Taskset with Masking: Although avoiding forced affinity is desired, using
taskset
with a mask that prefers P-cores but still allows E-core usage could be a compromise. For example, if cores 0-7 are P-cores, use a mask like0xFF
to encourage placement on P-cores while still permitting the scheduler to use E-cores if necessary. - Kernel Tracepoints and Monitoring: Use kernel tracepoints (e.g., with
perf
orftrace
) to monitor scheduler decisions and identify the factors influencing thread placement. This can provide valuable insights into why real-time threads are being scheduled on E-cores.
Profiling and Benchmarking
Thorough profiling and benchmarking are essential for understanding the performance impact of different scheduling configurations. We recommend using tools like:
- perf: A powerful performance analysis tool built into the Linux kernel. It allows you to collect detailed performance data, including CPU usage, cache misses, and branch mispredictions.
- ftrace: A tracing utility that allows you to monitor kernel events in real time. It can be used to track scheduler decisions, interrupt handling, and other low-level system activities.
- SystemTap: A scripting language that allows you to create custom probes to monitor kernel behavior. It’s more complex than
perf
orftrace
but offers greater flexibility. - LatencyTOP: Analyze sources of system-wide latency.
By comparing the performance of real-time applications under different scheduling configurations, we can identify the optimal settings for our specific workload. This should include a diverse set of tests, not just synthetic benchmarks, to represent real-world usage scenarios.
Kernel Update and Testing
Upgrading to a more recent kernel version might address underlying scheduling issues. Newer kernels often include performance improvements and bug fixes that could resolve the observed behavior. We should consider upgrading to the latest stable kernel release and retesting the real-time scheduling performance. Backporting relevant patches from newer kernels to our existing kernel might also be an option.
Conclusion: A Multi-Faceted Approach
Addressing the preferential scheduling of real-time threads on E-cores in Intel’s hybrid architecture requires a comprehensive approach that encompasses kernel configuration analysis, potential bug investigation, and alternative scheduling strategies. While forcing thread affinity to P-cores provides a temporary workaround, it is not a sustainable solution. By thoroughly investigating the factors outlined above and leveraging the available profiling and benchmarking tools, we can strive for dynamic scheduling that effectively utilizes both P-cores and E-cores to optimize real-time application performance. This iterative process of investigation, testing, and refinement will ultimately lead to a more robust and efficient scheduling strategy for hybrid architectures.
Specifically, we will:
- Review and adjust CPU frequency scaling governors.
- Verify Intel Turbo Boost functionality and configuration.
- Examine and potentially limit C-state usage for P-cores.
- Analyze scheduler tunables for potential misconfigurations.
- Confirm correct application and configuration of the
CONFIG_PREEMPT_RT
patch. - Investigate and adjust IRQ affinity.
- Verify NUMA configuration.
- Avoid CPU isolation of P-cores.
- Search for relevant kernel bugs and patches.
- Explore cgroups and cpupower for more nuanced scheduling control.
By pursuing this multi-faceted approach, we aim to achieve optimal real-time performance on Intel’s hybrid architecture without resorting to static thread affinity assignments. As revWhiteShadow, we are committed to providing detailed and practical solutions to complex performance challenges.