Is your cloud ready to perform? The move from dedicated appliances to virtualized network functions is full of risk. Technical strategies such as CPU pinning, NUMA allocation, and SR-IOV improve throughput performance, but also bring along significant downside in various forms such as increased latency, reduced mobility, or susceptibility to attacks – just to name a few.
To use these strategies successfully, it is critical to:
- Have a clear understanding of the risks and trade-offs
- Test with realistic workloads
- Have clear expectations for network performance
- Gain insight into how underlying resources are being utilized
For example, consider CPU pinning, which means that jobs are assigned to specific virtual cores. It does not mean that the cores will not be used by other jobs – in fact quite the opposite. In a virtualized environment, by definition, cores are still shared with other jobs, and how they get shared is up to the hypervisor. Hypervisors themselves introduce further complexity since the algorithm used depends on the vendor, and even the version, and hypervisors, like any other modern software product, are continually being updated. Do the cores march together like soldiers using the technique known as “gang scheduling”, or do they follow a more flexible algorithm? To make matters even more complicated, the combined performance of CPU pinning and the hypervisor depends on the nature of the workload itself. Certain workloads can actually reduce performance rather than improve it – defeating the original purpose of CPU pinning. Since CPU pinning says that a job must be performed by the specified cores, it means the timing of the execution of the job is at the mercy of the hypervisor algorithm. If the hypervisor algorithm says that a particular job must wait, then performance takes a hit.
Non-Uniform Memory Access, or NUMA, is an architecture commonly used in modern chip designs. With NUMA, memory is organized into nodes ideally in close proximity to the cores being used, with the idea that access to local memory is faster than access to remote memory. Again the goal is to improve performance. However, virtualized environments are designed with flexibility in mind, and can allow for workloads to access remote memory. The net effect of remote memory access in a NUMA environment is increased latency. If a workload includes latency-sensitive traffic such as VoIP, then it is critical to ensure that remote memory access is not impacting the acceptance criteria for such services.
SR-IOV, or Single Root – Input/Output Virtualization is a technique to improve the throughput performance of PCI-Express busses. The technique defines virtual functions (VFs), allocating just enough physical resources such as registers and queues, so that packets can be moved efficiently through a virtual NIC. In effect, SR-IOV allows packets to be processed more rapidly since an interrupt to the hypervisor is no longer required. While SR-IOV can improve throughput, it can also increase latency when traffic consists of larger packet sizes. Again, this can be a problem for certain kinds of workloads which are latency sensitive such as VoIP. Additionally, the efficiency of SR-IOV also comes into question in environments where security threats such Denial of Service (DoS) attacks are present, since attack packets must still be processed by the hypervisor. Network architects must also consider the practical limits on the number of VFs for each port. While SR-IOV theoretically allows for up to 256 VFs, in practice the limit is typically 8 VFs for a 1G card and 64 VFs for a 10G card. In a networking environment, where workloads of a virtual router for example, are expected to scale dramatically, the practical limits may be even lower.
While all of these techniques offer clear benefits, it’s clear that they also come with risks. There is no substitute for genuine understanding of how they work, and more importantly, practical validation of their effectiveness in realistic test environments. Realism entails testing with realistic workloads, and ensuring that networks process such workloads according to clearly defined acceptance criteria, including both throughput and latency. Meaningful testing should also go beyond simply offering a workload and declaring Pass/Fail – it should also provide detailed results which build up to an overall test verdict – in a word, testing should provide insight. Test results should not simply be limited to classic metrics such as loss, latency and throughput, but also include visibility into the utilization of virtual resources to identify and mitigate against potential bottlenecks, decreased performance, or even network outages.
To learn more about the nature of these risks – and how best to avert them, check out our latest white paper: Cloudy with a chance of poor network performance.