Friday, July 07, 2017

Oracle Linux - reading CPU flags from /proc/cpuinfo

Most people using Linux take the CPU they have for granted. Commonly the only question is, how fast is that CPU handling my code. However, in some cases, especially when doing lower level development work or doing more advanced Oracle Linux system tuning it is good to have a bit more insight into your physical CPU. The first place to look for more information on the CPU is /proc/cpuinfo which might give a wealth of additional information.

Below is an example of the content of my /proc/cpuinfo (running Oracle Linux in a Virtualbox image on a MacBook Pro):

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model  : 78
model name : Intel(R) Core(TM) i5-6267U CPU @ 2.90GHz
stepping : 3
cpu MHz  : 2903.998
cache size : 4096 KB
physical id : 0
siblings : 2
core id  : 0
cpu cores : 2
apicid  : 0
initial apicid : 0
fpu  : yes
fpu_exception : yes
cpuid level : 22
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx rdrand hypervisor lahf_lm abm 3dnowprefetch rdseed clflushopt
bugs  :
bogomips : 5807.99
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model  : 78
model name : Intel(R) Core(TM) i5-6267U CPU @ 2.90GHz
stepping : 3
cpu MHz  : 2903.998
cache size : 4096 KB
physical id : 0
siblings : 2
core id  : 1
cpu cores : 2
apicid  : 1
initial apicid : 1
fpu  : yes
fpu_exception : yes
cpuid level : 22
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx rdrand hypervisor lahf_lm abm 3dnowprefetch rdseed clflushopt
bugs  :
bogomips : 5807.99
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:

As you can see, a lot of flags are mentioned, those might be interesting to understand the capabilities of your processor in more detail. In our case the following flags are mentioned:  fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx rdrand hypervisor lahf_lm abm 3dnowprefetch rdseed clflushopt.

fpu
Onboard FPU (floating point support). designed to carry out operations on floating point numbers. Typical operations are addition, subtraction, multiplication, division, square root, and bitshifting.

vme
Virtual 8086 mode enhancements. In the 80386 microprocessor and later, virtual 8086 mode (also called virtual real mode, V86-mode or VM86) allows the execution of real mode applications that are incapable of running directly in protected mode while the processor is running a protected mode operating system. It is a hardware virtualization technique that allowed multiple 8086 processors to be emulated by the 386 chip; it emerged from the painful experiences with the 80286 protected mode, which by itself was not suitable to run concurrent real mode applications well

de
Debugging Extensions (CR4.DE). A control register CR is a processor register which changes or controls the general behavior of a CPU or other digital device. Common tasks performed by control registers include interrupt control, switching the addressing mode, paging control, and coprocessor control. CR4 is used in protected mode to control operations such as virtual-8086 support, enabling I/O breakpoints, page size extension and machine check exceptions. CR4 bit 3 controls the DE (Debugging Extension), if set, enables debug register based breaks on I/O space access.

pse
Page Size Extensions (4MB memory pages). PSE is a feature of x86 processors that allows for pages larger than the traditional 4 KiB size. It was introduced in the original Pentium processor, but it was only publicly documented by Intel with the release of the Pentium Pro.

tsc
Time Stamp Counter (RDTSC). is a 64-bit register present on all x86 processors since the Pentium. It counts the number of cycles since reset. The instruction RDTSC returns the TSC in EDX:EAX. In x86-64 mode, RDTSC also clears the higher 32 bits of RAX and RDX. Its opcode is 0F 31.

msr
Model-Specific Registers (RDMSR, WRMSR). MSR is any of various control registers in the x86 instruction set used for debugging, program execution tracing, computer performance monitoring, and toggling certain CPU features.

pae
Physical Address Extensions (support for more than 4GB of RAM). Physical Address Extension (PAE), sometimes referred to as Page Address Extension, is a memory management feature for the x86 architecture. PAE was first introduced by Intel in the Pentium Pro, and later by AMD in the Athlon processor.[2] It defines a page table hierarchy of three levels, with table entries of 64 bits each instead of 32, allowing these CPUs to directly access a physical address space larger than 4 gigabytes

mce
Machine Check Exception. A Machine Check Exception (MCE) is a type of computer hardware error that occurs when a computer's central processing unit detects a hardware problem. Modern versions of Microsoft Windows handle machine check exceptions through the Windows Hardware Error Architecture. On Linux, a process (such as klogd[2]) writes a message to the kernel log and/or the console screen (usually only to the console when the error is non-recoverable and the machine crashes as a result)

cx8
CMPXCHG8 instruction (64-bit compare-and-swap). Compares the 64-bit value in EDX:EAX (or 128-bit value in RDX:RAX if operand size is 128 bits) with the operand (destination operand). If the values are equal, the 64-bit value in ECX:EBX (or 128-bit value in RCX:RBX) is stored in the destination operand. Otherwise, the value in the destination operand is loaded into EDX:EAX (or RDX:RAX). The destination operand is an 8-byte memory location (or 16-byte memory location if operand size is 128 bits). For the EDX:EAX and ECX:EBX register pairs, EDX and ECX contain the high-order 32 bits and EAX and EBX contain the low-order 32 bits of a 64-bit value. For the RDX:RAX and RCX:RBX register pairs, RDX and RCX contain the high-order 64 bits and RAX and RBX contain the low-order 64bits of a 128-bit value. This instruction encoding is not supported on Intel processors earlier than the Pentium processors.

apic
Onboard APIC. Advanced Programmable Interrupt Controller (APIC) is a family of interrupt controllers. As its name suggests, the APIC is more advanced than Intel's 8259 Programmable Interrupt Controller (PIC), particularly enabling the construction of multiprocessor systems. It is one of several architectural designs intended to solve interrupt routing efficiency issues in multiprocessor computer systems.

sep
SYSENTER/SYSEXIT. Executes a fast call to a level 0 system procedure or routine. SYSENTER is a companion instruction to SYSEXIT. The instruction is optimized to provide the maximum performance for system calls from user code running at privilege level 3 to operating system or executive procedures running at privilege level 0.

mtrr
Memory Type Range Registers. Memory type range registers (MTRRs) are a set of processor supplementary capabilities control registers that provide system software with control of how accesses to memory ranges by the CPU are cached. It uses a set of programmable model-specific registers (MSRs) which are special registers provided by most modern CPUs. Possible access modes to memory ranges can be uncached, write-through, write-combining, write-protect, and write-back. In write-back mode, writes are written to the CPU's cache and the cache is marked dirty, so that its contents are written to memory later.

pge
Page Global Enable (global bit in PDEs and PTEs). the page global enable (PGE) flag in the register CR4 and the global (G) flag of a page-directory or page-table entry can be used to prevent frequently used pages from being automatically invalidated in the TLBs on a task switch or a load of register CR3. Bit 7 in CR4 is used to set PGE, if set, address translations (PDE or PTE records) may be shared between address spaces.

mca
Machine Check Architecture. Machine Check Architecture (MCA) is an Intel mechanism in which the CPU reports hardware errors to the operating system. Intel's Pentium 4, Intel Xeon, P6 family processors as well as the Itanium architecture implement a machine check architecture that provides a mechanism for detecting and reporting hardware (machine) errors, such as: system bus errors, ECC errors, parity errors, cache errors, and translation lookaside buffer errors. It consists of a set of model-specific registers (MSRs) that are used to set up machine checking and additional banks of MSRs used for recording errors that are detected.

cmov
CMOV instructions (conditional move) (also FCMOV). FCMOV is a floating point conditional move opcode of the Intel x86 architecture, first introduced in Pentium Pro processors. It copies the contents of one of the floating point stack register, depending on the contents of EFLAGS integer flag register, to the ST(0) (top of stack) register. There are 8 variants of the instruction selected by the condition codes that need be set for the instruction to perform the move. Similar to the CMOV instruction, FCMOV allows some conditional operations to be performed without the usual branching overhead. However, it has a higher latency than conditional branch instructions.[2] Therefore, it is most useful for simple yet unpredictable comparison or conditional operations, where it can provide substantial performance gains. The instruction is usually used with the FCOMI instruction or the FCOM-FSTSW-SAHF idiom to set the relevant condition codes based on the result of a floating point comparison.

pat
Page Attribute Table. The page attribute table (PAT) is a processor supplementary capability extension to the page table format of certain x86 and x86-64 microprocessors. Like memory type range registers (MTRRs), they allow for fine-grained control over how areas of memory are cached, and are a companion feature to the MTRRs.

pse36
36-bit PSEs (huge pages). PSE-36 (36-bit Page Size Extension) refers to a feature of x86 processors that extends the physical memory addressing capabilities from 32 bits to 36 bits, allowing addressing to up to 64 GB of memory. Compared to the Physical Address Extension (PAE) method, PSE-36 is a simpler alternative to addressing more than 4 GB of memory. It uses the Page Size Extension (PSE) mode and a modified page directory table to map 4 MB pages into a 64 GB physical address space. PSE-36's downside is that, unlike PAE, it doesn't have 4-KB page granularity above the 4 GB mark.

clflush
Cache Line Flush instruction. Invalidates the cache line that contains the linear address specified with the source operand from all levels of the processor cache hierarchy (data and instruction). The invalidation is broadcast throughout the cache coherence domain. If, at any level of the cache hierarchy, the line is inconsistent with memory (dirty) it is written to memory before invalidation. The source operand is a byte memory location. The availability of CLFLUSH is indicated by the presence of the CPUID feature flag CLFSH (bit 19 of the EDX register, see Section , CPUID-CPU Identification). The aligned cache line size affected is also indicated with the CPUID instruction (bits 8 through 15 of the EBX register when the initial value in the EAX register is 1).

mmx
Multimedia Extensions. MMX is a single instruction, multiple data (SIMD) instruction set designed by Intel, introduced in 1997 with its P5-based Pentium line of microprocessors, designated as "Pentium with MMX Technology".It developed out of a similar unit introduced on the Intel i860, and earlier the Intel i750 video pixel processor. MMX is a processor supplementary capability that is supported on recent IA-32 processors by Intel and other vendors.

fxsr
FXSAVE/FXRSTOR, CR4.OSFXSR. Operating system support for FXSAVE and FXRSTOR instructions. If set, enables SSE instructions and fast FPU save & restore. bit 9 in CR4 is used to control fxsr.

sse
Intel SSE vector instructions. Streaming SIMD Extensions (SSE) is an SIMD instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series of processors shortly after the appearance of AMD's 3DNow!. SSE contains 70 new instructions, most of which work on single precision floating point data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications are digital signal processing and graphics processing.

sse2
SSE2 (Streaming SIMD Extensions 2), is one of the Intel SIMD (Single Instruction, Multiple Data) processor supplementary instruction sets first introduced by Intel with the initial version of the Pentium 4 in 2001. It extends the earlier SSE instruction set, and is intended to fully replace MMX. Intel extended SSE2 to create SSE3 in 2004. SSE2 added 144 new instructions to SSE, which has 70 instructions. Competing chip-maker AMD added support for SSE2 with the introduction of their Opteron and Athlon 64 ranges of AMD64 64-bit CPUs in 2003.

ht
Hyper-threading (officially called Hyper-Threading Technology or HT Technology, and abbreviated as HTT or HT) is Intel's proprietary simultaneous multithreading (SMT) implementation used to improve parallelization of computations (doing multiple tasks at once) performed on x86 microprocessors. It first appeared in February 2002 on Xeon server processors and in November 2002 on Pentium 4 desktop CPUs.[4] Later, Intel included this technology in Itanium, Atom, and Core 'i' Series CPUs, among others.

syscall
SYSCALL (Fast System Call) and SYSRET (Return From Fast System Call). SYSCALL invokes an OS system-call handler at privilege level 0. It does so by loading RIP from the IA32_LSTAR MSR (after saving the address of the instruction following SYSCALL into RCX). (The WRMSR instruction ensures that the IA32_LSTAR MSR always contain a canonical address.)

nx
Execute Disable. The NX bit, which stands for No-eXecute, is a technology used in CPUs to segregate areas of memory for use by either storage of processor instructions (code) or for storage of data, a feature normally only found in Harvard architecture processors. However, the NX bit is being increasingly used in conventional von Neumann architecture processors, for security reasons.

rdtscp
Read Time-Stamp Counter and Processor ID. Loads the current value of the processor’s time-stamp counter (a 64-bit MSR) into the EDX:EAX registers and also loads the IA32_TSC_AUX MSR (address C000_0103H) into the ECX register. The EDX register is loaded with the high-order 32 bits of the IA32_TSC MSR; the EAX register is loaded with the low-order 32 bits of the IA32_TSC MSR; and the ECX register is loaded with the low-order 32-bits of IA32_TSC_AUX MSR. On processors that support the Intel 64 architecture, the high-order 32 bits of each of RAX, RDX, and RCX are cleared.

lm
Long Mode (x86-64: amd64, also known as Intel 64, i.e. 64-bit capable). Long mode is the mode where a 64-bit operating system can access 64-bit instructions and registers. 64-bit programs are run in a sub-mode called 64-bit mode, while 32-bit programs and 16-bit protected mode programs are executed in a sub-mode called compatibility mode. Real mode or virtual 8086 mode programs cannot be natively run in long mode.

constant_tsc
TSC ticks at a constant rate. Recent Intel processors include a constant rate TSC (identified by the kern.timecounter.invariant_tsc sysctl on FreeBSD or by the "constant_tsc" flag in Linux's /proc/cpuinfo). With these processors, the TSC ticks at the processor's nominal frequency, regardless of the actual CPU clock frequency due to turbo or power saving states. Hence TSC ticks are counting the passage of time, not the number of CPU clock cycles elapsed.

rep_good
rep microcode works well.

nopl
The NOPL (0F 1F) instructions

xtopology
cpu topology enum extensions

nonstop_tsc
TSC does not stop in C states. NONSTOP_TSC acts in conjunction with CONSTANT_TSC. CONSTANT_TSC indicates that the TSC runs at constant frequency irrespective of P/T- states, and NONSTOP_TSC indicates that TSC does not stop in deep C-states.

pni
SSE-3 (“Prescott New Instructions”). SSE3, Streaming SIMD Extensions 3, also known by its Intel code name Prescott New Instructions (PNI), is the third iteration of the SSE instruction set for the IA-32 (x86) architecture. Intel introduced SSE3 in early 2004 with the Prescott revision of their Pentium 4 CPU. In April 2005, AMD introduced a subset of SSE3 in revision E (Venice and San Diego) of their Athlon 64 CPUs. The earlier SIMD instruction sets on the x86 platform, from oldest to newest, are MMX, 3DNow! (developed by AMD, but not supported by Intel processors), SSE, and SSE2.

pclmulqdq
Perform a Carry-Less Multiplication of Quadword instruction — accelerator for GCM). Carry-less Multiplication (CLMUL) is an extension to the x86 instruction set used by microprocessors from Intel and AMD which was proposed by Intel in March 2008 and made available in the Intel Westmere processors announced in early 2010. One use of these instructions is to improve the speed of applications doing block cipher encryption in Galois/Counter Mode, which depends on finite field GF(2k)) multiplication, which can be implemented more efficiently with the new CLMUL instructions than with the traditional instruction set. Another application is the fast calculation of CRC values, including those used to implement the LZ77 sliding window DEFLATE algorithm in zlib and pngcrush.

ssse3
Supplemental SSE-3. SSSE3 was first introduced with Intel processors based on the Core microarchitecture on 26 June 2006 with the "Woodcrest" Xeons. SSSE3 has been referred to by the codenames Tejas New Instructions (TNI) or Merom New Instructions (MNI) for the first processor designs intended to support it. SSSE3 contains 16 new discrete instructions as a supplement on SSE-3

cx16
CMPXCHG16B. al words. This is useful for parallel algorithms that use compare and swap on data larger than the size of a pointer, common in lock-free and wait-free algorithms. Without CMPXCHG16B one must use workarounds, such as a critical section or alternative lock-free approaches.

sse4_1
SSE4.1 instruction set. These instructions were introduced with Penryn microarchitecture, the 45 nm shrink of Intel's Core microarchitecture. Support is indicated via the CPUID.01H:ECX.SSE41[Bit 19] flag.

sse4_2
SSE4.2 instruction set. SSE4.2 added STTNI (String and Text New Instructions), several new instructions that perform character searches and comparison on two operands of 16 bytes at a time. These were designed (among other things) to speed up the parsing of XML documents. It also added a CRC32 instruction to compute cyclic redundancy checks as used in certain data transfer protocols. These instructions were first implemented in the Nehalem-based Intel Core i7 product line and complete the SSE4 instruction set. Support is indicated via the CPUID.01H:ECX.SSE42[Bit 20] flag.

x2apic
The xAPIC was introduced with the Pentium 4, while the x2APIC is the most recent generation of the Intel's programmable interrupt controller, introduced with the Nehalem microarchitecture. The major improvements of the x2APIC address the number of supported CPUs and performance of the interface. The x2APIC now uses 32 bits to address CPUs, allowing to address up to 232 − 1 CPUs using the physical destination mode. The logical destination mode now works differently and introduces clusters; using this mode, one can address up to 220 − 16 processors. The x2APIC architecture also provides backward compatibility modes to the original Intel APIC Architecture (introduced with the Pentium/P6) and with the xAPIC architecture (introduced with the Pentium 4).

movbe
Move Data After Swapping Bytes instruction. Performs a byte swap operation on the data copied from the second operand (source operand) and store the result in the first operand (destination operand). The source operand can be a general-purpose register, or memory loca-tion; the destination register can be a general-purpose register, or a memory location; however, both operands can not be registers, and only one operand can be a memory location. Both operands must be the same size, which can be a word, a doubleword or quadword. The MOVBE instruction is provided for swapping the bytes on a read from memory or on a write to memory; thus providing support for converting little-endian values to big-endian format and vice versa. In 64-bit mode, the instruction's default operation size is 32 bits. Use of the REX.R prefix permits access to additional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See the summary chart at the beginning of this section for encoding data and limits.

popcnt
Return the Count of Number of Bits Set to 1 instruction (Hamming weight, i.e. bit count). These instructions operate on integer rather than SSE registers, because they are not SIMD instructions, but appear at the same time and although introduced by AMD with the SSE4a instruction set, they are counted as separate extensions with their own dedicated CPUID bits to indicate support. Intel implements POPCNT beginning with the Nehalem microarchitecture and LZCNT beginning with the Haswell microarchitecture. AMD implements both beginning with the Barcelona microarchitecture. Population count (count number of bits set to 1). Support is indicated via the CPUID.01H:ECX.POPCNT[Bit 23] flag.

aes
Advanced Encryption Standard Instruction Set (or the Intel Advanced Encryption Standard New Instructions; AES-NI) is an extension to the x86 instruction set architecture for microprocessors from Intel and AMD proposed by Intel in March 2008. The purpose of the instruction set is to improve the speed of applications performing encryption and decryption using the Advanced Encryption Standard (AES).

xsave
Save Processor Extended States: also provides XGETBY,XRSTOR,XSETBY. Performs a full or partial save of processor state components to the XSAVE area located at the memory address specified by the destination operand. The implicit EDX:EAX register pair specifies a 64-bit instruction mask. The specific state components saved correspond to the bits set in the requested-feature bitmap (RFBM), which is the logical-AND of EDX:EAX and XCR0.

avx
Advanced Vector Extensions. Advanced Vector Extensions (AVX) are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge processor shipping in Q1 2011 and later on by AMD with the Bulldozer processor shipping in Q3 2011. AVX provides new features, new instructions and a new coding scheme.

rdrand
Read Random Number from hardware random number generator instruction. RDRAND (previously known as Bull Mountain) is an instruction for returning random numbers from an Intel on-chip hardware random number generator which has been seeded by an on-chip entropy source. RDRAND is available in Ivy Bridge processors and is part of the Intel 64 and IA-32 instruction set architectures. AMD added support for the instruction in June 2015. The random number generator is compliant with security and cryptographic standards such as NIST SP 800-90A, FIPS 140-2, and ANSI X9.82. Intel also requested Cryptography Research Inc. to review the random number generator in 1999 and 2012, which resulted in two published papers: The Intel Random Number Generator in 1999, and Analysis of Intel's Ivy Bridge Digital Random Number Generator in 2012.

hypervisor
This flag is set by the hypervisor to indicate that your machine is running as a virtual machine on a hypervisor. Even though the presence of the flag is a good indicator that your machine is in fact a virtual machine running on a hypervisor you should be careful when building logic upon the presence of this flag. It is the decision of the hypervisor to push the flag or not. This means that if the flag is not pushed by the hypervisor, and is not present, the machine can still be a virtual machine instead of a bare-metal machine.

lahf_lm 
Load AH from Flags (LAHF) and Store AH into Flags (SAHF) in long mode.

abm
Advanced Bit Manipulation. ABM is only implemented as a single instruction set by AMD; all AMD processors support both instructions or neither. Intel considers POPCNT as part of SSE4.2, and LZCNT as part of BMI1. POPCNT has a separate CPUID flag; however, Intel uses AMD's ABM flag to indicate LZCNT support (since LZCNT completes the ABM).

3dnowprefetch
3DNow prefetch instructions. 3DNow! is an extension to the x86 instruction set developed by Advanced Micro Devices (AMD). It adds single instruction multiple data (SIMD) instructions to the base x86 instruction set, enabling it to perform vector processing, which improves the performance of many graphic-intensive applications. The first microprocessor to implement 3DNow was the AMD K6-2, which was introduced in 1998. When the application was appropriate this raised the speed by about 2-4 times. However, the instruction set never gained much popularity, and AMD announced on August 2010 that support for 3DNow would be dropped in future AMD processors, except for two instructions (the PREFETCH and PREFETCHW instructions). The two instructions are also available in Bay-Trail Intel processors.

rdseed
The RDSEED instruction. Non-deterministic random bit generator compatible with NIST SP 800-90B & C (drafts)

clflushopt
CLFLUSHOPT instruction. With the CLFLUSHOPT instruction, a store buffer only needs to be flushed if it holds data from the same cache line that the CLFLUSHOPT is accessing.   Store buffers holding any other addresses can continue to "cache" their stored data until some other mechanism forces them to push that data to the L1 Data Cache (where it becomes visible to all agents in the coherence fabric).

Reading and understanding the above flags will give you a good insight into the capabilities of your processor and how Oracle Linux is seeing the processor. Additionally, when you are writing low level code for processes that need to interact relatively direct with the hardware (or virtual hardware) it is of vital essence to understand what the options are at your disposal and what the machine is capable of doing. 

No comments: