performance – A Passionate Techie

Network Latencies

Latency	Description
Name resolution latency	The time for a host to be resolved to an IP address, usually by DNS resolution—a common source of performance issues.
Ping latency	The time from an ICMP echo request to a response. This measures the network and kernel stack handling of the packet on each host.
TCP connection initialization latency	The time from when a SYN is sent to when the SYN,ACK is received. Since no applications are involved, this measures the network and kernel stack latency on each host, similar to ping latency, with some additional kernel processing for the TCP session. TCP Fast Open (TFO) may be used to reduce this latency.
TCP first-byte latency	Also known as the time-to-first-byte latency (TTFB), this measures the time from when a connection is established to when the first data byte is received by the client. This includes CPU scheduling and application think time for the host, making it a more a measure of application performance and current load than TCP connection latency.
TCP retransmits	If present, can add thousands of milliseconds of latency to network I/O.
TCP TIME_WAIT latency	The duration that locally closed TCP sessions are left waiting for late packets.
Connection/session lifespan	The duration of a network connection from initialization to close. Some protocols like HTTP can use a keep-alive strategy, leaving connections open and idle for future requests, to avoid the overheads and latency of repeated connection establishment.
System call send/ receive latency	Time for the socket read/write calls (any syscalls that read/write to sockets, including read(2), write(2), recv(2), send(2), and variants).
System call connect latency	For connection establishment; note that some applications perform this as a non-blocking syscall.
Network round-trip time	The time for a network request to make a round-trip between endpoints. The kernel may use such measurements with congestion control algorithms.
Interrupt latency	Time from a network controller interrupt for a received packet to when it is serviced by the kernel.
Inter-stack latency	Time for a packet to move through the kernel TCP/IP stack.

TCP performance features

Sliding window: This allows multiple packets up to the size of the window to be sent on the network before acknowledgments are received, providing high throughput even on high-latency networks. The size of the window is advertised by the receiver to indicate how many packets it is willing to receive at that time.
Congestion avoidance: To prevent sending too much data and causing saturation, which can cause packet drops and worse performance.
Slow-start: Part of TCP congestion control, this begins with a small congestion window and then increases it as acknowledgments (ACKs) are received within a certain time. When they are not, the congestion window is reduced.
Selective acknowledgments (SACKs): Allow TCP to acknowledge discontinuous packets, reducing the number of retransmits required.
Fast retransmit: Instead of waiting on a timer, TCP can retransmit dropped packets based on the arrival of duplicate ACKs. These are a function of round-trip time and not the typically much slower timer.
Fast recovery: This recovers TCP performance after detecting duplicate ACKs, by resetting the connection to perform slow-start.
TCP fast open: Allows a client to include data in a SYN packet, so that server request processing can begin earlier and not wait for the SYN handshake (RFC7413). This can use a cryptographic cookie to authenticate the client.
TCP timestamps: Includes a timestamp for sent packets that is returned in the ACK, so that round-trip time can be measured (RFC 1323) [Jacobson 92].
TCP SYN cookies: Provides cryptographic cookies to clients during possible SYN flood attacks (full backlogs) so that legitimate clients can continue to connect, and without the server needing to store extra data for these connection attempts.

Network terminologies

Interface: The term interface port refers to the physical network connector. The term interface or link refers to the logical instance of a network interface port, as seen and configured by the OS. (Not all OS interfaces are backed by hardware: some are virtual.)
Packet: The term packet refers to a message in a packet-switched network, such as IP packets.
Frame: A physical network-level message, for example an Ethernet frame.
Socket: An API originating from BSD for network endpoints.
Bandwidth: The maximum rate of data transfer for the network type, usually measured in bits per second. “100 GbE” is Ethernet with a bandwidth of 100 Gbits/s. There may be bandwidth limits for each direction, so a 100 GbE may be capable of 100 Gbits/s transmit and 100 Gbit/s receive in parallel (200 Gbit/sec total throughput).
Throughput: The current data transfer rate between the network endpoints, measured in bits per second or bytes per second.
Latency: Network latency can refer to the time it takes for a message to make a round-trip between endpoints, or the time required to establish a connection (e.g., TCP handshake), excluding the data transfer time that follows.

Memory architecture

This section introduces memory architecture, both hardware and software, including processor and operating system specifics.

Hardware

Main Memory

The common type of main memory in use today is dynamic random-access memory (DRAM). This is a type of volatile memory—its contents are lost when power is lost. DRAM provides high-density storage, as each bit is implemented using only two logical components: a capacitor and a transistor. The capacitor requires a periodic refresh to maintain charge.

Latency

The access time of main memory can be measured as the column address strobe (CAS) latency: the time between sending a memory module the desired address (column) and when the data is available to be read.

Main Memory Architecture

Uniform Memory Access

Non-uniform memory access

Buses

Main memory may be accessed in one of the following ways:

Shared system bus: Single or multiprocessor, via a shared system bus, a memory bridge controller, and finally a memory bus.
Direct: Single processor with directly attached memory via a memory bus.
Interconnect: Multiprocessor, each with directly attached memory via a memory bus, and processors connected via a CPU interconnect.

Multichannel

System architectures may support the use of multiple memory buses in parallel, to improve bandwidth. Common multiples are dual-, triple-, and quad-channel.

CPU Caches

Processors typically include on-chip hardware caches to improve memory access performance. The caches may include the following levels, of decreasing speed and increasing size:

Level 1: Usually split into a separate instruction cache and data cache
Level 2: A cache for both instructions and data
Level 3: Another larger level of cache

Depending on the processor, Level 1 is typically referenced by virtual memory addresses, and Level 2 onward by physical memory addresses.

MMU

The MMU (memory management unit) is responsible for virtual-to-physical address translations. These are performed per page, and offsets within a page are mapped directly.

TLB

The MMU uses a TLB (translation lookaside buffer) as the first level of address translation cache, followed by the page tables in main memory. The TLB may be divided into separate caches for instruction and data pages.

Software

Freeing Memory

When the available memory on the system becomes low, there are various methods that the kernel can use to free up memory, adding it to the free list of pages.

Free list: A list of pages that are unused (also called idle memory) and available for immediate allocation. This is usually implemented as multiple free page lists, one for each locality group (NUMA).
Page cache: The file system cache. A tunable parameter called swappiness sets the degree to which the system should favor freeing memory from the page cache instead of swapping.
Swapping: This is paging by the page-out daemon, kswapd, which finds not recently used pages to add to the free list, including application memory. These are paged out, which may involve writing to either a file system-based swap file or a swap device. Naturally, this is available only if a swap file or device has been configured.
Reaping: When a low-memory threshold is crossed, kernel modules and the kernel slab allocator can be instructed to immediately free any memory that can easily be freed. This is also known as shrinking.
OOM killer: The out-of-memory killer will free memory by finding and killing a sacrificial process, found using select_bad_process() and then killed by calling oom_kill_process(). This may be logged in the system log (/var/log/messages) as an “Out of memory: Kill process” message.

Free List(s)

Reaping

Reaping mostly involves freeing memory from the kernel slab allocator caches. These caches contain unused memory in slab-size chunks, ready for reuse. Reaping returns this memory to the system for page allocations.

Page Scanning

Freeing memory by paging is managed by the kernel page-out daemon. When available main memory in the free list drops below a threshold, the page-out daemon begins page scanning. Page scanning occurs only when needed. A normally balanced system may not page scan very often and may do so only in short bursts.
kswapd scans the inactive list first, and then the active list, if needed.

Process Virtual Address Space

Managed by both hardware and software, the process virtual address space is a range of virtual pages that are mapped to physical pages as needed. The addresses are split into areas called segments for storing the thread stacks, process executable, libraries, and heap.

Executable text: Contains the executable CPU instructions for the process. This is mapped from the text segment of the binary program on the file system. It is read-only with the execute permission.
Executable data: Contains initialized variables mapped from the data segment of the binary program. This has read/write permissions so that the variables can be modified while the program is running. It also has a private flag so that modifications are not flushed to disk.
Heap: This is the working memory for the program and is anonymous memory (no file system location). It grows as needed and is allocated via malloc(3).
Stack: Stacks of the running threads, mapped read/write.

Allocators

Slab

The kernel slab allocator manages caches of objects of a specific size, allowing them to be recycled quickly without the overhead of page allocation. This is especially effective for kernel allocations, which are frequently for fixed-size structs.

Slub

The Linux kernel SLUB allocator is based on the slab allocator and is designed to address various concerns, especially regarding the complexity of the slab allocator. Improvements include the removal of object queues, and per-CPU caches—leaving NUMA optimization to the page allocator

glibc

Its behavior depends on the allocation request size. Small allocations are served from bins of memory, containing units of a similar size, which can be coalesced using a buddy-like algorithm. Larger allocations can use a tree lookup to find space efficiently. Very large allocations switch to using mmap. The net result is a high-performing allocator that benefits from multiple allocation policies.

Memory concepts

Following are some commonly mentioned memory related terminologies:

Main memory: Also referred to as physical memory, this describes the fast data storage area of a computer, commonly provided as DRAM.
Virtual memory: An abstraction of main memory that is (almost) infinite and non-contended. Virtual memory is not real memory.
Resident memory: Memory that currently resides in main memory.
Anonymous memory: Memory with no file system location or path name. It includes the working data of a process address space, called the heap.
Address space: A memory context. There are virtual address spaces for each process, and for the kernel.
Segment: An area of virtual memory flagged for a particular purpose, such as for storing executable or writeable pages.
Instruction text: Refers to CPU instructions in memory, usually in a segment.
OOM: Out of memory, when the kernel detects low available memory.
Page: A unit of memory, as used by the OS and CPUs. Historically it is either 4 or 8 Kbytes. Modern processors have multiple page size support for larger sizes.
Page fault: An invalid memory access. These are normal occurrences when using on-demand virtual memory.
Paging: The transfer of pages between main memory and the storage devices.
Swapping: Linux uses the term swapping to refer to anonymous paging to the swap device (the transfer of swap pages). In Unix and other operating systems, swapping is the transfer of entire processes between main memory and the swap devices. This book uses the Linux version of the term.
Swap: An on-disk area for paged anonymous data. It may be an area on a storage device, also called a physical swap device, or a file system file, called a swap file. Some tools use the term swap to refer to virtual memory (which is confusing and incorrect).

Virtual Memory

Virtual memory is an abstraction that provides each process and the kernel with its own large, linear, and private address space. It simplifies software development, leaving physical memory placement for the operating system to manage. It also supports multitasking (virtual address spaces are separated by design) and oversubscription (in-use memory can extend beyond main memory).

Paging

Paging is the movement of pages in and out of main memory, which are referred to as page-ins and page-outs, respectively.

File System Paging: File system paging is caused by the reading and writing of pages in memory-mapped files. This is normal behavior for applications that use file memory mappings (mmap(2)) and on file systems that use the page cache
Anonymous Paging (Swapping): Anonymous paging involves data that is private to processes: the process heap and stacks. It is termed anonymous because it has no named location in the operating system (i.e., no file system path name). Anonymous page-outs require moving the data to the physical swap devices or swap files. Linux uses the term swapping to refer to this type of paging.

Demand Paging

Operating systems that support demand paging (most do) map pages of virtual memory to physical memory on demand. This defers the CPU overhead of creating the mappings until they are actually needed and accessed, instead of at the time a range of memory is first allocated.

If the mapping can be satisfied from another page in memory, it is called a minor fault. Page faults that require storage device access (not shown in this figure), such as accessing an uncached memory-mapped file, are called major faults.

States of a page in virtual memory:
A. Unallocated
B. Allocated, but unmapped (unpopulated and not yet faulted
C. Allocated, and mapped to main memory (RAM)
D. Allocated, and mapped to the physical swap device (disk)

Resident set size (RSS): The size of allocated main memory pages (C)
Virtual memory size: The size of all allocated areas (B + C + D)

Overcommit

Linux supports the notion of overcommit, which allows more memory to be allocated than the system can possibly store—more than physical memory and swap devices combined. It relies on demand paging and the tendency of applications to not use much of the memory they have allocated.

Process Swapping

Process swapping is the movement of entire processes between main memory and the physical swap device or swap file.

File System Cache Usage

It is normal for memory usage to grow after system boot as the operating system uses available memory to cache the file system, improving performance. The principle is: If there is spare main memory, use it for something useful.

Utilization and Saturation

Main memory utilization can be calculated as used memory versus total memory. Memory used by the file system cache can be treated as unused, as it is available for reuse by applications. If demands for memory exceed the amount of main memory, main memory becomes saturated.

Allocators

While virtual memory handles multitasking of physical memory, the actual allocation and placement within a virtual address space are often handled by allocators.

Shared Memory

Memory can be shared between processes. This is commonly used for system libraries to save memory by sharing one copy of their read-only instruction text with all processes that use it.

Proportional set size (PSS)

Private memory (not shared) plus shared memory divided by the number of users.

Working Set Size

Working set size (WSS) is the amount of main memory a process frequently uses to perform work.

Word Size

Processors may support multiple word sizes, such as 32-bit and 64-bit, allowing software for either to run. As the address space size is bounded by the addressable range from the word size, applications requiring more than 4 Gbytes of memory are too large for a 32-bit address space and need to be compiled for 64 bits or higher.

CPU Architecture

The control unit is the heart of the CPU, performing instruction fetch, decoding, managing execution, and storing results.

P-cache: Prefetch cache (per CPU core)
W-cache: Write cache (per CPU core)
Clock: Signal generator for the CPU clock (or provided externally)
Timestamp counter: For high-resolution time, incremented by the clock
Microcode ROM: Quickly converts instructions to circuit signals
Temperature sensors: For thermal monitoring
Network interfaces: If present on-chip (for high performance)

CPU Caches

They include:

Level 1 instruction cache (I$)
Level 1 data cache (D$)
Translation lookaside buffer (TLB)
Level 2 cache (E$)
Level 3 cache (optional)

MMU

Scheduler

CPU concepts

Clock Rate
The clock is a digital signal that drives all processor logic. Each CPU instruction may take one or more cycles of the clock (called CPU cycles) to execute. CPUs execute at a particular clock rate; for example, a 4 GHz CPU performs 4 billion clock cycles per second.
Instructions
CPUs execute instructions chosen from their instruction set. An instruction includes the following steps, each processed by a component of the CPU called a functional unit:
Instruction fetch
Instruction decode
Execute
Memory access
Register write-back
Memory access is the slowest
Instruction Pipeline
The instruction pipeline is a CPU architecture that can execute multiple instructions in parallel by executing different components of different instructions at the same time.
Branch Prediction
Modern processors can perform out-of-order execution of the pipeline, where later instructions can be completed while earlier instructions are stalled, improving instruction throughput.
Instruction Width
Multiple functional units of the same type can be included, so that even more instructions can make forward progress with each clock cycle. This CPU architecture is called superscalar and is typically used with pipelining to achieve a high instruction throughput.
Instruction Size
x86, which is classified as a complex instruction set computer (CISC), allows up to 15-byte instructions. ARM, which is a reduced instruction set computer (RISC), has 4 byte instructions with 4-byte alignment
SMT
Simultaneous multithreading makes use of a superscalar architecture and hardware multithreading support (by the processor) to improve parallelism. It allows a CPU core to run more than one thread, effectively scheduling between them during instructions.
IPC, CPI
Instructions per cycle (IPC) is an important high-level metric for describing how a CPU is spending its clock cycles and for understanding the nature of CPU utilization. This metric may also be expressed as cycles per instruction (CPI), the inverse of IPC.
Utilization
CPU utilization is measured by the time a CPU instance is busy performing work during an interval, expressed as a percentage. It can be measured as the time a CPU is not running the kernel idle thread but is instead running user-level application threads or other kernel threads, or processing interrupts.
User Time/Kernel Time
The CPU time spent executing user-level software is called user time, and kernel-level software is kernel time. Kernel time includes time during system calls, kernel threads, and interrupts. When measured across the entire system, the user time/kernel time ratio indicates the type of workload performed.
Saturation
A CPU at 100% utilization is saturated, and threads will encounter scheduler latency as they wait to run on-CPU, decreasing overall performance. This latency is the time spent waiting on the CPU run queue or other structure used to manage threads.
Preemption
Allows a higher-priority thread to preempt the currently running thread and begin its own execution instead. This eliminates the run-queue latency for higher-priority work, improving its performance.
Priority Inversion
Priority inversion occurs when a lower-priority thread holds a resource and blocks a higher-priority thread from running. This reduces the performance of the higher-priority work, as it is blocked waiting.
Multiprocess, Multithreading
Most processors provide multiple CPUs of some form. For an application to make use of them, it needs separate threads of execution so that it can run in parallel.
Word Size
Processors are designed around a maximum word size—32-bit or 64-bit—which is the integer size and register size.
Compiler Optimization
Compilers are also frequently updated to take advantage of the latest CPU instruction sets and to implement other optimizations. Sometimes application performance can be significantly improved simply by using a newer compiler.

Thread/Process states

User: On-CPU in user mode
Kernel: On-CPU in kernel mode
Runnable: And off-CPU waiting for a turn on-CPU
Swapping (anonymous paging): Runnable, but blocked for anonymous page-ins
Disk I/O: Waiting for block device I/O: reads/writes, data/text page-ins
Net I/O: Waiting for network device I/O: socket reads/writes
Sleeping: A voluntary sleep
Lock: Waiting to acquire a synchronization lock (waiting on someone else)
Idle: Waiting for work

Observability tools

Noting down some helpful sources 🙂

Package	Provides
procps	ps(1), vmstat(8), uptime(1), top(1)
util-linux	dmesg(1), lsblk(1), lscpu(1)
sysstat	iostat(1), mpstat(1), pidstat(1), sar(1)
iproute2	ip(8), ss(8), nstat(8), tc(8)
numactl	numastat(8)
linux-tools-common linux-tools-$(uname -r)	perf(1), turbostat(8)
bcc-tools (aka bpfcc-tools)	opensnoop(8), execsnoop(8), runqlat(8), runqlen(8), softirqs(8), hardirqs(8), ext4slower(8), ext4dist(8), biotop(8), biosnoop(8), biolatency(8), tcptop(8), tcplife(8), trace(8), argdist(8), funccount(8), stackcount(8), profile(8), and many more
bpftrace	bpftrace, basic versions of opensnoop(8), execsnoop(8), runqlat(8), runqlen(8), biosnoop(8), biolatency(8), and more
perf-tools-unstable	Ftrace versions of opensnoop(8), execsnoop(8), iolatency(8), iosnoop(8), bitesize(8), funccount(8), kprobe(8)
trace-cmd	trace-cmd(1)
nicstat	nicstat(1)
ethtool	ethtool(8)
tiptop	tiptop(1)
msr-tools	rdmsr(8), wrmsr(8)
github.com/brendangregg/msr-cloud-tools	showboost(8), cpuhot(8), cputemp(8)
github.com/brendangregg/pmc-cloud-tools	pmcarch(8), cpucache(8), icache(8), tlbstat(8), resstalls(8)

vmstat(8): Virtual and physical memory statistics, system-wide
mpstat(1): Per-CPU usage
iostat(1): Per-disk I/O usage, reported from the block device interface
nstat(8): TCP/IP stack statistics
sar(1): Various statistics; can also archive them for historical reporting

ps(1): Shows process status, shows various process statistics, including memory and CPU usage.
top(1): Shows top processes, sorted by CPU usage or another statistic.
pmap(1): Lists process memory segments with usage statistics.

perf(1): The standard Linux profiler, which includes profiling subcommands.
profile(8): A BPF-based CPU profiler from the BCC repository (covered in Chapter 15, BPF) that frequency counts stack traces in kernel context.
Intel VTune Amplifier XE: Linux and Windows profiling, with a graphical interface including source browsing.

gprof(1): The GNU profiling tool, which analyzes profiling information added by compilers (e.g., gcc -pg).
cachegrind: A tool from the valgrind toolkit, can profile hardware cache usage (and more) and visualize profiles using kcachegrind.
Java Flight Recorder (JFR): Programming languages often have their own special-purpose profilers that can inspect language context. For example, JFR for Java.

tcpdump(8): Network packet tracing (uses libpcap)
biosnoop(8): Block I/O tracing (uses BCC or bpftrace)
execsnoop(8): New processes tracing (uses BCC or bpftrace)
perf(1): The standard Linux profiler, can also trace events
perf trace: A special perf subcommand that traces system calls system-wide
Ftrace: The Linux built-in tracer
BCC: A BPF-based tracing library and toolkit
bpftrace: A BPF-based tracer (bpftrace(8)) and toolkit

strace(1): System call tracing
gdb(1): A source-level debugger
perf stat: performance counter statistics

Latency Numbers for various components in a system

Units of time

Unit	Abbreviation	Fraction of 1 second
Minute	m	60
Second	s	1
Milisecond	ms	0.001 or 1 × 10^-3
Microsecond	μs	1 × 10^-6
Nanosecond	ns	1 × 10^-9
Picosecond	ps	1 × 10^-12

Time scale of system latencies
Numbers are from runs on (4 vCPUs, 16 GB memory), SSD persistent – Intel Cascade machine.

Event	Latency
L1 cache reference	1ns
Branch mispredict	3ns
L2 cache reference	4ns
Mutex lock/unlock	17ns
Main memory reference	100ns
SSD Random read	16μs
Read 1 MiB from memory	3μs
Read 1MiB sequentially from SSD	49μs
Read 1MiB sequentially from HDD	825μs
System Call	500ns
Context Switch	10μs
Sequential Memory R/W 1MiB	100μs
Random HDD Seek 1MiB	15ms
Random SSD Seek 1MiB	15ms
Sequential SSD write 1MiB (with fsync)	100ms
Sequential SSD write 1MiB (without fsync)	1ms

	gamejudilebaran.word… on Chef: Roles and Environme…
	WARN: Waiting for se… on OSSEC start problem due to…
	aratik711 on Ansible issues
	aratik711 on Chef: Roles and Environme…
	situs judi on Chef: Roles and Environme…

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Hardware

Main Memory

Latency

Main Memory Architecture

Buses

Multichannel

CPU Caches

MMU

TLB

Software

Freeing Memory

Free List(s)

Reaping

Page Scanning

Process Virtual Address Space

Allocators

Slab

Slub

glibc

Rate this:

Share this:

Virtual Memory

Paging

Demand Paging

Overcommit

Process Swapping

File System Cache Usage

Utilization and Saturation

Allocators

Shared Memory

Proportional set size (PSS)

Working Set Size

Word Size

Rate this:

Share this:

CPU Caches

MMU

Scheduler

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this: