Playing with Swap Monitoring and Increasing Swap Space Using ZFS Volumes

In Oracle Solaris 11.1

by Alexandre BorgesOracle ACE

Part 2 of a series that describes the key features of ZFS in Oracle Solaris 11.1 and provides step-by-step procedures explaining how to use them. This article describes how to monitor swap space and how to increase or decrease the swap space using ZFS volumes.

 

Published April 2014

graphic

right arrow Part 1 - Using COMSTAR and ZFS to Configure a Virtualized Storage Environment
right arrow Part 2 - Playing with Swap Monitoring and Increasing Swap Space Using ZFS Volumes
right arrow Part 3 - Playing with ZFS Shadow Migration
right arrow Part 4 - Delegating a ZFS Dataset to a Non-Global Zone
right arrow Part 5 - Playing with ZFS Encryption
right arrow Part 6 - Playing with ZFS Snapshots

graphic

During installation, Oracle Solaris 11 usually makes the swap space around one quarter of the RAM size. System and, particularly, application requirements can vary for each environment, so it's often appropriate to alter the swap space size by adding or removing space.

Want to comment on this article? Post the link on Facebook's OTN Garage page. Have a similar article to share? Bring it up on Facebook or Twitter and let's discuss.

 

The swap space is an area of disk dedicated to paged anonymous memory and processes that are moved because of a low amount of RAM.

Monitoring Swap Space

There are several ways to see the current size of the space swap for your system, for example:

root@solaris11-1:~# swap -l

swapfile dev swaplo blocks free

/dev/zvol/dsk/rpool/swap 285,2 8 2097144 2097144

 

where:

  • swapfile indicates the swap space comes from a ZFS volume at /dev/zvol/dsk/rpool/swap.
  • dev shows the major number, which in this case confirms that the swap object is based on a ZFS volume:

  • root@solaris11-1:~# more /etc/name_to_major | grep 285
  • zfs 285
  • swaplo indicates the minimum possible swap space size, which represents the memory page size (8 sectors x 512 bytes = 4K). To check it, pagesize can be obtained by executing the following:

  • root@solaris11-1:~# pagesize
  • 4096

 

A value of 4K is typically found on Intel machines. However, with Oracle Solaris 11 on SPARC machines, the page size can vary from 16K to 2 GB (this upper limit also applies for Intel processors). This upper limit is mainly used as the page size for the System Global Area (SGA)—a dedicated shared-memory area for an instance of Oracle Database 11g . Additionally, it is worth noting that 2 GB pages are supported with Oracle Solaris 10 8/11 or later Oracle Solaris releases and Oracle's SPARC T4 processor, but this page size isn't enabled by default. If it's suitable for some applications, we have to enable it by inserting set max_uheap_lpsize=0x80000000 in the /etc/system file and then rebooting the system.

Furthermore, Oracle Solaris 11 supports multiple page sizes, which can be set manually according the application profile or automatically through a new built-in memory prediction technology that is able to analyze the demands of applications in order to assign a suitable value.

The supported page sizes can be shown by running the following command (in this case, on an Intel processor):

root@solaris11-1:~# pagesize -a

4096

2097152

 

The example above shows us that two page sizes are supported: 4K and 2 GB. The real reason for using larger memory pages is for improving the Memory Management Unit (MMU) performance by reducing TLB (Translation Lookaside Buffer) misses. The number of TLB misses can be verified by using the trapstat command (although trapstat is not usually implemented on Intel platforms).

  • blocks is the total size of the swap space (2097144 x 512 bytes = 1 GB).
  • free represents the free swap space (1 GB).

Another very good way to monitor the swap space is the following command:

root@solaris11-1:~# swap -s

total: 680180k bytes allocated + 266516k reserved = 946696k used, 2321756k available

 

From this command output, we can see the following:

  • 680180K bytes allocated indicates the amount of swap space that already has been used (that is, touched previously but not necessarily still being used at this time) and continues to be available and reserved for use. A rough comparison would be a high-watermark threshold.
  • 266516k reserved indicates swap space that has not been allocated yet, but has been claimed for possible future use. Remember that swap space is reserved when the virtual memory (heap segment or anonymous memory) for a process is created, and the reserved swap space is then allocated when the process is run. Anonymous memory is made of pages that don't have a counterpart in any file system and that are migrated to the swap space due to a shortage of physical memory (RAM)—probably because the sum of the stack, the shared memory, and the process heap (from the malloc function, for example) is larger than the amount of available memory.
  • 946696k used indicates the total amount of swap space that is either allocated or reserved.
  • 2321756k available indicates the swap space available for future allocation.

Additionally, we must remember that some swap space is reserved when the virtual memory for a process is created, but only part of this reserved space is really associated with the address space of the process; otherwise, the swap -s output can be misinterpreted, because it is telling us that 946696k is, at the end, reserved (in order to allocate a space, the space must has been reserved previously) and 680180K of swap space has been touched.

Another very important point is that the swap -l command reports the physical swap space (on disk) while swap -s reports virtual swap space, which is the sum of the physical swap space and the physical memory. Therefore, the available swap space from swap -s is the sum of free physical swap space plus free physical memory space. That's the reason that the swap -s command is not recommended for evaluating the physical swap space; instead, swap -l should be used for this goal.

If we want to try another way to get the swap information, we can use the echo ::swapinfo | mdb -k command, for example:

root@solaris11-1:~# echo ::swapinfo | mdb -k

ADDR VNODE PAGES FREE NAME

ffffc10007798260 ffffc10007a7db40 262143 262143 /dev/zvol/dsk/rpool/swap

 

It's simple to confirm that 262143 pages x 8K = 2097144K.

As mentioned earlier, it's good to remember that anonymous memory doesn't have a counterpart in the file system. Usually, anonymous pages are the private data of a process, which includes the process heap (anonymous data) and the thread structure (the stack area, for example).

Swapping—an operation in which the swapper process ( sched) swaps out processes that have been sleeping for more than 20 seconds (first their thread structures and then the stack and heap data [anonymous page])—shouldn't be confused with paging, which is moving pages (normally 4 KB or 8 KB each) from memory to disk and usually results in very efficient memory management. However, one kind of paging has a horrible effect on system performance—anonymous paging (mainly anonymous page-in)—because it increases application latency for reading back data from a disk.

Also swapping shouldn't be confused with reaping, which is a technique to free memory from the kernel slab allocator caches and which is done by the function kmem_reap( ) .

How can you verify whether a system is using anonymous pages? In the following output, the columns that are interesting are apo (anonymous page-out) and api (anonymous page-in), which both ideally should be equal to zero. The latter is responsible for an increase in application latency.

root@solaris11-1:~# vmstat -p 1

memory page executable anonymous filesystem

swap free re mf fr de sr epi epo epf api apo apf fpi fpo fpf

2973844 2609240 3 18 0 0 3 0 0 0 0 0 0 0 0 0

2895156 2544236 26 47 0 0 0 0 0 0 0 0 0 0 0 0

2895156 2544092 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 

To find out what process is doing anonymous page-in, use the following command:

root@solaris11-1:~# dtrace -n 'vminfo:::anonpgin { @[pid, execname] = count(); }'

 

Swapping is the last-used resource when paging is not able to free enough memory to meet the demands of an application, which can be indicated by a high level of page scanning (searching for free memory pages).

Usually, when the amount of free memory goes below the amount specified by the desfree kernel parameter and then below the amount specified by the minfree kernel parameter, page scanning becomes more intensive. If the amount of free memory stays below the desfree value for 30 seconds or more, the system starts swapping.

The worst form of swapping is hard swapping, which is when some inactive kernel modules are unloaded and moved to the swap space.

We can monitor whether the system is hard swapping by using the following command:

root@solaris11-1:~# echo "hardswap/D" | mdb -k

hardswap:

hardswap: 0

 

Hard swapping is rare because following conditions must be met:

  • The amount of free memory needs to be below desfree for more than 30 seconds, AND
  • There must constantly be two pending processes on the run queue (the r column in the vmstat output below), AND
  • freemem must be below minfree OR the number of page-ins plus page-outs must be greater than maxpgio, where maxpgio is the number of page-out requests that can be queued by the paging system.

In other words, maxpgio is used to limit how many memory pages can be sent to swap causing a disk I/O bottleneck. Therefore, maxpgio depends on the number of swap devices using their own disk controller. Its default value is 40 pages.

More often, we might see a light kind of swapping called soft swapping , which happens when the amount of free memory is below the desfree value.

We can check for soft swapping by executing the following command:

root@solaris11-1:~# echo "softswap/D" | mdb -k

softswap:

softswap: 0

 

By way of introduction (more details would be beyond the scope of this article), the minfree value equals desfree/2, and the desfree value equals lotsfree/2. The following is the formula for calculating lotsfree:

lotsfree = [memory - kernel]/(64 * page size)]

 

These values can be seen by running the following commands:

root@solaris11-1:~# prtconf | grep -i memory

Memory size: 4096 Megabytes

 

root@solaris11-1:~# echo lotsfree/D | mdb -k

lotsfree:

lotsfree: 16318

 

root@solaris11-1:~# echo desfree/E | mdb -k

desfree:

desfree: 8159

 

root@solaris11-1:~# echo minfree/D | mdb -k

minfree:

minfree: 4079

 

root@solaris11-1:~# bc

16318 * 4096 * 64

4277665792

root@solaris11-1:~#

 

The best method for getting the values of lotsfree, desfree, and minfree is executing the following command:

root@solaris11-1:~# kstat -n system_pages

module: unix instance: 0

name: system_pages class: pages

availrmem 409132

crtime 0

desfree 8159

desscan 25

econtig 4229439488

fastscan 522183

freemem 243665

kernelbase 0

lotsfree 16318

minfree 4079

nalloc 110633425

nalloc_calls 31285

nfree 107403292

nfree_calls 23611

nscan 0

pagesfree 243665

pageslocked 635234

pagestotal 1044366

physmem 1044366

pp_kernel 649290

slowscan 100

snaptime 26017.87927546

 

Furthermore, returning to the page scanning subject, there are different values for page scanning that happen at different times. For example, fastscan is the number of pages scanned per second when free memory is equal to zero, desscan is the scan rate goal during page scanning, and nscan is the number of pages scanned during the last page scan action. In this example, there is enough memory and there isn't any page scanning activity (nscan equals 0).

This same information from kstat can be collected by running the following commands:

root@solaris11-1:~# echo fastscan/E | mdb -k

fastscan:

fastscan: 522183

root@solaris11-1:~# echo slowscan/E | mdb -k

slowscan:

slowscan: 100

root@solaris11-1:~# echo desscan/E | mdb -k

desscan:

desscan: 25

root@solaris11-1:~# echo nscan/E | mdb -k

nscan:

nscan: 0

 

To monitor the swap space, we can check the past and the present (real time) swapping statistics by executing this command:

root@solaris11-1:~# vmstat 1

kthr memory page disk faults cpu

r b w swap free re mf pi po fr de sr s0 s2 s3 s4 in sy cs us sy id

0 0 0 2972960 2608516 3 18 0 0 0 0 3 0 0 0 0 659 480 723 1 4 95

0 0 0 2895104 2544208 26 49 0 0 0 0 0 0 0 0 0 660 648 694 1 4 95

0 0 0 2895104 2544056 0 2 0 0 0 0 0 0 0 0 0 690 1839 847 4 4 92

 

The important column for us is w , which shows swapped out threads caused by memory pressure that was probably caused by the amount of free memory dropping below minfree or desfree for more than 30 seconds and, thus, causing idle processes to be swapped out to the swap space.

The following command shows the real-time swap status:

root@solaris11-1:~# vmstat -S 1

kthr memory page disk faults cpu

r b w swap free si so pi po fr de sr s0 s2 s3 s4 in sy cs us sy id

0 0 0 2972572 2608200 0 0 0 0 0 0 3 0 0 0 0 659 480 723 1 4 95

0 0 0 2895032 2544000 0 0 0 0 0 0 0 0 0 0 0 706 875 901 2 5 93

0 0 0 2895032 2544000 0 0 0 0 0 0 0 0 0 0 0 615 511 671 1 3 96

 

Columns so and si represent swapped-out pages and swapped-in pages, respectively, in real time. Again, ideally both should be zero for good performance.

Adding or Removing Swap Space Using a ZFS Volume

Now that we know how to monitor the swap space, it's time to learn to add space and delete disk space that is allocated to the swap area. The Oracle Solaris 11 host we are using (solaris11-1) has the following file system-related components:

root@solaris11-1:~# zfs list -r rpool

NAME USED AVAIL REFER MOUNTPOINT

rpool 28.5G 49.7G 4.91M /rpool

rpool/ROOT 25.4G 49.7G 31K legacy

rpool/ROOT/solaris 25.4G 49.7G 24.4G /

rpool/ROOT/solaris-backup-1 138K 49.7G 24.2G /

rpool/ROOT/solaris-backup-1/var 64K 49.7G 291M /var

rpool/ROOT/solaris/var 486M 49.7G 234M /var

rpool/VARSHARE 92K 49.7G 92K /var/share

rpool/dump 2.06G 49.8G 2.00G -

rpool/export 805K 49.7G 32K /export

rpool/export/home 773K 49.7G 32K /export/home

rpool/export/home/ale 741K 49.7G 741K /export/home/ale

rpool/swap 1.03G 49.7G 1.00G -

 

The last line indicates the swap space is 1GB and it's a ZFS volume. This information can be verified by executing the following:

root@solaris11-1:~# ls -l /dev/zvol/rdsk/rpool/swap

lrwxrwxrwx 1 root root 0 Dec 2 06:31 /dev/zvol/rdsk/rpool/swap -> ../../../..//devices/pseudo/zfs@0:2,raw

 

Thus, it's feasible to change its size because the rpool has some free space and the swap volume belongs to the rpool storage pool:

root@solaris11-1:~# zfs get volsize rpool/swap

NAME PROPERTY VALUE SOURCE

rpool/swap volsize 1G local

 

root@solaris11-1:~# zfs set volsize=2G rpool/swap

root@solaris11-1:~# zfs get volsize rpool/swap

NAME PROPERTY VALUE SOURCE

rpool/swap volsize 2G local

 

root@solaris11-1:~# swap -l

swapfile dev swaplo blocks free

/dev/zvol/dsk/rpool/swap 285,2 8 097144 2097144

/dev/zvol/dsk/rpool/swap 285,2 097160 097144 2097144

 

root@solaris11-1:~# swap -s

total: 451556k bytes allocated + 259888k reserved = 711444k used, 3886000k available

 

root@solaris11-1:~# zfs list -r rpool/swap

NAME USED AVAIL REFER MOUNTPOINT

rpool/swap 2.06G 48.7G 2.00G -

root@solaris11-1:~#

 

However, it is not always possible to change the properties of the swap space, because it could be busy. So sometimes it's necessary to add a second volume into the rpool storage pool and, afterwards, to insert a line at end of /etc/vfstab to mount this volume automatically:

root@solaris11-1:~# zfs create -V 2G rpool/newswap

root@solaris11-1:~# swap -a /dev/zvol/dsk/rpool/newswap

root@solaris11-1:~# swap -l

swapfile dev swaplo blocks free

/dev/zvol/dsk/rpool/swap 285,2 8 2097144 2097144

/dev/zvol/dsk/rpool/swap 285,2 2097160 2097144 2097144

/dev/zvol/dsk/rpool/newswap 285,4 8 4194296 4194296

 

root@solaris11-1:~# swap -s

total: 453668k bytes allocated + 260304k reserved = 713972k used, 5962264k available

 

root@solaris11-1:~# zfs list -r rpool

NAME USED AVAIL REFER MOUNTPOINT

rpool 31.6G 46.6G 4.91M /rpool

rpool/ROOT 25.4G 46.6G 31K legacy

rpool/ROOT/solaris 25.4G 46.6G 24.4G /

rpool/ROOT/solaris-backup-1 138K 46.6G 24.2G /

rpool/ROOT/solaris-backup-1/var 64K 46.6G 291M /var

rpool/ROOT/solaris/var 486M 46.6G 234M /var

rpool/VARSHARE 92K 46.6G 92K /var/share

rpool/dump 2.06G 46.7G 2.00G -

rpool/export 805K 46.6G 32K /export

rpool/export/home 773K 46.6G 32K /export/home

rpool/export/home/ale 741K 46.6G 741K /export/home/ale

rpool/newswap 2.06G 46.7G 2.00G -

rpool/swap 2.06G 46.7G 2.00G -

 

root@solaris11-1:~# more /etc/vfstab

#device device mount FS fsck mount mount

#to mount to fsck point type pass at boot options

#

/devices - /devices devf - no -

/proc - /proc proc - no -

Ctfs - /system/contract ctfs - no -

Objfs - /system/object objfs - no -

Sharefs - /etc/dfs/sharetab sharefs - no -

Fd - /dev/fd fd - no -

Swap - /tmp tmpfs - yes -

 

/dev/zvol/dsk/rpool/swap - - swap - no -

/dev/zvol/dsk/rpool/newswap - - swap - no -

 

Obviously, the process of removing swap space is the reverse. For example, the following command is executed and then the last line in the /etc/vfstab file is deleted:

root@solaris11-1:~# swap -d /dev/zvol/dsk/rpool/newswap

 

See Also

Here are some links to other things I've written:

And here are some Oracle Solaris 11 resources:

About the Author

Alexandre Borges is an Oracle ACE and who worked as an employee and contracted instructor at Sun Microsystems from 2001 to 2010 teaching Oracle Solaris, Oracle Solaris Cluster, Oracle Solaris security, Java EE, Sun hardware, and MySQL courses. Nowadays, he teaches classes for Symantec, Oracle partners, Hitachi, and EC-Council, and he teaches several very specialized classes about information security. In addition, he is a regular writer and columnist at Linux Magazine Brazil .