Reducing RAM usage in pkgin
Recently I’ve had a number of users complain about pkgin running out of memory when installing packages. This turned into a nice example of how to use DTrace to show memory allocations and help track down excessive use.
My test case was pkgin -y install gcc47
. This is usually one of the first
commands I run in a new SmartOS zone anyway, and as gcc47
happens to be one
of the largest packages we ship it will help to exaggerate any memory
allocation.
Trace heap allocations
As a first step I wanted to answer the question of how much memory was being
allocated for pkgin. A simple and naive way to do this would be to run tools
such as ps(1)
or prstat(1)
(the SmartOS equivalent to top(1)
) whilst
pkgin is running, and monitor the memory columns. This may give you a very
rough idea of how much memory is being used, but it’s not very accurate and
you may miss a large allocation just before the process exits.
Instead we can use DTrace to trace the brk()
system calls and calculate the
exact amount of memory that has been allocated. brk()
is where libc memory
allocation functions such as malloc()
end up on SmartOS, so by tracing that
single system call we can see exactly what has been allocated by the process.
Tracing brk()
has the additional advantage of only showing heap growth. If
we traced all the libc *alloc()
calls, we would have to perform additional
analysis to determine whether we actually allocated more memory or whether an
existing allocation was reused. For more information about the different ways
to trace memory allocations, see Brendan Gregg’s excellent Memory Flame
Graphs page,
which is where many of the DTrace scripts in this post are based on.
I used the following DTrace script to output 3 pieces of information over the lifetime of the target process:
- A quantized set of
brk()
allocation sizes. - The total heap allocation.
- The number of
brk()
calls.
Comments are inline. pid == $target
ensures we only log brk()
calls made
by the process we specify as opposed to all brk()
calls across the entire
system, and arg0
is the argument to the brk()
system call.
Saving the script as brkquantize.d
and running it gives us the following
output:
Wow, that’s a lot of memory. 383MB has been allocated on the heap, with one of those allocations alone being between 128MB and 256MB. No wonder users are running out of memory!
This answers the questions regarding how much memory is being allocated, but doesn’t answer the question of what is causing it. I have my suspicions at this point (the gcc47 package tarball is 250MB, is pkgin caching the entire thing?), but in order to prove my suspicion I want to produce a flame graph.
Memory flame graph
If you didn’t read the earlier link to Brendan Gregg’s “Memory Flame Graphs” page, go and do that now. The reason for creating one is to see visually and easily which code paths are responsible for the allocations.
To create the memory flame graph I used a slightly modified version of
Brendan’s brkbytes.d
with additional comments:
Again we execute the script with pkgin
as our target, after ensuring a clean
environment:
Now we can use a couple of tools from Brendan’s FlameGraph repository to convert the stack traces into a flame graph:
The resulting SVG is below, you should be able to mouse-over the individual elements for further details.
From the flame graph it’s clear that the majority of allocations are coming
from download_file()
, and we now have an accurate count of how much memory is
being allocated by that function.
We can further drill down on our hypothesis by comparing sizes. The command we are running is downloading and installing these two files:
That’s a total of 271,664,552 bytes. According to the flame graph,
download_file()
allocated 271,671,296 bytes. So it seems highly likely it is
caching those files, the 6,744 byte descrepancy likely due to rounding to the
nearest page size (4K on SmartOS) and an additional page for something else.
Let’s go to the source to confirm.
Optimising download_file()
The download_file()
function is reasonably straight-foward, and it’s quite
clear that we are indeed reading the entire file into RAM before writing it out
to disk. Source edited for clarity and added comments (full version
here):
On return to the caller it writes the returned buffer to a file descriptor and then frees the buffer.
Optimising this is pretty straight-foward. We will instead pass an open file
descriptor to a new download_pkg()
function, which will stream to it directly
from each successful fetchIO_read()
via a static 4K buffer. The commit to
implement this is
here.
Running brkquantize.d
on the new implementation we see significantly reduced
memory usage:
We’ve reduced our initial 383MB usage down to 128MB, and saved around 60 calls
to brk()
in the process - a good start.
Optimising pkg_summary
handling
However 128MB still seems a lot for what the software is doing, can we do even better?
Let’s start with an updated memory flame graph to see where we stand with the new version:
It’s clear that our download_*()
functions are no longer on the scene, and
now the majority of the memory usage is caused by update_db()
, accounting for
97MB. This function handles fetching the remote pkg_summary.bz2
file and
transferring its contents into pkgin’s local sqlite3 database, which is then
used for local queries.
Analysing update_db()
is a little more involved than download_file()
, but
we can use flame graphs to help us identify which functions to look at. In
this case we want to take a closer look at decompress_buffer()
and
insert_summary()
.
decompress_buffer()
After calling download_file()
to fetch the the pkg_summary.bz2
file, the
decompress_buffer()
function is called to decompress it into memory and then
free the download_file()
allocation.
However, why uncompress the entire file before parsing it? Instead we can use libarchive to stream the decompression and process chunks at a time. As it turns out pkgin already links against libarchive but doesn’t actually use it, so this is easy enough to add.
insert_summary()
While parsing the pkg_summary
buffer, a set of INSERT
statements are
constructed by this function. However, again we are buffering the whole lot,
when instead we could just stream them one by one.
Testing streaming updates
I made some
changes
to implement streaming updates at each end, reading chunks of our compressed
pkg_summary
file and, once we’d read a complete record, stream an update to
the database. Here’s how the allocations look afterwards:
That’s better, now just 29MB to perform an update. However, that still seems quite a lot, so let’s generate an updated flame graph to see where the rest of the memory is being used.
Ok, so it’s clear the rest of the memory is being used by sqlite. Anything we can optimise there?
Turns out there is. I looked through pkgin to see if it was setting any non-default sqlite parameters, and the very first one immediately caught my eye:
The manual says that this value is in
pages, with a default of 2000, and that the page size defaults to 1024 bytes,
so we’re setting up a 976MB cache instead of the default 2MB. This seems to be
rather larger than we need, so let’s try just removing that PRAGMA
and using
the default.
That’s worked out very well, and we’re now down to just 9MB, which seems entirely reasonable to me. One final flame graph:
The majority of our usage is now handling the compressed pkg_summary.bz2
file. For now we will stop there as it’s only 2MB, but future work could
include looking at streaming it directly from libfetch to libarchive rather
than having to load it all into memory first.
Final thoughts
Given we’ve changed a lot of code, and especially options around cache sizes,
how have they affected performance? We can’t be as accurate as with our DTrace
measurements here, but we can perform a real-world benchmark of timing a pkgin
update
run against a localhost repository. I ran each multiple times and took
the fastest result:
Less RAM and significantly faster? I’ll take that!
Summary
By using DTrace and Flame Graphs we are able to quickly identify code paths using large resources. By streaming data instead of caching we are able to significantly reduce the amount of RAM required and simultaneously boost performance.
With these commits in place:
the amount of RAM required to run pkgin install gcc47
on a clean SmartOS
install reduces from 383MB to just 16MB.
I am hoping to get these changes in to the version of pkgin we ship for our 2015Q2 package sets, and will work to get these changes into upstream pkgin.
August 2015 update (streaming pkg_summary
)
Since writing this post I revisited the improvement I mentioned where we can
use libarchive to stream directly from libfetch rather than downloading the
entire pkg_summary
file first.
Let’s see where things stand with the 2015Q2 pkgin which includes all of the fixes described above:
To integrate libfetch directly into libarchive, we split our
download_summary()
function into separate archive_read_open()
callbacks.
These callbacks are called when the archive is opened, read, and closed. Not
only does this reduce our memory requirements, it also simplifies the code a
little as libarchive can handle EOF
and detect download failures.
The commit to implement this is here. One side-effect of this change is that now the remote INSERTions are performed inline, we need to remove the separate progress meter as it conflicts with the libfetch one.
With that change applied we can see the RSS has decreased by a further 2MB
which corresponds to the size of the pkg_summary.bz2
file we were previously
caching in RAM first:
And for completeness sake, a final flame graph:
How does it affect runtime? Again I ran each multiple times against a localhost repository and took the fastest time:
That’s another clear win, with a 2MB reduction in RSS usage, and 1.5 seconds shaved off the runtime.
All Posts
- 16 Jul 2015 » Reducing RAM usage in pkgin
- 03 Mar 2015 » pkgsrc-2014Q4: LTS, signed packages, and more
- 06 Oct 2014 » Building packages at scale
- 04 Dec 2013 » A node.js-powered 8-bit CPU - part four
- 03 Dec 2013 » A node.js-powered 8-bit CPU - part three
- 02 Dec 2013 » A node.js-powered 8-bit CPU - part two
- 01 Dec 2013 » A node.js-powered 8-bit CPU - part one
- 21 Nov 2013 » MDB support for Go
- 30 Jul 2013 » What's new in pkgsrc-2013Q2
- 24 Jul 2013 » Distributed chrooted pkgsrc bulk builds
- 07 Jun 2013 » pkgsrc on SmartOS - creating new packages
- 15 Apr 2013 » What's new in pkgsrc-2013Q1
- 19 Mar 2013 » Installing SVR4 packages on SmartOS
- 27 Feb 2013 » SmartOS is Not GNU/Linux
- 18 Feb 2013 » SmartOS development preview dataset
- 17 Jan 2013 » pkgsrc on SmartOS - fixing broken builds
- 15 Jan 2013 » pkgsrc on SmartOS - zone creation and basic builds
- 10 Jan 2013 » Multi-architecture package support in SmartOS
- 09 Jan 2013 » Solaris portability - cfmakeraw()
- 08 Jan 2013 » Solaris portability - flock()
- 06 Jan 2013 » pkgsrc-2012Q4 illumos packages now available
- 23 Nov 2012 » SmartOS and the global zone
- 24 Oct 2012 » Setting up Samba on SmartOS
- 10 Oct 2012 » pkgsrc-2012Q3 packages for illumos
- 23 Aug 2012 » Creating local SmartOS packages
- 10 Jul 2012 » 7,000 binary packages for OSX Lion
- 09 Jul 2012 » 9,000 packages for SmartOS and illumos
- 07 May 2012 » Goodbye Oracle, Hello Joyent!
- 13 Apr 2012 » SmartOS global zone tweaks
- 12 Apr 2012 » Automated VirtualBox SmartOS installs
- 30 Mar 2012 » iptables script for Debian / Ubuntu
- 20 Feb 2012 » New site design
- 11 Jan 2012 » Set up anonymous FTP upload on Oracle Linux
- 09 Jan 2012 » Kickstart Oracle Linux in VirtualBox
- 09 Jan 2012 » Kickstart Oracle Linux from Ubuntu
- 22 Dec 2011 » Last day at MySQL
- 15 Dec 2011 » Installing OpenBSD with softraid
- 21 Sep 2011 » Create VirtualBox VM from the command line
- 14 Sep 2011 » Creating chroots for fun and MySQL testing
- 30 Jun 2011 » Graphing memory usage during an MTR run
- 29 Jun 2011 » Fix input box keybindings in Firefox
- 24 Jun 2011 » How to lose weight
- 23 Jun 2011 » How to fix stdio buffering
- 13 Jun 2011 » Serving multiple DNS search domains in IOS DHCP
- 13 Jun 2011 » Fix Firefox URL double click behaviour
- 20 Apr 2011 » SSH via HTTP proxy in OSX
- 09 Nov 2010 » How to build MySQL releases
- 29 Apr 2010 » 'apt-get' and 5,000 packages for Solaris10/x86
- 16 Sep 2009 » ZFS and NFS vs OSX
- 12 Sep 2009 » pkgsrc on Solaris
- 09 Dec 2008 » Jumpstart from OSX
- 31 Dec 2007 » Set up local caching DNS server on OSX 10.4