Monday, October 21, 2013

Response to "Optimizing Linux Memory Management..."

I read a fantastic article last week by some engineers from LinkedIn.  It was fantastic because it really made me think about how we want the kernel to work.  But, it also contains inaccuracies that trouble me.  I have a few thoughts on the article.
  1. The "zone reclaim" feature at the root of the authors' troubles is not enabled everywhere, not even on all NUMA systems.  Hardware vendors are essentially responsible for whether this feature is on or not.
  2. The "Linux is quite bad at cleaning up this garbage properly" comment really stings.  It's actually the opposite from what I believe and have been advising folks on for years.  Linux is actually fantastic at managing its garbage.  It is arguable that the kernel's default behavior is not obvious, that we should use zone_reclaim_mode=0 in all but the most extreme NUMA environments.  But, the fact is that the kernel was working both as designed and as documented.
  3. Do not read in to this article too much, especially for trying to understand how the Linux VM or the kernel works.  The authors misread the "global spinlock on the zone" source code and the interpretation in the article is dead wrong.
  4. Memory pressure is caused when someone needs a particular kind of memory.  Usually, that memory is simply any free memory.  But, not all memory is the same from the kernel's perspective: you can see pressure when there is lots of free memory of other kinds.  A few examples of these special needs would be DMA-capable memory, physical contiguity for large pages, "low" memory, and NUMA locality.  The authors made the fundamental mistake of assuming that having any free memory means that there is no pressure.
  5. There is no such thing as "NUMA memory balancing" in the kernel that was running.  The authors are observant in noticing that direct page scans and thp_splits occur at the same times, but they are wholly incorrect in assuming that these constitute any intentional rebalancing.  "Transparent HugePages do not play nice with NUMA systems" is also a dangerously broad thing to say, and it is not supported by even the data that the authors present.
The thing that most troubles me is how difficult it was for these fellow software engineers (with access to the source code and documentation for their kernel) to figure out what the kernel was doing.  How do we get end users to make the leap from "I see latency spikes in my custom database" to "I should set zone_reclaim_mode=0"?

The LinkedIn folks also pointed out that very few Google searches end up pointing to the (wonderful) linux-mm wiki.  Does everybody know that it's there?  Does anybody actually use it?

Wednesday, April 10, 2013

USB (EHCI) Debugging Ports

 In my new job, I'm looking at kernel scalability on an 8-socket (160 logical cpu!) system.  Unfortunately, I'm having some troubles getting it to even boot wit
h 3.9-rc kernels.  It's been dying in very early boot, and doesn't have a hardware serial port!

I've recently discovered that there is early printk support for "USB Debug Ports", which I should be able to use in place of serial ports.  Darren Hart was kind enough to let me borrow his USB EHCI debug adapter, since the manufacturer is not selling them at the moment.


The debug adapter has two sides and it is not symmetric. This can be a bit confusing since if you plug it in to both systems while they are booted, you will get a /dev/ttyUSB0 on both that behaves the same.  In my case, only one side will power the device, and that is the side which goes in to the good system (the one you are not debugging).

I tried to get it to work on three different systems.  I only got it to work on one. :(



  • Lenovo S10-3 netbook - Only one EHCI debug controller, but never got it to work. I assume the debug port is not connected to the outside world.
  • Acer Veriton X4618G
    • My system has two USB controllers,which means that I might need either "earlyprintk=dbgp" *OR* "earlyprintk=dbgp1". Theoretically, I could probably figure out which of the two physical controllers was which in lspci, then figure out the order in which earlyprintk=dbgp enumerates those, then track down how they're connected. But, I'm dumb, and I don't expect myself to perform this procedure correctly.
    • My motherboard has 4 internal 10-pin USB headers (each with two actual ports), plus another 6 ports on the back. So, I've got 14 possible ports and two possible boot options. Worst case, I'm going to have to boot the system 28 times! I ended up finding the port on my 17th boot. Take good notes!
    • I never did get "earlyprintk=dbgp" to work. Only "earlyprintk=dbgp1" worked. I'm not sure if this was my error somehow, or if that port is just not exposed.
    • Some of the internal motherboard headers, despite having 10 pins, only actually have one port on them. I _believe_ this are intended for front-panel card readers, but just beware.
  • Fujitsu Primequest 1800e2
    • This systems problems in early boot were the reason I tried EHCI debugging to begin with.
    • This is a large, expensive system.  The USB ports exposed to the outside world all appear to be on different USB buses than the ones that have the USB debug functionality.  The CD drive claimed to be on the right bus, but the USB bus was not accessible.
    • I gave up and got a PCIe serial card instead

Friday, February 8, 2013

GFP_ATOMIC Allocation Failures

I see a good number of bug reports or complaints about GFP_ATOMIC allocations failing.  The general consensus seems to be that if there is an allocation failure in the kernel, something is wrong with the kernel (or specifically the VM).

GFP_ATOMIC allocations are special.  If you call kmalloc(GFP_KERNEL), or alloc_page(GFP_KERNEL), you actually implicitly pass a couple of other flags in:

 #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
The most important of those if __GFP_WAIT.  If you call in to the allocator with that flag, it tells the VM that it is OK for you to sleep for a bit.  The VM will actually put the allocating process to sleep in order to go off and free up other memory.

But, GFP_ATOMIC does not have the __GFP_WAIT bit set.  It effectively says to the kernel, "give me memory NOW and don't put me to sleep."  But, the implication here is that the kernel will not be given a chance to go off and free other memory.

Let's say you walk in to a fast food joint because you want french fries.  You walk up to the counter and order your fries, but they are out.  If you can wait (__GFP_WAIT), you'll get your fries in a moment once they cook another batch.  But, if you are in a hurry and can not wait for another batch to be cooked (GFP_ATOMIC) you are going to have to walk away empty handed.  You can complain that there should have been more fries cooked in advance (increase min_free_kbytes), but that only helps keep them from running out less often, not from ever running out.

Inherently, being in a hurry (GFP_ATOMIC) exposes you to the possibility of failure.  If you do not want failures, then do not use GFP_ATOMIC.


Friday, January 25, 2013

Honey, please stop spamming me!


Recently, I have received many emails that look like they're from my wife and other family members, and have innocuous subjects like "great" or "hi".  However, once opened, they usually contain a single http link to some suspicious-looking site.

My first reaction was, "oh no, my wife's computer is infected!"  But, upon closer examination, the messages all say:
From: MyWife Hansen <notmywife@yahoo.com>
The From: name matches, but the email address does not.  This is basically normal spam, but with a twist: someone has found out who I know in order to craft messages which I am very likely to open.   How are they making the connection?  There were two clues.

  1. I only received these "from" people that have my same last name.  My dad, brother, and wife, but not my sister who goes by her married name.  
  2. Of the messages from my wife, some included her maiden and married names, which is highly unusual.  She only goes by one or the other... except on Facebook.

Facebook has a "feature" where you can search for friends by email address.  I believe the spammers are creating fake Facebook and Yahoo accounts.  Once they are ready to spam me, they look up my email on Facebook and look at my friends list.  They pick a friend with the same last name, and set that as the "From" name.  From Facebook, they do not know what "MyWife Hansen"'s email address is, so they use one of the fraudulent addresses.  Some lessons:

  1. Yahoo is the source of the spam, probably by letting fake accounts get created.   This has been going on for months at least.
  2. Yahoo might check that the account and email address in "From:" match, but does not check the name, or use it to help indicate spam when messages are sent.
  3. Once again, sharing information on Facebook is dangerous and has unintended consequences.  In this case, the only information the spammers needed was your email and your friends list.
Takeaway: Lock down your Facebook account so that your friends list is not available, especially to non-friends.  You can also remove the ability for people to find you by email address.

Friday, October 28, 2011

Speeding up Kernel Build/Reboot/Test Cycles with iPXE

I make the kernel crash a lot. To debug those crashes, I add a lot of printk()s, recompile the kernel, and make it crash again. I repeat this until I fix the crash. This is time consuming, especially when the crash is bad enough that the system is unusable. For every cycle, I have to:

  1. Gather all of the information I need about the crash
  2. Reboot the system
  3. Boot to a good kernel
  4. Recompile, or copy the kernel over
  5. Reboot again
  6. Load the new kernel

On some hardware, each of these reboots can be upwards of several minutes. Virtually all (>90%) of the time is spent waiting for the firmware before the bootloader and and way before the kernel loads. If I could get this down to one reboot cycle instead of two, it would drastically reduce the amount of time that this takes. Ideally, I'd also like to be compiling in parallel with the system booting.

My solution to this is to use a separate build machine and iPXE (I used to use GRUB for this). In short, iPXE can boot your system from the network. I use it instead of the "boot to a good kernel", "recompile, or copy new kernel over" step.

Step 1: Put your kernel image on a web server

There are many ways to do this, but here's how I do it:

  1. make sure ~/public_html exists and can serve content. If your web server uses a different prefix, make sure and change it.
  2. On my compile server, I take a copy of /sbin/installkernel, put it in ~/bin/installkernel,
  3. Type "make install" in a kernel tree, it runs this script. My script places all of the kernel images in ~/public_html/
  4. Ensure that you can fetch kernel images by navigating to this path on the web server in your browser.

Step 2: Get, Compile, and Install iPXE

Compile yourself:

git clone http://git.ipxe.org/ipxe.git
cd ipxe/src
make -j4 bin/ipxe.lkrn
cp bin/ipxe.lkrn /boot

Step 3: Get iPXE the information it needs to boot

iPXE needs two bits of information to boot:

  1. an IP address
  2. A URL to boot from

Those can be assigned at compile time, passed in by a DHCP server, or passed in at boot-time. I prefer to let the DHCP server assign the IP address, but I pass in the URL at boot-time from GRUB by putting the following in menu.lst:

title iPXE uuid
kernel /boot/ipxe.lkrn && dhcp && chain http://1.2.3.4/~dave/ipxe.script

Remember to fill in your IP address and URL with appropriate values for your server.

Step 4: Tell iPXE from where to fetch a kernel image and what arguments to pass

Note that it references a URL on a web server. We need to go make sure that file exists. The kernel command-line should be a copy of what you see in menu.lst above.

$ cat ~/public_html/ipxe.script
#!gpxe kernel http://1.2.3.4/~dave/vmlinuz root= ro debug profile=2 console=ttyS0,115200
initrd http://1.2.3.4/~dave/initrd-from-boot.img
boot

Again, that will differ for your systems. You can either build the initrd each time you compile a kernel, or just ensure that the kernel image doesn't need any modules at all, and use an arbitrary initrd. Note that the whole kernel command-line is now in this file. That means that you can edit it on the web server with vim or emacs instead of having to do it on the GRUB command-line. I greatly prefer this to trying to use GRUB's console.

iPXE also has support for variables. You can do fun things like fetch a file with the system's IP address in the filename:

kernel http://9.47.67.96/~dave/vmlinuz-${net0/ip}

Monday, October 17, 2011

Working Around "Closed" Framechannel Devices

I got a nice WiFi digital picture frame (Motorola LS1000WB) for my Grandma so that she can keep tabs on the family. It was really handy since it could power on directly in to a state where it fetched pictures from a service called Framechannel. However, the economy evidently got the best of them and Framechannel shut down. The frame now powers on to a nice configuration screen (not very grandma-friendly). It's also closed-source and not very hackable.

However, some intrepid folks have dug up a certification checklist and documented the XML format that the device uses.

The hardest part in all of this is getting a hold of "rss.framechannel.com". You need to trick the frame in to going to a site your control instead of the defunct framechannel one. If you have an OpenWRT or DD-WRT this is fairly simple. You just put an entry in your router's /etc/hosts that says something like:
192.168.22.33 rss.framechannel.com
where 192.168.22.33 is the IP of your web server. You can do this with bind, but it's a bit more involved. After you get this part working, you need some pictures in to Framechannel's XML format. I use this script to fetch a Picasa feed and put it in Framechannel's format. Lastly, you need a webserver which can serve the XML back out. In my case, I needed a path like this:
/productId=MOT001/frameId=00FD53221ABC/language=en/firmware=20090721
I did it with this simple script:
DIR=/var/www/productId=MOT001/frameId=00FD53221ABC/language=en/
mkdir -p "$DIR"
perl picasa-to-framechannel-rss.pl [your RSS feed here] > "$DIR/firmware=20090721"
Note: if you are going to do this, remember that this makes your pictures publicly accessible. You should at least set your web server to not let folks get directory indexes on "/productId=MOT001". But, they can still guess your frameId pretty easily.

Despite the frame itself being closed, the openness of apache, bind/dnsmasq and the XML format it uses allowed the frame to be resurrected from doorstop status to a fully-working frame again.

Thursday, September 22, 2011

addr2line keeps me sane

Doing kernel work, I end up with a lot of text dumps of things. It's typical to get lots of junk that looks like gibberish:
[71818.339389] bash            S 0000000000000000     0  3829   3753 0x00000000
[71818.339389] ffff88007a5fdd38 0000000000000086 0000000000000001 0000000000011e80
[71818.339389] 0000000000000000 ffff88007af31080 ffff88007bc15040 ffff88007fc11e80
[71818.339389] ffff88007af31080 0000000000000000 ffff88007fc11e80 0000000000000000
[71818.339389] Call Trace:
[71818.339389] [] ? check_preempt_curr+0x7a/0x90
[71818.339389] [] ? try_to_wake_up+0x1e5/0x280
[71818.339389] [] schedule+0x45/0x60
[71818.339389] [] schedule_timeout+0x14f/0x250
[71818.339389] [] n_tty_read+0x2f0/0x810
[71818.339389] [] ? try_to_wake_up+0x280/0x280
[71818.339389] [] tty_read+0xa6/0xe0
[71818.339389] [] vfs_read+0xcb/0x170
[71818.339389] [] sys_read+0x55/0x90
[71818.339389] [] system_call_fastpath+0x16/0x1b
Let's say you were trying to interpret this stack trace. Sometimes, the compiler will inline function calls and they might not show up in a stack trace, so it is not immediately apparent how tty_read() might call try_to_wake_up(). You can disassemble or use a debugger, but those both require skill. I prefer to replace having skill with tools instead, which is why I love addr2line. You need to feed it a vmlinux (not a vmlinuz or bzImage mind you), but its output is wonderful:

dave@kernel:~/linux-2.6.git$ addr2line -e vmlinux ffffffff81389ff6
/home/dave/work/linux-2.6.git/drivers/tty/tty_io.c:959
dave@kernel:~/linux-2.6.git$ vi /home/dave/work/linux-2.6.git/drivers/tty/tty_io.c +959
Which points to:
                i = (ld->ops->read)(tty, file, buf, count);
else
i = -EIO;
tty_ldisc_deref(ld); <------------------
if (i > 0)
inode->i_atime = current_fs_time(inode->i_sb);
return i;
}
and it's fairly easy to follow the call path from there.