We have a customer reporting a bunch of problems with our (Transparent Proxy Provider-based) suite. Looking at a sysdiagnose, I see that our GUI applet was killed:
Termination Reason: Namespace PORT_SPACE, Code 14123288431434181290 (Limit 305834 ports) Exceeded system-wide per-process Port Limit
Looking at the top.txt
file from it, I see that it has 193,086 ports -- compared to about ~250 on one of my test systems.
Has anyone run into this? Any idea what might be causing it? (I'm still poring over the sysdiagnose, but I don't see any kernel logs around that time -- except that our process does close a dozen or so ports because of cancellation.)
There were .diag files in the sysdiagnose, which showed the crash message. The other one I was looking at, which was closer to 100k, had been alive for 40 seconds, based on top.txt.
The second sysdiagnose might actually be your best data point. Leaking at the kind of rate requires your app to have been doing "something" quite frequently, and if it was running when the sysdiagnose was captured, there's a good chance the spintrace captured it. Open the sysdiagnose and look for the file named "spindump-nosymbols.txt". If you're not familiar with the format, inside you'll find a long list of entries which look something like this (this is from an iOS, but the format is the same):
Process: com.apple.WebKit.WebContent (WebContentExtension) [755]
UUID: D36E12E9-BE53-345F-BA95-69ABE5388F5C
Path: /private/preboot/Cryptexes/OS/System/Library/ExtensionKit/Extensions/WebContentExtension.appex/com.apple.WebKit.WebContent
Identifier: com.apple.WebKit.WebContent
...
Memory Limit: 1536MB
Jetsam Priority: 100
Footprint: 194.63 MB
Time Since Fork: 807s
Num samples: 8 (1-8)
CPU Time: 0.003s (1895.2K cycles, 963.8K instructions, 1.97c/i)
Num threads: 3
Note: 1 idle work queue thread omitted
...
Thread 0x9450 Thread name "JavaScriptCore libpas scavenger" 8 samples (1-8) priority 37 (base 37) cpu time 0.002s (1403.3K cycles, 784.7K instructions, 1.79c/i)
8 ??? (libsystem_pthread.dylib + 2252) [0x1fb7228cc]
8 ??? (libsystem_pthread.dylib + 17444) [0x1fb726424]
6 ??? (JavaScriptCore + 55864) [0x1b54dea38]
6 ??? (libsystem_kernel.dylib + 26068) [0x24833f5d4]
*6 ??? (pthread + 18508) [0xfffffff00a75e17c]
2 ??? (JavaScriptCore + 56084) [0x1b54deb14]
2 ??? (libsystem_kernel.dylib + 26068) [0x24833f5d4]
*2 ??? (pthread + 18508) [0xfffffff00a75e17c]
Binary Images:
...
The stack above is a "spintrace". What's going on here is that the system captured the state of every process’s stack a fixed number of times (Num samples -> "8") and the spintrace format then consolidates those samples into a condensed format to show you what the thread was actually doing in those samples. So, this is the thread entry point:
8 ??? (libsystem_pthread.dylib + 2252) [0x1fb7228cc]
8 ??? (libsystem_pthread.dylib + 17444) [0x1fb726424]
...and the "8" means that function was present in 8 samples (in this case, "all" of them). Then the execution flow "split", so that in 6 samples the thread was doing this:
6 ??? (JavaScriptCore + 55864) [0x1b54dea38]
6 ??? (libsystem_kernel.dylib + 26068) [0x24833f5d4]
*6 ??? (pthread + 18508) [0xfffffff00a75e17c]
And in two samples it was doing this:
2 ??? (JavaScriptCore + 56084) [0x1b54deb14]
2 ??? (libsystem_kernel.dylib + 26068) [0x24833f5d4]
*2 ??? (pthread + 18508) [0xfffffff00a75e17c]
You'll note the "*" next to two numbers. That indicates that this thread was not actively executing at the time the spintrace was captured. That's actually the common case (there are a lot more threads than cores), but you can actually see two different variations at the "end" of the stack. The case above is the most common case, which indicates that the thread was blocked in the kernel waiting on "something". However, the other cases you might see are:
*8 ??? (<94145C7E-F836-38C1-A6BA-1F2984CA2D7D> + 84104) [0xfffffff007f70888]
*8 ??? (<94145C7E-F836-38C1-A6BA-1F2984CA2D7D> + 86648) [0xfffffff007f71278] (running)
...which means "this thread was CPU bound and eligible to run when the trace was captured, but wasn't actually running at the time it was captured.
And finally the least common, as there are lots of threads and only so many cores):
*1 ??? (CoreSymbolication + 154068) [0x1f5db09d4]
*1 ??? (CoreSymbolication + 157500) [0x1f5db173c] (running)
...which means that particular thread was actively executing when the trace occurred.
Finally, the "Binary Images" section has the list of libraries and load addresses, which mean you can symbolicate the data using the process described in "Adding identifiable symbol names to a crash report".
__
Kevin Elliott
DTS Engineer, CoreOS/Hardware