Any recent changes to dlopen() implementation?

In some recent releases of macos (14.x and 15.x), we have noticed what seems to be a slower dlopen() implementation. I don't have any numbers to support this theory. I happened to notice this "slowness" when investigating something unrelated. In one part of the code we have a call of the form:

const char * fooBarLib = ....; 
dlopen(fooBarLib, RTLD_NOW + RTLD_GLOBAL);

It so happened that due to some timing related issues, the process was crashing. A slow execution of code in this part of the code would trigger an issue in some other part of the code that would then lead to a process crash. The crash itself isn't a concern, because it's an internal issue that will addressed in the application code. What was interesting is that the slowness appears to be contributed by the call to dlopen(). Specifically, whenever a slowness was observed, the crash reports showed stack frames of the form:

Thread 1:
0 dyld 0x18f08b5b4 _kernelrpc_mach_vm_protect_trap + 8
1 dyld 0x18f08f540 vm_protect + 52
2 dyld 0x18f0b87e0 lsl::MemoryManager::writeProtect(bool) + 204
3 dyld 0x18f0a7fe4 invocation function for block in dyld4::Loader::findAndRunAllInitializers(dyld4::RuntimeState&) const + 932
4 dyld 0x18f0e629c invocation function for block in dyld3::MachOAnalyzer::forEachInitializer(Diagnostics&, dyld3::MachOAnalyzer::VMAddrConverter const&, void (unsigned int) block_pointer, void const*) const + 172
5 dyld 0x18f0d9c38 invocation function for block in dyld3::MachOFile::forEachSection(void (dyld3::MachOFile::SectionInfo const&, bool, bool&) block_pointer) const + 496
6 dyld 0x18f08c2dc dyld3::MachOFile::forEachLoadCommand(Diagnostics&, void (load_command const*, bool&) block_pointer) const + 300
7 dyld 0x18f0d8bcc dyld3::MachOFile::forEachSection(void (dyld3::MachOFile::SectionInfo const&, bool, bool&) block_pointer) const + 192
8 dyld 0x18f0db5a0 dyld3::MachOFile::forEachInitializerPointerSection(Diagnostics&, void (unsigned int, unsigned int, bool&) block_pointer) const + 160
9 dyld 0x18f0e5f90 dyld3::MachOAnalyzer::forEachInitializer(Diagnostics&, dyld3::MachOAnalyzer::VMAddrConverter const&, void (unsigned int) block_pointer, void const*) const + 432
10 dyld 0x18f0a7bb4 dyld4::Loader::findAndRunAllInitializers(dyld4::RuntimeState&) const + 176
11 dyld 0x18f0af190 dyld4::JustInTimeLoader::runInitializers(dyld4::RuntimeState&) const + 36
12 dyld 0x18f0a8270 dyld4::Loader::runInitializersBottomUp(dyld4::RuntimeState&, dyld3::Array<dyld4::Loader const*>&, dyld3::Array<dyld4::Loader const*>&) const + 312
13 dyld 0x18f0ac560 dyld4::Loader::runInitializersBottomUpPlusUpwardLinks(dyld4::RuntimeState&) const::$_0::operator()() const + 180
14 dyld 0x18f0a8460 dyld4::Loader::runInitializersBottomUpPlusUpwardLinks(dyld4::RuntimeState&) const + 412
15 dyld 0x18f0c089c dyld4::APIs::dlopen_from(char const*, int, void*) + 2432
16 libjli.dylib 0x1025515b4 DoFooBar + 56
17 libjli.dylib 0x10254d2c0 Hello_World_Launch + 1160
18 helloworld 0x10250bbb4 main + 404
19 libjli.dylib 0x102552148 apple_main + 88
20 libsystem_pthread.dylib 0x18f4132e4 _pthread_start + 136
21 libsystem_pthread.dylib 0x18f40e0fc thread_start + 8 

So, out of curiosity, have there been any known changes in the implementation of dlopen() which might explain the slowness?

Like I noted, I don't have concrete numbers, but to quantify the slowness I don't think it's slower by a noticeable amount - maybe a few milli seconds. I guess what I am trying to understand is, whether there's anything that needs attention here.

Answered by DTS Engineer in 827406022
Written by jaikiran in 775550021
have there been any known changes in the implementation of dlopen() which might explain the slowness?

There isn’t a good way to answer this. You’re talking about a span of two or three major OS releases, each with a bazillion changes that could introduce a performance regression in this space.

If you really want to dig into this you’d need to create an apples-to-apples comparison between an old OS release, where this was ‘fast’, and a new OS release, where it’s slow.

Having said that, dlopen is not the path to good performance |-: The system as a whole, and the dynamic linker specifically, is definitely optimised for launching native apps, where the vast majority of code is included in the closure that the dynamic linker forms from the main executable. Now, that doesn’t meant that dlopen is deliberately slow, just that it doesn’t get the level of optimisation that we given to the more common case.

I realise that your specific setup is gonna end up using dlopen a lot. I wonder if you could do something with the new mergeable libraries feature to package your runtime up with all the native modules into a single app executable. I suspect that’d seriously help with launch times.

An Apple Library Primer has links to info on mergeable libraries.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

Accepted Answer
Written by jaikiran in 775550021
have there been any known changes in the implementation of dlopen() which might explain the slowness?

There isn’t a good way to answer this. You’re talking about a span of two or three major OS releases, each with a bazillion changes that could introduce a performance regression in this space.

If you really want to dig into this you’d need to create an apples-to-apples comparison between an old OS release, where this was ‘fast’, and a new OS release, where it’s slow.

Having said that, dlopen is not the path to good performance |-: The system as a whole, and the dynamic linker specifically, is definitely optimised for launching native apps, where the vast majority of code is included in the closure that the dynamic linker forms from the main executable. Now, that doesn’t meant that dlopen is deliberately slow, just that it doesn’t get the level of optimisation that we given to the more common case.

I realise that your specific setup is gonna end up using dlopen a lot. I wonder if you could do something with the new mergeable libraries feature to package your runtime up with all the native modules into a single app executable. I suspect that’d seriously help with launch times.

An Apple Library Primer has links to info on mergeable libraries.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

Hello Quinn,

There isn’t a good way to answer this. You’re talking about a span of two or three major OS releases, each with a bazillion changes that could introduce a performance regression in this space.

I understand. I wasn't too sure if I wanted to raise this question in first place, given that I don't have anything concrete to follow up on. I was merely hoping that the stack frames might provide some hints on whether or not whatever is happening in that call stack should be happening at all. But yes, right now I don't have anything concrete. I'll keep an eye to see if any of this warrants a deeper investigation.

I realise that your specific setup is gonna end up using dlopen a lot. I wonder if you could do something with the new mergeable libraries feature to package your runtime up with all the native modules into a single app executable. I suspect that’d seriously help with launch times.

An Apple Library Primer has links to info on mergeable libraries.

Thank you for that link, I'm going to read up on it and discuss with my team mates more familiar in this area to understand if there's anything we could do with that feature.

Written by jaikiran in 827443022
I was merely hoping that the stack frames might provide some hints

Fair enough. Looking at that backtrace:

  • Frames 21 through 20 show that this is running on a thread that isn’t the main thread.

  • Frames 19 through 16 are you code, presumably ending in that dlopen call.

  • Frames 15 through 2 are all in the dynamic linker. It looks like it’s formed a closure (the tree of libraries rooted at the library you’re loading) and is in the process of calling initialisers for all the new libraries.

  • Specifically, frame 2 suggests it’s just about to run newly mapped memory as code, which means it needs to write protect it.

  • Frames 1 and 0 are it calling the kernel to do exactly that.

At that point you’ve run off user space and need to start looking at the kernel. There are various tools that can help with that, including spindump, the System Trace instrument template, DTrace [1], and so on. However, getting good results with those will require a significant investment in time.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

[1] DTrace requires you to disable SIP, which may well perturb the performance. There are undocumented csrutil flags that let you keep SIP enabled while enabling just DTrace.

Any recent changes to dlopen() implementation?
 
 
Q