Network Push Provider Wifi Selection Behavior

In our App, we have a network extension with a NEAppPushProvider subclass running. We run the following steps

  1. Setup a dual-band wireless router per the following:
    1. Broadcasting 2.4 GHz and 5 GHz channels
    2. Same SSID names for both channels
    3. Connected to the production network to the router
    4. DHCP assigning addresses in the 10.1.x.x network
  2. Connect the mobile device to the 5 GHz network (if needed, turn off the 2.4 GHz network temporarily; once the device connects to the 5 GHz network, the 2.4 GHz network can be turned back on).
  3. Create a NEAppPushManager in the App, using the SSID from the above mentioned network and set it to the matchSSIDs property. Call saveToPreferences() on the push manager to save. A. We have UI that shows the extension has been started and it has connected to the server successfully.
  4. Walk out of the range of the 5 GHz channel of the router, but stay within range of the 2.4ghz channel.
  5. Wait for the mobile device to connect to the 2.4 GHz channel.

Expected: The extension would reconnect to the 2.4ghz network.

Observed: The extension does not reconnect. Checking the logs for the extension we see that the following was called in the push provider subclass.

stop(with:completionHandler:) > PID: 808 | 🗒️🛑 Stopped with reason 3: "noNetworkAvailable"

The expectation is that start() on the NEAppPushProvider subclass would be called. Is this an incorrect expectation?

How does the NEAppPushProvider handle same network SSID roaming among various band frequencies? I looked at the documentation and did not find any settings targeting 2.4 or 5 ghz networks. Please advise on what to do.

The expectation is that start() on the NEAppPushProvider subclass would be called. Is this an incorrect expectation?

Sort of. NEAppPushProvider doesn't support "reconnection" as such, so what I would actually expect here is:

  1. Start on 5ghz network-> Extension1 is running
  2. Walk off the edge of 5ghz network
  3. Device leaves 5ghz network
  4. Extension1 stops
  5. Device joins 2.5ghz network
  6. Extension2 starts

The big thing I'd verify here is that step #5 actually occurred. On lot of networks the border of 2.5ghz and 5ghz is narrow enough by the time the device leaves 5ghz, the signal strength is bad enough that we don't bother joining the 2.5ghz. For testing purposes, I'd actually keep the device stationary and well "inside" the 5ghz zone, then replicate #2 by just turning off the 5ghz radio. Coming from the other direction, I'd also see what happens if use different SSIDs (both registered with push provider) on exactly the same radio connection.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

We've tested it and the phone connects to the 2.4 GHz network. We check the router to make sure, and we have UI in our app that shows that it's able to connect to the internet.

But we also have UI to indicate that the extension is unable to connect to the internet. When observing the log for the extension, our NEAppPushProvider subclass is called to stop, with the NEProviderStopReason being 3: no network. We do not get a subsequent call to init and start(), the extension just sits there waiting.

In your list, item 6, do you call it Extension2 because it should initialize a new push provider subclass when it joins the 2.5 ghz network? Just asking to confirm.

In your list, item 6, do you call it Extension2 because it should initialize a new push provider subclass when it joins the 2.5 ghz network? Just asking to confirm.

I expect to create a new process, not just a new subclass. Stepping back for a moment, this is how the push provider life cycle actually works:

  1. Device joins a matching network.
  2. Extension is launched, start called.
  3. Extension runs
  4. Device leaves the matched network.
  5. Extension stops and is then terminated.
  6. <time passes>
  7. Device joins a matching network, process begins again at #2.

That leads me to here:

initialize a new push provider subclass

One of the common mistakes I see developers make is not being aware that their extension is regularly going to be terminate/launch and then misunderstanding what the data they're looking at actually means. For example, trying to "actively" debug a push provider extension is only useful under very specific circumstances (for example, when you're targeting a specific, known failure). In broader testing, the debugger just gets in the way, causing you to miss secondary launches or event preventing those secondary launches at all. Once your extension basically "works", testing and debugging needs to be done through log data to avoid disrupting the entire system your trying to test.

Related to that point, I strongly recommend creating your own stand alone log architecture that:

  • Logs your messages to your own file (so you can easily trace what "you" did)/

  • Logs the same messages to the system console (so you can easily correlate your own activity against the broader system).

  • In your own logging, includ the current pid (process id) in every log message, typically as the first value in the log message.

On that last point, if everything is formatted properly, it's easy to filter out when scanning an extended log but any change still tends to "jump out" when your scanning a log. It takes up minimal space and I've seen way to many cases where it was the one detail that suddenly made everything clear.

All that leads back to here:

We do not get a subsequent call to init and start(), the extension just sits there waiting.

How are you tracking this "waiting"? Do you just mean that your log doesn't show it starting again, or do you mean that your doing some kind of "active" monitoring? My concern here is what I talked about with the debugger above- many forms of "active" monitoring can prevent the extension from terminating, which will then prevent the "next" launch.

Similarly, expanding on this point:

We've tested it and the phone connects to the 2.4 GHz network.

The part that can be tricky here is that it's entirely possible for a device to:

  1. Be "on" a Wifi network, in the sense that "the device is aware of the network and making some effort to use it".

  2. The communication quality to be so poor that the device doesn't actually consider the network truly "working" and may not have even launched your extension.

Keep in mind that this is not a theoretical concern. I've seen many real world networks where there was a surrounding "band" of several feet where an iPhone could clearly receive from the AP but the AP was basically incapable of receiving from the iPhone, leaving the device "stuck" in the state above.

This kind of issue is why I specifically recommend this kind of testing:

For testing purposes, I'd actually keep the device stationary and well "inside" the 5ghz zone, then replicate #2 by just turning off the 5ghz radio.

...as it removes all radio variability from the testing process. Radio variability is not something your app can do anything about, so need to make sure it isn't distorting any testing you do.

Another concern I have is here:

we have UI in our app that shows that it's able to connect to the internet.

One of the mistakes I've seen over and over again when looking at this kind of issue is assuming an underlying failure based on relatively vague data without having actively investigated what was ACTUALLY going on. Case in point, it's easy to see how this would "look" like a problem:

  • Device shows WiFi connectivity on the screen*.

  • App is able to connect to your server.

  • Local Push Connectivity Extension is NOT running.

...but that is NOT necessarily the case. The state above is normal and expected if:

  1. The Wifi radio is stuck in the state I described above.

  2. App is using one of our "modern" network APIs.

Your app is actually connecting over cellular, not Wifi, and the system hasn't launched your extension because it's (correctly) assessed that the Wifi network does not in fact "work" (yet).

*You could argue that the device shouldn't show connectivity before the network "works" however, in practice, this is a case of trying to pick the option that will least annoy users.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Just a quick FYI: There's no cellular data on these phones, and the server that the app connects to is on a local network and not accessible from the Internet. Also, the app and extension were debugged using logs written to a file -- not through a debugger. They already include PID, but some improvement are being done to give clarity regarding different extension processes.

Regarding the following:

Case in point, it's easy to see how this would "look" like a problem:

  • Device shows WiFi connectivity on the screen*.
  • App is able to connect to your server.
  • Local Push Connectivity Extension is NOT running.

...but that is NOT necessarily the case. The state above is normal and expected if:

  1. The Wifi radio is stuck in the state I described above.
  2. App is using one of our "modern" network APIs.

The app being stuck is a result of the following

  1. "The Wi-Fi radio is stuck.""

The QA team uses an app to detect the BSSID, which allows them to confirm if they have connected to the 2.4 GHz network. They also opened up a web browser and were able to connect to the internet.

Given these facts would it then be correct to assume the wifi radio is not stuck in this scenario? I think there needs to be clarification on what it means for the Wifi radio to be stuck. There's the impression from the rest of the team that the Wi-Fi radio being stuck and being able to access the internet are mutually exclusive things, but from your comments it seems like it is not. Please clarify if this is true. Also would info in SCNetworkReachabilityFlags provide any additional context?

  1. "The app is using one of our "modern" network APIs.""

Are you saying the current network API that I'm using is causing the extension to be stuck? Just to make sure we're all on the same page, what API's are you referring to? The key API that is of concern is NetworkExtension, specifically NEAppPushProvider. Is there any other specific API that I should be exploring? Are you suggesting that NEAppPushProvider is acting as it is expected to?

I think there needs to be clarification on what it means for the Wifi radio to be stuck.

What I mean by "stuck" is that the radio environment is such that the devices can "see" the AP and attempt to communicate with it, but not able to actually establish an IP connection that's reliable enough to actually "work".

Given these facts would it then be correct to assume the wifi radio is not stuck in this scenario? There's the impression from the rest of the team that the Wi-Fi radio being stuck and being able to access the internet are mutually exclusive things, but from your comments it seems like it is not. Please clarify if this is true.

Yes, that's generally correct, ignoring edge cases (like cellular).

Also would info in SCNetworkReachabilityFlags provide any additional context?

First off, as a general rule, SCNetworkReachability is an old API* that no one should be using. If you want to understand how network activity is being routed, use "NWPathMonitor". Having said that, no, I would expect that to provide any new information.

*Note that as the primary author of the Reachability sample, I'm well qualified to make that statement.

Are you suggesting that NEAppPushProvider is acting as it is expected to?

I'm specifically referring to the APIs you used inside your extension, not the extension point itself.

Are you saying the current network API that I'm using is causing the extension to be stuck?

Possibly. What Networking API are you using? Ideally you'd be using the Network framework or possibly BSD sockets. If you're using an older API, particularly one of CFSocketStream (and it's ObjectiveC variant) then that could be a problem.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Network Push Provider Wifi Selection Behavior
 
 
Q