DNSProxyProvider - High CPU usage

Hello everyone,

I am currently experiencing an issue with a network extension that we've developed, which seems to be causing problems on a select number of machines.

Problem Overview:

Our extension is designed to analyze all DNS traffic and block access to suspicious domains. Most of the time, this works seamlessly. However, I've observed repeated log messages on a few machines that hint at some network-related problems. Accompanying these logs, I've also noticed occasional spikes in CPU usage.

Logs:

Below are the logs captured from the Console that are repeated over and over again:

[C659 IPv6#3738d855.53 udp, attribution: developer] restart

[C659 IPv6#3738d855.53 waiting parent-flow (unsatisfied (No network route), dns)] event: path:restart @14383.841s, uuid: C7F27BD5-E86F-4076-A03E-1BD6A9C4C405

[C659 IPv6#3738d855.53 waiting parent-flow (unsatisfied (No network route), dns)] event: path:unsatisfied @14383.841s, uuid: C7F27BD5-E86F-4076-A03E-1BD6A9C4C405

[C659 IPv6#3738d855.53 in_progress parent-flow (unsatisfied (No network route), dns)] event: flow:start_child @14383.841s

nw_connection_report_state_with_handler_on_nw_queue [C659] reporting state preparing

[C659.157 IPv6#3738d855.53 initial path ((null))] event: path:start @14383.841s

[C659.157 IPv6#3738d855.53 waiting path (unsatisfied (No network route), dns)] event: path:unsatisfied @14383.842s, uuid: C7F27BD5-E86F-4076-A03E-1BD6A9C4C405

[C659.157 IPv6#3738d855.53 failed path (unsatisfied (No network route), dns)] event: null:null @14383.842s

[C659 IPv6#3738d855.53 waiting parent-flow (unsatisfied (No network route), dns)] event: flow:child_failed @14383.842s

nw_connection_report_state_with_handler_on_nw_queue [C659] reporting state waiting

[C659 IPv6#3738d855.53 waiting parent-flow (unsatisfied (No network route), dns)] event: path:restart @14383.842s, uuid: C7F27BD5-E86F-4076-A03E-1BD6A9C4C405

[C659 IPv6#3738d855.53 udp, attribution: developer] restart

Details & Observations:

  1. The high CPU usage appears randomly, and I haven't discerned any specific pattern.
  2. When CPU usage increase, the only way to go back to 0-0.3, is to restart the computer or restart the extension.
  3. A considerable amount of the above logs are generated over and over again, in the console, in a very short amount of time (in 15 seconds there are about ok 500k logs)
  4. Of the 600 machines using this extension, only 4 are exhibiting this issue.
  5. I've thoroughly checked the network configuration of the problematic machines, and I haven't found any disparities when compared to the ones working seamlessly.
  6. In cases where our extension can't determine if access to a specific domain should be blocked, we forward a request to our backend for verification. The code snippet used for this is: URLRequest(url: requestUrl) ... urlSession.dataTask(with: request)...resume(). Given that this method works perfectly on other machines, I'm inclined to believe this isn't the root issue.

I'm reaching out to understand if anyone has encountered a similar problem or if there are any insights into what might be causing this. Any guidance or suggestions would be greatly appreciated. Also would be helpful if anyone can help me to break down the anatomy of the logs.

Here is the main classes that handle the entire UDP flow, maybe you can spot any issue in the code.

If I have to guess, a single flow fails, and then automatically tries to resolve over and over again, but I don't know how to add a counter for the number of retries.

Thank you in advance for your assistance!

PS: A ScreenShot from the console to see the times between logs: https://ibb.co/pvjVD89

A few things, I'm not exactly sure what's happening here because issue like this are very hard to track down without isolating the issue further, but here are a few ideas on how to do that:

  1. It may be easier for you to debug this if you remove the concurrent queue from your provider. This should make your life easier and provide a way to see the flows passing through your provider one at a time.
  2. Are you actually returning a true or false from your provider handleNewFlow function?
  3. It does sound like you are correct that you may be hitting a case where your networking on a few of your flows loses connectivity and goes into an endless loop. I would put a ceiling on the amount of times a flow is retried if possible. For example, if it fails 2 times then mark that flow and give up on it. This should at least allow you to debug the failed flows and solve your logging / CPU issue.

Hello meaton,

Thank you for your response. Indeed that was the first thing I've done: removed all queues. but nothing changed.

For the send question, yes, I am returng true when i choose to handle the flow myself. in that case, if the flow is a UDP the i will handle it, if not i will return false, but i don't believe to be a problem here, because if i will return false, the flow if terminated and should not generate any issue. Also, looking at logs created by HMSharedData.Logs.error - i do not have any TCP flow..

override func handleNewFlow(_ flow: NEAppProxyFlow) -> Bool {
        if let udpFlow = flow as? NEAppProxyUDPFlow {
            MainUDPFlowManager.shared.handleNewFlow(flow: udpFlow)
            return true
        }

        /// NEAppProxyTCPFlow is less likely to be here
        if let _ = flow as? NEAppProxyTCPFlow {
            HMSharedData.Logs.error("We have a TCP flow!!!!! ^^^^^^^^$$$$$ Something is wrong here. We will drop the flow")
        }

        // @return YES if the proxy implementation has retained the flow and intends to handle the flow data. NO if the proxy implementation has not retained the flow and will not handle the flow data. In this case the flow is terminated.
        return false
    }

For the 3rd point, i'm not really sure how i can accomplish this. Maybe to keep a copy of a flow in let's sau 100 fixed size array/set, and if in 30 seconds interval same flow hit again to drop it? But, how I decide if a flow is initiated as a result of a failed one?

Thanks!

For the 3rd point, i'm not really sure how i can accomplish this.

Essentially you need a way to manage these flows so that they have a given lifetime and are not allowed to retry endlessly and timeout after a x amount of seconds. That way any flows that may be causing your providers to hang unexpectedly do have a given lifetime. However, that is just a mitigation strategy, and you may still need to solve another underlying issue.

Hello meaton,

Thank you for your continued advice - it's proving invaluable. I've taken note of your suggestion to manage the lifetime of these flows and prevent endless retries. It sounds like a sensible mitigation strategy, and I am prepared to give it a shot to see how it performs.

However, while I'm confident this approach will at least help tackle the symptoms, I am eager to pinpoint the root cause to properly address this. I would truly appreciate any more insights or ides you might have that could potentially bring me closer to detecting the main culprit of these occurring problems.

Thanks once again for your assistance!

DNSProxyProvider - High CPU usage
 
 
Q