Periodic, seemingly global APNS disruptions

Hello, I'm from Microsoft team maintaining push notification api behind Teams platform.

We are experiencing strange and short error spikes towards APNS that seem to mostly correlate worldwide. We checked the networking and push request code but could not find what could be causing this. These error spikes are all timeouts or connection resets (by remote host, ie. APNS servers) and seem to come and go randomly:

Would it be possible to check this for outages or some other metrics on your side or investigate why would it happen? Since it's worldwide it seems unlikely it's something broken on our side. We are using the standard APNS http2 endpoint with modern support for all RFC features (so everything should work normally).

Mind you, our api might be in a unique position because of the volume of notifications (in the billions per day).

Answered by Engineer in 831176022

We can check our logs here without the apns-id, but we will need some information you may not want to share publicly.

If you would like to discuss your specific use case in detail, and for us to share some additional information, please open a support request at https://developer.apple.com/contact/request/code-level-support/ and reference this forum thread in the "Did someone from Apple ask you to submit ..." section

If you can share the following in your message, it would be great:

  • what’s the apns-topic
  • some IP addresses on your end that is seeing degradation
  • specific time ranges in the last few days

We don't see similar spikes on our service in general. So, this is likely related to your application.

Billions per day does not really matter. Although, what you send per second might. Do you know your per second and minute burst rates?

Usually such reported issues stem from network capacity and reliability problems between your servers and APNs. As you state the issue is global perhaps we can look at what's going on with the servers.

Your burst rates could be overtaxing your host setup perhaps. It is also important to find out where is the time being spent for requests which end up in timeouts. Is it the outgoing connection and sending taking too long, or the response is not arriving in time.

If you can provide apns-ids of a few such requests which happened in the past few days, we can check if a specific issue is visible on our end.

As for connection resets, while it is also possible that they are due to network interruptions, another cause would be that you are getting too many errors back to back and the connection is being reset. The typical cause we see is wrong tokens (or tokens belonging to the development environment used in production) which somehow made it to the production token database used, which would return token errors, and if they are ignored and more erroneous requests are made, APNs will drop the connection.

For those as well, if you have examples of push requests made just before the connections are reset, we can take a look.

Best would be to provide apns-ids for requests made in the last 2-3 days.

Hi,

Do you know your per second and minute burst rates?

Peak hours, from all our deployments, I'm showing it can go as high as 11.3m requests per minute towards api.push.apple.com.

Your burst rates could be overtaxing your host setup perhaps.

Well the issue is happening in vastly different parts of the world, with independent networking and hardware. Some areas are off-peak and some are peak hours, despite this the issue seems extremely coordinated every time (down to a specific minute).

For example large incident was on 14th March around 10:21 UTC:

Is it the outgoing connection and sending taking too long, or the response is not arriving in time.

If you can provide apns-ids of a few such requests which happened in the past few days, we can check if a specific issue is visible on our end.

I'll do some digging to be able to answer these (thought due to the nature of the errors we are seeing, I expect it's timeouts because response isn't received ie. request headers and body are uploaded but then no response frames ever arrive), it might take some time though. We don't log the apns-id but maybe it's time to start and add it to the code as it might be useful in the future for other investigations as well.

Accepted Answer

We can check our logs here without the apns-id, but we will need some information you may not want to share publicly.

If you would like to discuss your specific use case in detail, and for us to share some additional information, please open a support request at https://developer.apple.com/contact/request/code-level-support/ and reference this forum thread in the "Did someone from Apple ask you to submit ..." section

If you can share the following in your message, it would be great:

  • what’s the apns-topic
  • some IP addresses on your end that is seeing degradation
  • specific time ranges in the last few days
Periodic, seemingly global APNS disruptions
 
 
Q