So, let me start by providing a bit of background context on why CallKit is relevant and helpful here:
And I don't talk about audio recording/playing by app, I'm talking only about sound playing by PTT Framework to indicate PTT start. That's why comparing with CallKit has no meaning, since CallKit doesn't play such sound.
Architecturally, the PTT framework is an extension of CallKit, not an independant API.  That is, the PTT session your app manages is actually a modified variant of the same "calls" callservicesd.  This is particularly true for audio handling, as the PTT framework basically doesn't implement any of it's "own" in process audio handling, relying entirely on CallKit's implementation.
Now, that leads to here:
and as mentioned in comment, if remove try inputNode.setVoiceProcessingEnabled(false) PTT start sound become distorted.
Throwing out some educated guesses, are you:
I ask because those to factor can cause exactly the issue you mentioned here:
Also, the way how PTT start sound is distorted, sounds for me very similar to what I hear if I start some audio playback with .playback category and do switch to .playAndRecord category while audio is still playing...
...for exactly the same reason.  The issue with bluetooth in particular is that bluetooth has two different specification for dealing with audio:
- 
Advanced Audio Distribution Profile (A2DP)-> This is playback only and is what speakers use when playing audio. 
- 
Hands-Free Profile (HFP)-> This is bidirectional, allowing playback and recording. 
The distinction here matters because A2DP has significantly higher fidelity than HFP and simply "sounds" better*.
*Note that this is for entirely practical reasons.  HFP has roughly "half" the playback bandwidth as A2DP, since it has to divide it's bandwith between playback and recording.  In addition, HFP use different (and less efficient) audio compression codecs because it's latency is MUCH lower (~30ms vs ~100ms+).
In any case, the net result is that the switch from A2DP to HFP caused by something like:
start some audio playback with .playback category and do switch to .playAndRecord category while audio is still playing
...will cause unavoidably cause quite noticeable* (and unpleasant) decline in audio quality.  That same behavior will occur when enabling voice processing for the same reason- voice processing enables input and output, which is effectively the same as playAndRecord.
*Change will occur on wired I/O as well, but the effect is most significant on bluetooth.
With all that context, let me go back to here:
Is there any way to workaround it?
So, the basic answer here is that you need to either ensure that the transitions only occurs when no playback is occurring (which isn't really possible) or ensure that the session "stays" in playAndRecord.
Being more specific, I have two answers:
- 
If you're using PTTransmissionMode.halfDuplex, stop and switch to PTTransmissionMode.fullDuplex.  In hindsight, we probably shouldn't have bothered implementing halfDuplex, as it basically forces exactly this kind of awkward transition.  More the point, my experience has been that products which truly are halfDuplex often end up using fullDuplex anyway because the mechanics of our halfDuplex implementation don't actually match up with their implementation.  PTTransmissionMode.fullDuplex ends up just working "better", as they can use the additional flexibility to implement exactly the behavior they want*. 
- 
If you're using PTTransmissionMode.fullDuplex, then (I think) leaving voiceChat enabled at all times will avoid the problem.  If it's not, then you should take a closer at exactly what and where your app does audio configuration.  My guess is that you're doing "something" which is pushing you back to playback only, setting up the distortion when your start recording again. 
*Keep in mind that the system enabling tranmission/recording does NOT mean your app actually has to "do" anything with the audio it received from the system.  For example, you can implement floor "claiming" mechanics have requesting the floor when you receive the transmission request, then play a sound (so the user knows it's happened) and start actually recording and sending audio once they have the floor.
__
Kevin Elliott
DTS Engineer, CoreOS/Hardware