VAD and CNG
VAD and CNG overview
VAD stands for Voice Activity Detection. Its role is to distinguish between a voice and anything else, including silence. In VoIP applications it may be used as a tool to minimise the number of audio packets transmitted. If nobody is speaking it is possible to either stop the flow of audio packets, or at least change to a much lower rate of comfort noise packets. In a typical phone conversation there are short periods when both parties are talking, but most of the time only one party is talking. With VAD, transmission in each direction can be greatly reduced, or even halted, for nearly 50% of the call. VAD is usually a function within the endpoints of a VoIP path.
There are two things to note here, which often confuse people. VAD is not the same as silence detection. Loud music is certainly not silence, but it is also not voice, and a good VAD will declare "no voice present". Secondly, the use of VAD to minimise packet flow is often described as a bandwidth reduction measure. This is only the case for network links carrying large numbers of concurrent calls, when the likelihood of everyone talking at once is low. For most customer premises applications the bandwidth required of the network will be the peak when all conversations are declared to be voice, and packets are being transmitted at the normal rate of their voice codecs. What VAD allows in these cases is a reduction in the average data rate, freeing up lots of capacity for data which is not hard real-time data.
CN stands for Comfort Noise. This is simulated background noise, synthesised at the receiving end of a VoIP path. This function is called comfort noise generation (CNG). In a crude form it may be simple simulation of general room "mush" (e.g. Gaussian noise with a Hoth spectral weighting). In a more sophisticated form, noise parameters received from the sender may contain noise modelling parameters. These may be used to produce noise which closely matches the amplitude and spectral qualities of the noise currently being picked up in the sender's environment.
CN also refers to the CN RTP packets specified by RFC 3389. CN packets are sent when the VAD function declares there is no voice present. A CN packet can convey the noise modelling parameters described above, but frequently this information is missing. Ideally CN packets should be sent as the noise in the sender's environment changes, so the CNG function at the receiver can update the noise effectively, and avoid abrupt changes in the noise when the voice signal resumes. More typically, just a single CN packet is sent as the flow of voice packets ceases.
VAD in FreeSWITCH
VAD can be set in endpoint profiles and can have 4 values:
- in - turn on VAD for incoming media,
- out - turn on VAD for outgoing media,
- both - turn on VAD for both incoming and outgoing media,
- none - VAD is completely turned off.
When FreeSWITCH does not detect speech, it stops transmitting RTP. FreeSWITCH also supports per call VAD handling with the following channel variables:
CNG in FreeSWITCH
In FreeSWITCH the CNG options select whether or not FreeSWITCH will generate CN RTP packets. suppress-cng sofia profile option and suppress_cng channel variable used to set of this setting. When both sides are supporting RFC 3389 (they agree in SDP message exchange, rtpmap:13), FreeSWITCH will send CN packets.
In case one of the parties in bridge do not handle VAD and asynchronous RTP media, there should be an issue as the one might think hearing perfect silence and might think the connection has been dropped. Another example is when on one side is Asterisk.
For handling these endpoints, there is a channel variable: bridge_generate_comfort_noise which will generate fake audio.
Useful channel variables
- suppress_cng (This can be used to stop a remote party - typically a handset - from using silence suppression. Poorly implemented silence suppression can result in lost speech, and this is a way to fix that, at the cost of more use of bandwidth)
silence file type
To assign silence as a source of music on hold or ringback use this syntax:
The higher the level, the lower the volume. Default value is approximately 400. Set the value in the appropriate channel variable:
<action application="set" data="hold_music=silence"/> <action application="set" data="ringback=silence"/> <action application="set" data="transfer_ringback=silence"/>