Appendix - Latency

Latency means "delay" and is a technical term for the single most irritating problem in using computers for music and is at the same time a fundamental property of the way that digital audio works. A Wikipedia article on latency is available here.

One practical definition of latency in digital audio is the time between a sound being made and a sound being heard. This difference should be as close to zero as possible to make online music possible. Imagine that you play a piano, but the sound is heard 5 seconds after your fingers touch the keys. It becomes impossible to synchronise your playing with the sound that you hear and it becomes effectively impossible to play the instrument. Imagine also that you play in a duet and that you hear the second instrument 5 seconds after the other person makes a sound. You will compensate your playing to match their timing, but then the other player will compensate to match your extra delay, then you will compensate even more ...... and it again becomes impossible to play. You've probably noticed this effect if you call a cellphone and get an echo - it's very difficult to keep on talking if you keep hearing yourself delayed.

Now 5 seconds is much more than you can expect in reality, but the human ear is such a wonderful mechanism that the tolerances that are undetectable turn out to be surprisingly small. Delays of more than 100ms (ms is an abbreviation for millisecond which is 1/1000 of a second - so 100ms is 1/10 of a second) can be clearly audible and latencies under 50ms are the design goal for this sort of software. Even then, if you own an electronic keyboard that produces sound 50ms after you press a key, the delay is clearly audible for most people, and if you search on digital audio forums you will see that many people claim to be able to hear far smaller delays.

Luckily there is a difference between the delay between pressing key and hearing a sound, and the delay between two people playing together. If you are 10m away from you partner, there is already a delay of 33ms between the other player producing a sound and you hearing, simply because this is the amount of time a soundwave needs to travel the distance. People can quite happily play together and compensate for this sort of delay in practice.

The Solocontutti link software (Solocontutti app) is designed to ensure minimal latency in the connection between musicians connected by internet, but needs help from the user in order to function properly. The rest of this section discusses latency, its causes and what you can do to keep it to a minimum.

Sound Cards

Computer sound cards always introduce latency - they are designed that way. The reason is something called "buffering". When a sound card receives sound from a microphone or some other source, it needs to convert this into digital information. It does this by a process called sampling. Sampling in sound is a process similar to the way movies are made. By taking lots of discrete snapshots (samples) the illusion is created of continuous sound. Professional sound cards take 96.000 or more samples per second, but the base standard for quality recording products is 48.000 and this is the number that the Solocontutti app uses. (You may have come across 44.100 as a sample rate - this is quite common and is the standard for CD audio).Sampling theory says that the maximum frequency captured is half the sampling rate, so 48.000 samples per second will reproduce frequenies up to 24 kHz, which is adequate for most purposes.

A sound card does not convert each individual sample, but collects a number of samples (a buffer or a frame) and converts this in one go. So a sound card that has a frame (or buffer) size of 240 samples will collect 240 samples before starting the conversion to digital data. This means that there is a delay of at least 240 sample lengths before processing starts, so the sound card generates this much latency. In time this is 240/48.000 = 0.005 seconds or 5ms. The following table shows the latency produced by sound card buffering at various frame sizes:

Frame Size (samples)Buffer Latency (ms)
601.25
1202.50
2405.00
48010.00
96020.00
192040.00

Looking at this table it seems obvious that we should go for the 60 sample framesize and minimise latency, but there is a catch - the smaller the sample size the more work the computer and network needs to do, because of the overhead costs associated with dealing with a frame. Handling a 60 sample frame can be quite an ordeal for some computers, and older computers may even have difficulty with 240 samples. If the computer is overloaded it will be very audible: cracks and pops in the sound, gaps in the sound and distortion. In some cases reducing the frame size may, paradoxically, lead to extra latency. Finding the right frame size is an important step in setting up your computer.

The next step in latency is the time taken for the sound card to process a frame. This needs to be shorter than the duration of a frame, otherwise the card can't keep up with itself, but can be quite long. For example a 256 sample frame could easily have an additional 4ms added by processing time in the sound card. Not only that, but sometimes the sound card and driver use extra buffering, which increases latency. In general the better the card the lower the latency, so inbuilt sound chips tend to introduce high latencies and quality or professional sound cards introduce low latencies in the order of 1ms.

Once the sound card has processed the data it needs to get it into the rest of the computer and this is done by the driver. Older Windows drivers and the multimedia system behind it were notorious for producing enormous latencies (hundreds of milliseconds) into the stream. Because of this a standard for drivers in low latency music applications was developed by the Steinberg corporation and has become a universally adopted standard in the music world. The standard is called ASIO and is supported by all professional quality sound card manufacturers and quite a few consumer grade as well.

If you are using Windows then Sound devices with ASIO drivers tend to have extremely low latencies in both sound card and driver. The Solocontutti will work with ASIO and WDM drivers, but in most cases ASIO will give the best results. If you haven't got an ASIO device, there is an alternative called ASIO4ALL (explained here) which allows non-ASIO sound cards to work with ASIO programs, but will not overcome intrinsic latency issues of the hardware. In the meantime the newer versions of Microsoft Windows have greatly improved and offer lower latencies, but the ASIO standard has stuck and is often utilised. MacOS an iOS deal with this well] but Android has only recently added capabilities for low latency audio and many devices will not be suitable. On the Mac you can often get the best results with external audio interfaces and these will often also work with iOS and Android.

Once the data is out of the sound chip and into the processor, the Solocontutti app takes over. the Solocontutti app converts and compresses the data, and sends it out on the network. At the same time the Solocontutti app receives the incoming streams, decompresses them, mixes them and sends them to the sound card. On a reasonably modern computer this can be done in a few milliseconds, and will work well with frame sizes of 240 samples or smaller. Once the data is put onto the network, the next step begins.

Network Latency

There is an additional delay between sending information from your computer to another computer across the network. Surprisingly this has very little to do with the amount of time it takes to transmit the signal: it takes the signal less time to travel a thousand kilometres than it does for the sound to travel 50cm from your instrument to the microphone. Nor is it dependent on how fast your network link is - if your link is fast enough to comfortably handle the amount of data needed, then a faster link will not reduce latency. You can compare this to a waterpipe. If you have a thin water pipe where the water rushes very fast through the pipe, it may not deliver much water pressure but if you put a small ball in at one end it will reach the other end very quickly. If, on the other hand, you have a really big pipe with the water moving very slowly, it may well be able to deliver much more water pressure, but the same ball takes much longer to travel the same distance. What we are used to of thinking of as "network speed" is in fact "network bandwidth", which is akin to the width of the pipe and not the speed of the water.

The main cause of latency in the network is the fact that the data travels through several intermediate stops before arriving at the end destination. Each stop takes the data and does something with it before passing it on, so in general the more stops the more latency. The number of stops (and therefore latency) is loosely related to the distance. This is not always the case - there is often a single stop between Europe and America because there are transatlantic cables providing a fast link. Generally speaking for short distances (less then 3000km) a network latency of 20-30ms is not unreasonable, but this can vary quite dramatically for different countries and different internet service providers.

The Solocontutti app avoids some sources of network latency by using a very rudimentary type of network connection called UDP. Web browsers use HTML with TCP/IP but this is much too cumbersome for use with the Solocontutti app, which avoids HTML or similar by using its own highly efficient protocol, and uses UDP instead of TCP which enormously reduces overhead and waiting time. There is, however, a catch. The reason that TCP is used is that (amongst other things) it guarantees delivery of your data and guarantees that information will arrive in the order in which it was sent. This is not the case with UDP and the Solocontutti app has to compensate for loss of data or data arriving in the wrong order, or too late to be fit into the sound stream.

The codec (sound compression and decompression software) which the Solocontutti app uses is quite good at compensating for missing data, but has its limits. Once the network quality degrades beyond a certain point you will start noticing occasional clicks and, in extreme cases, degradation in sound quality and dropouts. The only really way to compensate for this is to use less bandwidth by reducing the bit rate, or to use longer packets by increasing the framesize. This will of course respectively reduce the sound quality or increase the latency, so tuning of these values for an optimal result is a very important part of setting up the Solocontutti app

Arriving at the other end

Finally the information arrives at the other computer. More latency is introduced as the computer repeats the process in reverse: decoding the data, mixing the various sources, sending to the sound card and coming out of your headphones as sound. There is one difference between receiving and sending which affects latency and that is framesize. If the sender, for example, has a framesize of 240and the receiver has a framesize of 120 then the latency due to buffering is 240 + 120 = 360 (7.5ms) and not 120 + 120 = 240 (5ms) as you might expect from the local settings.

In summary

The above is quite complex, so to illustrate it lets take an example. Let's say that your computer has a good external sound device and can handle frame sizes of 120 samples, with a hardware latency of 2ms. Your computer needs 1.5ms to process the information and send it off to the network. The network latency is 15ms. At the other end your friend has a standard sound card with ASIO4ALL and can handle frame sizes of 240 samples. The computer needs 3ms to process the data and the sound card and drivers have a latency of 6ms. So, adding it all up:

  • Buffer in sound card = 120/48.000 = 2.5ms [total 2.5ms]
  • Driver & card latency = 2ms [total 4.5ms]
  • Processing on PC = 1.5ms [total 6ms]
  • Internet latency = 15ms [total 21ms]
  • Processing on other PC = 3ms [total 24ms]
  • Driver & Card latency = 6ms [total 30ms]
  • Buffer in sound card = 240/48.000 = 5ms [total 35ms]