1. About Digital Audio
    2. As recording engineers have known for years, it takes a little art, and more science, to capture sound and store it for later playback. You may already have tried to do some recording of your own, so you've probably noticed that home recordings often don't sound as good as your favorite compact disc. Professional recordings typically sound clean and free of both distortion and background noise.

      If you'll settle for fuzzy tones, distortions, random pops, thumps, buzzes, and hums, then just about any old microphone and recording technique will work. On the other hand, if you prefer digital recordings that are pleasing to the ear, then you'll want to acquaint yourself with some essential concepts. But before jumping into digital audio, it's important to have an understanding of the physics and the perception of sound, as well as the process by which sound can get in and out of your computer.

      1. Like a pebble in a pond (...sort of)
      2. Sound consists of vibrations, either in the air or in some other medium. When a sound is created, waves of vibration spread out from a source, much the way waves spread out on the surface of a pond when you throw a pebble into it.

        Waves have an up-and-down motion in pools of water. Sound waves, however, consist of variations in air pressure. The "crest" of each wave is a region where the air molecules are packed more closely together, and the "trough" is a region where the molecules are farther apart. This idea is illustrated in Figure 1.

      3. Waves are busy, but don't go anywhere
      4. In a pond, the waves travel outward, but the water itself doesn't go anywhere. All that happens is that the water molecules near the surface of the pond move up and down. You can see this if you float a dry leaf on the surface and then toss in a pebble. The leaf will not travel away from the point where the pebble entered the water; it will only bob up and down. In the same way, when sound waves travel outward from a sound source, the air molecules only press against one another while staying pretty much where they are.

        Sound waves travel considerably faster than the waves in a pond: About 1,000 feet per second in air. And, if you've ever watched a marching band outdoors, you've probably noticed that you can see the cymbal players' cymbals crash together well before you hear them. That's because light travels a lot faster than sound. (About 8,900,000 times faster!)

        Consider a loudspeaker. When it starts to make sounds, the speaker diaphragm is first pushed outward, compressing the air molecules. Then the diaphragm is pulled back, reducing the air pressure momentarily. As it is moved forward and backward again and again, sound is generated. The key question is this: How quickly does the sound source move back and forth? If it vibrates rapidly, the peaks of the sound waves will be close together (high notes); if it vibrates more slowly, the peaks will be farther apart (low notes), as in Figure 2.

      5. Our ears discern subtleties
      6. Human ears can very easily detect how closely spaced sound waves are when they arrive at our eardrums. When the sound source is vibrating rapidly, we say that the sound has a high frequency (high note), because lots of waves reach our ears in a short period of time. When the vibrations are slow, the sound has a low frequency (low note), because fewer waves strike our eardrums in the same amount of time. A clarinet, for example, produces higher frequency sound, while a bass viola produces much lower frequency sound.

        Frequency is measured in cycles per second. The technical term for this measurement is hertz (abbreviated Hz). One cycle per second equals one hertz. The low range of human hearing is about 20 Hz; at the upper end of the range, children and adults with acute hearing can hear sounds as high as 20,000 Hz. (abbreviated as 20 kHz. The "k" is for kilo, or "thousand." (Since you use computers, you probably knew that already.)

        The range of human hearing, then, is usually considered to be about 20 Hz to 20 kHz. Many adults, however, have a reduced ability to hear high frequencies. They may not be able to perceive sounds above the 12 kHz to 15 kHz range.

      7. Music
      8. Frequency and musical pitch are closely related. Each musical note consists of vibrations at a specific frequency. For example, when you play a Middle C, you'll hear a sound whose frequency is almost exactly 261.653 Hz. The A above Middle C, which is often used as a tuning reference by orchestras, vibrates at 440 Hz. (Hence, it's called A-440.)

        The musical scale sounds the way it does because the frequency doubles each time pitch rises by an octave. For example, the A directly below Middle C has a frequency of 220 Hz, and the next A below it vibrates at 110 Hz. Moving up the keyboard, the C above Middle C has a frequency of about 523.3 Hz, the next C has a frequency of 1,046.6 Hz. But if a clarinet, a bass viola, and a banjo each sound a note whose pitch is Middle C, how is it that our ears can instantly tell which instrument is playing?

      9. Overtones
      10. Consider the body of a banjo. When a note is played on a banjo, everything vibrates: the front of the banjo vibrates, the sides and back vibrate, the neck and fingerboard vibrate, and so on. The sound of a banjo, then, doesn't consist of a pure sound wave at a single frequency. Instead, each part tends to vibrate in a different way, and the sound produced by the instrument is a composite, or blend, of a number of different vibrations at different frequencies. If the sound of the banjo is recorded and examined on a computer screen, it won't look at all liked the simple sound waves shown in Figure 2. It will have a much more complex shape. Most sounds in the real world exhibit complexities of this kind.

        A mathematically pure tone is called a sine wave. Scientifically, it's possible to analyze a complex wave as being the sum of a number of different sine waves, each with its own frequency and amplitude (loudness). This is called a Fourier (pronounced "FOOR-ee-aye") analysis. The body of a banjo produces a complex wave containing a number of separate tones at different frequencies, all at the same time.

        These are called overtones, and virtually all sounds in nature contain numerous overtones. One way our ears can instantly tell the sound of a banjo from the sound of a clarinet or bass viola is by noticing the relative loudness of the overtones at various frequencies.

        Usually, the loudest of the overtones is the one vibrating at the nominal pitch of the note (440 Hz if the violinist is playing an A above Middle C). This frequency is called the fundamental.

        With this understanding of what sound is, and what happens when it's propagated through the air, let's look at the process of capturing it on your computer.

      11. Capturing Sound
      12. A microphone is a form of transducer, a device that can take energy in one form and translate it into another form. Changes in air pressure arrive at a microphone, and are translated into changes in electrical voltage. (During playback, a loudspeaker, driven by electrical voltage, transforms the voltage back into air pressure-so a loudspeaker is a transducer too.)

        Mounted inside a microphone is a small, thin, sensitive piece of material called the diaphragm. As sound waves strike the diaphragm, it vibrates at the same frequencies as the sound. The diaphragm, and its associated parts, translate these sound movements into fluctuating voltages. The microphone must be be built with precision so it can be as sensitive as possible to slight sound vibrations. Professional recording studios think nothing of spending $1,000 or more for a good microphone. You may not want to rely on the free microphone shipped with your sound card. Many of these microphones are for recording sounds to jazz up your computer operating system, but that's about it.

      13. Analog to Digital Conversion
      14. A microphone works much like our ear. When sound strikes the diaphragm of a microphone, it produces an electrical wave that is analogous to the air pressure that produces the sound. Because of this correspondence between the air pressure and the voltage coming out of the microphone, the wave is termed an analog signal. Analog is used to characterize some common types of audio components, such as tape recorders, etc., to distinguish that equipment from the kind that incorporates digital circuitry.

        Computers are designed to be operated in the digital realm, where everything is a series of on or off voltages, formatted as bits. To store a sound in the computer we must convert the continuously changing analog audio signal into digital data. The circuitry that transforms data from analog to digital, and vice-versa is termed the DAC (digital to analog converter).

        At regular intervals a DAC instantaneously freezes the audio signal voltage and holds it steady while another circuit selects the binary code that most closely represents the sampled voltage. The DAC outputs a number in binary format that represents the input signal at any given instant in time.

        Digital-to-analog conversion (for playback) is the exact opposite. In digital-to-analog conversion, the digital data is converted to a continuously changing series of voltage levels. The shape of this continuously changing stream of voltage levels approximates the shape of the original wave. This signal is then passed through a low-pass filter, which removes the digital "switching noise."

        Once in digital form, the audio is essentially immune to degradation caused by system noise or defects in the storage or transmission medium (unlike older analog systems). The digitized audio signal is easily stored on a hard disk drive, where is can be kept indefinitely without loss of fidelity.

      15. Sampling Resolution
      16. Sampling resolution refers to the number of discrete levels that are used in the analog-to-digital (and digital-to-analog) conversion processes.

        Sampling resolution is measured in bits, which refer to the amount of memory required to store each individual sample. The number of different values that can be stored in a collection of bits is equal to 2 raised to the "bit'th" power. For example, an eight-bit sample is digitized to 28, or 256 different levels. A sixteen-bit sampling system, on the other hand, senses 216, or 65,536 different levels.

        Obviously, the more bits of resolution that are used, the more closely the sampled signal will represent the analog signal, which has an essentially infinite resolution. However, higher resolutions require more storage, so some tradeoffs must be made. Generally, 8 bits is accepted as the lowest sampling resolution that can be used to obtain reasonable results, while 16-bit resolution is preferred for professional applications.

        The resolution of a sampling system is almost always determined by the hardware. The minimum resolution permitted for multimedia hardware is 8 bits. Compact discs and digital audio tapes use 16-bit samples, although many playback units only use 14 bits in their output circuitry. The telephone network typically uses 14 bits, but with an enhancement called compression.

      17. Compression
      18. When the DAC outputs a binary number representing the input signal, this number must be stored in a convenient form for later retrieval. This conversion of a binary number for storage purposes is called encoding.

        Audio encoding techniques can be broadly categorized into two classes: those for encoding analog waveforms as faithfully as possible, and those for minimizing (or compressing) the computer storage requirements. The two most common techniques used to encode an audio waveform are pulse code modulation (PCM) and delta modulation (DM).

        Linear pulse code modulation (PCM) associates a particular binary number with every voltage level of the incoming analog signal. As the incoming signal increases, the binary number goes up in value proportionally. Similarly, as the analog voltage goes down, the binary number decreases. A multimedia 16 bit wave file is an example of linear PCM.

        Non-linear PCM allows the computer to store fewer bits per sample by dropping the bits for signals that require less sensitivity. Non-linear PCM associates a particular binary number with a voltage range of the incoming analog signal. By carefully choosing the ranges, 14 bits of resolution can be packed into an 8 bit sample with minimal loss of fidelity. 8 bit u-law (properly pronounced "MEW-law", for the Greek letter u) is an example of a non-linear PCM encoding technique. 8 bit u-law is used to carry audio through the North American phone system.

        Delta Modulation (DM) is a data compression method where only the difference between subsequent samples is stored. Since voice signals are relatively stable from sample to sample, the number of bits required to faithfully reproduce the signal can be reduced. One type of delta modulation is Continuously Variable Slope Delta modulation (CVSD), where a single bit is used to indicate whether the signal is increasing or decreasing. Another common type of DM is Adaptive Delta Pulse Code Modulation (ADPCM) where a constantly changing table of multiplier values enables the encoder to adapt to various types of signals. DialogicÔ 4 bit files are an example of ADPCM.

        Linear predictive coding (LPC) extracts perceptually significant features of speech directly from a time domain speech waveform to produce a time varying model of the human vocal tract excitation and transfer function. A synthesizer on the decoding reverses the process. CCITT (ITU-T) G.728 is an example of a LPC compression algorithm.

        All you really need to know is that compression techniques exist and that their sole purpose is to reduce the number of bits required to store audio while simultaneously retaining as much fidelity as possible.

      19. Frequency of Sampling
      20. If every single number produced by the DAC were used to fully represent the sound, it would be impossible to store all that data in memory. This problem is handled by sampling the output from the DAC at a regular rate. This is called the sample rate, and is an important factor in determining the quality of digital sound. The sample rate is also called the sampling frequency, because it too can be measured in hertz.

        Because of the physics of electronic filtering, it is necessary only to sample a wave twice during each cycle to get an accurate representation. This principle holds true even for very complex sound waves. As we saw earlier, even the most complex wave is composed of the sum of sinusoids at varying frequencies and amplitudes. Therefore, if you sample at a rate that is at least twice the highest frequency in your input signal, the content will be accurately captured.

        If a signal is insufficiently sampled, new and unwanted frequencies are generated and added to the sampled sound. These are related to both the input frequencies and the sample frequency in such a way that they are virtually guaranteed to sound unpleasant. This is called aliasing.

        Fortunately, aliasing is not a serious problem with most modern systems. Audio equipment incorporates special circuits, called anti-aliasing filters, that automatically restrict the bandwidth (frequency content) of the input signal based on the sample rate. For example, if the telephony card is commanded to sample a sound at 8 kHz, then the anti-aliasing filter is adjusted to reject frequencies over 4 kHz. A similar filter is used on the output when the sound is played back to smooth off the "rough edges" created when the analog data is digitized.

        The sample rate used by a particular sampling system can usually be set through software, though the upper value is limited by the hardware. The Multimedia PC document issued by Microsoft specifies that the sound hardware be at least capable of sampling at 11.025 kHz and 22.050 kHz. Compact discs are recorded with a fixed sample rate of 44.1 kHz. Telephone quality boards, such as Dialogic boards, are capable of sampling at 6 kHz or 8 kHz, using compression techniques that result in 4 or 8 bits per sample.

        Sometimes we need to convert files recorded at one frequency for use on another system at another frequency. This process is called re-sampling. Re-sampling requires that the original signal be reconstructed, filtered, and then chopped back up into samples at the new frequency. The techniques used to perform this process are referred to as re-sampling algorithms. The most common re-sampling algorithm is based on a linear approximation of the original signal. Faster algorithms simply skip or add samples, which results in lower fidelity. More time-consuming algorithms use Fourier analysis to more accurately reproduce the original audio waveform.

        At a relatively low sampling rate of 6 or 8 kHz (typical for telephone quality voice) far fewer code bits are produced each second than, for example, at the 44.1 kHz sampling rate used for compact discs. For a two-channel 16 bit signal at a 44.1 kHz sampling rate, 11 million bytes are generated each minute. That's why you'll need at least an 800 million byte hard disk to record an hour of compact disc quality music. On the other hand, a 60 second segment of compressed telephony audio, sampled at 8,000 samples per second, using 1 byte samples, translates into 500 thousand bytes of data, thereby requiring significantly less storage. Even with compression, though, you can see that most recordings take up quite a bit of disk space.

      21. Putting it all Together: How Audio Editors Work
      22. Now that you've stored the audio signal onto your computer hard disk, what can you do with it? Well, just as a word processor lets you manipulate the words and pictures that make up a document, an audio editor lets you edit sound in much the same way. You can make a sound "bold" by increasing its volume; make it "italic" by changing the pitch. With a graphical audio editor you can both see and hear the audio waveform while you work. Figure 5 provides a block diagram of how an audio editor interacts with the components you've seen so far.

      23. What is VFEdit®?

VFEditÒ is a graphical audio editor that lets you "see" the binary numbers that have been stored on your computer hard disk by your telephony or multimedia hardware. VFEdit knows how to translate these digitized and encoded audio signals into a picture that you can see. Similarly, VFEdit knows how to take common editing commands such as "Play" or "Cut" and perform the requested operation on the hardware and the data. VFEdit provides a convenient user interface for recording new sounds and playing back audio already recorded. And VFEdit has been specifically designed to deal with the specialized hardware and audio formats used by telephony systems.

Let's say you're running a home shopping network. Your professional voice talent costs $50 an hour (or more). Then you have to change your prices-but just the prices. With VFEdit you just erase the old prices, record the new prices, and keep the old, but good (and very expensive) words.

Or maybe you have a travel agency. You want your system to call your clients, promote a week in Hawaii, and play some exotic Hawaiian hula music to set the scene. (Studies have shown that adding music to an audio phone message can increase retention level by as much as 30%.) With just a few mouse clicks you can mix that background music into the background of your sales pitch.

The basic reason to use VFEdit is that you simply want your audio presentations to sound good-just as you want your written presentations to look good. If you develop for voice processing, or if you only change, fix, and improve your voice prompts, you'll find VFEdit just about indispensable. It's fun, too!