From ms20@u.washington.edu Sat Sep 3 22:48:53 1994 Date: Sat, 3 Sep 1994 19:19:44 -0700 (PDT) From: HIgH TeCH To: analogue Subject: Talking Machines (Long!) This is an excerpt taken from J.L.Flanagan's Speech Analysis, Synthesis, and Perception, Second Edition Pages 204-211 The text is reproduced as it is in the book, except where references to illustrations were made. I mainly wanted to expose the readers to the history of speech synthesis preceding the Vocoder, so anything actually involving the Vocoder is not included in this text. Don't let that discourage you. This is good reading! Enjoy, Romeo Fahl ++++++++++ ms20@u.washington.edu ------------------------------------------------------------------------- SPEECH SYNTHESIS ---------------- Ancient man often took his ability of speech as a symbol of divine origin. Not unnaturally, he sometimes ascribed the same ability to his gods. Pagan priests, eager to fulfill great expectations, frequently tried to make their idols speak directly to the people. Talking statues, miraculous voices and oracles were well known in the Greek and Roman civilizations - the voice usually coming to the artificial mouth via cleverly concealed speaking tubes. Throughout early times the capacity of "artificial speech" to amaze, amuse and influence its listeners was remarkably well appreciated and exploited. As the civilized world entered the Renaissance scientific curiousity developed and expanded. Man began to inquire more seriously into the nature of things. Human life and physiological functions were fair targets of study, and the physiological mechanism of speech belonged in this sphere. Not surprisingly, the relatively complex vocal mechanism was often considered in terms of more tractable models. These early models were invariably mechanical contrivances, and some were exceedingly clever in design. MECHANICAL SPEAKING MACHINES: HISTORICAL EFFORTS ------------------------------------------------ One of the earliest documented efforts at speech synthesis was by Kratzenstein in 1779. The Imperial Academy of St.Petersburg offered its annual prize for explaining the physiological differences between five vowels, and for making apparatus to produce them artificially. As the winning solution, Kratzenstein constructed acoustic resonators with vibrating reeds which, in a manner analogous to the human vocal cords, interrupted an air stream. A few years later (1791), Von Kempelen constructed and demonstrated a more elaborate machine for generating connected utterances. [Apparently Von Kempelen's efforts antedate Kratzenstein's, since Von Kempelen pruportedly began work on his device in 1769 (Von Kempelen; Dudley and Tarnoczy).] Although his machine received considerable publicity, it was not taken as seriously as it should have been. Von Kempelen had earlier perpetrated a deception in the form of a mechanical chess-playing machine. The main "mechanism" of the machine was a concealed, legless man - an expert chess player. The speaking machine, however, was a completely legitimate device. It used a bellows to supply air to a reed which, in turn, excited a single, hand-varied resonator for producing voiced sounds. Consonants, including nasals, were simulated by four separate constricted passages, controlled by the fingers of the other hand. An improved version of the machine was built from Von Kempelen's description by Sir Charles Wheatstone (of the Wheatstone Bridge, and who is credited in Britain with the invention of the telegraph). Briefly, the device was operated in the following manner. The right arm rested on the main bellows and expelled air through a vibrating reed to produce voiced sounds. The fingers of the right hand controlled the air passages for the fricatives /á/ and /s/, as well as the "nostril" openings and the reed on-off control. For vowel sounds, all the passages were closed and the reed turned on. Control of vowel resonances was effected with the left hand by suitably deforming the leather resonator at the front of the device. Unvoiced sounds were produced with the reed off, and by a turbulent flow through a suitable passage. In the original work, Von Kempelen claimed that approximately 19 consonant sounds could be made passably well. Von Kempelen's efforts probably had a more far-reaching influence than is generally appreciated. During Alexander Graham Bell's boyhood in Edingburgh, Scotland (latter 1800's), Bell had an opportunity to see the reproduction of Von Kempelen's machine which had been constructed by Wheatstone. He was greatly impressed with the device. With stimulation from his father (Alexander Melville Bell, an elocutionist like his own father), and his brother Melville's assistance, Bell set out to construct a speaking automaton of his own. Following their father's advice, the boys attempted to copy the vocal organs by making a cast from a human skull and molding the vocal parts in the gutta-percha. The lips, tongue, palate, teeth, pharynx, and velum were represented. The lips were a frame-work of wire, covered with rubber which had been stuffed with cotton batting. Rubber checks were enclosed in the mouth cavity, and the tongue was simulated by wooden sections - likewise covered by a rubber skin and stuffed with batting. The parts were actuated by levers controlled from a keyboard. A larynx "box" was constructed of tin and had a flexible tube for a windpipe. A vocal cord orifice was made by stretching a slotted rubber sheet over tin supports. Bell says the device could be made to say vowels and nasals and could be manipulated to produce a few simple utterances (apparently well enough to attract the neighbors). It is tempting to speculate how this boyhood interest may have been decisive in leading to U.S. patent No. 174,465, dated February 14, 1876 - describing the telephone, and which has been perhaps one of the most valuable patents in history. Bell's youthful interest in speech production also led him to experiment with his pet Skye terrier. He taught the dog to sit up on his hind legs and growl continuously. At the same time, Bell manipulated the dog's vocal tract by hand. The dog's repertoire of sounds finally consisted of the vowels /a/ and /u/, the diphthong /ou/ and the syllables /ma/ and /ga/. His greatest linguistic accomplishment consisted of the sentence, "How are you Grandmamma?" The dog apparently started taking a "bread and butter" interest in the project and would try to talk by himself. But on his own, he could never do better than the usual growl. This, according to Bell, is the only foundation to the rumor that he once taught a dog to speak. Interest in mechanical analogs of the vocal system continued to the twentieth century. Among those who developed a penetrating understanding of the nature of human speech was Sir Richard Paget. Besides making accurate plaster tube models of the vocal tract, he was also adept at simulating vocal configurations with his hands. He could literally "talk with his hands" by cupping them and exciting the cavities either with a reed, or with thelips made to vibrate after the fashion of blowing a trumpet. Around the same time, a different approach to artificial speech was taken by people like Helmholtz, D.C. Miller, Stumpf, and Koenig. Their view was more from the point of perception than from production. Helmholtz synthesized vowel sounds by causing a sufficient number of tuning forks to vibrate at selected frequencies and with prescribed amplitudes. Miller and Stumpf, on the other hand, accomplished the same thing by sounding organ pipes. Still different, Koenig synthesized vowel spectra from a siren in which air jets were directed at rotating, toothed wheels. At least one more-recent design for a mechanical talker has been put forward (Riesz, unpublished, 1937). Air under pressure is brought from a reservoir at the right. Two valves control the flow. The first valve admits air into a chamber in which a reed is fixed. The reed vibrates and interrupts the air flow much like the vocal cords. A spring-loaded slider varies the effective length of the reed and changes its fundamental frequency. Unvoiced sounds are produced by admitting air through the second valve. The configuration of the vocal tract is varied by means of nine movable members representing the lips, teeth, tongue, pharynx, and velar coupling. To simplify the control, Riesz constructed the mechanical talker with finger keys to control the configuration, but with only one control each for lips and teeth (which worked in opposition to each other). The different members were covered with a soft rubber lining to accomplish realistic closures and dampings. Two keys (4 and 5) operate excitation valves (V4 and V5), arranged somewhat differently than the first two. Valve V4 admits air through a hole forward in the tract for producing unvoiced sounds. Valve V5 supplies air to the reed chamber for voiced excitation. In this case pitch is controlled by the amount of air passed by the valve V5. When operated by a skilled person, the machine could be made to simulate connected speech. One of its particularly good utterances was reported to be "cigarette". ELECTRICAL METHODS FOR SPEECH SYNTHESIS --------------------------------------- With the evolution of electrical technology, interest in speech synthesis assumed a broader basis. Academic interest in the physiology and acoustics of the signal-producing mechanism was supplemented by the potential for communicating at a distance. Although "facsimile waveform" transmission of speech was the first method to be applied successfully (i.e. in the telephone), many early inventors appreciated the resonance nature of the vocal system and the importance to intelligibility of preserving the short-time amplitude spectrum *. Analytical formulation and practical application of this knowledge were longer in coming. SPECTRUM RECONSTRUCTION TECHNIQUES ---------------------------------- Investigators such as Helmholtz, D.C. Miller, R. Koenig and Stumpf had earlier noted that speech-like sounds could be generated by producing an harmonic spectrum with the correct fundamental frequency and relative amplitudes. In other words, the signal could be synthesized with no compelling effort at duplicating the vocal system, but mainly with the objective of producing the desired percept. Among the first to demonstrate the principle electrically was Stewart, who excited two coupled resonant electrical circuits by a current interrupted at a rate analogous to the voice fundamental. By adjusting the circuit tuning, sustained vowels could be simulated. The apparatus was not elaborate enough to produce connected utterances. Somewhat later, Wagner devised a similar set of four electrical resonators, connected in parallel, and excited by a buzz-like source. The outputs of the four resonators were combined in the proper amplitudes to produce vowel spectra. Probably the first electrical synthesizer which attempted to produce connected speech was the Voder (Dudley, Riesz, and Watkins). It was basicaly a spectrum-synthesis device operated from a finger keyboard. It did, however, duplicate one important physiological characteristic of the vocal system, namely, that the excitation can be voiced or unvoiced. The "resonance control" box of the divice contains 10 contiguous band-pass filters which span the speech frequency range and are connected in parallel. All the filters receive excitation from either the noise source or the buzz (relaxation) oscillator. The wrist bar selects the excitation source, and a foot pedal controls the pitch of the buzz oscillator. The outputs of the band-pass filters pass through potentiometer gain controls and are added. Ten finger keys operate the potentiometers. Three additional keys provide a transient excitation of selected filters to simulate stop-consonant sounds. This speaking machine was demonstrated by trained operators at the World's Fairs of 1939 (New York) and 1940 (San Francisco). Although the training required was quite long (on the order of a year or more), the operators were able to "play" the machines - literally as though they were organs or pianos - and to produce intelligible speech **. More recently, further research studies based upon the Voder principle have been carried out (Oizumi and Kubo). ---- * Prominent among this group was Alexander Graham Bell. The events - in connection with experiments on the "harmonic telegraph" - that led Bell, in March of 1876, to apply the facsimile waveform principle are familiar to most students of communication. Less known, perhaps, is Bell's conception of a spectral transmission method remarkably similar to the channel vocoder. Bell called the idea the "harp telephone". It consisted of an elongated electromagnet with a row of steel reeds in the magnetic circuit. The reeds were to be arranged to vibrate in proximity to the pole of the magnet, and were to be tuned successively to different frequencies. Bell suggested that "-they might be considered analogous to the rods in the harp of Corti in the human ear". Sound uttered near the reeds would cause to vibrate those reeds corresponding to the spectral structure of the sound. Each reed would induce in the magnet an electrical current which would combine with the currents produced by other reeds into a resultant complex wave. The total current passing through a similar instrument at the receiver would, Bell thought, set identical reeds into motion and reproduce the original sound (Watson). The device was never constructed. The reason, Watson says, was the prohibitive expense! Also, because of the lack of means for amplification, Bell thought the currents generated by such a device might be too feeble to be practicable. (Bell found with his harmonic telegraph, however, that a magnetic transducer with a diaphragm attached to the armature could, in fact, produce audible sound from such feeble currents.) The principle of the "harp telephone" carries the implication that speech intelligibility is retained by preserving the short-time amplitude spectrum. Each reed of the device might be considered a combined electro-accoustic transducer and bandpass filter. Except for the mixing of the "filter" signals in a common conductor, and the absence of rectifying and smoothing means, the spectrum reconstruction principle bears a striking resemblance to that of the channel Vocoder. ** H.W. Dudley retired from Bell Laboratories in October 1961. On the completion of his more than 40 years in speech research, one of the Voder machines was retrieved from storage and refurbished. In addition, one of the original operators was invited to return and perform for the occasion. Amazingly, after an interlude of twenty years, the lady was able to sit down to the console and make the machine speak.