Howmanywordsisthis

By Nancy Huynh

March 18, 2012

As you read this sentence, it is easy to tell where one word ends and another begins — it is shown by the spaces between them. But what about in spoken language? Although pauses exist in speech, usually at the ends of sentences, the vast majority of oral communication occurs without stopping. Turn on the television or radio to a station in an unfamiliar language and listen. It will likely sound like a flood of miscellaneous noises running together. An acoustic signal, in fact, will show no pauses. However, to a native speaker, that continuous stream sounds perfectly understandable.

Certain oral clues help identify the separation between words. While adults may learn to pick out words that translate to meaningful terms in their own language, young children cannot depend on this method when learning their first language. Dr. Gaja Jarosz, assistant professor of linguistics at Yale, investigates the various types of cues used for word segmentation. Previous linguistic studies have artificially manipulated characteristics of a speech stream to investigate possible cues to which children are sensitive, but Jarosz seeks to discover to what extent these properties are present in natural language input, as well as how informative they are for children learning languages.

What is Computational Linguistics?

While researchers can take various approaches, including fieldwork and the study of language systems in explicit detail, the subfield of computational linguistics seems more akin to computer science than any other so-called “softer” science. In particular, computational linguistics involves creation of statistical and computer models to simulate certain aspects of language such as grammar or, in Jarosz’s case, word boundaries. It is the study of the “machinery,” as Jarosz describes it, “or the formal system, that underlies our knowledge of language … how we acquire it and how we process it.” This encompasses sound structure, syntax, combinations of words, and meanings of utterances.

In her most recent project, Jarosz and former Yale undergraduate J. Alex Johnson employed computational linguistics to look at various properties of speech when adults speak to children, because this is the input for children first acquiring language. Computer models helped to discover how well these signals could predict and identify word boundaries. By analyzing phonetic transcriptions of English, Polish, and Turkish, Jarosz compared the different cues present across these three languages and their involvement in learning.

Differences in Language

Deeming two languages as similar or different depends on how they are compared. Jarosz chose to focus on Turkish, Polish, and English based on certain distinctive structural characteristics, particularly their morphology and syllable complexity. The morpheme is the basic unit of morphology, which is the grammatical structure of a language. For example, in English, the morpheme “–s” denotes singular versus plural (e.g. dog versus dogs), and the morpheme “–ed” is indicative of the past tense (e.g. play vs. played), but these indicators vary in different languages. Polish and Turkish, for example, both have more complicated morphologies than English, involving cases, verb endings, and other changes dependent on grammatical context — and Turkish has the most complicated morphology of all three languages.. Because children first learning a language have to parse these complexities and determine the distinction between words and morphemes, it would hypothetically be more difficult to learn a morphologically complicated language like Turkish.

In contrast, Turkish has the simplest syllables, which should be easier to pronounce. Mandarin, Japanese, and most African languages are alike in this respect. English contains rather complex syllables, with multiple consonants at the beginnings or ends of words while Polish and Slavic languages in general, have even larger, more difficult ones.

Within languages, child-directed speech contains important differences from adult-to-adult speech. The formation of vowel sounds, for example, can be plotted acoustically in a 2-D plane to portray “vowel space.” This plot is made by putting a measurement of how far forward the tongue is (from the front to back of the mouth) on one axis and its height on the other. Jarosz explains, “When we talk to kids, the space gets stretched out more toward the extremes.” She studies this modified speech, as opposed to ordinary speech, because it is the specific input to which children are exposed to during the language acquisition process.

In spite of all these possible variations in morphology, syllable complexity, and other linguistic properties of languages, multilingual children seem to have no problem learning multiple languages, even if they are morphologically different. After a certain point, however, it becomes harder to learn a language fluently. Jarosz attributes this to what she calls “unlearning.” To illustrate this point, Jarosz pronounced a pair of very similar-sounding Polish syllables that young children can distinguish easily. “Children learning English after about a year or even less will learn not to distinguish them,” she says. “They will unlearn this difference and start to put them together in a single category.” In contrast, children learning Polish will maintain the ability to differentiate between the two sounds because they exist as two different categories in that language.

Distributional Cues

Across the three markedly different languages of English, Turkish, and Polish, Jarosz studied the predictive capability of 176 cues for word boundaries, such as stress patterns and transitional probability. Languages tend to put stresses near the edges of words, at the beginning, end, or second from beginning/ end, but children are not aware of this relationship when they first start learning languages. They have to learn, for instance, that the stressed syllable is at the beginning of the word in English but at the end in French. “The sorts of cues that children ultimately use to figure out where the word boundaries are create a kind of chicken and egg problem for the learner,” says Jarosz. While children need to know the word boundaries to know where the stress is, it seems that they also need to know where stress is to identify word boundaries.

Another boundary cue, the transitional probability, typically assesses the probability of seeing a certain phoneme (the basic, distinctive sound unit in language) given the previous one. For instance, what is the likelihood of seeing “o” next, given “d,” as in the word “dog”? The probability should be higher within words because the sounds/phonemes that make up the word always go together, but it should be lower at word boundaries since the next word can begin with any sound. Thus, dips in probability should predict the pattern of word segmentation. This property, among others, can also be calculated in either the forward or reverse direction, the latter instead looking at the likelihood of seeing a certain phoneme given the subsequent one. This time, given “o,” what is the chance of seeing “d” before it, as in “dog”?

Across all three languages, Jarosz found that the best cues were calculated in the reverse direction, which corroborates recent experimental findings that infants can track this information. However, the cues were informative to different extents in the different languages. The greatest predictor in English was boundary-predicting backwards phoneme-level trigram probability — the probability that the previous phoneme would be an indicator of word boundary, given two subsequent phonemes, instead of one. However, the same trend was not evident in Polish or Turkish.

Jarosz emphasizes the most important finding: while at least some individual cues in English are decently informative, no single one in Turkish or Polish is good by itself for learning about word segmentation. In other words, some of the 176 cues (stress, transitional probability, etc.) by themselves are somewhat useful in determining breaks between words in English, but none of them are individually helpful in the other two languages studied. When multiple cues are combined, though, such a clear advantage for English disappears. Thus, children must be paying attention to more than one word segmentation cue.

With this finding in mind, the next step is to figure out how children are integrating multiple cues to find the location of word boundaries in languages such as Turkish and Polish. Understanding this process is also a method to improve computational models of word segmentation for languages other than English. “A lot of models have been tested on English in particular but don’t work quite so well on other languages,” says Jarosz. “We have to make sure that we know how children do this in other languages as well — because obviously they do learn other languages, too!”

fulllength-syllables-3 — A language world map such as this one depicts the location of various language families. Courtesy of Freelang.

About the Author
Nancy Huynh is a junior Molecular, Cellular, and Developmental Biology major in Silliman College. She works in Dr. Barbara Kazmierczak’s lab studying the impact of antibiotics on the gut microbiome and vaccination response.

Acknowledgments
The author would like to thank Professor Jarosz for taking the time to explain her linguistics research.

Howmanywordsisthis

THE NATION'S OLDEST COLLEGE SCIENCE PUBLICATION