Uriel Cohen Priva has been awarded a grant from NSF. Read about it here.
Human language use reflects the nature of human communication. For instance, frequent words tend to have fewer sounds than infrequent ones, which facilitates quick production and understanding. However, little is known about more fine-grained distinctions. For instance, English has more /k/ than /p/ sounds. Does that reflect a property of human language and its physiological and perceptual nature or a historical accident? Answering such questions requires comparative data on the frequency and phonological makeup of words in many languages. This project will build on existing textual sources and word frequency lists to provide the phonological makeup of words in close to 200 low-resource languages. The phonological word lists will provide an invaluable resource to the understanding of human language and provide much-needed linguistic resources to low-resource languages. The outputs of the project will be made public and easily accessible, thereby assisting in documenting and teaching the processed languages, and in building computational linguistic resources such as text-to-speech engines.
The research team, including trained undergraduate and graduate students, will create rules to translate alphabets to phonemic representation for multiple languages. The team will then collect textual resources and word frequency lists from publicly available sources such as online Bibles, newspapers, and movie subtitles. The rules will be applied separately to each source and the resulting phonological representations will be made publicly available, such that not only researchers but also the general public will be able to use and interact with the data. The researchers will proceed to use the data to investigate whether the information theoretic properties of sounds have distributional universality: do sounds tend to provide similar amounts of information cross-linguistically, and if so, does their information content correlate with their phonetic properties? Universality is an age-old question, and the similarities and differences of properties across language can provide new insights into language use. Specifically, the researchers will use information theoretic properties to predict whether low information or other previously studied phonological properties are likely to promote consonant weakening in those languages.
This award reflects NSF’s statutory mission and has been deemed worthy of support through evaluation using the Foundation’s intellectual merit and broader impacts review criteria.