STATISTICAL MEASURES IN CORPUS LINGUISTICS

Authors

  • Tilolova Farangiz Murodovna Tilolova Farangiz Murodovna, Jizzakh, Uzbekistan

Abstract

Statistical measures turn raw text into interpretable evidence in corpus linguistics by quantifying prevalence, association, distribution, and diversity. This article describes the key statistics used by corpus researchers — raw and normalized frequency, dispersion indices, association measures (MI/PMI, T-score, log-likelihood), and lexical diversity indices (TTR, MTLD) — explaining formulas, interpretive strengths, and common pitfalls. I present a compact methodological guide for selecting and triangulating measures based on corpus size and research goals, and include a worked example comparing MI and log-likelihood rankings to show how different measures prioritize collocations. A summary table lists each measure, purpose, strengths, and weaknesses, while an illustrative figure contrasts MI and T-score behavior across frequency ranges. The literature review situates these measures historically and recommends best practices for reliable reporting: normalize frequencies, check dispersion, and report at least two complementary association scores. Applications to lexicography, language teaching, and NLP are outlined. The article concludes that robust corpus analysis depends on transparent choice and combination of statistics rather than any single metric.

 

References

Biber, D. (1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge University Press.

Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.

Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

Evert, S. (2005). The statistics of word cooccurrences: Word pairs and collocations (PhD thesis). Institute for Natural Language Processing, University of Stuttgart.

Gries, S. Th. (2008). Dispersions and adjusted frequency measures. In Corpus Linguistics and Linguistic Theory (pp. 123–150).

Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 97–133.

Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. MIT Press.

Downloads

Published

2025-12-15

How to Cite

Tilolova Farangiz Murodovna Tilolova Farangiz Murodovna,. (2025). STATISTICAL MEASURES IN CORPUS LINGUISTICS. ZAMONAVIY TA’LIMDA FAN VA INNOVATSION TADQIQOTLAR, 3(13), 193–199. Retrieved from http://zamtadqiqot.uz/index.php/zt/article/view/1736