Recent Advances in Text Analysis

Zheng Tracy Ke; Pengsheng Ji; Jiashun Jin; Wanshan Li

doi:10.1146/annurev-statistics-040522-022138

Annual Review of Statistics and Its Application

Volume 11, 2024

Review Article

Open Access

Recent Advances in Text Analysis

Zheng Tracy Ke¹, Pengsheng Ji², Jiashun Jin³, and Wanshan Li³
View Affiliations Hide Affiliations

Affiliations: ¹Department of Statistics, Harvard University, Cambridge, Massachusetts, USA; email: [email protected] ²Department of Statistics, University of Georgia, Athens, Georgia, USA ³Department of Statistics & Data Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Vol. 11:347-372 (Volume publication date April 2024) https://doi.org/10.1146/annurev-statistics-040522-022138
First published as a Review in Advance on November 29, 2023
Copyright © 2024 by the author(s).

This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See credit lines of images or other third-party material in this article for license information

Abstract

Text analysis is an interesting research area in data science and has various applications, such as in artificial intelligence, biomedical research, and engineering. We review popular methods for text analysis, ranging from topic modeling to the recent neural language models. In particular, we review Topic-SCORE, a statistical approach to topic modeling, and discuss how to use it to analyze the Multi-Attribute Data Set on Statisticians (MADStat), a data set on statistical publications that we collected and cleaned. The application of Topic-SCORE and other methods to MADStat leads to interesting findings. For example, we identified 11 representative topics in statistics. For each journal, the evolution of topic weights over time can be visualized, and these results are used to analyze the trends in statistical research. In particular, we propose a new statistical model for ranking the citation impacts of 11 topics, and we also build a cross-topic citation graph to illustrate how research results on different topics spread to one another. The results on MADStat provide a data-driven picture of the statistical research from 1975 to 2015, from a text analysis perspective.

Keyword(s): BERT, journal ranking, knowledge graph, neural network, SCORE, Stigler's model, topic weight, Topic-SCORE

Article metrics loading...

/content/journals/10.1146/annurev-statistics-040522-022138

2024-04-22

2024-05-07

Full text loading...

/deliver/fulltext/statistics/11/1/annurev-statistics-040522-022138.html?itemId=/content/journals/10.1146/annurev-statistics-040522-022138&mimeType=html&fmt=ahah

Literature Cited

Arora S, Ge R, Halpern Y, Mimno D, Moitra A, et al. 2013.. A practical algorithm for topic modeling with provable guarantees. . Proc. Mach. Learn. Res. 28:(2):280–88
[Google Scholar]
Arora S, Ge R, Moitra A. 2012.. Learning topic models—going beyond SVD. . In IEEE 53rd Annual Symposium on Foundations of Computer Science, pp. 1–10. Piscataway, NJ:: IEEE
[Google Scholar]
Azzalini A. 1985.. A class of distributions which includes the normal ones. . Scand. J. Stat. 12:(2):171–78
[Google Scholar]
Bahdanau D, Cho K, Bengio Y. 2014.. Neural machine translation by jointly learning to align and translate. . arXiv:1409.0473 [cs.CL]
Benjamini Y, Hochberg Y. 1995.. Controlling the false discovery rate: a practical and powerful approach to multiple testing. . J. R. Stat. Soc. Ser. B 57:(1):289–300
[Crossref] [Google Scholar]
Bing X, Bunea F, Wegkamp M. 2020.. A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics. . Bernoulli 26:(3):1765–96
[Crossref] [Google Scholar]
Blei DM, Ng AY, Jordan MI. 2003.. Latent Dirichlet allocation. . J. Mach. Learn. Res. 3::993–1022
[Google Scholar]
Cai TT, Ke ZT, Turner P. 2023.. Testing high-dimensional multinomials with applications to text analysis. . J. R. Stat. Soc. Ser. B. In press
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R. 1990.. Indexing by latent semantic analysis. . J. Am. Soc. Inf. Sci. 41:(6):391–407
[Crossref] [Google Scholar]
Dempster A, Laird N, Rubin D. 1977.. Maximum likelihood from incomplete data via the EM algorithm. . J. R. Stat. Soc. Ser. B 39:(1):1–22
[Crossref] [Google Scholar]
Devlin J, Chang MW, Lee K, Toutanova K. 2018.. BERT: pre-training of deep bidirectional transformers for language understanding. . arXiv:1810.04805 [cs.CL]
Donoho D. 2017.. 50 years of data science. . J. Comput. Graph. Stat. 26:(4):745–66
[Crossref] [Google Scholar]
Donoho D, Jin J. 2015.. Higher criticism for large-scale inference, especially for rare and weak effects. . Stat. Sci. 30:(1):1–25
[Crossref] [Google Scholar]
Donoho D, Stodden V. 2003.. When does non-negative matrix factorization give a correct decomposition into parts?. In Advances in Neural Information Processing Systems 16 (NeurIPS 2003), ed. S Thrun, L Saul, B Schölkopf , pp. 1141–48. Red Hook, NY:: Curran
[Google Scholar]
Donoho DL, Johnstone JM. 1994.. Ideal spatial adaptation by wavelet shrinkage. . Biometrika 81:(3):425–55
[Crossref] [Google Scholar]
Dos Santos C, Gatti M. 2014.. Deep convolutional neural networks for sentiment analysis of short texts. . In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, ed. J Tsujii, J Hajic , pp. 69–78. Stroudsburg, PA:: Assoc. Comput. Linguist.
[Google Scholar]
Efron B, Hastie T, Johnstone I, Tibshirani R. 2004.. Least angle regression. . Ann. Stat. 32:(2):407–99
[Crossref] [Google Scholar]
Fagan JL. 1988.. Experiments in automatic phrase indexing for document retrieval: a comparison of syntactic and nonsyntactic methods. Tech. Rep. , Cornell Univ., Ithaca, NY:
Gelfand AE, Smith AFM. 1990.. Sampling-based approaches to calculating marginal densities. . J. Am. Stat. Assoc. 85:(410):398–409
[Crossref] [Google Scholar]
Gillis N, Vavasis SA. 2013.. Fast and robust recursive algorithms for separable nonnegative matrix factorization. . IEEE Trans. Pattern Anal. Mach. Intell. 36:(4):698–714
[Crossref] [Google Scholar]
Harman DK, ed. 1993.. The First Text Retrieval Conference (TREC-1). Washington, DC:: US Dep. Commer.
[Google Scholar]
Hochreiter S, Schmidhuber J. 1997.. Long short-term memory. . Neural Comput. 9:(8):1735–80
[Crossref] [Google Scholar]
Hofmann T. 1999.. Probabilistic latent semantic indexing. . In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. New York:: ACM
[Google Scholar]
Horn RA, Johnson CR. 2013.. Matrix Analysis. Cambridge, UK:: Cambridge Univ. Press. , 2nd ed..
Hubert L, Arabie P. 1985.. Comparing partitions. . J. Classif. 2::193–218
[Crossref] [Google Scholar]
Ji P, Jin J, Ke ZT, Li W. 2022.. Co-citation and co-authorship networks of statisticians (with discussion). . J. Bus. Econ. Stat. 40:(2):469–504
[Crossref] [Google Scholar]
Jin J. 2015.. Fast community detection by SCORE. . Ann. Stat. 43:(1):57–89
[Crossref] [Google Scholar]
Jin J, Ke ZT, Luo S. 2018.. Network global testing by counting graphlets. . Proc. Mach. Learn. Res. 80::2333–41
[Google Scholar]
Jin J, Ke ZT, Luo S. 2021.. Optimal adaptivity of signed-polygon statistics for network testing. . Ann. Stat. 49:(6):3408–33
[Crossref] [Google Scholar]
Jin J, Ke ZT, Luo S. 2023.. Mixed membership estimation for social networks. . J. Econom. In press
[Google Scholar]
Kalchbrenner N, Grefenstette E, Blunsom P. 2014.. A convolutional neural network for modelling sentences. . arXiv:1404.2188 [cs.CL]
Ke Q, Ferrara E, Radicchi F, Flammini A. 2015.. Defining and identifying sleeping beauties in science. . PNAS 112:(24):7426–31
[Crossref] [Google Scholar]
Ke ZT, Jin J. 2023.. The SCORE normalization, especially for heterogeneous network and text data. . Stat 12:(1):e545
[Crossref] [Google Scholar]
Ke ZT, Kelly BT, Xiu D. 2019.. Predicting returns with text data. NBER Work. Pap. 26186
Ke ZT, Wang M. 2022.. Using SVD for topic modeling. . J. Am. Stat. Assoc. https://doi.org/10.1080/01621459.2022.2123813
[Google Scholar]
Kolar M, Taddy M. 2016.. Discussion of “Coauthorship and citation networks for statisticians. .” Ann. Appl. Stat. 10:(4):1835–41
[Crossref] [Google Scholar]
Lee DD, Seung HS. 1999.. Learning the parts of objects by non-negative matrix factorization. . Nature 401:(6755):788–91
[Crossref] [Google Scholar]
Lee J, Yoon W, Kim S, Kim D, Kim S, et al. 2020.. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. . Bioinformatics 36:(4):1234–40
[Crossref] [Google Scholar]
Liang KY, Zeger SL. 1986.. Longitudinal data analysis using generalized linear models. . Biometrika 73:(1):13–22
[Crossref] [Google Scholar]
McAuliffe J, Blei D. 2007.. Supervised topic models. . In Advances in Neural Information Processing Systems 20 (NeurIPS'07), ed. JC Platt, D Koller, Y Singer, ST Roweis , pp. 121–28. Red Hook, NY:: Curran
[Google Scholar]
Mei Q, Zhai C. 2001.. A note on EM algorithm for probabilistic latent semantic analysis. . In CIKM '20: Proceedings of the 29th ACM International Conference on Information and Knowledge Management. New York:: ACM
[Google Scholar]
Mikolov T, Chen K, Corrado G, Dean J. 2013.. Efficient estimation of word representations in vector space. . arXiv:1301.3781 [cs.CL]
Otter DW, Medina JR, Kalita JK. 2020.. A survey of the usages of deep learning for natural language processing. . IEEE Trans. Neural Netw. Learn. Syst. 32:(2):604–24
[Crossref] [Google Scholar]
Radford A, Narasimhan K, Salimans T, Sutskever I. 2018.. Improving language understanding by generative pre-training. Work. Pap. , OpenAI, San Francisco, CA:. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Rahali A, Akhloufi MA. 2023.. End-to-end transformer-based models in textual-based NLP. . AI 4:(1):54–110
[Crossref] [Google Scholar]
Shi F, Foster JG, Evans JA. 2015.. Weaving the fabric of science: dynamic network models of science's unfolding structure. . Soc. Netw. 43::73–85
[Crossref] [Google Scholar]
Stigler SM. 1994.. Citation patterns in the journals of statistics and probability. . Stat. Sci. 9::94–108
[Crossref] [Google Scholar]
Taddy M. 2012.. On estimation and selection for topic models. . Proc. Mach. Learn. Res. 20::1184–93
[Google Scholar]
Tibshirani R. 1996.. Regression shrinkage and selection via the lasso. . J. R. Stat. Soc. Ser. B 58:(1):267–88
[Crossref] [Google Scholar]
Varin C, Cattelan M, Firth D. 2016.. Statistical modeling of citation exchange between statistics journals. . J. R. Stat. Soc. A 179:(1):1–63
[Crossref] [Google Scholar]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, et al. 2017.. Attention is all you need. . In Advances in Neural Information Processing Systems 30 (NeurIPS'17), ed. U von Luxburg, I Guyon, S Bengio, H Wallach, R Fergus , pp. 6000–10. Red Hook, NY:: Curran
[Google Scholar]
Wallach HM. 2006.. Topic modeling: beyond bag-of-words. . In ICML '06: Proceedings of the 23rd International Conference on Machine Learning, pp. 977–84. New York:: ACM
[Google Scholar]
Wu R, Zhang L, Cai TT. 2023.. Sparse topic modeling: computational efficiency, near-optimal algorithms, and statistical inference. . J. Am. Stat. Assoc. 118::184961
[Crossref] [Google Scholar]
Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, et al. 2015.. Aligning books and movies: towards story-like visual explanations by watching movies and reading books. . In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp. 19–27. Piscataway, NJ:: IEEE
[Google Scholar]

/content/journals/10.1146/annurev-statistics-040522-022138

Recent Advances in Text Analysis

Annual Review of Statistics and Its Application 11, 347 (2024); https://doi.org/10.1146/annurev-statistics-040522-022138

/content/journals/10.1146/annurev-statistics-040522-022138

Data & Media loading...

Supplemental Material

Download the Supplemental Appendix (PDF). Includes sections A-L, Supplemental Tables 1-9, and Supplemental Figures 1-5.
Download the Supplemental Data and Code (ZIP).

Article Type: Review Article

Most Cited Most Cited RSS feed

- Functional Data Analysis
  
  Jane-Ling Wang, Jeng-Min Chiou, and Hans-Georg Müller
  
  Vol. 3 (2016), pp. 257–295
- Probabilistic Forecasting
  
  Tilmann Gneiting, and Matthias Katzfuss
  
  Vol. 1 (2014), pp. 125–151
- Bayesian Computing with INLA: A Review
  
  Håvard Rue, Andrea Riebler, Sigrunn H. Sørbye, Janine B. Illian, Daniel P. Simpson, and Finn K. Lindgren
  
  Vol. 4 (2017), pp. 395–421
- Functional Regression
  
  Jeffrey S. Morris
  
  Vol. 2 (2015), pp. 321–359
- Topological Data Analysis
  
  Larry Wasserman
  
  Vol. 5 (2018), pp. 501–532
- Algorithmic Fairness: Choices, Assumptions, and Definitions
  
  Shira Mitchell, Eric Potash, Solon Barocas, Alexander D'Amour, and Kristian Lum
  
  Vol. 8 (2021), pp. 141–163
- Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis
  
  Hongzhe Li
  
  Vol. 2 (2015), pp. 73–94
- Learning Deep Generative Models
  
  Ruslan Salakhutdinov
  
  Vol. 2 (2015), pp. 361–385
- On p-Values and Bayes Factors
  
  Leonhard Held, and Manuela Ott
  
  Vol. 5 (2018), pp. 393–419
- High-Dimensional Statistics with a View Toward Applications in Biology
  
  Peter Bühlmann, Markus Kalisch, and Lukas Meier
  
  Vol. 1 (2014), pp. 255–278
More Less

Annual Review of Statistics and Its Application

Volume 11, 2024

Review Article

Open Access

Recent Advances in Text Analysis

Abstract

Supplemental Material

Most Read This Month

Most Cited Most Cited RSS feed