1932

Abstract

Text analysis is an interesting research area in data science and has various applications, such as in artificial intelligence, biomedical research, and engineering. We review popular methods for text analysis, ranging from topic modeling to the recent neural language models. In particular, we review Topic-SCORE, a statistical approach to topic modeling, and discuss how to use it to analyze the Multi-Attribute Data Set on Statisticians (MADStat), a data set on statistical publications that we collected and cleaned. The application of Topic-SCORE and other methods to MADStat leads to interesting findings. For example, we identified 11 representative topics in statistics. For each journal, the evolution of topic weights over time can be visualized, and these results are used to analyze the trends in statistical research. In particular, we propose a new statistical model for ranking the citation impacts of 11 topics, and we also build a cross-topic citation graph to illustrate how research results on different topics spread to one another. The results on MADStat provide a data-driven picture of the statistical research from 1975 to 2015, from a text analysis perspective.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-statistics-040522-022138
2024-04-22
2024-05-07
Loading full text...

Full text loading...

/deliver/fulltext/statistics/11/1/annurev-statistics-040522-022138.html?itemId=/content/journals/10.1146/annurev-statistics-040522-022138&mimeType=html&fmt=ahah

Literature Cited

  1. Arora S, Ge R, Halpern Y, Mimno D, Moitra A, et al. 2013.. A practical algorithm for topic modeling with provable guarantees. . Proc. Mach. Learn. Res. 28:(2):28088
    [Google Scholar]
  2. Arora S, Ge R, Moitra A. 2012.. Learning topic models—going beyond SVD. . In IEEE 53rd Annual Symposium on Foundations of Computer Science, pp. 110. Piscataway, NJ:: IEEE
    [Google Scholar]
  3. Azzalini A. 1985.. A class of distributions which includes the normal ones. . Scand. J. Stat. 12:(2):17178
    [Google Scholar]
  4. Bahdanau D, Cho K, Bengio Y. 2014.. Neural machine translation by jointly learning to align and translate. . arXiv:1409.0473 [cs.CL]
  5. Benjamini Y, Hochberg Y. 1995.. Controlling the false discovery rate: a practical and powerful approach to multiple testing. . J. R. Stat. Soc. Ser. B 57:(1):289300
    [Crossref] [Google Scholar]
  6. Bing X, Bunea F, Wegkamp M. 2020.. A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics. . Bernoulli 26:(3):176596
    [Crossref] [Google Scholar]
  7. Blei DM, Ng AY, Jordan MI. 2003.. Latent Dirichlet allocation. . J. Mach. Learn. Res. 3::9931022
    [Google Scholar]
  8. Cai TT, Ke ZT, Turner P. 2023.. Testing high-dimensional multinomials with applications to text analysis. . J. R. Stat. Soc. Ser. B. In press
  9. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R. 1990.. Indexing by latent semantic analysis. . J. Am. Soc. Inf. Sci. 41:(6):391407
    [Crossref] [Google Scholar]
  10. Dempster A, Laird N, Rubin D. 1977.. Maximum likelihood from incomplete data via the EM algorithm. . J. R. Stat. Soc. Ser. B 39:(1):122
    [Crossref] [Google Scholar]
  11. Devlin J, Chang MW, Lee K, Toutanova K. 2018.. BERT: pre-training of deep bidirectional transformers for language understanding. . arXiv:1810.04805 [cs.CL]
  12. Donoho D. 2017.. 50 years of data science. . J. Comput. Graph. Stat. 26:(4):74566
    [Crossref] [Google Scholar]
  13. Donoho D, Jin J. 2015.. Higher criticism for large-scale inference, especially for rare and weak effects. . Stat. Sci. 30:(1):125
    [Crossref] [Google Scholar]
  14. Donoho D, Stodden V. 2003.. When does non-negative matrix factorization give a correct decomposition into parts?. In Advances in Neural Information Processing Systems 16 (NeurIPS 2003), ed. S Thrun, L Saul, B Schölkopf , pp. 114148. Red Hook, NY:: Curran
    [Google Scholar]
  15. Donoho DL, Johnstone JM. 1994.. Ideal spatial adaptation by wavelet shrinkage. . Biometrika 81:(3):42555
    [Crossref] [Google Scholar]
  16. Dos Santos C, Gatti M. 2014.. Deep convolutional neural networks for sentiment analysis of short texts. . In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, ed. J Tsujii, J Hajic , pp. 6978. Stroudsburg, PA:: Assoc. Comput. Linguist.
    [Google Scholar]
  17. Efron B, Hastie T, Johnstone I, Tibshirani R. 2004.. Least angle regression. . Ann. Stat. 32:(2):40799
    [Crossref] [Google Scholar]
  18. Fagan JL. 1988.. Experiments in automatic phrase indexing for document retrieval: a comparison of syntactic and nonsyntactic methods. Tech. Rep. , Cornell Univ., Ithaca, NY:
  19. Gelfand AE, Smith AFM. 1990.. Sampling-based approaches to calculating marginal densities. . J. Am. Stat. Assoc. 85:(410):398409
    [Crossref] [Google Scholar]
  20. Gillis N, Vavasis SA. 2013.. Fast and robust recursive algorithms for separable nonnegative matrix factorization. . IEEE Trans. Pattern Anal. Mach. Intell. 36:(4):698714
    [Crossref] [Google Scholar]
  21. Harman DK, ed. 1993.. The First Text Retrieval Conference (TREC-1). Washington, DC:: US Dep. Commer.
    [Google Scholar]
  22. Hochreiter S, Schmidhuber J. 1997.. Long short-term memory. . Neural Comput. 9:(8):173580
    [Crossref] [Google Scholar]
  23. Hofmann T. 1999.. Probabilistic latent semantic indexing. . In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 5057. New York:: ACM
    [Google Scholar]
  24. Horn RA, Johnson CR. 2013.. Matrix Analysis. Cambridge, UK:: Cambridge Univ. Press. , 2nd ed..
  25. Hubert L, Arabie P. 1985.. Comparing partitions. . J. Classif. 2::193218
    [Crossref] [Google Scholar]
  26. Ji P, Jin J, Ke ZT, Li W. 2022.. Co-citation and co-authorship networks of statisticians (with discussion). . J. Bus. Econ. Stat. 40:(2):469504
    [Crossref] [Google Scholar]
  27. Jin J. 2015.. Fast community detection by SCORE. . Ann. Stat. 43:(1):5789
    [Crossref] [Google Scholar]
  28. Jin J, Ke ZT, Luo S. 2018.. Network global testing by counting graphlets. . Proc. Mach. Learn. Res. 80::233341
    [Google Scholar]
  29. Jin J, Ke ZT, Luo S. 2021.. Optimal adaptivity of signed-polygon statistics for network testing. . Ann. Stat. 49:(6):340833
    [Crossref] [Google Scholar]
  30. Jin J, Ke ZT, Luo S. 2023.. Mixed membership estimation for social networks. . J. Econom. In press
    [Google Scholar]
  31. Kalchbrenner N, Grefenstette E, Blunsom P. 2014.. A convolutional neural network for modelling sentences. . arXiv:1404.2188 [cs.CL]
  32. Ke Q, Ferrara E, Radicchi F, Flammini A. 2015.. Defining and identifying sleeping beauties in science. . PNAS 112:(24):742631
    [Crossref] [Google Scholar]
  33. Ke ZT, Jin J. 2023.. The SCORE normalization, especially for heterogeneous network and text data. . Stat 12:(1):e545
    [Crossref] [Google Scholar]
  34. Ke ZT, Kelly BT, Xiu D. 2019.. Predicting returns with text data. NBER Work. Pap. 26186
  35. Ke ZT, Wang M. 2022.. Using SVD for topic modeling. . J. Am. Stat. Assoc. https://doi.org/10.1080/01621459.2022.2123813
    [Google Scholar]
  36. Kolar M, Taddy M. 2016.. Discussion of “Coauthorship and citation networks for statisticians. .” Ann. Appl. Stat. 10:(4):183541
    [Crossref] [Google Scholar]
  37. Lee DD, Seung HS. 1999.. Learning the parts of objects by non-negative matrix factorization. . Nature 401:(6755):78891
    [Crossref] [Google Scholar]
  38. Lee J, Yoon W, Kim S, Kim D, Kim S, et al. 2020.. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. . Bioinformatics 36:(4):123440
    [Crossref] [Google Scholar]
  39. Liang KY, Zeger SL. 1986.. Longitudinal data analysis using generalized linear models. . Biometrika 73:(1):1322
    [Crossref] [Google Scholar]
  40. McAuliffe J, Blei D. 2007.. Supervised topic models. . In Advances in Neural Information Processing Systems 20 (NeurIPS'07), ed. JC Platt, D Koller, Y Singer, ST Roweis , pp. 12128. Red Hook, NY:: Curran
    [Google Scholar]
  41. Mei Q, Zhai C. 2001.. A note on EM algorithm for probabilistic latent semantic analysis. . In CIKM '20: Proceedings of the 29th ACM International Conference on Information and Knowledge Management. New York:: ACM
    [Google Scholar]
  42. Mikolov T, Chen K, Corrado G, Dean J. 2013.. Efficient estimation of word representations in vector space. . arXiv:1301.3781 [cs.CL]
  43. Otter DW, Medina JR, Kalita JK. 2020.. A survey of the usages of deep learning for natural language processing. . IEEE Trans. Neural Netw. Learn. Syst. 32:(2):60424
    [Crossref] [Google Scholar]
  44. Radford A, Narasimhan K, Salimans T, Sutskever I. 2018.. Improving language understanding by generative pre-training. Work. Pap. , OpenAI, San Francisco, CA:. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
  45. Rahali A, Akhloufi MA. 2023.. End-to-end transformer-based models in textual-based NLP. . AI 4:(1):54110
    [Crossref] [Google Scholar]
  46. Shi F, Foster JG, Evans JA. 2015.. Weaving the fabric of science: dynamic network models of science's unfolding structure. . Soc. Netw. 43::7385
    [Crossref] [Google Scholar]
  47. Stigler SM. 1994.. Citation patterns in the journals of statistics and probability. . Stat. Sci. 9::94108
    [Crossref] [Google Scholar]
  48. Taddy M. 2012.. On estimation and selection for topic models. . Proc. Mach. Learn. Res. 20::118493
    [Google Scholar]
  49. Tibshirani R. 1996.. Regression shrinkage and selection via the lasso. . J. R. Stat. Soc. Ser. B 58:(1):26788
    [Crossref] [Google Scholar]
  50. Varin C, Cattelan M, Firth D. 2016.. Statistical modeling of citation exchange between statistics journals. . J. R. Stat. Soc. A 179:(1):163
    [Crossref] [Google Scholar]
  51. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, et al. 2017.. Attention is all you need. . In Advances in Neural Information Processing Systems 30 (NeurIPS'17), ed. U von Luxburg, I Guyon, S Bengio, H Wallach, R Fergus , pp. 600010. Red Hook, NY:: Curran
    [Google Scholar]
  52. Wallach HM. 2006.. Topic modeling: beyond bag-of-words. . In ICML '06: Proceedings of the 23rd International Conference on Machine Learning, pp. 97784. New York:: ACM
    [Google Scholar]
  53. Wu R, Zhang L, Cai TT. 2023.. Sparse topic modeling: computational efficiency, near-optimal algorithms, and statistical inference. . J. Am. Stat. Assoc. 118::184961
    [Crossref] [Google Scholar]
  54. Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, et al. 2015.. Aligning books and movies: towards story-like visual explanations by watching movies and reading books. . In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1927. Piscataway, NJ:: IEEE
    [Google Scholar]
/content/journals/10.1146/annurev-statistics-040522-022138
Loading
/content/journals/10.1146/annurev-statistics-040522-022138
Loading

Data & Media loading...

Supplemental Material

Supplemental Material

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error