1932

Abstract

Entity resolution is the process of merging and removing duplicate records from multiple data sources, often in the absence of unique identifiers. Bayesian models for entity resolution allow one to include a priori information, quantify uncertainty in important applications, and directly estimate a partition of the records. Markov chain Monte Carlo (MCMC) sampling is the primary computational method for approximate posterior inference in this setting, but due to the high dimensionality of the space of partitions, there are no agreed upon standards for diagnosing nonconvergence of MCMC sampling. In this article, we review Bayesian entity resolution, with a focus on the specific challenges that it poses for the convergence of a Markov chain. We review prior methods for convergence diagnostics, discussing their weaknesses. We provide recommendations for using MCMC sampling for Bayesian entity resolution, focusing on the use of modern diagnostics that are commonplace in applied Bayesian statistics. Using simulated data, we find that a commonly used Gibbs sampler performs poorly compared with two alternatives.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-statistics-040522-114848
2024-04-22
2024-05-07
Loading full text...

Full text loading...

/deliver/fulltext/statistics/11/1/annurev-statistics-040522-114848.html?itemId=/content/journals/10.1146/annurev-statistics-040522-114848&mimeType=html&fmt=ahah

Literature Cited

  1. Aleshin-Guendel S. 2023.. multilink: Multifile record linkage and duplicate detection. . R Package, version 0.1.1. https://cran.r-project.org/package=multilink
    [Google Scholar]
  2. Aleshin-Guendel S, Sadinle M. 2023.. Multifile partitioning for record linkage and duplicate detection. . J. Am. Stat. Assoc. 118:(543):178695
    [Crossref] [Google Scholar]
  3. Avoundjian T, Dombrowski JC, Golden MR, Hughes JP, Guthrie BL, et al. 2020.. Comparing methods for record linkage for public health action: matching algorithm validation study. . JMIR Public Health Surveill. 6:(2):e15917
    [Crossref] [Google Scholar]
  4. Ball P, Price M. 2019.. Using statistics to assess lethal violence in civil and inter-state war. . Annu. Rev. Stat. Appl. 6::6384
    [Crossref] [Google Scholar]
  5. Besag J, Green PJ. 1993.. Spatial statistics and Bayesian computation. . J. R. Stat. Soc. Ser. B 55:(1):2537
    [Crossref] [Google Scholar]
  6. Betancourt B, Sosa J, Rodríguez A. 2022a.. A prior for record linkage based on allelic partitions. . Comput. Stat. Data Anal. 172::107474
    [Crossref] [Google Scholar]
  7. Betancourt B, Zanella G, Steorts RC. 2022b.. Random partition models for microclustering tasks. . J. Am. Stat. Assoc. 117:(539):121527
    [Crossref] [Google Scholar]
  8. Binder DA. 1978.. Bayesian cluster analysis. . Biometrika 65:(1):3138
    [Crossref] [Google Scholar]
  9. Binette O, Steorts RC. 2022.. (Almost) all of entity resolution. . Sci. Adv. 8:(12):eabi8021
    [Crossref] [Google Scholar]
  10. Bird SM, King R. 2018.. Multiple systems estimation (or capture-recapture estimation) to inform public policy. . Annu. Rev. Stat. Appl. 5::95118
    [Crossref] [Google Scholar]
  11. Brooks SP, Gelman A, Jones G, Meng XL 2011.. Handbook of Markov Chain Monte Carlo. Boca Raton, FL:: CRC
  12. Bürkner PC, Gabry J, Kay M, Vehtari A. 2020.. posterior: Tools for working with posterior distributions. . R Package, version 1.4.1. https://mc-stan.org/posterior/
    [Google Scholar]
  13. Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, et al. 2017.. Stan: a probabilistic programming language. . J. Stat. Softw. 76:(1):132
    [Crossref] [Google Scholar]
  14. Chen B, Shrivastava A, Steorts RC. 2018.. Unique entity estimation with application to the Syrian conflict. . Ann. Appl. Stat. 12:(2):103967
    [Google Scholar]
  15. Christen P. 2012.. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Heidelberg, Ger.:: Springer
  16. Christophides V, Efthymiou V, Palpanas T, Papadakis G, Stefanidis K. 2021.. An overview of end-to-end entity resolution for big data. . ACM Comput. Surv. 53:(6):127
    [Crossref] [Google Scholar]
  17. Cowles MK, Carlin BP. 1996.. Markov chain Monte Carlo convergence diagnostics: a comparative review. . J. Am. Stat. Assoc. 91:(434):883904
    [Crossref] [Google Scholar]
  18. Dahl DB, Johnson DJ, Müller P. 2022.. Search algorithms and loss functions for Bayesian clustering. . J. Comput. Graph. Stat. 31:(4):1189201
    [Crossref] [Google Scholar]
  19. Dalzell NM, Reiter JP. 2018.. Regression modeling and file matching using possibly erroneous matching variables. . J. Comput. Graph. Stat. 27:(4):72838
    [Crossref] [Google Scholar]
  20. Dubey P, Müller HG. 2019.. Fréchet analysis of variance for random objects. . Biometrika 106:(4):80321
    [Crossref] [Google Scholar]
  21. Enamorado T, Fifield B, Imai K. 2019.. Using a probabilistic model to assist merging of large-scale administrative records. . Am. Political Sci. Rev. 113:(2):35371
    [Crossref] [Google Scholar]
  22. Fellegi IP, Sunter AB. 1969.. A theory for record linkage. . J. Am. Stat. Assoc. 64:(328):1183210
    [Crossref] [Google Scholar]
  23. Fortini M, Liseo B, Nuccitelli A, Scanu M. 2001.. On Bayesian record linkage. . Res. Off. Stat. 4:(1):18598
    [Google Scholar]
  24. Fortini M, Nuccitelli A, Liseo B, Scanu M. 2002.. Modeling issues in record linkage: a Bayesian perspective. . In Proceedings of the Section on Survey Research Methods, pp. 100813 Alexandria, VA:: Am. Stat. Assoc.
    [Google Scholar]
  25. Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. 2013.. Bayesian Data Analysis. Boca Raton, FL:: CRC, 3rd ed.
  26. Gelman A, Rubin DB. 1992.. Inference from iterative simulation using multiple sequences. . Stat. Sci. 7:(4):45772
    [Google Scholar]
  27. Geweke J. 1992.. Evaluating the accuracy of sampling-based approaches to the calculations of posterior moments. . Bayesian Stat. 4::64149
    [Google Scholar]
  28. Geyer CJ. 1991.. Markov chain Monte Carlo maximum likelihood. . In Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, ed. EM Keramidas, SM Kaufman , pp. 15663 Fairfax Station, VA:: Interface Found. N. Am.
    [Google Scholar]
  29. Geyer CJ, Thompson EA. 1995.. Annealing Markov chain Monte Carlo with applications to ancestral inference. . J. Am. Stat. Assoc. 90:(431):90920
    [Crossref] [Google Scholar]
  30. Gong L, Flegal JM. 2016.. A practical sequential stopping rule for high-dimensional Markov chain Monte Carlo. . J. Comput. Graph. Stat. 25:(3):684700
    [Crossref] [Google Scholar]
  31. Gutman R, Afendulis CC, Zaslavsky AM. 2013.. A Bayesian procedure for file linking to analyze end-of-life medical costs. . J. Am. Stat. Assoc. 108:(501):3447
    [Crossref] [Google Scholar]
  32. Hof MH, Ravelli AC, Zwinderman AH. 2017.. A probabilistic record linkage model for survival data. . J. Am. Stat. Assoc. 112:(520):150415
    [Crossref] [Google Scholar]
  33. Jain S, Neal RM. 2004.. A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. . J. Comput. Graph. Stat. 13:(1):15882
    [Crossref] [Google Scholar]
  34. Jaro MA. 1989.. Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida. . J. Am. Stat. Assoc. 84:(406):41420
    [Crossref] [Google Scholar]
  35. Jasra A, Holmes C, Stephens D. 2005.. Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. . Stat. Sci. 20:(1):5067
    [Crossref] [Google Scholar]
  36. Kaplan A, Betancourt B, Steorts RC. 2022.. A practical approach to proper inference with linked data. . Am. Stat. 76:(4):38493
    [Crossref] [Google Scholar]
  37. Larsen MD. 2005.. Advances in record linkage theory: hierarchical Bayesian record linkage theory. . In Proceedings of the Section on Survey Research Methods, pp. 327784 Alexandria, VA:: Am. Stat. Assoc.
    [Google Scholar]
  38. Larsen MD, Rubin DB. 2001.. Iterative automated record linkage using mixture models. . J. Am. Stat. Assoc. 96:(453):3241
    [Crossref] [Google Scholar]
  39. Liseo B, Tancredi A. 2011.. Bayesian estimation of population size via linkage of multivariate normal data sets. . J. Off. Stat. 27:(3):491505
    [Google Scholar]
  40. Manrique-Vallier D, Ball P, Sadinle M. 2022.. Capture-recapture for casualty estimation and beyond: recent advances and research directions. . In Statistics in the Public Interest: In Memory of Stephen E. Fienberg, ed. AL Carriquiry, JM Tanur, WF Eddy , pp. 1531 New York:: Springer
    [Google Scholar]
  41. Marchant NG, Kaplan A, Elazar DN, Rubinstein BI, Steorts RC. 2021.. d-blink: Distributed end-to-end Bayesian entity resolution. . J. Comput. Graph. Stat. 30:(2):40621
    [Crossref] [Google Scholar]
  42. Marchant NG, Rubinstein BIP, Steorts RC. 2023.. Bayesian graphical entity resolution using exchangeable random partition priors. . J. Surv. Stat. Methodol. 11:(3):56996
    [Crossref] [Google Scholar]
  43. Marinari E, Parisi G. 1992.. Simulated tempering: a new Monte Carlo scheme. . Europhys. Lett. 19:(6):45158
    [Crossref] [Google Scholar]
  44. Matsakis NE. 2010.. Active duplicate detection with Bayesian nonparametric models. PhD Thesis , Mass. Inst. Technol., Cambridge, MA:
  45. McVeigh BS, Spahn BT, Murray JS. 2020.. Scaling Bayesian probabilistic record linkage with post-hoc blocking: an application to the California Great Registers. . arXiv:1905.05337 [stat.ME]
  46. Meilă M. 2007.. Comparing clusterings—an information based distance. . J. Multivariate Anal. 98:(5):87395
    [Crossref] [Google Scholar]
  47. Miller J, Betancourt B, Zaidi A, Wallach H, Steorts RC. 2015.. Microclustering: When the cluster sizes grow sublinearly with the size of the data set. . arXiv:1512.00792 [stat.ME]
  48. Murray JS. 2015.. Probabilistic record linkage and deduplication after indexing, blocking, and filtering. . J. Priv. Confid. 7:(1):324
    [Google Scholar]
  49. Neal RM. 2000.. Markov chain sampling methods for Dirichlet process mixture models. . J. Comput. Graph. Stat. 9:(2):24965
    [Crossref] [Google Scholar]
  50. Newcombe HB, Kennedy JM, Axford S, James AP. 1959.. Automatic linkage of vital records: computers can be used to extract “follow-up” statistics of families from files of routine records. . Science 130:(3381):95459
    [Crossref] [Google Scholar]
  51. Papadakis G, Skoutas D, Thanos E, Palpanas T. 2020.. Blocking and filtering techniques for entity resolution: a survey. . ACM Comput. Surv. 53:(2):31
    [Google Scholar]
  52. Qin Q, Hobert JP. 2021.. On the limitations of single-step drift and minorization in Markov chain convergence analysis. . Ann. Appl. Probability 31:(4):163359
    [Crossref] [Google Scholar]
  53. Raftery AE, Lewis SM. 1996.. Implementing MCMC. In Markov Chain Monte Carlo in Practice, ed. WR Gilks, S Richardson, D Spiegelhalter , pp. 11530 London:: Chapman and Hall
    [Google Scholar]
  54. Rastelli R, Friel N. 2018.. Optimal Bayesian estimators for latent variable cluster models. . Stat. Comput. 28::116986
    [Crossref] [Google Scholar]
  55. Robert CP, Casella G. 2004.. Monte Carlo Statistical Methods. New York:: Springer, 2nd ed.
  56. Robert CP, Elvira V, Tawn N, Wu C. 2018.. Accelerating MCMC algorithms. . Wiley Interdiscip. Rev. Comput. Stat. 10:(5):e1435
    [Crossref] [Google Scholar]
  57. Roberts GO, Rosenthal JS. 2004.. General state space Markov chains and MCMC algorithms. . Probability Surv. 1::2071
    [Crossref] [Google Scholar]
  58. Rosenthal JS. 1995.. Minorization conditions and convergence rates for Markov chain Monte Carlo. . J. Am. Stat. Assoc. 90:(430):55866
    [Crossref] [Google Scholar]
  59. Roy V. 2020.. Convergence diagnostics for Markov chain Monte Carlo. . Annu. Rev. Stat. Appl. 7::387412
    [Crossref] [Google Scholar]
  60. Sadinle M. 2014.. Detecting duplicates in a homicide registry using a Bayesian partitioning approach. . Ann. Appl. Stat. 8:(4):240434
    [Crossref] [Google Scholar]
  61. Sadinle M. 2017.. Bayesian estimation of bipartite matchings for record linkage. . J. Am. Stat. Assoc. 112:(518):60012
    [Crossref] [Google Scholar]
  62. Sadinle M. 2018.. Bayesian propagation of record linkage uncertainty into population size estimation of human rights violations. . Ann. Appl. Stat. 12:(2):101338
    [Crossref] [Google Scholar]
  63. Stam A. 1983.. Generation of a random partition of a finite set by an urn model. . J. Comb. Theory Ser. A 35:(2):23140
    [Crossref] [Google Scholar]
  64. Steorts RC. 2015.. Entity resolution with empirically motivated priors. . Bayesian Anal. 10:(4):84975
    [Crossref] [Google Scholar]
  65. Steorts RC, Barnes M, Neiswanger W. 2017.. Performance bounds for graphical record linkage. . Proc. Mach. Learn. Res. 54::298306
    [Google Scholar]
  66. Steorts RC, Hall R, Fienberg SE. 2014a.. SMERED: a Bayesian approach to graphical record linkage and de-duplication. . J. Mach. Learn. Res. 33::92230
    [Google Scholar]
  67. Steorts RC, Hall R, Fienberg SE. 2016.. A Bayesian approach to graphical record linkage and deduplication. . J. Am. Stat. Assoc. 111:(516):166072
    [Crossref] [Google Scholar]
  68. Steorts RC, Shrivastava A. 2018.. Probabilistic blocking with an application to the Syrian conflict. . In Privacy in Statistical Databases: UNESCO Chair in Data Privacy, International Conference, PSD 2018, Valencia, Spain, September 26–28, 2018, Proceedings, ed. J Domingo-Ferrer , pp. 31427. New York:: Springer
    [Google Scholar]
  69. Steorts RC, Tancredi A, Liseo B. 2018.. Generalized Bayesian record linkage and regression with exact error propagation. . In Privacy in Statistical Databases: UNESCO Chair in Data Privacy, International Conference, PSD 2018, Valencia, Spain, September 26–28, 2018, Proceedings, ed. J Domingo-Ferrer , pp. 297313 New York:: Springer
    [Google Scholar]
  70. Steorts RC, Ventura SL, Sadinle M, Fienberg SE. 2014b.. A comparison of blocking methods for record linkage. . In Privacy in Statistical Databases: UNESCO Chair in Data Privacy, International Conference, PSD 2014, Ibiza, Spain, September 17–19, 2014, Proceedings, ed. J Domingo-Ferrer , pp. 25368 New York:: Springer
    [Google Scholar]
  71. Tancredi A, Liseo B. 2011.. A hierarchical Bayesian approach to record linkage and size population problems. . Ann. Appl. Stat. 5:(2B):155385
    [Crossref] [Google Scholar]
  72. Tancredi A, Liseo B. 2015.. Regression analysis with linked data: problems and possible solutions. . Statistica 75:(1):1935
    [Google Scholar]
  73. Tancredi A, Steorts R, Liseo B. 2020.. A unified framework for de-duplication and population size estimation (with discussion). . Bayesian Anal. 15:(2):63382
    [Crossref] [Google Scholar]
  74. Tang J, Reiter J, Steorts RC. 2020.. Bayesian modeling for simultaneous regression and record linkage. . In Privacy in Statistical Databases: UNESCO Chair in Data Privacy, International Conference, PSD 2020, Tarragona, Spain, September 23–25, 2020, Proceedings, ed. J Domingo-Ferrer, K Muralidhar , pp. 20923 New York:: Springer
    [Google Scholar]
  75. van Dyk DA, Park T. 2008.. Partially collapsed Gibbs samplers. . J. Am. Stat. Assoc. 103:(482):79096
    [Crossref] [Google Scholar]
  76. Vats D, Jones GL. 2021.. Invited discussion: “Rank-normalization, folding, and localization: an improved R for assessing convergence of MCMC. Bayesian Anal. 16:(2):695702
    [Google Scholar]
  77. Vehtari A, Gelman A, Simpson D, Carpenter B, Bürkner PC. 2021.. Rank-normalization, folding, and localization: an improved R for assessing convergence of MCMC (with discussion). . Bayesian Anal. 16:(2):667718
    [Crossref] [Google Scholar]
  78. Wade S, Ghahramani Z. 2018.. Bayesian cluster analysis: point estimation and credible balls (with discussion). . Bayesian Anal. 13:(2):559626
    [Crossref] [Google Scholar]
  79. Winkler WE. 1994.. Advanced methods for record linkage. . In Proceedings of the Section on Survey Research Methods, pp. 46772 Alexandria, VA:: Am. Stat. Assoc.
    [Google Scholar]
  80. Winkler WE. 2006.. Overview of record linkage and current research directions. Work. Pap. , US Census Bur., US Dep. Commer., Washington, DC:
  81. Winkler WE. 2014.. Matching and record linkage. . Wiley Interdiscip. Rev. Comput. Stat. 6:(5):31325
    [Crossref] [Google Scholar]
  82. Winkler WE, Thibaudeau Y. 1991.. An application of the Fellegi–Sunter model of record linkage to the 1990 U.S. decennial Census. Work. Pap. 91-09 , US Census Bur., US Dep. Commer., Washington, DC:
  83. Zanella G. 2020.. Informed proposals for local MCMC in discrete spaces. . J. Am. Stat. Assoc. 115:(530):85265
    [Crossref] [Google Scholar]
  84. Zanella G, Betancourt B, Wallach H, Miller J, Zaidi A, Steorts RC. 2016.. Flexible models for microclustering with application to entity resolution. . In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS'16, ed. DD Lee, U von Luxburg, R Garnett, M Sugiyama, I Guyon , pp. 142533 Red Hook, NY:: Curran
    [Google Scholar]
/content/journals/10.1146/annurev-statistics-040522-114848
Loading
/content/journals/10.1146/annurev-statistics-040522-114848
Loading

Data & Media loading...

Supplemental Material

Supplemental Material

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error