Michał Kot https://orcid.org/0000-0002-7885-4028 , Bogumił Kamiński https://orcid.org/0000-0002-0678-282X
ARTICLE

(English) PDF

ABSTRACT

In this article, we investigate contingency tables where the entries containing small counts are unknown for data privacy reasons. We propose and test two competitive methods for estimating the unknown entries: our modification of the Iterative Proportional Fitting Procedure (IPFP), and one of the Monte Carlo Markov Chain methods called Shake-and-Bake. We use simulation experiments to test these methods in terms of time complexity and the accuracy of searching the space of feasible solutions. To simplify the estimation procedure, we propose to pre-process partially unknown contingency tables with simple heuristics and dimensionality-reduction techniques to find and fill all trivial entries. Our results demonstrate that if the number of missing cells is not very large, the pre-processing is often enough to find fillings for the unknown values in contingency tables. In the cases where simple heuristics are insufficient, the Shake-and-Bake technique outperforms the modified IPFP in terms of time complexity and the accuracy of searching the space of feasible solutions.

KEYWORDS

contingency tables, Markov Chain Monte Carlo, Iterative Proportional Fitting Procedure

JEL

C15, C44

REFERENCES

Aoki, S., Hara, H., & Takemura, A. (2012). Running Markov Chain Without Markov Bases. In S. Aoki, H. Hara & A. Takemura (Eds.), Markov Bases in Algebraic Statistics (pp. 275–286). Springer. https://doi.org/10.1007/978-1-4614-3719-2_16.

Bacharach, M. (1965). Estimating Nonnegative Matrices from Marginal Data. International Economic Review, 6(3), 294–310. https://doi.org/10.2307/2525582.

Bailey, N. T. J. (1995). Statistical methods in biology (3rd edition). Cambridge University Press. https://doi.org/10.1017/CBO9781139170840.

Bezanson, J., Edelman, A., Karpinski, S., & Shah, V. B. (2017). Julia: A Fresh Approach to Numerical Computing. SIAM Review, 59(1), 65–98. https://doi.org/10.1137/141000671.

Boender, C. G. E., Caron, R. J., McDonald, J. F., Kan, A. H. G. R., Romeijn, H. E., Smith, R. L., Telgen, J., & Vorst, A. C. F. (1991). Shake-and-Bake Algorithms for Generating Uniform Points on the Boundary of Bounded Polyhedra. Operations Research, 39(6), 945–954. https://doi.org/10.1287/opre.39.6.945.

Bretto, A. (2013). Hypergraph Theory. Springer. https://doi.org/10.1007/978-3-319-00080-0.

Csiszár, I. (1975). I-Divergence Geometry of Probability Distributions and Minimization Problems. The Annals of Probability, 3(1), 146–158. https://doi.org/10.1214/aop/1176996454.

Deming, W. E., & Stephan, F. F. (1940). On a Least Squares Adjustment of a Sampled Frequency Table When the Expected Marginal Totals are Known. The Annals of Mathematical Statistics, 11(4), 427–444. https://doi.org/10.1214/aoms/1177731829.

Diaconis, P., & Sturmfels, B. (1998). Algebraic Algorithms for Sampling from Conditional Distributions. Annals of Statistics, 26(1), 363–397. https://doi.org/10.1214/aos/1030563990.

Dobra, A. (2003). Markov bases for decomposable graphical models. Bernoulli, 9(6), 1093–1108. https://doi.org/10.3150/bj/1072215202.

Dobra, A. (2012). Dynamic Markov Bases. Journal of Computational and Graphical Statistics, 21(2), 496–517. https://doi.org/10.1080/10618600.2012.663285.

Dobra, A., & Fienberg, S. E. (2010). The Generalized Shuttle Algorithm. In P. Gibilisco, E. Riccomagno, M. P. Rogantin & H. P. Wynn (Eds.), Algebraic and Geometric Methods in Statistics (pp. 135–156). Cambridge University Press. https://doi.org/10.1017/CBO9780511642401.

Eastman, S. T., & Ferguson, D. A. (2012). Media Programming: Strategies and Practice (9th edition). Wadsworth, Cengage Learning.

Facebook.com. (n.d.). Facebook lookalike audiences. Retrieved January 23, 2021, from https://www.facebook.com/business/a/custom-to-lookalike-audiences.

Fienberg, S. E. (1970). An Iterative Procedure for Estimation in Contingency Tables. The Annals of Mathematical Statistics, 41(3), 907–917. https://doi.org/10.1214/aoms/1177696968.

Google.com. (n.d.). About similar segments on the Display Network. Retrieved January 23, 2021, from https://support.google.com/google-ads/answer/2676774?hl=en.

Kot, M., & Kamiński, B. (2021). Agent Based Model of Cross Media Reach of Advertising. In P. Ahrweiler & M. Neumann (Eds.), Advances in Social Simulation: Proceedings of the 15th Social Simulation Conference: 23–27 September 2019 (pp. 45–57). Springer. https://doi.org/10.1007/978-3-030-61503-1_5.

Kroese, D. P., Taimre, T., & Botev, Z. I. (2011). Handbook of Monte Carlo Methods. John Wiley & Sons. https://doi.org/10.1002/9781118014967.

Levin, D. A., Peres, Y., & Wilmer, E. L. (2009). Markov Chains and Mixing Times (2nd edition). American Mathematical Society. http://dx.doi.org/10.1090/mbk/058.

Lovelace, R., Birkin, M., Ballas, D., & van Leeuwen, E. (2015). Evaluating the performance of Iterative Proportional Fitting for spatial microsimulation: new tests for an established technique. Journal of Artificial Societies and Social Simulation, 18(2), 1–25. http://dx.doi.org/%2010.18564/jasss.2768.

Nethercote, N., Stuckey, P. J., Becket, R., Brand, S., Duck, G. J., & Tack, G. (2007). MiniZinc: Towards a Standard CP Modelling Language. In C. Bessiere (Ed.), Principles and Practice of Constraint Programming – CP 2007 (pp. 529–543). Springer. https://doi.org/10.1007/978-3-540-74970-7_38.

Payne, G., & Payne, J. (2011). Key Concepts in Social Research. SAGE Publications. https://dx.doi.org/10.4135/9781849209397.

R Core Team. (2018). R: A language and environment for statistical computing. Retrieved January 23, 2021, from https://www.R-project.org.

Slavković, A. B. (2010). Partial Information Releases for Confidential Contingency Table Entries: Present and Future Research Efforts. Journal of Privacy and Confidentiality, 1(2), 253–264. https://doi.org/10.29012/jpc.v1i2.577.

Slavković, A. B., & Lee, J. (2010). Synthetic two-way contingency tables that preserve conditional frequencies. Statistical Methodology, 7(3), 225–239. https://doi.org/10.1016/j.stamet.2009.11.002.

van Valkenhoef, G., & Tervonen, T. (2019, January 8). ‘Hit and Run’ and ‘Shake and Bake’ for Sampling Uniformly from Convex Shapes. https://cran.r-project.org/web/packages/hitandrun/hitandrun.pdf.

Xie, C., Zhong, W., & Mueller, K. (2017). A Visual Analytics Approach for Categorical Joint Distribution Reconstruction from Marginal Projections. IEEE Transactions on Visualization and Computer Graphics, 23(1), 51–60. https://doi.org/10.1109/TVCG.2016.2598479.

Zelterman, D., & Louis, T. A. (2019). Contingency Tables in Medical Studies. In J. C. Bailar & F. Mosteller (Eds.), Medical Uses of Statistics (2nd edition; pp. 293–310). CRC Press. https://doi.org/10.1201/9780429187445.

Back to top
© 2019–2022 Copyright by Statistics Poland, some rights reserved. Creative Commons Attribution-ShareAlike 4.0 International Public License (CC BY-SA 4.0)