Simple Techniques for Efficient Top-k Batch Query Processing

Zhixuan Li; Joel Mackenzie

doi:10.54195/irrj.23893

Authors

Zhixuan Li The University of Queensland Author https://orcid.org/0009-0002-3736-7925
Joel Mackenzie The University of Queensland Author

DOI:

https://doi.org/10.54195/irrj.23893

Keywords:

Batch Processing, Caching, Dynamic Pruning, Efficient Retrieval

Abstract

When faced with a large number of related tasks, like clearing unread emails or tackling a mountain of laundry, it's often most efficient to handle them together in a single batch processing operation. The simple observation is that grouping similar tasks together can ultimately streamline the entire process.
In the context of Information Retrieval, batch processing has been applied to the problem of conjunctive querying by re-using partially computed results to avoid redundant list intersections or disk reads, reducing the total computational effort required to answer the given query batch. However, there is yet to be any work on exploiting batch query processing in the context of disjunctive querying semantics, which are often implemented by highly optimized top-k dynamic pruning algorithms. In this work, we explore two simple yet efficient approaches to reduce the computational cost of top-k querying in the batch processing regime. Our experimentation on two collections, and two unique query batches, demonstrates end-to-end cost reductions of up to 60% over standard query processing algorithms, and lays the foundation for future work towards more sophisticated top-k batch querying techniques.

Downloads

Download data is not yet available.

References

T. G. Armstrong, A. Moffat, W. Webber, and J. Zobel. Improvements that don’t add up: Ad-Hoc retrieval results since 1998. In Proc. ACM Int. Conf. on Information and Knowledge Management (CIKM), pages 601–610, 2009.

X. Bai, I. Arapakis, B. B. Cambazoglu, and A. Freire. Understanding and leveraging the impact of response latency on user behaviour in web search. ACM Trans. on Information Systems, 36(2):1–42, 2017.

P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, and T. Wang. MS MARCO: A human generated MAchine Reading COmprehension dataset. arXiv:1611.09268v3, 2018.

R. Benham, J. Mackenzie, A. Moffat, and J. S. Culpepper. Boosting search performance using query variations. ACM Trans. on Information Systems, 37(4):41.1–41.25, 2019.

R. Blanco, M. Catena, and N. Tonellotto. Exploiting green energy to reduce the operational costs of multi-center web search engines. In Proc. Conf. on the World Wide Web (WWW), pages 1237–1247, 2016.

A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a two-level retrieval process. In Proc. ACM Int. Conf. on Information and Knowledge Management (CIKM), pages 426–434, 2003.

S. Bruch. Foundations of Vector Retrieval. Springer, 2024. ISBN 9783031551826.

C. Buckley and A. F. Lewit. Optimization of inverted vector searches. In Proc. ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR), pages 97–110, 1985.

M. Catena, C. Macdonald, and N. Tonellotto. Load-sensitive CPU power management for web search engines. In Proc. ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR), pages 751–754, 2016.

Q. Chen, X. Geng, C. Rosset, C. Buractaon, J. Lu, T. Shen, K. Zhou, C. Xiong, Y. Gong, P. Bennett, N. Craswell, X. Xie, F. Yang, B. Tower, N. Rao, A. Dong, W. Jiang, Z. Liu, M. Li, C. Liu, Z. Li, R. Majumder, J. Neville, A. Oakley, K. M. Risvik, H. V. Simhadri, M. Varma, Y. Wang, L. Yang, M. Yang, and C. Zhang. MS MARCO Web Search: A large-scale information-rich web dataset with millions of real click labels. In Proc. Conf. on the World Wide Web (WWW), pages 292–301, 2024.

F. M. Choudhury, J. S. Culpepper, Z. Bao, and T. Sellis. Batch processing of top-k spatial-textual queries. ACM Trans. on Spatial Algorithms and Systems, 3(4):13.1–13.40, 2018.

G. Chowdhury. An agenda for green information retrieval research. Information Processing & Management, 48(6):1067–1077, 2012.

G. V. Cormack, M. D. Smucker, and C. L. A. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. Information Retrieval, 14(5):441–465, 2011.

M. Crane, A. Trotman, and R. O’Keefe. Maintaining discriminatory power in quantized indexes. In Proc. ACM Int. Conf. on Information and Knowledge Management (CIKM), pages 1221–1224, 2013.

M. Crane, J. S. Culpepper, J. Lin, J. Mackenzie, and A. Trotman. A comparison of Document-at-a-Time and Score-at-a-Time query evaluation. In Proc. ACM Int. Conf. on Web Search and Data Mining (WSDM), pages 201–210, 2017.

N. Craswell, D. Campos, B. Mitra, E. Yilmaz, and B. Billerbeck. ORCAS: 20 million clicked query-document pairs for analyzing search. In Proc. ACM Int. Conf. on Information and Knowledge Management (CIKM), pages 2983–2989, 2020.

L. L. S. de Carvalho, E. S. de Moura, C. M. Daoud, and A. S. da Silva. Heuristics to improve the BMW method and its variants. Journ. Information and Data Management, 6(3):178–191, 2015.

L. Dhulipala, I. Kabiljo, B. Karrer, G. Ottaviano, S. Pupyrev, and A. Shalita. Compressing graphs and indexes with recursive graph bisection. In Proc. Conf. on Knowledge Discovery and Data Mining (KDD), pages 1535–1544, 2016.

S. Ding and T. Suel. Faster top-k document retrieval using block-max indexes. In Proc. ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR), pages 993–1002, 2011.

S. Ding, J. Attenberg, R. Baeza-Yates, and T. Suel. Batch query processing for web search engines. In Proc. ACM Int. Conf. on Web Search and Data Mining (WSDM), pages 137–146, 2011.

L. Dunn, L. Gallagher, and J. Mackenzie. Approximate bag-of-words top-k corpus graphs. In Proc. European Conf. on Information Retrieval (ECIR), pages 174–182, 2025.

M. Fontoura, V. Josifovski, J. Liu, S. Venkatesan, X. Zhu, and J. Zien. Evaluation strategies for top-k queries over memory-resident inverted indexes. Proc. Conf. on Very Large Databases (VLDB), 4(12):1213–1224, 2011.

L. Gallagher, R-C. Chen, R. Blanco, and J. S. Culpepper. Joint optimization of cascade ranking models. In Proc. ACM Int. Conf. on Web Search and Data Mining (WSDM), pages 15–23, 2019.

L. Gallagher, J. Mackenzie, and A. Moffat. Empirical asymptotic growth of dynamic pruning mechanisms. In Proc. Int. ACM SIGIR Conference on Information Retrieval in the Asia Pacific (SIGIR-AP), pages 325–330, 2025.

C. Kamphuis, A. P. de Vries, L. Boytsov, and J. Lin. Which BM25 do you mean? A large-scale reproducibility study of scoring variants. In Proc. European Conf. on Information Retrieval (ECIR), pages 28–34, 2020.

A. Kane and F. W. Tompa. Split-lists and initial thresholds for WAND-based search. In Proc. ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR), pages 877–880, 2018.

E. Kayaaslan, B. B. Cambazoglu, R. Blanco, F. P. Junqueira, and C. Aykanat. Energy-price-driven query processing in multi-center web search engines. In Proc. ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR), pages 983–992, 2011.

D. Lemire and L. Boytsov. Decoding billions of integers per second through vectorization. Software Practice & Experience, 41(1):1–29, 2015.

J. Lin, J. Mackenzie, C. Kamphuis, C. Macdonald, A. Mallia, M. Siedlaczek, A. Trotman, and A. de Vries. Supporting interoperability between open-source search engines with the common index file format. In Proc. ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR), pages 2149–2152, 2020.

J. Mackenzie and A. Moffat. Examining the additivity of top-k query processing innovations. In Proc. ACM Int. Conf. on Information and Knowledge Management (CIKM), pages 1085–1094, 2020.

J. Mackenzie and A. Moffat. Index-based batch query processing revisited. In Proc. European Conf. on Information Retrieval (ECIR), pages 86–100, 2023.

J. Mackenzie, M. Petri, and A. Moffat. Tradeoff options for bipartite graph partitioning. IEEE Trans. on Knowledge and Data Engineering, 35(8):8644–8657, 2022.

A. Mallia, G. Ottaviano, E. Porciani, N. Tonellotto, and R. Venturini. Faster BlockMax WAND with variable-sized blocks. In Proc. ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR), pages 625–634, 2017.

A. Mallia, M. Siedlaczek, J. Mackenzie, and T. Suel. PISA: Performant indexes and search for academia. In Proc. OSIRRC at SIGIR 2019, pages 50–56, 2019a.

A. Mallia, M. Siedlaczek, and T. Suel. An experimental study of index compression and DAAT query processing methods. In Proc. European Conf. on Information Retrieval (ECIR), pages 353–368, 2019b.

A. Mallia, M. Siedlaczek, M. Sun, and T. Suel. A comparison of top-k threshold estimation techniques for disjunctive query processing. In Proc. ACM Int. Conf. on Information and Knowledge Management (CIKM), pages 2141–2144, 2020.

L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web, 1999. URL http://ilpubs.stanford.edu:8090/422/. Technical Report.

M. Petri, J. S. Culpepper, and A. Moffat. Exploring the magic of WAND. In Proc. Australasian Document Computing Symp. (ADCS), pages 58–65, 2013.

M. Petri, A. Moffat, J. Mackenzie, J. S. Culpepper, and D. Beck. Accelerated query processing via similarity score prediction. In Proc. ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR), pages 485–494, 2019.

S. E. Robertson and H. Zaragoza. The probabilistic relevance framework: BM25 and beyond. Foundations & Trends in Information Retrieval, 3:333–389, 2009.

H. Scells, S. Zhuang, and G. Zuccon. Reduce, reuse, recycle: Green information retrieval research. In Proc. ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR), pages 2825–2837, 2022.

E. Schurman and J. Brutlag. Performance related changes and their user impact. Velocity, 2009.

D. Shan, S. Ding, J. He, H. Yan, and X. Li. Optimized top-k processing with global page scores on block-max indexes. In Proc. ACM Int. Conf. on Web Search and Data Mining (WSDM), pages 423–432, 2012.

L. H. Thiel and H. S. Heaps. Program design for retrospective searches on large data bases. Information Storage and Retrieval, 8(1):1–20, 1972.

N. Tonellotto, C. Macdonald, and I. Ounis. Efficient query processing for scalable web search. Foundations & Trends in Information Retrieval, 12(4-5):319–500, 2018.

H. R. Turtle and J. Flood. Query evaluation: Strategies and optimizations. Information Processing & Management, 31(6):831–850, 1995.

L. Wang, J. Lin, and D. Metzler. Learning to efficiently rank. In Proc. ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR), pages 138–145, 2010.

E. Yafay and I. S. Altingovde. Caching scores for faster query processing with dynamic pruning in search engines. In Proc. ACM Int. Conf. on Information and Knowledge Management (CIKM), pages 2457–2460, 2019.

P. Yang, H. Fang, and J. Lin. Anserini: Reproducible ranking baselines using Lucene. Journal of Information Quality, 10(4):1–20, 2018.

A. Yates, C. Lassance, S. MacAvaney, T. Nguyen, and Y. Lei. Neural lexical search with learned sparse retrieval. In Proc. Int. ACM SIGIR Conference on Information Retrieval in the Asia Pacific (SIGIR-AP), pages 303–306, 2024.

J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2):6.1–6.56, 2006.

G. Zuccon, H. Scells, and S. Zhuang. Beyond CO2 emissions: The overlooked impact of water consumption of information retrieval models. In Proc. Int. Conf. on Theory of Information Retrieval (ICTIR), pages 283–289, 2023.

Simple Techniques for Efficient Top-k Batch Query Processing

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

Indexed by:

Latest publications