CRAWLDoc: A System for Contextual Ranking and Bibliographic Metadata Extraction from Web Resources

Authors

DOI:

https://doi.org/10.54195/irrj.23861

Keywords:

Large Language Model, Document Ranking, Bibliographic Metadata Extraction, Scholarly Dataset

Abstract

Publication databases rely on accurate metadata extraction from diverse web sources, yet variations in web layouts and data formats present challenges for metadata providers. This paper introduces CRAWLDoc, a novel system for contextual ranking of linked documents and metadata extraction from web resources. Using a publication’s DOI, CRAWLDoc retrieves the landing page and associated linked documents, including PDFs, ORCID profiles, and supplementary materials. It embeds these documents, along with anchor texts and URLs, into a unified representation. Our layout-independent embedding and ranking system ensures robustness across various web layouts and formats. Experimental results demonstrate CRAWLDoc’s superior performance in extracting bibliographic metadata compared to relying solely on landing pages. A leave-one-out experiment across six publishers shows CRAWLDoc’s resilience to layout changes and consistent extraction accuracy. Our source code and dataset can be accessed at https://github.com/FKarl/crawldoc-metadata-extraction

Downloads

Download data is not yet available.

References

Zahra Abbasiantaeb and Saeedeh Momtazi. Text-based question answering from information retrieval and deep neural network perspectives: A survey. WIREs Data Mining Knowl. Discov., 11(6), 2021. doi: 10.1002/WIDM.1412.

Monica Agrawal, Stefan Hegselmann, Hunter Lang, Yoon Kim, and David A. Sontag. Large language models are few-shot clinical information extractors. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, pages 1998–2022. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.EMNLP-MAIN.130.

Sophia Althammer, Sebastian Hofstätter, Mete Sertkan, Suzan Verberne, and Allan Hanbury. PARM: A paragraph aggregation retrieval model for dense document-to-document retrieval. In Advances in Information Retrieval - 44th European Conference on IR Research, ECIR 2022, Proceedings, Part I, volume 13185 of Lecture Notes in Computer Science, pages 19–34. Springer, 2022. doi: 10.1007/978-3-030-99736-6_2.

Xavier Amatriain. Prompt design and engineering: Introduction and advanced methods. CoRR, abs/2401.14423, 2024. doi: 10.48550/ARXIV.2401.14423.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.

Catherine Chen, Zejiang Shen, Dan Klein, Gabriel Stanovsky, Doug Downey, and Kyle Lo. Are layout-infused language models robust to layout distribution shifts? A case study with scientific documents. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13345–13360. Association for Computational Linguistics, 2023. doi:10.18653/V1/2023.FINDINGS-ACL.844.

Changyou Chen, Jianyi Zhang, Yi Xu, Liqun Chen, Jiali Duan, Yiran Chen, Son Tran, Belinda Zeng, and Trishul Chilimbi. Why do we need large batchsizes in contrastive learning? A gradient-bias perspective. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/db174d373133dcc6bf83bc98e4b681f8-Abstract-Conference.html.

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Y. Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling instruction-finetuned language models. CoRR, abs/2210.11416, 2022. doi: 10.48550/ARXIV.2210.11416.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019. doi:10.18653/v1/n19-1423.

Huy Hoang Nhat Do, Muthu Kumar Chandrasekaran, Philip S. Cho, and Min-Yen Kan. Extracting and matching authors and affiliations in scholarly documents. In 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2013, pages 219–228. ACM, 2013. doi: 10.1145/2467696.2467703.

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on RAG meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2024, pages 6491–6501. ACM, 2024. doi: 10.1145/3637528.3671470.

Michael Günther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, Mohammad Kalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, Bo Wang, Maximilian Werk, Nan Wang, and Han Xiao. Jina embeddings 2: 8192-token general-purpose text embeddings for long documents. CoRR, abs/2310.19923, 2023. doi: 10.48550/ARXIV.2310.19923.

Jiafeng Guo, Yixing Fan, Liang Pang, Liu Yang, Qingyao Ai, Hamed Zamani, Chen Wu, W. Bruce Croft, and Xueqi Cheng. A deep look into neural ranking models for information retrieval. Inf. Process. Manag., 57(6):102067, 2020. doi: 10.1016/J.IPM.2019.102067.

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM: retrieval-augmented language model pre-training. CoRR, abs/2002.08909, 2020. URL https://arxiv.org/abs/2002.08909.

Hui Han, C. Lee Giles, Eren Manavoglu, Hongyuan Zha, Zhenyue Zhang, and Edward A. Fox. Automatic document metadata extraction using support vector machines. In ACM/IEEE 2003 Joint Conference on Digital Libraries (JCDL 2003), 27-31 May 2003, Houston, Texas, USA, Proceedings, pages 37–48. IEEE Computer Society, 2003. doi: 10.1109/JCDL.2003.1204842.

Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry P. Heck. Learning deep structured semantic models for web search using clickthrough data. 22nd ACM International Conference on Information and Knowledge Management, CIKM’13, pages 2333–2338. ACM, 2013. doi: 10.1145/2505515.2505665.

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document AI with unified text and image masking. In MM ’22: The 30th ACM International Conference on Multimedia, pages 4083–4091. ACM, 2022. doi: 10.1145/3503161.3548112.

Gautier Izacard, Patrick S. H. Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Atlas: Few-shot learning with retrieval augmented language models. J. Mach. Learn. Res., 24: 251:1–251:43, 2023. URL http://jmlr.org/papers/v24/23-0037.html.

Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst., 20(4):422–446, 2002. doi: 10.1145/582415.582418.

Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over BERT. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, pages 39–48. ACM, 2020. doi: 10.1145/3397271.3401075.

Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim

Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html.

Michael Ley. DBLP - some lessons learned. Proc. VLDB Endow., 2(2):1493–1500, 2009. doi: 10.14778/1687553.1687577.

Huajing Li, Isaac G. Councill, Wang-Chien Lee, and C. Lee Giles. Citeseerx: an architecture and web service design for an academic document search engine. In Proceedings of the 15th international conference on World Wide Web, WWW 2006, pages 883–884. ACM, 2006. doi: 10.1145/1135777.1135926.

Minghan Li and Éric Gaussier. Intra-document block pre-ranking for bert-based long document information retrieval - abstract. In Proceedings of the 2nd Joint Conference of the Information Retrieval Communities in Europe (CIRCLE 2022), 2022, volume 3178 of CEUR Workshop Proceedings. CEUR-WS.org, 2022. URL https://ceur-ws.org/Vol-3178/CIRCLE_2022_paper_27.pdf.

Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Richard James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. RA-DIT: retrieval-augmented dual instruction tuning. In The Twelfth International Conference on Learning Representations, ICLR 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=22OTbutug9.

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Trans. Assoc. Comput. Linguistics, 12:157–173, 2024. doi: 10.1162/TACL_A_00638.

Patrice Lopez. GROBID: combining automatic bibliographic data recognition and term extraction for scholarship publications. In Research and Advanced Technology for Digital Libraries, 13th European Conference, ECDL 2009,Proceedings, volume 5714 of Lecture Notes in Computer Science, pages 473–474. Springer, 2009. doi: 10.1007/978-3-642-04346-8_62.

Zhenyan Lu, Xiang Li, Dongqi Cai, Rongjie Yi, Fangming Liu, Xiwen Zhang, Nicholas D. Lane, and Mengwei Xu. Small language models: Survey, measurements, and insights. CoRR, abs/2409.15790, 2024. doi: 10.48550/ARXIV.2409.15790.

Yuanhua Lv and ChengXiang Zhai. Lower-bounding term frequency normalization. In Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, pages 7–16. ACM, 2011. doi: 10.1145/2063576.2063584.

Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. CEDR: contextualized embeddings for document ranking. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, pages 1101–1104. ACM, 2019. doi: 10.1145/3331184.3331317.

Bhaskar Mitra and Nick Craswell. An introduction to neural information retrieval. Found. Trends Inf. Retr., 13(1):1–126, 2018. doi: 10.1561/1500000061.

Ha-Thanh Nguyen, Manh-Kien Phi, Xuan-Bach Ngo, Vu Tran, Le-Minh Nguyen, and Minh Phuong Tu. Attentive deep neural networks for legal document retrieval. Artif. Intell. Law, 32(1):57–86, 2024. doi: 10.1007/S10506-022-09341-8.

OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303.08774.

Shivam Parmar and Hemil Patel. Prompt engineering for large language model. 2024. doi: 10.13140/RG.2.2.11549.93923.

Vincent Perot, Kai Kang, Florian Luisier, Guolong Su, Xiaoyu Sun, Ramya Sree Boppana, Zilong Wang, Jiaqi Mu, Hao Zhang, and Nan Hua. LMDX: language model-based document information extraction and localization. CoRR, abs/2309.10952, 2023. doi: 10.48550/ARXIV.2309.10952.

Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In The Tenth International Conference on Learning Representations, ICLR 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0.

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models. Trans. Assoc. Comput. Linguistics, 11:1316–1331, 2023. doi: 10.1162/TACL_A_00605.

Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Donna K. Harman, editor, Proceedings of The Third Text REtrieval Conference, TREC 1994, volume 500-225 of NIST Special Publication, pages 109–126. National Institute of Standards and Technology (NIST), 1994. URL http://trec.nist.gov/pubs/trec3/papers/city.ps.gz.

Tarek Saier, Mayumi Ohta, Takuto Asakura, and Michael Färber. Hyperpie: Hyperparameter information extraction from scientific publications. CoRR, abs/2312.10638, 2023. doi: 10.48550/ARXIV.2312.10638.

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. Colbertv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, pages 3715–3734. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.NAACL-MAIN.272.

Ralf Schenkel. Integrating and exploiting public metadata sources in a bibliographic information system. In Proceedings of the 7th International Workshop on Bibliometric-enhanced Information Retrieval (BIR 2018) co-located with the 40th European Conference on Information Retrieval (ECIR 2018), volume 2080 of CEUR Workshop Proceedings, pages 16–21. CEUR-WS.org, 2018. URL https://ceur-ws.org/Vol-2080/paper2.pdf.

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models' sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=RIu5lyNXjT.

Mahsa Shamsabadi, Jennifer D’Souza, and Sören Auer. Large language models for scientific information extraction: An empirical study for virology. CoRR, abs/2401.10040, 2024. doi: 10.48550/ARXIV.2401.10040.

Xuedong Tian and Jiameng Wang. Retrieval of scientific documents based on HFS and BERT. IEEE Access, 9:8708–8717, 2021. doi: 10.1109/ACCESS.2021.3049391.

Dominika Tkaczyk, Pawel Szostek, Mateusz Fedoryszak, Piotr Jan Dendek, and Lukasz Bolikowski. CERMINE: automatic extraction of structured metadata from scientific literature. Int. J. Document Anal. Recognit., 18(4):317–335, 2015. doi: 10.1007/S10032-015-0249-8.

Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. CoRR, abs/1807.03748, 2018. URL http://arxiv.org/abs/1807.03748.

Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nourbakhsh, and Xiaomo Liu. Docllm: A layout-aware generative language model for multimodal document understanding. CoRR, abs/2401.00908, 2024a. doi: 10.48550/ARXIV.2401.00908.

Jiajia Wang, Jimmy Xiangji Huang, Xinhui Tu, Junmei Wang, Angela Jennifer Huang, Md. Tahmid Rahman Laskar, and Amran Bhuiyan. Utilizing BERT for information retrieval: Survey, applications, resources, and challenges. ACM Comput. Surv., 56(7):185:1–185:33, 2024b. doi: 10.1145/3648471.

Derong Xu, Wei Chen, Wenjun Peng, Chao Zhang, Tong Xu, Xiangyu Zhao, Xian Wu, Yefeng Zheng, and Enhong Chen. Large language models for generative information extraction: A survey. CoRR, abs/2312.17617, 2023. doi: 10.48550/ARXIV.2312.17617.

Ye Zhang, Md. Mustafizur Rahman, Alex Braylan, Brandon Dang, Heng-Lu Chang, Henna Kim, Quinten McNamara, Aaron Angert, Edward Banner, Vivek Khetan, Tyler McDonnell, An Thanh Nguyen, Dan Xu, Byron C. Wallace, and Matthew Lease. Neural information retrieval: A literature review. CoRR, abs/1611.06792, 2016. URL http://arxiv.org/abs/1611.06792.

Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Zhicheng Dou, and Ji-Rong Wen. Large language models for information retrieval: A survey. CoRR, abs/2308.07107, 2023. doi: 10.48550/ARXIV.2308.07107.

Downloads

Published

2026-02-04

Issue

Section

Articles

How to Cite

Karl, F., & Scherp, A. (2026). CRAWLDoc: A System for Contextual Ranking and Bibliographic Metadata Extraction from Web Resources. Information Retrieval Research, 2(1), 1-33. https://doi.org/10.54195/irrj.23861