A SURVEY OF RETRIEVAL ALGORITHMS AND THEIR PARALLELIZATION IN LARGE-SCALE SYSTEMS

Volume 4 (2), December 2021, Pages 126-131

Suleyman Suleymanzade


Institute of Information Technology, Azerbaijan National Academy of Sciences, Baku, Azerbaijan, This email address is being protected from spambots. You need JavaScript enabled to view it.


Abstract

This article presented a survey of two well-known algorithms, TF-IDF and BM-25 methods, for document ranking on a single CPU and parallel processes via HPC. An amazon review dataset with more than two million reviews was measured to measure the rank parameters. We set up the number of workers for the parallel processing during the experiment, which we selected as one and three. Four benchmarks evaluated the preprocess and reading time, vectorization time, TF-IDF transformation time, and overall time. Results metrics have shown a significant difference in speed.

Keywords:

TF-IDF, BM-25, Apache spark, Information retrieval, HPC.

DOI: https://doi.org/10.32010/26166127.2021.4.2.263.266

 

 

Reference 

Argerich, L., Zaffaroni, J. T., & Cano, M. J. (2016). Hash2vec, feature hashing for word embeddings. arXiv preprint arXiv:1608.08940.

Dietz, L., Xiong, C., Dalton, J., & Meij, E. (2019). Special issue on knowledge graphs and semantics in text analysis and retrieval. Information Retrieval Journal, 22(3), 229-231. 

Järvelin, K. (2007). An analysis of two approaches in information retrieval: From frameworks to study designs. Journal of the American Society for Information Science and Technology, 58(7), 971-986. 

Krishnan, A. G., & Goswami, D. (2021, December). Multi-Stage Memory Efficient Strassen’s Matrix Multiplication on GPU. In 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC) (pp. 212-221). IEEE.

Kumar, R., & Sharma, S. C. (2018). Information retrieval system: An overview, issues, and challenges. International Journal of Technology Diffusion (IJTD), 9(1), 1-10.

Kumari, M., Jain, A., & Bhatia, A. (2016). Synonyms based term weighting scheme: An extension to TF. IDF. Procedia Computer Science, 89, 555-561. 

Lawson, M., Gropp, W., & Lofstead, J. (2021). Exploring Spatial Indexing for Accelerated Feature Retrieval in HPC. arXiv preprint arXiv:2106.13972.

Lv, Y., & Zhai, C. (2011, October). Adaptive term frequency normalization for BM25. In Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 1985-1988). 

Metzler, D. (2008, October). Generalized inverse document frequency. In Proceedings of the 17th ACM conference on Information and knowledge management (pp. 399-408). 

Mezzoudj, S., Behloul, A., Seghir, R., & Saadna, Y. (2021). A parallel content-based image retrieval system using spark and tachyon frameworks. Journal of King Saud University-Computer and Information Sciences, 33(2), 141-149.

Mudambi, S. M., & Schuff, D. (2010). Research note: What makes a helpful online review? A study of customer reviews on Amazon. com. MIS quarterly, 185-200.

Ramli, F., Noah, S. A., & Kurniawan, T. B. (2016, August). Ontology-based information retrieval for historical documents. In 2016 Third International Conference on Information Retrieval and Knowledge Management (CAMP) (pp. 55-59). IEEE. 

Robertson, S. (2004). Understanding inverse document frequency: on theoretical arguments for IDF. Journal of documentation.

Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1995). Okapi at TREC-3. Nist Special Publication Sp, 109, 109. 

Schütze, H., Manning, C. D., & Raghavan, P. (2008). Introduction to information retrieval (Vol. 39, pp. 234-265). Cambridge: Cambridge University Press.

Tito Svenstrup, D., Hansen, J., & Winther, O. (2017). Hash embeddings for efficient word representations. Advances in neural information processing systems, 30.

Zheng, P., Wu, Z., Sun, J., et al. (2021). A parallel unmixing-based content retrieval system for distributed hyperspectral imagery repository on cloud computing platforms. Remote Sensing, 13(2), 176.