مقالات​

An Introduction to Noor Corpus and its Language Model

نویسندگان
Mohammad Hossein Elahimanesh, Behrouz Minaei-Bidgoli, Mohammad Javad Gholami, Hossein Juzi
چکیده
In Linguistics, a text corpus is defined as a large group of text documents. Text corpora are used in order to extract the hidden laws of languages. As one application for statistical researches and hidden laws extraction, language models are made to be used for information retrieval applications. In this paper we introduce one of the greatest text corpora in Islamic science which is called Noor Corpus, and then we provide the Language model of this corpus. The Noor Corpus is results of a decade of efforts from theological researchers and computer engineers of Computer Research Center of Islamic Sciences (CRCIS). This corpus includes thousands of Islamic Books are classified into different categories. Most of the existing texts are Arabic and Persian. There are 1.2 billion Arabic words as well as 616 million Persian words. The bigram language models of this corpus have 80 million distinct bigram words in Arabic and 44 million distinct bigram words in Persian.
کلیدواژه‌ها
Islamic Corpus; Language Model; Natural Language Processing
0 0 رای ها
رأی دهی
اشتراک در
اطلاع از
guest
0 نظر
بازخورد (Feedback) های اینلاین
نمایش همه نظرات