The Impact of Text Pre-processing and Term Weighting on Al-Hadith Al-Shareef Classification
Abstract: Al-Hadith is a collection of words, actions, provisions, and silent approval of Prophet Mohammed (peace and blessings of Allah be upon him) and the second religious source for all Muslims after Al-Qur’an. One of venerable Imam that was also the teller of Al-Hadith is Imam Muslim. He spent nearly fifteen years to compile over 3000 Hadith without repetition. This paper studies the impact of text pre-processing and totally different term weighting schemes on Al-Hadith Al-Shareef Classification. Additionally, thereto, presents and compares the effectiveness of three distinct automatic learning algorithms for classifying Al-Hadith Al-Shareef into eight selective books depending on Sahih Muslim. The automatic learning algorithms are Naïve Bayes (NB), Support Vector Machines (SVM), and Complement Naïve Bayes (CNB) with 10-fold cross-validation. We used Term Frequency-Inverse Document Frequency (TF-IDF), Term Frequency (TF), Term Occurrences (TO), and Binary Term Occurrences (BTO) techniques to compute the relative frequency for every word in a very specific document. The results indicate that term stemming and pruning, document normalization, and term weighting dramatically reduce reductional, enhance text representation and directly impact text mining performance. What is more, classification results show that the CNB achieved promising results compared with other supervised methods in classifying A-Hadith. CNB obtains 91.22% accuracy and 91.86% F-measure.
Keywords: Al-Hadith Al-Shareef, Arabic Text Mining, Arabic Text Classification, Classification Techniques, Term weighting, Arabic morphological analysis.
Islam based on two fundamental laws: Al-Qur’an as the set of words of Allah and Al-Hadith that documenting words, deeds, provisions, and approvals of Mohammed as the prophet of Allah. Hadith was compiled and classified by many Imam such as Imam Bukhari, Imam Muslim, and Imam Tirmidzi, etc. All of them based on one source prophet Mohammed (peace and blessings of Allah be upon him). Imam Muslim is one amongst the known Imam that according to Ulama. Imam Muslim spent nearly fifteen years to compile over 3000 Hadith without repetition 25. Referring to 28, Figure 1 is the component of Hadith.
Sanad is that the chain of the conveyor of every Hadith, this part present at the beginning of Hadith. Matan is that the content of Hadith, present after the Sanad, and at last Rawy, this is the person or Imam that compile Hadith such as Imam Muslim.
By the exponential growth of digitalized document, emerge the necessity of a system that ready to extract high-quality information, that’s why automatic Text Classification (TC) become widespread.
TC task goes through three main steps: text pre-processing, text classification and evaluation. Text pre-processing phase is to make the text documents appropriate to train the classifier. Then, the classifier is built and tuned employing a learning technique against the training dataset. Finally, the classifier gets evaluated by some evaluation measures, i.e. recall, precision; etc. The careful description of those steps is often found in 29, 30, 31.
Several existing classification algorithms are used to classify English text corpora such as: SVM 6, 33, NB 6, 7, 33, NB 6, 7, 33, Decision Trees (DTs) 6, 7, k-Nearest Neighbor (KNN) 33, Artificial Neural Networks (ANNs) 33 et al.. However, little research works are conducted on Arabic corpora, chiefly since the Arabic language is very wealthy and needs special treatments like order verbs, morphological analysis, etc. Notably, in Arabic morphology, words have affluent meanings and contain a good deal of grammatical and lexical information 32. Additionally, in grammar structure, Arabic sentence formation differs from English. During this regard, the Arabic text documents are required, significant processing to build an accurate classification model. Therefore, few scholars have applied a variety of classification approaches to the matter of Arabic text classification, i.e. NB 3, 10 13, SVM 2, 15, 22, KNN 22 and DTs 2, 16. Even so, researchers conclude that the Arabic text classification may be a terribly difficult task because of language complexity.
This paper studies the impact of text pre-processing techniques and different term weighting schemes on Arabic corpus collected manually from Islam’s lawsuit and indicative web site. Additionally, presents and compares varied classification rule mining methods associated with the matter of Arabic text classification. Primaries, NB, SVM, and CNB learning methods are applied to classify Sahih Muslim Arabic corpus into one of the predefined categories (books) to measure their performance and effectiveness with reference to different text evaluation metrics like accuracy, precision, recall, and F-measure measures. Experiments are going to be conducted on a specific set of AL-Hadith from Muslim book, wherever eight selective books were chosen as categories so as to run these experiments.
The subsequence sections are organized as follows: section 2 contains related works. Section 3 introduces the corpus; we used to test our learning methods and the pre-processing done to the text. Finally, experimental results and evaluation, and conclusions are presented in Section 4 and Section 5 respectively.
2. Related Works
The Arabic language is the mother tongue of more than 300 million people; it is considered for religious reasons the language of Islam, and it is ranked as the fifth most spoken language around the world 26. Unlike Latin-based alphabets, the orientation of writing in Arabic is from right to left; the Arabic alphabet consists of 28 letters. Arabic, in general, is a challenging language because it has a very complex morphology as compared to English. This is due to the unique nature of Arabic morphological principle, which is highly inflectional and derivational 9, 11, 14.
El-kourdi et. al 10 used an NB classifier to classify an in-house collection of Arabic documents. The collections include five classes and three hundred web documents for every class and have used many partitions of the data set. They have concluded that there is some indication that the performance of the NB algorithm in classifying Arabic documents is not sensitive to the Arabic root extraction algorithm, additionally to their own root extraction algorithm; they used other root extraction algorithms. The average accuracy reported was about 68.78%.
Duwairi 8 compared the performance of NB, KNN, and distance-based classifiers for Arabic text categorization. The collected corpus contains a thousand documents that vary in length and writing styles and comprise ten classes every class consists of a hundred documents. The author used stemming to reduce the number of features extracted from documents. The recall, precision, error rate and fallout measures were used to compare the accuracy of classifiers. The results showed that the performance of NB classifier outperformed the other two classifiers.
Al-harbi et. al 2 evaluated the performance of two popular classification algorithms C5.0 decision tree and SVM on classifying Arabic text using the seven different Arabic corpora such as (Saudi News Papers, WEB Sites, Arabic Poems). They have implemented a tool for Arabic text classification to accomplish feature extraction and selection. They have concluded that the C5.0 decision tree algorithm outperformed SVM in terms of accuracy whereas the SVM, average accuracy was 68.65%, while the average accuracy for the C5.0 was 78.42%.
Hattab et. al 17 applied the SVM model in classifying Arabic text documents. The results compared with the other traditional classifiers NB classifier, KNN classifier, and Rocchio classifier. Their experimental results performed on a set of 1132 documents, showing that Rocchio classifier gave better results when the size of the feature set is small while SVM outperformed the other classifiers when the size of the feature set was large enough. The classification rate exceeds 90% when using more than 4000 features.
Al-khatib 4 compared the effectiveness of four different learning algorithms for classifying Al-Hadith Al-Shareef into eight selective books depending on Sahih Bukhari. The testing corpus has 1500 Hadiths that vary in length distributed eight books. The learning algorithms are the Rocchio algorithm, KNN, NB, and SVM. He used the Term TF-IDF technique to compute the relative frequency for each word in a particular document. His results showed that the best accuracy was reported for the SVM algorithm in AL-Hadith Classifications since the precision value is the smallest one for all results. KNN and NB algorithms had a good accuracy in Al-Hadith classifications, and the worst accuracy is reported for the Rocchio algorithm in AL-Hadith classifications since the precision value is the largest one.
Jbara 19 examined the knowledge discovery from AL-Hadith through a classification algorithm in order to classify AL-Hadith to one of the thirteen predefined classes (books) from Sahih AL-Bukhari. The testing corpus has 1321 Hadiths that vary in length distributed over thirteen books. The author used a supervised method called Stem Expansion (SEC) to discover knowledge from AL-Hadith by assigning each Hadith to one book (class) of predefined classes. His results showed that SEC performed better in classifying AL-Hadith against existing classification methods (WBC and AL-Kabi) according to the most reliable measurements (recall, precision, and F-Measure) in the text classification field.
We found that there’s a significant shortcoming of the Arabic classification studies during this field. Each study is restricted to a limited range of classification algorithms. This research studies the impact of text pre-processing and different term weighting schemes on Arabic text classification. Additionally, presents and compares distinct classification methods that may use the same corpus in order to evaluate such algorithms and choose the one most suited to the considered case study. This guarantees that the various algorithms had the same conditions and also the same setting in all the experiments.
3. The Corpus and The Text Pre-processing
3.1. The Corpus
In this work, we tend to build an in-house corpus of Arabic texts collected from 18, that referred to as MHAC to perform our experimentations; the corpus includes 1,306 text documents and classified in eight classes that chosen from Sahih Muslim. The corpus contains concerning 24,127 district features after stop words removal. We generate all text representations for MHAC corpus to evaluate the obtained classification results. The generated text representations for MHAC corpus are: (Light stemming, Stemming) and percentual term pruning (min threshold = 3%, max threshold = 30%) with (TF-IDF, TF, TO, and BTO). Table 1 shows statistical information concerning the books included within the experiments along with its name in English and Arabic as it was used by Sahih Muslim.
3.2. The Text Pre-processing
One of the widely used methods for text mining presentations is viewing the text as a Bag of Tokens (BOT) (words, n-grams). Under that model, we can already classify text 5.
Before applying any algorithm, for both training and testing data, some pre-processing will be conducted on each Hadith. It includes removing Sanad, tokenizing string to words, removing punctuation and diacritic marks, applying stop words removal, applying the appropriate term stemming and pruning methods as feature reduction techniques, normalizing the tokenized words and finally applying the suitable term weighting scheme to enhance text document representation as feature vectors. We use the open-source machine learning tool Rapid Miner for text pre-processing. Table 2 shows all steps of pre-processing for AL-Hadith.
In linguistics, morphology is the identification, analysis, and description of the structure of morphemes and other units of meaning in a language like words, affixes, and parts of speech. For the Arabic language, there are two different morphological analysis techniques; stemming and light stemming. Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form. Stemming algorithm by Khoja 21 is one of the well-known Arabic stemmers. Light stemming, in contrast, removes common affixes from words without reducing them to their stems and keeps the words’ meanings unaffected 1, 12, 24. A light stemmer 23 is a standard Arabic light stemmer.
The aim of term weighting is to enhance text document representation as feature vectors. Popular term weighting schemes are TF-IDF, TF, TO, and BTO. BTO indicates the absence or presence of a word with Booleans 0 or 1 respectively. TF(t,d) is the number that the term t occurred in document d. TO be the number of occurrences of term t in document d. TF-IDF is a weight often used in retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. Term frequency tf(t, d) is the number that the term t occurred in document d. Document frequency df(t) is the number of documents in which the term t occurred at least once. The inverse document frequency can be calculated from document frequency using the formula: log(num of Docs/num of Docs with word i). A reasonable measure of term importance may then be obtained by using the product of the term frequency and the inverse document frequency (TF*IDF) 12, 20, 24, 27.
4. Experimental Results and Evaluation
We perform experiments on Arabic MHAC corpus collected manually from Islam’s lawsuit and indicative website 18. The corpus includes 1,306 text documents belonging to one of the eight categories (The Book of Prayers, The Book of Zakat, The Book of Fasting, The Book of Marriage, The Book of Transactions, The Book of Musaqah, The Book of Drinks, and The Book of Greetings) that chosen from Sahih Muslim. For text classification, we use NB, SVM, and CNB with 10-fold cross-validation. We split the corpus into two parts (80% of the corpus for training and the remaining 20% to test) using stratified sampling, which keeps class distributions remain the same after splitting. We split the corpus in this way to achieve higher classification results.
For evaluating the classification results, we use confusion matrices that are the primary source of performance measurement for the classification problem. We have evaluated the obtained classification results using the most common classification measures such as accuracy, precision, recall, and F-measure.
The average classification results are depicted in Figure 2. The morphological analysis (stemming, light stemming), term pruning and term weighting schemes (TF-IDF, TF, TO, BTO) have an obvious impact on the classifier performance as shown in Figure 2. The Figure emphasizes that light stemming, and TO representation for CNB classifier has the best classification results (the accuracy is 91.22%, and the F-measure is 91.86%).
Several observations can be made by analyzing the results in Figure 2. First, using pre-processing techniques like Arabic stop word remover and Arabic stemmer will enhance the accuracy and the F-measure of the classifiers. Second, light stemming has the best classification results this is because lighting stemming is more proper than stemming from linguistics and semantic viewpoint and keeps the word meanings unaffected. Furthermore, classifiers are very sensitive to term weighting schemes because they depend on the distance function to determine the nearest neighbors. For example, the BTO weighting scheme has the worst classification results because the text representation is 0 or 1.
Figure 3 shows the classification results for the optimal text representation of MHAC corpus (light stemming + TO for CNB) in each of the domain categories. From Figure 2, we can see that the best F-measure is recorded in The Book of Musaqah that because The Book of Musaqah has limited space of words that are limited and cleared compared with other books. Moreover, it shows that The Book of Zakat has the lowest F-measure may be that also because The Book of Zakat has a large space domain.
5. Conclusion and Future Works
This paper studies the impact of text pre-processing and different term weighting schemes on Arabic text classification. In addition, presents and compares the effectiveness of three distinct automatic learning algorithms for classifying Al-Hadith Al-Shareef into eight selective books depending on Sahih Muslim. The classifiers have been tested using Arabic text corpus collected manually by us from the Sahih Muslim, which cover eight books: The Book of Prayers, The Book of Zakat, The Book of Fasting, The Book of Marriage, The Book of Transactions, The Book of Musaqah, The Book of Drinks, and The Book of Greetings. The learning algorithms are NB, SVM and CNB with 10-fold cross-validation are applied to classify Sahih Muslim Arabic corpus. Moreover, we used TF-IDF, TF, TO, BTO and techniques to compute the relative frequency for each word in a particular document. The results indicate that term stemming and pruning, document normalization, and term weighting dramatically reduce dimensionality, enhance text representation and directly impact text mining performance. Furthermore, classification results show that the CNB achieved promising results compared with other supervised methods in classifying A-Hadith. CNB obtains 91.22% accuracy and 91.86% F-measure.
Possible directions for future work include conducting additional experiments using further text collections to make sure the results that we got. Additionally, we tend to decide to use the other feature choice and weighting methods and compare them with the methods already used. Additionally, enhancing the accuracy of the system, more than one classification method can be merged with each other to increase the accuracy. Finally, it’s possible to build a system which can accept as input an archive of texts like Islamic books archive and some category (subject), and as a result, it will give all the texts, which are related to this category.