Arabic Question Answering System
Question answering systems are special systems designed in a way that they are capable of answering questions from different languages. The systems use natural language processing and information retrieval mechanisms to generate the desired answers from the question input. Question answering systems go one step further by analyzing the data related to the questions before processing an answer. Arabic question answering systems are some of the most popular systems used today. This is supported by the fact that Arabic is the 6th most popular language in the world today with over 350 million speakers 1. The popularity of Arabic question answering systems is further supported by the fact that the internet has seen a lot of Arabic content being provided on the web.
The majority of question answering systems work using two major principles; information retrieval and natural language processing. This implies that the system can retrieve the information from the web to generate the required answer 2. On the other hand, some systems have a knowledge base that is constructed using semantics for the generation of the required answers 3. Additionally, the process of answering an Arabic question can be segmented into three distinct phases; analysis of the question, passage retrieval, and the extraction of an answer. Question answering systems handle various types of questions based on the structure and nature of the data required for the extraction of an answer 4. The questions can be factoid which can be answered by a simple word answer or open-ended questions that require the retrieval of numerous amount of information to generate the required answer.
Many systems have been developed with the aim of improving the question answering paradigm. The systems might vary with the structure and mode of operation, but the aim is to provide a very accurate answer as possible 5. The purpose of this paper is to propose question answering systems specifically targeting the Arabic language. The proposed system not only retrieve the relative documents, but also refer to target paragraph inside the document that might have the answer.
2.0 Related Works
Bassam, Hani, and Lytinen have proposed the adoption of a question answering system referred to as the QARAB 6. The system intakes, natural queries and articulates them in the Arabic language with the aim of getting short and accurate answers. The primary source of knowledge is an Arabic newspaper text obtained from Al-Raya, Qatar. Its source of knowledge is the traditional Information Retrieval method combined with Natural Language Processing (NLP).
QARAB aim is to recognize text passages that respond to natural language queries 6. A summary of the task would be as follows: provided with some queries in Arabic, give answers to the questions rooted in these expectations. First, the feedback should not extend through documents. Secondly, it should originate from the Arabic newspaper text obtained from Qatar’s Al-Raya journal. QARAB’s QA processing has three steps 6. First, it processes the query. Secondly, it gets the documents possessing responses from the IR system. Lastly, it analyzes the documents in a similar manner it processes the inquiries and showing sentences that may possess the answer.
The structure of QARAB contains the IR system and the NLP System. The IR system is a chip off the Salton’s vector model. Initially, text obtained from the Al-Raya newspaper is processed to achieve an inverted file system that has answers to the questions provided. The role of the IR system is to retrieve documents with information essential to the question 6.
Omar, Lamia, and Paolo are concerned with the scarcity of QA system for the Arabic language 7. Thus, they are proposing the Arabic Definition Question Answering System (DefArabicQA). It is rooted in the pattern approach to recognizing accurate and exact data about an organization using the internet. The approach used uses a linguistic analysis without language comprehension capacity. DefArabicQA recognizes candidate definition with the assistance of lexical patterns, the heuristic rules filter candidate definitions and uses the statistical approach to rank them. When using Google as the web source to answer 50 questions, about 54% of the questions are answered. While using Wikipedia and Google as the Web sources to answer the 50 questions, about 64% of the questions are answered. However, several words are excluded in the definition answer since the snippet is truncated.
Sman and Maryam have proposed the Arabic Question-Answering system referred to as the AQuASys (Arabic Question Answering System) 8. AQuASysis designed to assist users to pose queries in the Arabic language to retrieve accurate answers in the same language. The system responds to queries linked to a named entity of any kind; quantity, organization, person, time, location, and many more. Thus, the system inputs a query commencing with how, when, where, what, and who. Nonetheless, posing queries in Arabic interrogative form results in the extension of questioning nouns playing a similar role as the interrogative nouns preferred by the already developed Arabic QA systems. The performance of AQuASys is measured over several questions offered by native Arabic speakers in the testing stage. The architecture of AQuASys, which is composed of sentence filtering, query analysis, and ranking modules determines the accuracy of feedback provided.
Yassine, Abdelouahid, and Paolo 9 are concerned that most facets of the QA systems are language-reliance. Therefore, when building the system, the target language peculiarities should be put into consideration. To this end, they have proposed the Arabic QA system (ArabiQA). ArabiQA comprises of several structures. First is the question assessment module that determines the type of question and the relevant keywords. The passage retrieval model estimates the most accurate answers while the answer extraction module depicts the relevant answers. The test-set for the system has a total precision of 83%, which implies that it is an efficient approach for accurate extraction of answers to the factoid questions. The accuracy of the entire QA system is not recognized as it lags in the implementation stage 9. It demands further improvements to provide answers to more complex questions than the factoid ones.
Bouma et al 10 have proposed the CLEF for the English and Arabic QA tasks. The system is greatly dependent on syntactic information. CLEF has greatly advanced with the inclusion of two innovations. First, Wikipedia was added as its document extension, an internet encyclopedia presented in varying languages. The XML files are preprocessed to index the document collection for more accurate information retrieval. Lastly, the essential plain text is extracted and parsed. Secondly, the test queries are varied in topics 10. To a certain topic, queries may presuppose or refer to information from former responses or questions to the presented questions. An anaphora resolution system is created to recognize the anaphoric elements. In addition, it identifies the desirable antecedent in the topic’s first question or answers. Lastly, the information retrieval facet has also been improved. The question expansion rooted to blind significant feedback and synonym-lists boosts the IR module’s reciprocal rank. It has a query classification module that uses the question class dedicated to the English source queries and automatically translated Arabic questions.
Finally, the internet is a huge source of knowledge. Thus, it becomes complex to identify accurate information. The recent search engines only provide effective answers rather than the exact answer to the query user 10. To this end, Question Answering (QA) systems provide exact and effective answers to any question asked in the natural or native language. The QA systems mentioned above, such as QARAB, ArabiQA, and AQuASys are effective in providing precise and exact answers in the native Arabic language without limitations to query development rules, precise question language, or the precise knowledge domain.
In this section, the methodology of our work will be described briefly. The methodology used is decomposed into two phases; the first phase will classify the query into the corresponding class using Support Vector Machine (SVM), while the second phase uses the Latent Semantic Indexing (LSI) to retrieve the relevant documents with the selected paragraphs that has the answer.
3.1 Query Classification
Query Classification is an automatically labeling a query regarding target taxonomy. SVM algorithm learns to distinguish between a set of classes based from a training set that have some labeled examples for the target classes. The SVM represents document as point in high dimensional space, where the documents in each class represent positive examples, while the other documents represent negative examples.
The steps for building the SVM model are as the following:
In this step, the documents are tokenized into words based onto the spaces between the words. After tokenization, a stop word elimination used to remove un-useful words ( ??? ???? ??? …) based on Arabic stop words list. Then, a light Arabic stemmer used for stemming the set of words in order to remove various word suffixex, and to reduce the number of words.
In this step, a feature vector is built for each document, where the feature set are all distinct terms in the documents set, and the values are the Term Frequency-Inverse Document Frequency (TF-IDF), which reflect how important a word is for a set of documents in a corpus. Equation (1) shows the TF-IDF calculation.
Where TF is the term frequency of the word in the document, and the IDF is the word frequency across all documents represent in Equation (2).
Where is the number of documents contained the term, and N is the number of documents in the corpus.SVM Model
The input query has to be represented by the previous steps as the documents in the corpus. The SVM classifier used is the inner product kernel, where the input query classified to one of the classes. Figure 1 represents the steps for the SVM classification method.
Figure 1: SVM classifier Steps.
3.2 Latent Semantic Indexing
After the classification step, A Term Document Matrix (TDM), and Singular Value Decomposition (SVD) are built for each class. The TDM is the basic step for building the term vector, which is the basic step for LSI. TDM contains the distinct terms, and the documents in the class, where the entries are TF-IDF. The SVD is the decomposing of a matrix into a product of three matrices. The decomposition exposes all the best properties and feature of the matrices. The SVD calculated for each class, Equation (3) present the SVD Formula.
A= USVt (3)
• U: is an m × k matrix, the columns of the matrix are the eigenvectors of the AAt matrix;
• S: is an k × k matrix, the diagonal elements are the singular values of A; so, all non-diagonal elements are zero by definition.
• V: is an n × k matrix, the columns are the eigenvectors of the At A matrix; right eigenvectors.
The output of the SVD calculation will be the set of term vectors, and document vectors. Where the term vector reflects the relevancy between the term and all the terms, while the document vector reflects the relevancy a certain document to all other documents.
The LSI uses the cosine similarity to find the relation between each term vectors of the query and the document vectors. Equation (4) shows the cosine similarity formula.
Cosine Similarity= A. BA|B| (4)
S. K. Ray and K.Shaalan, “A review and future perspectives of Arabic Question Answering systems,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 12, pp. 3169-3190, 2016.
A. Albarghothi and K.Shaalan, “Arabic question answering using ontology,” Procedia Computer Science, vol. 117, pp. 183-191, 2017.
M. Shaheen and A.M.Ezzeldin, “Arabic question answering: Systems, tools, and future trends,” Arabian Journal for Science and Engineering, vol. 39, no. 6, pp. 4541-4564, 2014.
H.Al-Chalabi, S.Ray, and K.Shaalan, “Semantic based query expansion for Arabic question answering systems,” In Arabic Computational Linguistics, pp. 127-132, 2015.
H. A. Chalabi, Question Processing for Arabic Question Answering System, Dubai: The British University, 2015.
B. Hammo, H.Abu-Salem and S.Lytinen, “QARAB: A Question Answering System to Support the Arabic Language,” pp. 1-11, 2009.
O. Trigui, L.H. Belguith and P.Rosso, “DefArabiQA: Arabic Definition Question Answering System,” Natural Language Engineering, pp. 1-5, 2013.
S. Bekhti and M.Al-Harbi, “AQuASys: A Question-Answering System for Arabic,” Recent Advances in Applied Computer Science, pp. 130-140, 2011.
Y. Benajiba, P. Rosso, A. Lyhyaoui, “Implementation of the ArabiQA question answering system’s components.” InProc. Workshop on Arabic Natural Language Processing, 2nd Information Communication Technologies Int. Symposium, ICTIS-2007, Fez, Morroco, pp. 3-5, 2007.
G. Bouma, J.Mur, G.Kloosterman and G.van Noord, “Question Answering with Joost at CLEF 2007,” pp. 1-9, 2007.