Publications | Marzia Nouri

The publications are grouped into two sections: conference and journal papers.
* Denotes Equal Contribution

Conference Publications

EMNLP 2024
Latent Concept-based Explanation of NLP Models

Xuemin Yu, Fahim Dalvi, Nadir Durrani, Marzia Nouri, and Hassan Sajjad

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), 2024

Abs arXiv Bib PDF

Interpreting and understanding the predictions made by deep learning models poses a formidable challenge due to their inherently opaque nature. Many previous efforts aimed at explaining these predictions rely on input features, specifically, the words within NLP models. However, such explanations are often less informative due to the discrete nature of these words and their lack of contextual verbosity. To address this limitation, we introduce the Latent Concept Attribution method (LACOAT), which generates explanations for predictions based on latent concepts. Our foundational intuition is that a word can exhibit multiple facets, contingent upon the context in which it is used. Therefore, given a word in context, the latent space derived from our training process reflects a specific facet of that word. LACOAT functions by mapping the representations of salient input words into the training latent space, allowing it to provide latent context-based explanations of the prediction.
@inproceedings{yu2024latentconceptbasedexplanationnlp, title = {Latent Concept-based Explanation of NLP Models}, author = {Yu, Xuemin and Dalvi, Fahim and Durrani, Nadir and Nouri, Marzia and Sajjad, Hassan}, booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024)}, year = {2024}, url = {https://arxiv.org/abs/2404.12545}, archiveprefix = {arXiv}, eprint = {2404.12545}, primaryclass = {cs.CL}, keyword = {conference} }
COLM 2024
Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?

Omid Ghahroodi, Marzia Nouri^*, Mohammad Vali Sanian^*, Alireza Sahebi^*, Doratossadat Dastgheib, Ehsaneddin Asgari, Mahdieh Soleymani Baghshah, and Mohammad Hossein Rohban

In First Conference on Language Modeling, 2024

Abs Bib PDF Github

Evaluating Large Language Models (LLMs) is challenging due to their generative nature, necessitating precise evaluation methodologies. Additionally, non-English LLM evaluation lags behind English, resulting in the absence or weakness of LLMs for many languages. In response to this necessity, we introduce Khayyam Challenge (also known as PersianMMLU), a meticulously curated collection comprising 20,805 four-choice questions sourced from 38 diverse tasks extracted from Persian examinations, spanning a wide spectrum of subjects, complexities, and ages. The primary objective of the Khayyam Challenge is to facilitate the rigorous evaluation of LLMs that support the Persian language. Distinctive features of the Khayyam Challenge are (i) its comprehensive coverage of various topics, including literary comprehension, mathematics, sciences, logic, intelligence testing, etc aimed at assessing different facets of LLMs such as language comprehension, reasoning, and information retrieval across various educational stages, from lower primary school to upper secondary school (ii) its inclusion of rich metadata such as human response rates, difficulty levels, and descriptive answers (iii) its utilization of new data to avoid data contamination issues prevalent in existing frameworks (iv) its use of original, non-translated data tailored for Persian speakers, ensuring the framework is free from translation challenges and errors while encompassing cultural nuances (v) its inherent scalability for future data updates and evaluations without requiring special human effort. Previous works lacked an evaluation framework that combined all of these features into a single comprehensive benchmark. Furthermore, we evaluate a wide range of existing LLMs that support the Persian language, with statistical analyses and interpretations of their outputs. We believe that the Khayyam Challenge will improve advancements in LLMs for the Persian language by highlighting the existing limitations of current models, while also enhancing the precision and depth of evaluations on LLMs, even within the English language context.
@inproceedings{ghahroodi2024khayyam, title = {Khayyam Challenge (Persian{MMLU}): Is Your {LLM} Truly Wise to The Persian Language?}, author = {Ghahroodi, Omid and Nouri, Marzia and Sanian, Mohammad Vali and Sahebi, Alireza and Dastgheib, Doratossadat and Asgari, Ehsaneddin and Baghshah, Mahdieh Soleymani and Rohban, Mohammad Hossein}, booktitle = {First Conference on Language Modeling}, year = {2024}, url = {https://openreview.net/forum?id=yIEyHP7AvH}, keyword = {conference} }
AACL 2023
The Language Model, Resources, and Computational Pipelines for the Under-Resourced Iranian Azerbaijani

Marzia Nouri^*, Mahsa Amani^*, Reihaneh Zohrabi, and Ehsaneddin Asgari

In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), Nov 2023

Abs Bib PDF Github

Iranian Azerbaijani is a dialect of the Azerbaijani language spoken by more than 16% of the population in Iran (>14 million). Unfortunately, a lack of computational resources is one of the factors that puts this language and its rich culture at risk of extinction. This work aims to create fundamental natural language processing (NLP) resources and pipelines for the processing and analysis of Iranian Azerbaijani introducing standard datasets and starter models for various NLP tasks such as language modeling, text classification, part-of-speech (POS) tagging, and machine translation. The proposed resources have been curated and preprocessed to facilitate the development of NLP models for Iranian Azerbaijani and provide a strong baseline for further research and development. This study is an example of bridging the gap in NLP for low-resource languages and promoting the advancement of language technologies in underrepresented languages. To the best of our knowledge, for the first time, this paper presents major infrastructures for the processing and analysis of Iranian Azerbaijani, with the ultimate goal of improving communication and information access for millions of individuals. Furthermore, our translation model’s online demo is accessible at https://azeri.parsi.ai/.
@inproceedings{nouri-etal-2023-language, title = {The Language Model, Resources, and Computational Pipelines for the Under-Resourced {I}ranian {A}zerbaijani}, author = {Nouri, Marzia and Amani, Mahsa and Zohrabi, Reihaneh and Asgari, Ehsaneddin}, editor = {Park, Jong C. and Arase, Yuki and Hu, Baotian and Lu, Wei and Wijaya, Derry and Purwarianti, Ayu and Krisnadhi, Adila Alfa}, booktitle = {Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)}, month = nov, year = {2023}, address = {Nusa Dua, Bali}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.ijcnlp-short.19}, doi = {10.18653/v1/2023.ijcnlp-short.19}, pages = {166--174}, keyword = {conference} }

Journal Publications

Linguistic Resources and Transformer-based Models for the Machine Translations between Luri and Yazdi Dialects versus Standard Persian

Zahra Bahmani, Mohaddeseh Mirbeygi, Negin Hashemi Dijujin, Marzia Nouri, Mahsa Amani, Ehsan Asgari, Mahdieh Soleymani Baghshah, Hamid Beigy, Ali Movaghar, and Afzal Moghimi

Language and Linguistics, 2022

Abs Github

Despite recent advances in developing language technologies for the standard Persian dialect, the official Iranian language, a large number of Iranian language variations remained computationally unexplored. Iranian languages, e.g., Kurdi, Azeri, and many Persian dialects are examples of low-resource language distinctions lacking significant linguistic resources such as machine-readable lexicons or part-of-speech (POS) taggers. Efforts in developing language technologies for such languages can significantly contribute to language survival in the digital era and promote cultural diversity. To the best of our knowledge, for the first time, we created linguistic resources for the Luri and the Yazdi dialects by introducing the first parallel corpora between these language variations and the modern Persian language. In this study, we train neural encoder-decoders (1) recurrent sequence-to-sequence and (2) transformer-based machine translation models and evaluate the trained model using BLEU score on an unseen test dataset.Availability of datasets and models: Datasets are available here at https://github.com/language-ml/dataset_yazdi_luri.git