submitted on 2024-10-28, 10:26 and posted on 2024-10-30, 09:55authored byHaya Al-Thani
Conversational search (CS) presents unique challenges, particularly context-awareness and data limitation. This dissertation investigates these challenges and proposes novel solutions to address them. We explore conversational query reformulation using a text-to-text model and a binary-term classifier, highlighting their advantages and applying techniques to improve performance, such as query clarity score and multi-model fusion. By combining both reformulation models, we achieved state-of-the-art results. Additionally, we developed a system that selectively incorporates responses into conversation history, improving the CS system’s ability to retrieve passages for ambiguous queries. To tackle the issue of limited training data, we introduced paraphrasing as a data augmentation method, increasing the size of our CS dataset by over 665% and enhancing language diversity. We used automatic paraphrase generation combined with human-in-the-loop techniques to produce a high-quality dataset, the Expanded-CAsT (ECAsT). The ECAsT dataset serves as a valuable resource for the CS research community, offering numerous applications. We utilized ECAsT to assess the robustness of CS evaluation concerning language diversity and to train two novel multi-turn paraphrasing models with potential applications in query expansion, data augmentation, and passage retrieval. This dissertation contributes to the advancement of conversational search by addressing its main challenges and providing innovative techniques and resources for the research community.