S. Mesbah | TU Delft Repository

LOREM

Language-consistent Open Relation Extraction from Unstructured Text

Conference paper (2020) - Tom Harting, Sepideh Mesbah, Christoph Lofi

We introduce a Language-consistent multi-lingual Open Relation Extraction Model (LOREM) for finding relation tuples of any type between entities in unstructured texts. LOREM does not rely on language-specific knowledge or external NLP tools such as translators or PoS-taggers, and exploits information and structures that are consistent over different languages. This allows our model to be easily extended with only limited training efforts to new languages, but also provides a boost to performance for a given single language. An extensive evaluation performed on 5 languages shows that LOREM outperforms state-of-the-art mono-lingual and cross-lingual open relation extractors. Moreover, experiments on languages with no or only little training data indicate that LOREM generalizes to other languages than the languages that it is trained on. ...

Semantic-Enhanced Training Data Augmentation Methods for Long-Tail Entity Recognition Models

Doctoral thesis (2020) - Sepideh Mesbah, Geert-Jan Houben, Alessandro Bozzon, Christoph Lofi

Named Entity Recognition (NER) is an essential information retrieval task. It enables a wide range of natural language processing applications such as semantic search, machine translation, etc. The NER can be formulated as the task of identifying and typing words or phrases in a text that refers to certain classes of interest (e.g., disease, Adverse Drug Reactions). There are different techniques to tackle NER, such as dictionary-based, rulebased, and machine learning-based. Machine learning-based NER techniques have shown to perform the best for entities with large amounts of human-labeled training datasets.
However, their performance is limited when dealing with long-tail entities. Long-tail entities are entities that have a low frequency in the document collections and usually have no reference to existing Knowledge Bases. Obtaining human-labeled datasets is expensive and time-consuming, especially for long-tail entities that are scarcely available in document collections. This dissertation focuses on the problem of the lack of training data, arguably the largest bottleneck in training machine learning-based NER techniques. We investigated efficient and effective ways to augment training data by enhancing their size and quality automatically. Our work aimed at showing how, by enhancing the size and quality of the training data using different techniques, it will be possible to improve the performance of Long-tail Entity Recognition (L-tER). ...

Normalization of Long-tail Adverse Drug Reactions in Social Media

Conference paper (2020) - E. Manousogiannis, Sepideh Mesbah, Alessandro Bozzon, Robert-Jan Sips, Zoltán Szlávik, Selene Baez Santamaria

The automatic mapping of Adverse Drug Reaction (ADR) reports from user-generated content to concepts in a controlled medical vocabulary provides valuable insights for monitoring public health. While state-of-the-art deep learning-based sequence classification techniques achieve impressive performance for medical concepts with large amounts of training data, they show their limit with long-tail concepts that have a low number of training samples. The above hinders their adaptability to the changes of layman’s terminology and the constant emergence of new informal medical terms. Our objective in this paper is to tackle the problem of normalizing long-tail ADR mentions in user-generated content. In this paper, we exploit the implicit semantics of rare ADRs for which we have few training samples, in order to detect the most similar class for the given ADR. The evaluation results demonstrate that our proposed approach addresses the limitations of the existing techniques when the amount of training data is limited. ...

Give it a shot: Few-shot learning to normalize ADR mentions in Social Media posts

Conference paper (2019) - E. Manousogiannis, Sepideh Mesbah, Selene Baez Santamaria, Alessandro Bozzon, Robert-Jan Sips

This paper describes the system that team MYTOMORROWS-TU DELFT developed for the 2019 Social Media Mining for Health Applications (SMM4H) Shared Task 3, for the end-to-end normalization of ADR tweet mentions to their corresponding MEDDRA codes. For the first two steps, we reuse a state-of-theart approach, focusing our contribution on the final entity-linking step. For that we propose a simple Few-Shot learning approach, based on pre-trained word embeddings and data from the UMLS, combined with the provided training data. Our system (relaxed F1: 0.337- 0.345) outperforms the average (relaxed F1 0.2972) of the participants in this task, demonstrating the potential feasibility of few-shot learning in the context of medical text normalization. ...

Training Data Augmentation for Detecting Adverse Drug Reactions in User-Generated Content

Conference paper (2019) - Sepideh Mesbah, Jie Yang, Robert-Jan Sips, Manuel Valle Torre, Christoph Lofi, Alessandro Bozzon, Geert-Jan Houben

Social media provides a timely yet challenging data source for adverse drug reaction (ADR) detection. Existing dictionary-based, semi-supervised learning approaches are intrinsically limited by the coverage and maintainability of laymen health vocabularies. In this paper, we introduce a data augmentation approach that leverages variational autoencoders to learn high-quality data distributions from a large unlabeled dataset, and subsequently, to automatically generate a large labeled training set from a small set of labeled samples. This allows for efficient social-media ADR detection with low training and re-training costs to adapt to the changes and emergence of informal medical laymen terms. An extensive evaluation performed on Twitter and Reddit data shows that our approach matches the performance of fully-supervised approaches while requiring only 25% of training data. ...

Coner

A Collaborative Approach for Long-Tail Named Entity Recognition in Scientific Publications

Conference paper (2019) - Daniel Vliegenthart, Sepideh Mesbah, Christoph Lofi, Akiko Aizawa, Alessandro Bozzon

Named Entity Recognition (NER) for rare long-tail entities as e.g., often found in domain-specific scientific publications is a challenging task, as typically the extensive training data and test data for fine-tuning NER algorithms is lacking. Recent approaches presented promising solutions relying on training NER algorithms in an iterative weakly-supervised fashion, thus limiting human interaction to only providing a small set of seed terms. Such approaches heavily rely on heuristics in order to cope with the limited training data size. As these heuristics are prone to failure, the overall achievable performance is limited. In this paper, we therefore introduce a collaborative approach which incrementally incorporates human feedback on the relevance of extracted entities into the training cycle of such iterative NER algorithms. This approach, called Coner, allows to still train new domain specific rare long-tail NER extractors with low costs, but with ever increasing performance while the algorithm is actively used in an application. ...

SmartPub

A Platform for Long-Tail Entity Extraction from Scientific Publications

Conference paper (2018) - Sepideh Mesbah, Alessandro Bozzon, Christoph Lofi, Geert-Jan Houben

This demo presents SmartPub, a novel web-based platform that supports the exploration and visualization of shallow meta-data (e.g., author list, keywords) and deep meta-data--long tail named entities which are rare, and often relevant only in specific knowledge domain--from scientific publications. The platform collects documents from different sources (e.g. DBLP and Arxiv), and extracts the domain-specific named entities from the text of the publications using Named Entity Recognizers (NERs) which we can train with minimal human supervision even for rare entity types. The platform further enables the interaction with the Crowd for filtering purposes or training data generation, and provides extended visualization and exploration capabilities. SmartPub will be demonstrated using sample collection of scientific publications focusing on the computer science domain and will address the entity types Dataset (i.e. dataset presented or used in a publication), and Methods (i.e. algorithms used to create/enrich/analyse a data set) ...

Concept Focus

Semantic Meta-Data For Describing MOOC Content

Conference paper (2018) - Sepideh Mesbah, Guanliang Chen, Manuel Valle Torre, Alessandro Bozzon, Christoph Lofi, Geert-Jan Houben

MOOCs promised to herald a new age of open education.
However, efficient access to MOOC content is still hard, thus unneces-
sarily complicating many use cases like efficient re-use of material, or
tailored access for life-long learning scenarios. One of the reasons for this
lack of accessibility is the shortage of meaningful semantic meta-data de-
scribing MOOC content and the resulting learning experience. In this pa-
per, we explore Concept Focus, a new type of meta-data for describing a
perceptual facet of modern video-based MOOCs, capturing how focused
a learning resource is topic-wise, which is often an indicator of clarity
and understandability. We provide the theoretical foundations of Con-
cept Focus and outline a methodical workflow of how to automatically
compute it for MOOC lectures. Furthermore, we show that the learners’
consumption behavior is correlated with a MOOC lecture’s Concept Focus, thus underlining that this type of meta-data is indeed relevant for user-centric querying, personalizing or even designing the MOOC experience. For showing this, we performed an extensive study with real-life
MOOCs and 12,849 learners over the duration of three months. ...

TSE-NER

An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications

Conference paper (2018) - Sepideh Mesbah, Christoph Lofi, Manuel Valle Torre, Alessandro Bozzon, Geert-Jan Houben

Named Entity Recognition and Typing (NER/NET) is a challenging task, especially with long-tail entities such as the ones found in scientific publications. These entities (e.g. “WebKB”, “StatSnowball”) are rare, often relevant only in specific knowledge domains, yet important for retrieval and exploration purposes. State-of-the-art NER approaches employ supervised machine learning models, trained on expensive typelabeled data laboriously produced by human annotators. A common workaround is the generation of labeled training data from knowledge bases; this approach is not suitable for long-tail entity types that are, by definition, scarcely represented in KBs.
This paper presents an iterative approach for training NER and NET
classifiers in scientific publications that relies on minimal human input,
namely a small seed set of instances for the targeted entity type. We
introduce different strategies for training data extraction, semantic expansion, and result entity filtering.We evaluate our approach on scientific
publications, focusing on the long-tail entities types Datasets, Methods in
computer science publications, and Proteins in biomedical publications. ...

Nudge your Workforce

A Study on the Effectiveness of Task Notification Strategies in Enterprise Mobile Crowdsourcing

Conference paper (2017) - Sarah Bashirieh, Sepideh Mesbah, Judith Redi, Alessandro Bozzon, Zoltán Szlávik, Robert Jan Sips

As crowdsourcing gains popularity, organisations seek ways to systematically and reliably involve their workforce with data processing pipelines. Mobile crowdsourcing allows for opportunistic task executions and thus, potentially, for higher throughput. However, how to engage and to retain employees in enterprise crowdsourcing campaigns is still an open research topic. .is paper discusses the results of a study performed in IBM Benelux. We surveyed 93 employees to discover the factors that might a.ect engagement in mobile enterprise crowdsourcing. .e survey informed the design of an experiment that aimed at investigating the e.ectiveness of di.erent task noti€cation strategies. We studied how factors such as time and context of noti€cation can a.ect the participation and retention of employees. Results show that break times are the most suitable for crowd work, and that "aggressive" noti€cation strategies act as deterrent for participation, while moderate yet regular nudges are the most likely to retain contributors. ...

Semantic Annotation of Data Processing Pipelines in Scientific Publications

Conference paper (2017) - Sepideh Mesbah, Kyriakos Fragkeskos, Christoph Lofi, Alessandro Bozzon, Geert-Jan Houben

Data processing pipelines are a core object of interest for data scientist and practitioners operating in a variety of data-related application domains. To effectively capitalise on the experience gained in the creation and adoption of such pipelines, the need arises for mechanisms able to capture knowledge about datasets of interest, data processing methods designed to achieve a given goal, and the performance achieved when applying such methods to the considered datasets. However, due to its distributed and often unstructured nature, this knowledge is not easily accessible. In this paper, we use (scientific) publications as source of knowledge about Data Processing Pipelines. We describe a method designed to classify sentences according to the nature of the contained information (i.e. scientific objective, dataset, method, software, result), and to extract relevant named entities. The extracted information is then semantically annotated and published as linked data in open knowledge repositories according to the DMS ontology for data processing metadata. To demonstrate the effectiveness and performance of our approach, we present the results of a quantitative and qualitative analysis performed on four different conference series. ...

Describing Data Processing Pipelines in Scientific Publications for Big Data Injection

Conference paper (2017) - Sepideh Mesbah, Alessandro Bozzon, Christoph Lofi, Geert-Jan Houben

The rise of Big Data analytics has been a disruptive game changer for many application domains, allowing the integration into domain-specific applications and systems of insights and knowledge extracted from external big data sets. The effective ``injection'' of external Big Data demands an understanding of the properties of available data sets, and expertise on the available and most suitable methods for data collection, enrichment and analysis. A prominent knowledge source is scientific literature, where data processing pipelines are described, discussed, and evaluated. Such knowledge is however not readily accessible, due to its distributed and unstructured nature. In this paper, we propose a novel ontology aimed at modeling properties of data processing pipelines, and their related artifacts, as described in scientific publications. The ontology is the result of a requirement analysis that involved experts from both academia and industry. We showcase the effectiveness of our ontology by manually applying it to a collection of publications describing data processing methods. ...

Facet Embeddings for Explorative Analytics in Digital Libraries

Conference paper (2017) - Sepideh Mesbah, Kyriakos Fragkeskos, Christoph Lofi, Alessandro Bozzon, Geert Jan Houben

With the increasing amount of scientific publications in digital libraries, it is crucial to capture “deep meta-data” to facilitate more effective search and discovery, like search by topics, research methods, or data sets used in a publication. Such meta-data can also help to better understand and visualize the evolution of research topics or research venues over time. The automatic generation of meaningful deep meta-data from natural-language documents is challenged by the unstructured and often ambiguous nature of publications’ content. In this paper, we propose a domain-aware topic modeling technique called Facet Embedding which can generate such deep meta-data in an efficient way. We automatically extract a set of terms according to the key facets relevant to a specific domain (i.e. scientific objective, used data sets, methods, or software, obtained results), relying only on limited manual training. We then cluster and subsume similar facet terms according to their semantic similarity into facet topics. To showcase the effectiveness and performance of our approach, we present the results of a quantitative and qualitative analysis performed on ten different conference series in a Digital Library setting, focusing on the effectiveness for document search, but also for visualizing scientific trends. ...