S. Mesbah
Please Note
13 records found
1
LOREM
Language-consistent Open Relation Extraction from Unstructured Text
However, their performance is limited when dealing with long-tail entities. Long-tail entities are entities that have a low frequency in the document collections and usually have no reference to existing Knowledge Bases. Obtaining human-labeled datasets is expensive and time-consuming, especially for long-tail entities that are scarcely available in document collections. This dissertation focuses on the problem of the lack of training data, arguably the largest bottleneck in training machine learning-based NER techniques. We investigated efficient and effective ways to augment training data by enhancing their size and quality automatically. Our work aimed at showing how, by enhancing the size and quality of the training data using different techniques, it will be possible to improve the performance of Long-tail Entity Recognition (L-tER). ...
However, their performance is limited when dealing with long-tail entities. Long-tail entities are entities that have a low frequency in the document collections and usually have no reference to existing Knowledge Bases. Obtaining human-labeled datasets is expensive and time-consuming, especially for long-tail entities that are scarcely available in document collections. This dissertation focuses on the problem of the lack of training data, arguably the largest bottleneck in training machine learning-based NER techniques. We investigated efficient and effective ways to augment training data by enhancing their size and quality automatically. Our work aimed at showing how, by enhancing the size and quality of the training data using different techniques, it will be possible to improve the performance of Long-tail Entity Recognition (L-tER).
Coner
A Collaborative Approach for Long-Tail Named Entity Recognition in Scientific Publications
Named Entity Recognition (NER) for rare long-tail entities as e.g., often found in domain-specific scientific publications is a challenging task, as typically the extensive training data and test data for fine-tuning NER algorithms is lacking. Recent approaches presented promising solutions relying on training NER algorithms in an iterative weakly-supervised fashion, thus limiting human interaction to only providing a small set of seed terms. Such approaches heavily rely on heuristics in order to cope with the limited training data size. As these heuristics are prone to failure, the overall achievable performance is limited. In this paper, we therefore introduce a collaborative approach which incrementally incorporates human feedback on the relevance of extracted entities into the training cycle of such iterative NER algorithms. This approach, called Coner, allows to still train new domain specific rare long-tail NER extractors with low costs, but with ever increasing performance while the algorithm is actively used in an application.
Concept Focus
Semantic Meta-Data For Describing MOOC Content
However, efficient access to MOOC content is still hard, thus unneces-
sarily complicating many use cases like efficient re-use of material, or
tailored access for life-long learning scenarios. One of the reasons for this
lack of accessibility is the shortage of meaningful semantic meta-data de-
scribing MOOC content and the resulting learning experience. In this pa-
per, we explore Concept Focus, a new type of meta-data for describing a
perceptual facet of modern video-based MOOCs, capturing how focused
a learning resource is topic-wise, which is often an indicator of clarity
and understandability. We provide the theoretical foundations of Con-
cept Focus and outline a methodical workflow of how to automatically
compute it for MOOC lectures. Furthermore, we show that the learners’
consumption behavior is correlated with a MOOC lecture’s Concept Focus, thus underlining that this type of meta-data is indeed relevant for user-centric querying, personalizing or even designing the MOOC experience. For showing this, we performed an extensive study with real-life
MOOCs and 12,849 learners over the duration of three months. ...
However, efficient access to MOOC content is still hard, thus unneces-
sarily complicating many use cases like efficient re-use of material, or
tailored access for life-long learning scenarios. One of the reasons for this
lack of accessibility is the shortage of meaningful semantic meta-data de-
scribing MOOC content and the resulting learning experience. In this pa-
per, we explore Concept Focus, a new type of meta-data for describing a
perceptual facet of modern video-based MOOCs, capturing how focused
a learning resource is topic-wise, which is often an indicator of clarity
and understandability. We provide the theoretical foundations of Con-
cept Focus and outline a methodical workflow of how to automatically
compute it for MOOC lectures. Furthermore, we show that the learners’
consumption behavior is correlated with a MOOC lecture’s Concept Focus, thus underlining that this type of meta-data is indeed relevant for user-centric querying, personalizing or even designing the MOOC experience. For showing this, we performed an extensive study with real-life
MOOCs and 12,849 learners over the duration of three months.
SmartPub
A Platform for Long-Tail Entity Extraction from Scientific Publications
TSE-NER
An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications
This paper presents an iterative approach for training NER and NET
classifiers in scientific publications that relies on minimal human input,
namely a small seed set of instances for the targeted entity type. We
introduce different strategies for training data extraction, semantic expansion, and result entity filtering.We evaluate our approach on scientific
publications, focusing on the long-tail entities types Datasets, Methods in
computer science publications, and Proteins in biomedical publications. ...
This paper presents an iterative approach for training NER and NET
classifiers in scientific publications that relies on minimal human input,
namely a small seed set of instances for the targeted entity type. We
introduce different strategies for training data extraction, semantic expansion, and result entity filtering.We evaluate our approach on scientific
publications, focusing on the long-tail entities types Datasets, Methods in
computer science publications, and Proteins in biomedical publications.
Nudge your Workforce
A Study on the Effectiveness of Task Notification Strategies in Enterprise Mobile Crowdsourcing
As crowdsourcing gains popularity, organisations seek ways to systematically and reliably involve their workforce with data processing pipelines. Mobile crowdsourcing allows for opportunistic task executions and thus, potentially, for higher throughput. However, how to engage and to retain employees in enterprise crowdsourcing campaigns is still an open research topic. .is paper discusses the results of a study performed in IBM Benelux. We surveyed 93 employees to discover the factors that might a.ect engagement in mobile enterprise crowdsourcing. .e survey informed the design of an experiment that aimed at investigating the e.ectiveness of di.erent task noti€cation strategies. We studied how factors such as time and context of noti€cation can a.ect the participation and retention of employees. Results show that break times are the most suitable for crowd work, and that "aggressive" noti€cation strategies act as deterrent for participation, while moderate yet regular nudges are the most likely to retain contributors.
With the increasing amount of scientific publications in digital libraries, it is crucial to capture “deep meta-data” to facilitate more effective search and discovery, like search by topics, research methods, or data sets used in a publication. Such meta-data can also help to better understand and visualize the evolution of research topics or research venues over time. The automatic generation of meaningful deep meta-data from natural-language documents is challenged by the unstructured and often ambiguous nature of publications’ content. In this paper, we propose a domain-aware topic modeling technique called Facet Embedding which can generate such deep meta-data in an efficient way. We automatically extract a set of terms according to the key facets relevant to a specific domain (i.e. scientific objective, used data sets, methods, or software, obtained results), relying only on limited manual training. We then cluster and subsume similar facet terms according to their semantic similarity into facet topics. To showcase the effectiveness and performance of our approach, we present the results of a quantitative and qualitative analysis performed on ten different conference series in a Digital Library setting, focusing on the effectiveness for document search, but also for visualizing scientific trends.