Analysis of a Data Processing Pipeline for Generating Knowledge Graphs from Unstructured Data

Data Processing Pipeline for Knowledge Graphs

More Info
expand_more

Abstract

With the rapid growth of unstructured data across different mediums, it exposes new challenges for its analysis. To overcome this, data processing pipelines are designed with the help of different tools and technologies for the analysis of data at different stages. One of the applications which we find useful for our company is the creation of knowledge graphs for better representation and understanding of relations in the data. Knowledge graph is a structure of representing information where nodes represent the entities and edges define the relationships among them. The construction of a knowledge graph is a process of extracting meaningful information of entities and relations from unstructured textual data and storing it in a graph database. In this project, we are using Neo4j as a graph database for the efficient storage of data in the form of nodes and relations. To achieve this goal, our first research question proposes the architecture and implementation of a data processing pipeline for the construction of knowledge graphs using unstructured textual data. There are three major stages involved in our pipeline and each component is implemented in a microservice architecture. The first stage starts with the parsing of textual documents in two different formats which are PDF and PPT. In the second stage, we are applying natural language processing techniques for the extraction of meaningful information out of this raw text. In the final stage, key pieces of data are stored into a graph database(Neo4j) for the construction of knowledge graphs. We are running our pipeline on a local machine for evaluating the performance and results of each component. The core aspect of retrieving insights from this unstructured data is achieved with the use of natural language processing. In order to investigate more on this component, our second research question examines the cloud based natural language processing services from three renowned providers which are Amazon, Google and IBM. For choosing a suitable service among them, we evaluate their performance on a common data set of category Marketing from wikipedia. Based on our experimental analysis, IBM stands out among them from the perspective of the quality of output, execution time, features and cost. The adoption of a cloud based service not only leads to a faster development of business solutions but also reduces the engineering effort, its cost and maintenance of our custom implementation only with a little cost per our usage.