Reverse Engineering Relational Data for Entity Type Recognition in Enterprise Solutions at ING

More Info
expand_more

Abstract

Database entity type recognition is the practice of recognizing conceptual entity types for which given data sets contain data. In big data or data lake settings, it is not always known which conceptual entity types are represented in each data set, making it difficult to extract value from the data. Depending on the logical schemas, each conceptual entity type can also be represented in the data instances in multiple different ways. This phenomenon, called semantic heterogeneity, poses a challenge when attempting database entity type recognition. Narrowing down the problem space to a specific organization makes it easier to cope with such problems. Organizations know which entity types are used in the organization and require only those to be recognized. And while there is heterogeneity in representation, there is likely a common set of rules each logical schema adheres to which can be exploited to recognize semantic heterogeneity. Furthermore, experts at an organization can provide example data instances for each conceptual entity type of interest, which provide ground truth for the proposed database entity type recognition solution. The proposed solution makes data profiles of the example data instances, and then attempts to recognize entity types in previously unseen data instances using a rule-based approach. Rules are used to maximize the ease of explainability of results, as is often desired at a bank, and can easily be added to or removed from the solution to maximize adaptability. Experiments using the proposed solution show promising results, with up to 90 percent of entity types correctly recognized over a total of 170,000 entities.

Files