Artificial Intelligence in Customs Risk Management for e-Commerce

Design of a Web-crawling Architecture for the Dutch Customs Administration

More Info
expand_more

Abstract

The last decade saw the rise of e-commerce trade and the shift of the manufacturing industry to the emerging economies, China first of all. In this context, the European Customs Authorities experienced an explosion of small parcels coming from e-commerce websites, often from China, and faced difficulties to detect fiscal frauds and security threats using their conventional risk management systems. To address this problem, the European project PROFILE brings together the customs administrations of Netherlands, Belgium, Sweden, Norway, and Estonia, aiming to provide the EU with a shared platform for: (1) accurately assessing customs risks; (2) optimizing operation and logistics by integrating multiple sources of information; (3) developing a shared data platform to share customs risk management (CRM) practices.

As part of this project, the Dutch Customs Administration (DCA) and International Business Machines (IBM) Corporation are collaborating to deploy the cutting-edge technologies of artificial intelligence to automatically cross-check the customs declarations coming from Chinese e- commerce against online information. Through a Design Science approach, I carried out this research for the Delft University of Technology, written in collaboration with IBM Netherlands, aiming to deliver a preparatory study for the developing team before the PROFILE project begins. This includes knowledge brokering between the Dutch Customs Administration and IBM Netherlands so that a more precise problem scope can be defined, and the requirements elicited. In particular, this research focuses on the first part of the project: the development of an adaptive web-crawler for e-commerce, able to compare the declarations documents against online information.

According to the Dutch Customs Administration, the web-crawling system should gather the description of the goods from declarations, search the product on the web, find its price of sale on the e-commerce platforms, compare it with the value declared in the declaration, and return a risk indicator of green/red flag to the targeting officer. The design process of this system follows approaches coming from the systems engineering discipline, starting with the requirement analysis, addressing them with the state-of-the-art big data analytics, and finally deriving the logical components of the system, whose design is presented through a logical architecture.

First, the application domain is investigated. When goods entry the Netherlands need an entry declaration. These goods arrive at the harbor of Rotterdam or airport of Schiphol, where some of these are imported into the country and become import/export, and others stop temporarily as transit waiting to be shipped somewhere else. The Dutch Customs Administration monitors these processes through risk management systems aiming to stop non-compliant goods. This research describes these practices, with a higher focus on the e-commerce risk targeting. About the e- commerce world, a study of the e-commerce processes behind an online purchase is also carried out through a real purchase on Chinese e-commerce. This was used to observe how the Chinese sender described the item, and how the Dutch Customs assessed the risk and decided on the duties to be paid. This led to reflect on the possible frauds scenarios and how to address them. Finally, the Dutch Customs also reported that the products descriptions are often vague and ambiguous, and a more accurate formulation of the problem is described.

Secondly, an in-depth literature on the fields of web-crawling and big data analytics techniques is carried out. The possible technologies that could be useful to address the requirements and the problem formulation are investigated. Starting with an analysis of the existing literature on the field of big data analytics, this research also covers the recent trends of machine learning and artificial intelligence. To avoid reporting a too big literature, the topics reported have been accurately chosen, for instance describing only the techniques for web analytics and text analytics.

This literature on big data analytics is further broken in two sub-topics, one more theoretical, which classifies the types of analytics methods and defines the technology of machine learning and natural language processing, including the last paradigms of deep learning and reinforcement learning, and one more practical, where guidelines for the design, development, and implementation of machine learning techniques are proposed. It is here that a theoretical framework to systematically reflect on the challenges of the field of big data analytics has been identified. This framework is then used to systematically collect the main technological challenges of the use case under analysis and translate them into non-functional requirements.

Finally, the last part of the literature describes what a web-crawler is and what web- crawling/web-craping means. This later extends to the concepts of focused web-crawling and smart, intelligent, adaptive web-crawling, where machine learning techniques are deployed to improve performance. The literature concludes by providing related works of machine learning techniques implemented in smart web-crawling of the e-commerce websites and stating the knowledge gap that needs to be bridged to address the use case under analysis.

After the application domain and the literature review, the knowledge from these previous phases combines in a continuous iterative process according to the design science methodology (Hevner, 2014). Through unstructured interviews with the DCA and IBM experts, the requirements elicitation is carried out. The approach by Armstrong and Sage (2000) deriving from the field of systems engineering is used. The main objective of the system to be developed is broken down into a series of sub-activities that must be carefully structured to formulate the requirements. About the non-functional requirements, instead of reflecting on the different domains – technological, environment, law compliance, etc. – as it is proposed by the same systems engineering approach mentioned earlier, this research uses the framework identified in the literature review about the main challenges of big data project (Sivarajah, 2016).

To derive the components of the architecture from the requirements and customer needs, the methodology proposed by Suh (1998) called Axiomatic Design has been used, mapping the requirements into architectural components in a rigorous manner. In this way, the design domains proposed by this methodology – customer, functional, physical and process domains – are taken as the reference point for the design process: first, the business needs are identified, then these are translated into requirements, which are mapped into design features. The process domain is left out of this research and will be addressed by the IBM development team in Ireland.
The design cycle leads to the design of a web-crawling system represented through a service- oriented architecture (SOA). Its block diagram and black-box description of each application service are provided. Furthermore, the architecture functionality is described with an architecture walk-through and a sequence diagram in the unified modeling language (UML). The result is an innovative real-time web-crawling system to identify the value of a given product on the e-commerce websites. It deploys natural language process models to filter the non-relevant search results, and other machine learning models to best matching the remaining relevant results with a given item description.

The design and architecture description of this innovative web-crawling system is the main artifact of this research, while the mixed methodology of systems engineering methodologies and big data frameworks is another important scientific contribution.