SD

S.S. Dijkstra

info

Please Note

2 records found

Master thesis (2022) - S.S. Dijkstra, P. van Buuren, P. Chen, G. Jongbloed, A. Papapantoleon

Improving data quality is of the utmost importance for any data-driven company, as data quality is unmistakably tied to business analytics and processes. One method to improve upon data quality is to restore missing and wrong data entries. 

The goal of this research is construct an algorithm such that it is possible to restore missing and wrong data entries, while making use of a human adaptive framework. This algorithm has been constructed in a modular fashion and consists of three main modules: Data Transformation, Data Structure Analysis and Model Selection. Data Transformation has concerned itself with conversion of raw data to data types and forms the other modules can use.

Data Structure Analysis has been designed to deal with correctly missing data and dichotomy in the target feature by making use of three clustering algorithms: DBSCAN, K-Means and Diffusion Maps. DBSCAN is used to determine the necessity of clustering as well as the initialisation of the K-Means algorithm. K-Means and Diffusion Maps have been used as clustering methods in the one-dimensional target feature and the two-dimensional input-target feature pairs, respectively. Data Structure Analysis has further been designed to perform feature selection through three filter methods: CorrCoef, FCBF and Treelet.

Model Selection has proposed a novel approach to selection of the best model of a candidate set through the optimisation of a conditional model ranking strategy based on the prior construction of theoretical testing. Our candidate set consisted of Expectation Maximisation, K-Means, Multi-Layer Perceptron, Nearest Neighbor, Random Forest, Linear Regression, Polynomial Regression, ElasticNet Regression.

In terms of restorability, it was shown that the optimal configuration of the Cleansing Algorithm for the restoration of missing data, was provided by opting not to use clustering, using a custom alteration to the Treelet algorithm for feature selection and making use of the model selection strategy. This not only lead to the greatest restorability of 56.90% on Aegon data sets, which was an improvement of 44.83% when compared to not using the Cleansing Algorithm, but also to the reduction of computation time by over 400%. A more realistic restorability due to the presence of correctly missing data, was given by the same configuration making use of one-dimensional output clustering. This resulted in a restorability on Aegon data sets of 43.10%. As such it was deemed possible to restore missing data on Aegon data sets.

With respect to the human adaptive framework, it was determined that the construction of the algorithm be modular in the sense that any alternate feature selection or clustering approach can be implemented with ease. Furthermore, the model selection module allows us to customize the theoretical testing and choice of regression or classification models for the restoration of missing data. In doing so, the algorithm has laid the foundations for human adaptivity of the Cleansing Algorithm. ...

In this work we set out to determine the impact, if any, of the analysis of news on stock price prediction, that is, are we able to predict stock movements more accurately on a consistent basis than a proposed baseline or random guessing on the basis of news’ text analysis. We considered a methodology to be more accurate if its success rate is greater than that of a baseline or random guess. We considered a methodology to be consistently more accurate if the average of the success rates over a specified number of runs, say one hundred, is greater than that of a baseline or random guess. As a result, we discovered that the analysis of news, though readily available with modern day technological advancements, does come paired with some problems. 1. The widespread availability of news has made it more difficult to find that news which is of importance to us, news can cover anything and everything. 2. The content of news can discuss events happening anywhere from far past to the far future, making consistent analysis difficult. 3. Most financial news sources tend to block any mass datamining attempts. These problems can mostly be solved by making use of so-called 8-K reports. These reports only cover major events of companies sorted into nine different categories. The 8-K reports reduce the time interval the news impacts from the far past and far future to an interval of five business days, as the reports ought to be published within four business days. Finally, since companies are obligated to publish these reports by the U.S. securities and exchange commission, the reports are readily available and easily accessible through the U.S. securities and exchange commission website. We can then use these texts and analyze them using a rule-based or automatic text analysis approach. However, the rule-based text approach, using lists of positive and negative words for the analysis, tends to be unreliable as text contains a plethora of challenging cases. This problem is solved by using an automatic text analysis, using predetermined scores for texts. The form of automatic text analysis used, is a decision tree approach. Though single decision trees we construct have the characteristic to over-fit, we can construct random forests of decision trees on subsets of our input data to solve this problem. For our analysis we looked at the stocks prices of Tesla, Microsoft, EA and Amazon, due to their varying values. We gave scores to the texts of the 8-K reports using the stock price movement of the day of publishing. We did this for up to 4 business days prior to publishing as well. We also compensated for the market movements using the variable for days zero to four. We gave scores from -1, 0 or 1 dependent on the price movement. This generally resulted in success rates greater than our considered baseline of 33.33% of random guessing. The highest success rate for Tesla, Microsoft, EA and Amazon were in order: 73.09%, 100.00%, 88.64%, 84.95%. ...