WH
W. Hajer
info
Please Note
<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>
2 records found
1
Authorship attribution is the task of determining the unknown author of a text. In forensic authorship attribution, the likelihood that a suspect has written a specific text of unknown origin is computed based on reference texts from both the suspect and a background population. The current method used at the Netherlands Forensic Institute contains a manual and a computational part. In this thesis, we attempted to improve the computational part of this process. We study this problem from three directions.
Firstly, the performance of state-of-the-art computational authorship attribution methods was assessed on Dutch, forensically relevant corpora. The compared methods were support vector machines combined with masking, using either word or character n-grams as features, BERT-based models using a mean pooling strategy to handle long texts and the baseline, which consists of a logistic regression model with the 100 most frequent Dutch words as features. We notice similar performance differences between state-of-the-art methods as in the literature. The best-performing method was a support vector machine without masking using character n-grams as features. In comparison, both the baseline and BERT-based models perform worse on our corpora.
Secondly, a score-based likelihood ratio system was created to modify the computational authorship attribution methods for usage in forensics. This method is based on kernel density estimators and uses cross-calibration to handle the small number of training and calibration texts of the suspect. For most methods, the performance is in line with the previous performances outside the likelihood ratio system, except for the BERT-based methods, which significantly underperform when part of a likelihood ratio system. This is likely caused by the combination of cross-calibration and the randomness in finetuning BERT models.
Additionally, authorship attribution methods should be topic-robust, such that their attribution is not biased by the topic of a text. We introduced two new metrics to measure the topic-robustness of authorship attribution methods, ‘topic impact’ and ‘conversation impact’. These metrics can only be used on specific types of corpora, the topic impact can be computed on topic-controlled corpora and the conversation impact can be computed on conversational corpora. To study whether these metrics both measured the topic-robustness of authorship attribution methods for their respective corpus type, we computed the correlation between the results of the metrics for varying authorship attribution methods.
We found a correlation of 0.68. As a result, we cannot conclude that the conversation impact is a perfect metric to measure the topic-robustness of methods using conversational corpora, but it does give a good indication of large differences between methods.
Using this new metric, we found that our best-performing methods suffered from a high conversation impact and, as a result, might be more likely to have a low topic-robustness. If more of the infrequent words were masked, the conversation impact decreased, but so did the performance. A trade-off between high performance and high topic-robustness must be made when a model is chosen for real forensic case work. The conversation impact metric we proposed can help quantify these effects on forensically relevant corpora and therefore assist in making better choices.
...
Firstly, the performance of state-of-the-art computational authorship attribution methods was assessed on Dutch, forensically relevant corpora. The compared methods were support vector machines combined with masking, using either word or character n-grams as features, BERT-based models using a mean pooling strategy to handle long texts and the baseline, which consists of a logistic regression model with the 100 most frequent Dutch words as features. We notice similar performance differences between state-of-the-art methods as in the literature. The best-performing method was a support vector machine without masking using character n-grams as features. In comparison, both the baseline and BERT-based models perform worse on our corpora.
Secondly, a score-based likelihood ratio system was created to modify the computational authorship attribution methods for usage in forensics. This method is based on kernel density estimators and uses cross-calibration to handle the small number of training and calibration texts of the suspect. For most methods, the performance is in line with the previous performances outside the likelihood ratio system, except for the BERT-based methods, which significantly underperform when part of a likelihood ratio system. This is likely caused by the combination of cross-calibration and the randomness in finetuning BERT models.
Additionally, authorship attribution methods should be topic-robust, such that their attribution is not biased by the topic of a text. We introduced two new metrics to measure the topic-robustness of authorship attribution methods, ‘topic impact’ and ‘conversation impact’. These metrics can only be used on specific types of corpora, the topic impact can be computed on topic-controlled corpora and the conversation impact can be computed on conversational corpora. To study whether these metrics both measured the topic-robustness of authorship attribution methods for their respective corpus type, we computed the correlation between the results of the metrics for varying authorship attribution methods.
We found a correlation of 0.68. As a result, we cannot conclude that the conversation impact is a perfect metric to measure the topic-robustness of methods using conversational corpora, but it does give a good indication of large differences between methods.
Using this new metric, we found that our best-performing methods suffered from a high conversation impact and, as a result, might be more likely to have a low topic-robustness. If more of the infrequent words were masked, the conversation impact decreased, but so did the performance. A trade-off between high performance and high topic-robustness must be made when a model is chosen for real forensic case work. The conversation impact metric we proposed can help quantify these effects on forensically relevant corpora and therefore assist in making better choices.
...
Authorship attribution is the task of determining the unknown author of a text. In forensic authorship attribution, the likelihood that a suspect has written a specific text of unknown origin is computed based on reference texts from both the suspect and a background population. The current method used at the Netherlands Forensic Institute contains a manual and a computational part. In this thesis, we attempted to improve the computational part of this process. We study this problem from three directions.
Firstly, the performance of state-of-the-art computational authorship attribution methods was assessed on Dutch, forensically relevant corpora. The compared methods were support vector machines combined with masking, using either word or character n-grams as features, BERT-based models using a mean pooling strategy to handle long texts and the baseline, which consists of a logistic regression model with the 100 most frequent Dutch words as features. We notice similar performance differences between state-of-the-art methods as in the literature. The best-performing method was a support vector machine without masking using character n-grams as features. In comparison, both the baseline and BERT-based models perform worse on our corpora.
Secondly, a score-based likelihood ratio system was created to modify the computational authorship attribution methods for usage in forensics. This method is based on kernel density estimators and uses cross-calibration to handle the small number of training and calibration texts of the suspect. For most methods, the performance is in line with the previous performances outside the likelihood ratio system, except for the BERT-based methods, which significantly underperform when part of a likelihood ratio system. This is likely caused by the combination of cross-calibration and the randomness in finetuning BERT models.
Additionally, authorship attribution methods should be topic-robust, such that their attribution is not biased by the topic of a text. We introduced two new metrics to measure the topic-robustness of authorship attribution methods, ‘topic impact’ and ‘conversation impact’. These metrics can only be used on specific types of corpora, the topic impact can be computed on topic-controlled corpora and the conversation impact can be computed on conversational corpora. To study whether these metrics both measured the topic-robustness of authorship attribution methods for their respective corpus type, we computed the correlation between the results of the metrics for varying authorship attribution methods.
We found a correlation of 0.68. As a result, we cannot conclude that the conversation impact is a perfect metric to measure the topic-robustness of methods using conversational corpora, but it does give a good indication of large differences between methods.
Using this new metric, we found that our best-performing methods suffered from a high conversation impact and, as a result, might be more likely to have a low topic-robustness. If more of the infrequent words were masked, the conversation impact decreased, but so did the performance. A trade-off between high performance and high topic-robustness must be made when a model is chosen for real forensic case work. The conversation impact metric we proposed can help quantify these effects on forensically relevant corpora and therefore assist in making better choices.
Firstly, the performance of state-of-the-art computational authorship attribution methods was assessed on Dutch, forensically relevant corpora. The compared methods were support vector machines combined with masking, using either word or character n-grams as features, BERT-based models using a mean pooling strategy to handle long texts and the baseline, which consists of a logistic regression model with the 100 most frequent Dutch words as features. We notice similar performance differences between state-of-the-art methods as in the literature. The best-performing method was a support vector machine without masking using character n-grams as features. In comparison, both the baseline and BERT-based models perform worse on our corpora.
Secondly, a score-based likelihood ratio system was created to modify the computational authorship attribution methods for usage in forensics. This method is based on kernel density estimators and uses cross-calibration to handle the small number of training and calibration texts of the suspect. For most methods, the performance is in line with the previous performances outside the likelihood ratio system, except for the BERT-based methods, which significantly underperform when part of a likelihood ratio system. This is likely caused by the combination of cross-calibration and the randomness in finetuning BERT models.
Additionally, authorship attribution methods should be topic-robust, such that their attribution is not biased by the topic of a text. We introduced two new metrics to measure the topic-robustness of authorship attribution methods, ‘topic impact’ and ‘conversation impact’. These metrics can only be used on specific types of corpora, the topic impact can be computed on topic-controlled corpora and the conversation impact can be computed on conversational corpora. To study whether these metrics both measured the topic-robustness of authorship attribution methods for their respective corpus type, we computed the correlation between the results of the metrics for varying authorship attribution methods.
We found a correlation of 0.68. As a result, we cannot conclude that the conversation impact is a perfect metric to measure the topic-robustness of methods using conversational corpora, but it does give a good indication of large differences between methods.
Using this new metric, we found that our best-performing methods suffered from a high conversation impact and, as a result, might be more likely to have a low topic-robustness. If more of the infrequent words were masked, the conversation impact decreased, but so did the performance. A trade-off between high performance and high topic-robustness must be made when a model is chosen for real forensic case work. The conversation impact metric we proposed can help quantify these effects on forensically relevant corpora and therefore assist in making better choices.
Over the past century various different discrepancies in the expected and observed behaviour of galaxies and galaxy clusters were found. Together this is called the missing mass problem and the most well known theory trying to explain these differences states that there is additional undetectable mass in the form of dark matter. Modified Newtonian Dynamics (MOND) is another theory that tries to explain these discrepancies in a different way then by introducing dark matter. Instead the theory changes Newtons law of gravity for low accelerations, smaller than Milgroms constant a0. In this bachelor thesis we look at a discrete model to simulate MOND in galaxy clusters. To simplify calculations we will not be looking at full MOND, which holds for all accelerations but instead we look at the so called deep MOND, which only holds for accelerations much smaller than a0. We use two different versions of the Poisson equation, the standard version for Newtonian dynamics and a modified version for MOND, to compute the gravitational potential fields caused by a mass density distribution. This is done by using the discrete Fourier transform on an discretized region of space. The initial mass density distribution consists of a number of galaxies ranging between 50 and 1000 per cluster. Each galaxy is modelled as a sphere of constant density. To calculate the potential with MOND we use an iterative process starting from the potential we get using Newtonian dynamics. In this iterative process we made use of the Helmholtz decomposition. From the MOND potential we can compute an apparent mass distribution, which is the mass distribution that would result in the same MOND potential using Newtonian dynamics. This apparent matter distribution we use to predict at what distance of a galaxy most apparent dark matter is located. Lastly we also look at the apparent mass distributions when the galaxy cluster is projected on a 2d plane. All of these calculations were made in Python on a discrete grid of 256 × 256 × 256 points.
When looking at the total amount of apparent mass in concentric spheres around galaxies in our cluster we saw that this increases in three distinct phases. The middle phases, where a linear increase was seen, had a slope in the same order of magnitude as the theoretical value. We found that the average apparent mass density is the highest in the center of galaxies and decreases very quickly at higher distance to the galaxy. When the distance becomes high enough to reach neighbouring galaxies in the same cluster the average apparent mass density stabilizes and becomes almost constant, but still slightly decreases. For the projected mass density a similar pattern was found. ...
When looking at the total amount of apparent mass in concentric spheres around galaxies in our cluster we saw that this increases in three distinct phases. The middle phases, where a linear increase was seen, had a slope in the same order of magnitude as the theoretical value. We found that the average apparent mass density is the highest in the center of galaxies and decreases very quickly at higher distance to the galaxy. When the distance becomes high enough to reach neighbouring galaxies in the same cluster the average apparent mass density stabilizes and becomes almost constant, but still slightly decreases. For the projected mass density a similar pattern was found. ...
Over the past century various different discrepancies in the expected and observed behaviour of galaxies and galaxy clusters were found. Together this is called the missing mass problem and the most well known theory trying to explain these differences states that there is additional undetectable mass in the form of dark matter. Modified Newtonian Dynamics (MOND) is another theory that tries to explain these discrepancies in a different way then by introducing dark matter. Instead the theory changes Newtons law of gravity for low accelerations, smaller than Milgroms constant a0. In this bachelor thesis we look at a discrete model to simulate MOND in galaxy clusters. To simplify calculations we will not be looking at full MOND, which holds for all accelerations but instead we look at the so called deep MOND, which only holds for accelerations much smaller than a0. We use two different versions of the Poisson equation, the standard version for Newtonian dynamics and a modified version for MOND, to compute the gravitational potential fields caused by a mass density distribution. This is done by using the discrete Fourier transform on an discretized region of space. The initial mass density distribution consists of a number of galaxies ranging between 50 and 1000 per cluster. Each galaxy is modelled as a sphere of constant density. To calculate the potential with MOND we use an iterative process starting from the potential we get using Newtonian dynamics. In this iterative process we made use of the Helmholtz decomposition. From the MOND potential we can compute an apparent mass distribution, which is the mass distribution that would result in the same MOND potential using Newtonian dynamics. This apparent matter distribution we use to predict at what distance of a galaxy most apparent dark matter is located. Lastly we also look at the apparent mass distributions when the galaxy cluster is projected on a 2d plane. All of these calculations were made in Python on a discrete grid of 256 × 256 × 256 points.
When looking at the total amount of apparent mass in concentric spheres around galaxies in our cluster we saw that this increases in three distinct phases. The middle phases, where a linear increase was seen, had a slope in the same order of magnitude as the theoretical value. We found that the average apparent mass density is the highest in the center of galaxies and decreases very quickly at higher distance to the galaxy. When the distance becomes high enough to reach neighbouring galaxies in the same cluster the average apparent mass density stabilizes and becomes almost constant, but still slightly decreases. For the projected mass density a similar pattern was found.
When looking at the total amount of apparent mass in concentric spheres around galaxies in our cluster we saw that this increases in three distinct phases. The middle phases, where a linear increase was seen, had a slope in the same order of magnitude as the theoretical value. We found that the average apparent mass density is the highest in the center of galaxies and decreases very quickly at higher distance to the galaxy. When the distance becomes high enough to reach neighbouring galaxies in the same cluster the average apparent mass density stabilizes and becomes almost constant, but still slightly decreases. For the projected mass density a similar pattern was found.