Bv
B. van den Berg
info
Please Note
<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>
2 records found
1
Efficient Query Estimation by Vector Averaging in Dual-Encoder Re-Ranking
Estimating Query Embeddings as Weighted Average of Document Embeddings and Lightweight Query Encoding
A central problem in information retrieval (IR) is passage ranking, where the task is to retrieve passages from a corpus and order them in decreasing relevance to an arbitrary search query.
Traditional lexical retrieval methods are susceptible to the vocabulary mismatch problem, where relevant passages are overlooked if they do not contain the exact query terms (e.g., synonyms), despite being semantically relevant.
A recent trend in IR is to address this issue by utilizing neural network models (dense rankers) which embed text sequences into dense vector representations that effectively capture their semantics through complex attention mechanisms.
For efficiency, dense rankers are often employed in a retrieve-and-re-rank setting, where a lexical ranker initially retrieves a subset of candidate passages, which are then reordered more accurately by a dense ranker.
In this thesis, we focus on the task of passage re-ranking.
We employ a dual-encoder architecture as re-ranker that employs a two independent query and document encoders, allowing document embeddings to be pre-computed. Dense query-passage similarity is computed as a dot product between their representations.
We then combine scores from both stages using score interpolation.
We identify query encoding latency as a bottleneck and propose an Average Embedding (AvgEmb) estimator. This novel model can efficiently predict an accurate query representation, without requiring any attention-based encoding.
It operates solely on looking up embeddings and computing their weighted average representation.
Our model is distilled from a TCT-ColBERT and achieves 98.6% of its teacher's accuracy while being 13.4X more efficient in query latency and 1.6X better in the full interpolated passage re-ranking pipeline on CPU.
Our code is publicly available on https://github.com/BovdBerg/fast-forward-indexes. ...
Traditional lexical retrieval methods are susceptible to the vocabulary mismatch problem, where relevant passages are overlooked if they do not contain the exact query terms (e.g., synonyms), despite being semantically relevant.
A recent trend in IR is to address this issue by utilizing neural network models (dense rankers) which embed text sequences into dense vector representations that effectively capture their semantics through complex attention mechanisms.
For efficiency, dense rankers are often employed in a retrieve-and-re-rank setting, where a lexical ranker initially retrieves a subset of candidate passages, which are then reordered more accurately by a dense ranker.
In this thesis, we focus on the task of passage re-ranking.
We employ a dual-encoder architecture as re-ranker that employs a two independent query and document encoders, allowing document embeddings to be pre-computed. Dense query-passage similarity is computed as a dot product between their representations.
We then combine scores from both stages using score interpolation.
We identify query encoding latency as a bottleneck and propose an Average Embedding (AvgEmb) estimator. This novel model can efficiently predict an accurate query representation, without requiring any attention-based encoding.
It operates solely on looking up embeddings and computing their weighted average representation.
Our model is distilled from a TCT-ColBERT and achieves 98.6% of its teacher's accuracy while being 13.4X more efficient in query latency and 1.6X better in the full interpolated passage re-ranking pipeline on CPU.
Our code is publicly available on https://github.com/BovdBerg/fast-forward-indexes. ...
A central problem in information retrieval (IR) is passage ranking, where the task is to retrieve passages from a corpus and order them in decreasing relevance to an arbitrary search query.
Traditional lexical retrieval methods are susceptible to the vocabulary mismatch problem, where relevant passages are overlooked if they do not contain the exact query terms (e.g., synonyms), despite being semantically relevant.
A recent trend in IR is to address this issue by utilizing neural network models (dense rankers) which embed text sequences into dense vector representations that effectively capture their semantics through complex attention mechanisms.
For efficiency, dense rankers are often employed in a retrieve-and-re-rank setting, where a lexical ranker initially retrieves a subset of candidate passages, which are then reordered more accurately by a dense ranker.
In this thesis, we focus on the task of passage re-ranking.
We employ a dual-encoder architecture as re-ranker that employs a two independent query and document encoders, allowing document embeddings to be pre-computed. Dense query-passage similarity is computed as a dot product between their representations.
We then combine scores from both stages using score interpolation.
We identify query encoding latency as a bottleneck and propose an Average Embedding (AvgEmb) estimator. This novel model can efficiently predict an accurate query representation, without requiring any attention-based encoding.
It operates solely on looking up embeddings and computing their weighted average representation.
Our model is distilled from a TCT-ColBERT and achieves 98.6% of its teacher's accuracy while being 13.4X more efficient in query latency and 1.6X better in the full interpolated passage re-ranking pipeline on CPU.
Our code is publicly available on https://github.com/BovdBerg/fast-forward-indexes.
Traditional lexical retrieval methods are susceptible to the vocabulary mismatch problem, where relevant passages are overlooked if they do not contain the exact query terms (e.g., synonyms), despite being semantically relevant.
A recent trend in IR is to address this issue by utilizing neural network models (dense rankers) which embed text sequences into dense vector representations that effectively capture their semantics through complex attention mechanisms.
For efficiency, dense rankers are often employed in a retrieve-and-re-rank setting, where a lexical ranker initially retrieves a subset of candidate passages, which are then reordered more accurately by a dense ranker.
In this thesis, we focus on the task of passage re-ranking.
We employ a dual-encoder architecture as re-ranker that employs a two independent query and document encoders, allowing document embeddings to be pre-computed. Dense query-passage similarity is computed as a dot product between their representations.
We then combine scores from both stages using score interpolation.
We identify query encoding latency as a bottleneck and propose an Average Embedding (AvgEmb) estimator. This novel model can efficiently predict an accurate query representation, without requiring any attention-based encoding.
It operates solely on looking up embeddings and computing their weighted average representation.
Our model is distilled from a TCT-ColBERT and achieves 98.6% of its teacher's accuracy while being 13.4X more efficient in query latency and 1.6X better in the full interpolated passage re-ranking pipeline on CPU.
Our code is publicly available on https://github.com/BovdBerg/fast-forward-indexes.
Big data applications are becoming increasingly popular. The importance of testing these applications increases with it. A recently proposed work called BigFuzz applies automated testing. The big data fuzzing tool shows very promising results. The aim of this research is to inspect how coverage guidance affects the performance of big data fuzzing. The current coverage usage is first described, then an extension is proposed, which is compared to the original. This work extends the BigFuzz tool with branch coverage guidance. The existing black-box fuzzer is substituted for a grey-box fuzzer, which is then extended to a boosted grey-box fuzzer. The two extensions both allow branch discovery. Boosted grey-box fuzzing shows to be the most efficient branch exploration mechanic. Furthermore, both extensions outperform the original tool regarding error detection.
...
Big data applications are becoming increasingly popular. The importance of testing these applications increases with it. A recently proposed work called BigFuzz applies automated testing. The big data fuzzing tool shows very promising results. The aim of this research is to inspect how coverage guidance affects the performance of big data fuzzing. The current coverage usage is first described, then an extension is proposed, which is compared to the original. This work extends the BigFuzz tool with branch coverage guidance. The existing black-box fuzzer is substituted for a grey-box fuzzer, which is then extended to a boosted grey-box fuzzer. The two extensions both allow branch discovery. Boosted grey-box fuzzing shows to be the most efficient branch exploration mechanic. Furthermore, both extensions outperform the original tool regarding error detection.