[1]
Pooled Evaluation Over Query Variations: Users are as Diverse as Systems
Short Papers: Information Retrieval
/
Moffat, Alistair
/
Scholer, Falk
/
Thomas, Paul
/
Bailey, Peter
Proceedings of the 2015 ACM Conference on Information and Knowledge
Management
2015-10-19
p.1759-1762
© Copyright 2015 ACM
Summary: Evaluation of information retrieval systems with test collections makes use
of a suite of fixed resources: a document corpus; a set of topics; and
associated judgments of the relevance of each document to each topic. With
large modern collections, exhaustive judging is not feasible. Therefore an
approach called pooling is typically used where, for example, the documents to
be judged can be determined by taking the union of all documents returned in
the top positions of the answer lists returned by a range of systems.
Conventionally, pooling uses system variations to provide diverse documents to
be judged for a topic; different user queries are not considered. We explore
the ramifications of user query variability on pooling, and demonstrate that
conventional test collections do not cover this source of variation. The effect
of user query variation on the size of the judging pool is just as strong as
the effect of retrieval system variation. We conclude that user query variation
should be incorporated early in test collection construction, and cannot be
considered effectively post hoc.
[2]
The Benefits of Magnitude Estimation Relevance Assessments for Information
Retrieval Evaluation
Session 7A: Assessing
/
Turpin, Andrew
/
Scholer, Falk
/
Mizzaro, Stefano
/
Maddalena, Eddy
Proceedings of the 2015 Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2015-08-09
p.565-574
© Copyright 2015 ACM
Summary: Magnitude estimation is a psychophysical scaling technique for the
measurement of sensation, where observers assign numbers to stimuli in response
to their perceived intensity. We investigate the use of magnitude estimation
for judging the relevance of documents in the context of information retrieval
evaluation, carrying out a large-scale user study across 18 TREC topics and
collecting more than 50,000 magnitude estimation judgments. Our analysis shows
that on average magnitude estimation judgments are rank-aligned with ordinal
judgments made by expert relevance assessors. An advantage of magnitude
estimation is that users can chose their own scale for judgments, allowing
deeper investigations of user perceptions than when categorical scales are
used.
We explore the application of magnitude estimation for IR evaluation,
calibrating two gain-based effectiveness metrics, nDCG and ERR, directly from
user-reported perceptions of relevance. A comparison of TREC system
effectiveness rankings based on binary, ordinal, and magnitude estimation
relevance shows substantial variation; in particular, the top systems ranked
using magnitude estimation and ordinal judgments differ substantially. Analysis
of the magnitude estimation scores shows that this effect is due in part to
varying perceptions of relevance, in terms of how impactful relative
differences in document relevance are perceived to be. We further use magnitude
estimation to investigate gain profiles, comparing the currently assumed linear
and exponential approaches with actual user-reported relevance perceptions.
This indicates that the currently used exponential gain profiles in nDCG and
ERR are mismatched with an average user, but perhaps more importantly that
individual perceptions are highly variable. These results have direct
implications for IR evaluation, suggesting that current assumptions about a
single view of relevance being sufficient to represent a population of users
are unlikely to hold. Finally, we demonstrate that magnitude estimation
judgments can be reliably collected using crowdsourcing, and are competitive in
terms of assessor cost.
[3]
User Variability and IR System Evaluation
Session 8A: Variability in test collections
/
Bailey, Peter
/
Moffat, Alistair
/
Scholer, Falk
/
Thomas, Paul
Proceedings of the 2015 Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2015-08-09
p.625-634
© Copyright 2015 ACM
Summary: Test collection design eliminates sources of user variability to make
statistical comparisons among information retrieval (IR) systems more
affordable. Does this choice unnecessarily limit generalizability of the
outcomes to real usage scenarios? We explore two aspects of user variability
with regard to evaluating the relative performance of IR systems, assessing
effectiveness in the context of a subset of topics from three TREC collections,
with the embodied information needs categorized against three levels of
increasing task complexity. First, we explore the impact of widely differing
queries that searchers construct for the same information need description. By
executing those queries, we demonstrate that query formulation is critical to
query effectiveness. The results also show that the range of scores
characterizing effectiveness for a single system arising from these queries is
comparable or greater than the range of scores arising from variation among
systems using only a single query per topic. Second, our experiments reveal
that searchers display substantial individual variation in the numbers of
documents and queries they anticipate needing to issue, and there are
underlying significant differences in these numbers in line with increasing
task complexity levels. Our conclusion is that test collection design would be
improved by the use of multiple query variations per topic, and could be
further improved by the use of metrics which are sensitive to the expected
numbers of useful documents.
[4]
Features of Disagreement Between Retrieval Effectiveness Measures
Short Papers
/
Jones, Timothy
/
Thomas, Paul
/
Scholer, Falk
/
Sanderson, Mark
Proceedings of the 2015 Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2015-08-09
p.847-850
© Copyright 2015 ACM
Summary: Many IR effectiveness measures are motivated from intuition, theory, or user
studies. In general, most effectiveness measures are well correlated with each
other. But, what about where they don't correlate? Which rankings cause
measures to disagree? Are these rankings predictable for particular pairs of
measures? In this work, we examine how and where metrics disagree, and identify
differences that should be considered when selecting metrics for use in
evaluating retrieval systems.
[5]
Different Rankers on Different Subcollections
Evaluation
/
Jones, Timothy
/
Scholer, Falk
/
Turpin, Andrew
/
Mizzaro, Stefano
/
Sanderson, Mark
Proceedings of ECIR 2015, the 2015 European Conference on Information
Retrieval
2015-03-29
p.203-208
Keywords: Collection Partitioning; Subcollections; Retrieval effectiveness
© Copyright 2015 Springer International Publishing Switzerland
Summary: Recent work has shown that when documents in a TREC ad hoc collection are
partitioned, different rankers will perform optimally on different partitions.
This result suggests that choosing different highly effective rankers for each
partition and merging the results, should be able to improve overall
effectiveness. Analyzing results from a novel oracle merge process, we
demonstrate that this is not the case: selecting the best performing ranker on
each subcollection is very unlikely to outperform just using a single best
ranker across the whole collection.
[6]
Judging Relevance Using Magnitude Estimation
Evaluation
/
Maddalena, Eddy
/
Mizzaro, Stefano
/
Scholer, Falk
/
Turpin, Andrew
Proceedings of ECIR 2015, the 2015 European Conference on Information
Retrieval
2015-03-29
p.215-220
© Copyright 2015 Springer International Publishing Switzerland
Summary: Magnitude estimation is a psychophysical scaling technique whereby numbers
are assigned to stimuli to reflect the ratios of their perceived intensity. We
report on a crowdsourcing experiment aimed at understanding if magnitude
estimation can be used to gather reliable relevance judgements for documents,
as is commonly required for test collection-based evaluation of information
retrieval systems. Results on a small dataset show that: (i) magnitude
estimation can produce relevance rankings that are consistent with more
classical ordinal judgements; (ii) both an upper-bounded and an unbounded scale
can be used effectively, though with some differences; (iii) the presentation
order of the documents being judged has a limited effect, if any; and (iv) only
a small number repeat judgements are required to obtain reliable magnitude
estimation scores.
[7]
Predicting Re-finding Activity and Difficulty
User Behavior
/
Sadeghi, Sargol
/
Blanco, Roi
/
Mika, Peter
/
Sanderson, Mark
/
Scholer, Falk
/
Vallet, David
Proceedings of ECIR 2015, the 2015 European Conference on Information
Retrieval
2015-03-29
p.715-727
Keywords: Re-finding Identification; Difficulty Detection; Behavioral Features
© Copyright 2015 Springer International Publishing Switzerland
Summary: In this study, we address the problem of identifying if users are attempting
to re-find information and estimating the level of difficulty of the re-finding
task. We propose to consider the task information (e.g. multiple queries and
click information) rather than only queries. Our resultant prediction models
are shown to be significantly more accurate (by 2%) than the current state of
the art. While past research assumes that previous search history of the user
is available to the prediction model, we examine if re-finding detection is
possible without access to this information. Our evaluation indicates that such
detection is possible, but more challenging. We further describe the first
predictive model in detecting re-finding difficulty, showing it to be
significantly better than existing approaches for detecting general search
difficulty.
[8]
A Study of Querying Behaviour of Expert and Non-expert Users of Biomedical
Search Systems
Papers
/
Kharazmi, Sadegh
/
Karimi, Sarvnaz
/
Scholer, Falk
/
Clark, Adam
Proceedings of the 2014 Australasian Document Computing Symposium
2014-11-27
p.10-17
© Copyright 2014 ACM
Summary: The amount of biomedical literature, and the popularity of health-related
searches, are both growing rapidly. While most biomedical search systems offer
a range of advanced features, there is limited understanding of user
preferences, and how searcher expertise relates to the use and perception of
different search features in this domain. Through a controlled user study where
both medical experts and non-medical participants were asked to carry-out
informational searches in a task-based environment, we seek to understand how
querying behaviour differs, both in the formulation of query strings, and in
the use of advanced querying features. Our results suggest that preferences
vary substantially between these groups of users, and that biomedical search
systems need to offer a range of tools in order to effectively support both
types of searchers.
[9]
Assessing the Cognitive Complexity of Information Needs
Posters
/
Moffat, Alistair
/
Bailey, Peter
/
Scholer, Falk
/
Thomas, Paul
Proceedings of the 2014 Australasian Document Computing Symposium
2014-11-27
p.97-100
© Copyright 2014 ACM
Summary: Information retrieval systems can be evaluated in laboratory settings
through the use of user studies, and through the use of test collections and
effectiveness metrics. In a larger investigation we are exploring the extent to
which individual user differences and behaviours can affect the scores
generated by a retrieval system.
Our objective in the first phase of that project is to define information
need statements corresponding to a range of TREC search tasks, and to
categorise those statements in terms of task complexity. The goal is to reach a
position from which we can determine whether user actions while searching are
influenced by the way the information need is expressed, and by the fundamental
nature of the information need. We describe the process used to create
information need statements, and then report inter- and intra-assessor
agreements across four annotators. We conclude that assessing the relative
cognitive complexity of tasks is a complex activity, even for experienced
annotators.
[10]
Identifying Re-finding Difficulty from User Query Logs
Posters
/
Sadeghi, Sargol
/
Blanco, Roi
/
Mika, Peter
/
Sanderson, Mark
/
Scholer, Falk
/
Vallet, David
Proceedings of the 2014 Australasian Document Computing Symposium
2014-11-27
p.105-108
© Copyright 2014 ACM
Summary: This paper presents a first study of how consistently human assessors are
able to identify, from query logs, when searchers are facing difficulties
re-finding documents. Using 12 assessors, we investigate the effect of two
variables on assessor agreement: the assessment guideline detail, and assessor
experience. The results indicate statistically significant better agreement
when using detailed guidelines. An upper agreement of 78.9% was achieved, which
is comparable to the levels of agreement in other information retrieval
contexts. The effects of two contextual factors, representative of system
performance and user effort, were studied. Significant differences between
agreement levels were found for both factors, suggesting that contextual
factors may play an important role in obtaining higher agreement levels. The
findings contribute to a better understanding of how to generate ground truth
data both in the re-finding and other labeling contexts, and have further
implications for building automatic re-finding difficulty prediction models.
[11]
Size and Source Matter: Understanding Inconsistencies in Test
Collection-Based Evaluation
IR Track Posters
/
Jones, Timothy
/
Turpin, Andrew
/
Mizzaro, Stefano
/
Scholer, Falk
/
Sanderson, Mark
Proceedings of the 2014 ACM Conference on Information and Knowledge
Management
2014-11-03
p.1843-1846
© Copyright 2014 ACM
Summary: Past work showed that significant inconsistencies between retrieval results
occurred on different test collections, even when one of the test collections
contained only a subset of the documents in the other. However, the
experimental methodologies in that paper made it hard to determine the cause of
the inconsistencies. Using a novel methodology that eliminates the problems
with uneven distribution of relevant documents, we confirm that observing a
statistically significant improvement between two IR systems can be strongly
influenced by the choice of documents in the test collection. We investigate
two possible causes of this problem of test collections. Our results show that
collection size and document source have a strong influence in the way that a
test collection will rank one retrieval system relative to another. This is of
particular interest when constructing test collections, as we show that using
different subsets of a collection produces differing evaluation results.
[12]
Modeling decision points in user search behavior
Short papers
/
Thomas, Paul
/
Moffat, Alistair
/
Bailey, Peter
/
Scholer, Falk
Proceedings of the 2014 Symposium on Information Interaction in Context
2014-08-26
p.239-242
© Copyright 2014 ACM
Summary: Understanding and modeling user behavior is critical to designing search
systems: it allows us to drive batch evaluations, predict how users would
respond to changes in systems or interfaces, and suggest ideas for improvement.
In this work we present a comprehensive model of the interactions between a
searcher and a search engine, and the decisions users make in these
interactions. The model is designed to deal only with observable phenomena.
Based on data from a user study, we are therefore able to make initial
estimates of the probabilities associated with various decision points.
More sophisticated estimates of these decision points could include
probabilities conditioned on some amount of search activity state. In
particular, we suggest that one important part of this state is the amount of
utility a user is seeking, and how much of this they have collected so far. We
propose an experiment to test this, and to elucidate other factors which
influence user actions.
[13]
Using score differences for search result diversification
Poster session (short papers)
/
Kharazmi, Sadegh
/
Sanderson, Mark
/
Scholer, Falk
/
Vallet, David
Proceedings of the 2014 Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2014-07-06
p.1143-1146
© Copyright 2014 ACM
Summary: We investigate the application of a light-weight approach to result list
clustering for the purposes of diversifying search results. We introduce a
novel post-retrieval approach, which is independent of external information or
even the full-text content of retrieved documents; only the retrieval score of
a document is used. Our experiments show that this novel approach is beneficial
to effectiveness, albeit only on certain baseline systems. The fact that the
method works indicates that the retrieval score is potentially exploitable in
diversity.
[14]
TREC: topic engineering exercise
Poster session (short papers)
/
Culpepper, J Shane
/
Mizzaro, Stefano
/
Sanderson, Mark
/
Scholer, Falk
Proceedings of the 2014 Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2014-07-06
p.1147-1150
© Copyright 2014 ACM
Summary: In this work, we investigate approaches to engineer better topic sets in
information retrieval test collections. By recasting the TREC evaluation
exercise from one of building more effective systems to an exercise in building
better topics, we present two possible approaches to quantify topic "goodness":
topic ease and topic set predictivity. A novel interpretation of a well known
result and a twofold analysis of data from several TREC editions lead to a
result that has been neglected so far: both topic ease and topic set
predictivity have changed significantly across the years, sometimes in a
perhaps undesirable way.
[15]
Using eye tracking for evaluating web search interfaces
/
Al Maqbali, Hilal
/
Scholer, Falk
/
Thom, James A.
/
Wu, Mingfang
Proceedings of ADCS'13, Australasian Document Computing Symposium
2013-12-05
p.2-9
© Copyright 2013 ACM
Summary: Using eye tracking in the evaluation of web search interfaces can provide
rich information on users' information search behaviour, particularly in the
matter of user interaction with different informative components on a search
results screen. One of the main issues affecting the use of eye tracking in
research is the quality of captured eye movements (calibration), therefore, in
this paper, we propose a method that allows us to determine the quality of
calibration, since the existing eye tracking system (Tobii Studio) does not
provide any criteria for this aspect. Another issue addressed in this paper is
the adaptation of gaze direction. We use a black screen displaying for 3
seconds between screens to avoid the effect of the previous screen on user gaze
direction on the coming screen. A further issue when employing eye tracking in
the evaluation of web search interfaces is the selection of the appropriate
filter for the raw gaze-points data. In our studies, we filtered this data by
removing noise, identifying gaze points that occur in Area of Interests (AOIs),
optimising gaze data and identifying viewed AOIs.
[16]
Choices in batch information retrieval evaluation
/
Scholer, Falk
/
Moffat, Alistair
/
Thomas, Paul
Proceedings of ADCS'13, Australasian Document Computing Symposium
2013-12-05
p.74-81
© Copyright 2013 ACM
Summary: Web search tools are used on a daily basis by billions of people. The
commercial providers of these services spend large amounts of money measuring
their own effectiveness and benchmarking against their competitors; nothing
less than their corporate survival is at stake. Techniques for offline or
"batch" evaluation of search quality have received considerable attention,
spanning ways of constructing relevance judgments; ways of using them to
generate numeric scores; and ways of inferring system "superiority" from sets
of such scores.
Our purpose in this paper is consider these mechanisms as a chain of
inter-dependent activities, in order to explore some of the ramifications of
alternative components. By disaggregating the different activities, and asking
what the ultimate objective of the measurement process is, we provide new
insights into evaluation approaches, and are able to suggest new combinations
that might prove fruitful avenues for exploration. Our observations are
examined with reference to data collected from a user study covering 34 users
undertaking a total of six search tasks each, using two systems of markedly
different quality.
We hope to encourage broader awareness of the many factors that go into an
evaluation of search effectiveness, and of the implications of these choices,
and encourage researchers to carefully report all aspects of the evaluation
process when describing their system performance experiments.
[17]
Augmenting web search surrogates with images
IR track: search engines
/
Capra, Robert
/
Arguello, Jaime
/
Scholer, Falk
Proceedings of the 2013 ACM Conference on Information and Knowledge
Management
2013-10-27
p.399-408
© Copyright 2013 ACM
Summary: While images are commonly used in search result presentation for vertical
domains such as shopping and news, web search results surrogates remain
primarily text-based. In this paper, we present results of two large-scale user
studies to examine the effects of augmenting text-based surrogates with images
extracted from the underlying webpage. We evaluate effectiveness and efficiency
at both the individual surrogate level and at the results page level.
Additionally, we investigate the influence of two factors: the goodness of the
image in terms of representing the underlying page content, and the diversity
of the results on a results page. Our results show that at the individual
surrogate level, good images provide only a small benefit in judgment accuracy
versus text-only surrogates, with a slight increase in judgment time. At the
results page level, surrogates with good images had similar effectiveness and
efficiency compared to the text-only condition. However, in situations where
the results page items had diverse senses, surrogates with images had higher
click precision versus text-only ones. Results of these studies show tradeoffs
in the use of images in web search surrogates, and highlight particular
situations where they can provide benefits.
[18]
Users versus models: what observation tells us about effectiveness metrics
IR track: evaluation
/
Moffat, Alistair
/
Thomas, Paul
/
Scholer, Falk
Proceedings of the 2013 ACM Conference on Information and Knowledge
Management
2013-10-27
p.659-668
© Copyright 2013 ACM
Summary: Retrieval system effectiveness can be measured in two quite different ways:
by monitoring the behavior of users and gathering data about the ease and
accuracy with which they accomplish certain specified information-seeking
tasks; or by using numeric effectiveness metrics to score system runs in
reference to a set of relevance judgments. In the second approach, the
effectiveness metric is chosen in the belief that user task performance, if it
were to be measured by the first approach, should be linked to the score
provided by the metric.
This work explores that link, by analyzing the assumptions and implications
of a number of effectiveness metrics, and exploring how these relate to
observable user behaviors. Data recorded as part of a user study included user
self-assessment of search task difficulty; gaze position; and click activity.
Our results show that user behavior is influenced by a blend of many factors,
including the extent to which relevant documents are encountered, the stage of
the search process, and task difficulty. These insights can be used to guide
development of batch effectiveness metrics.
[19]
The effect of threshold priming and need for cognition on relevance
calibration and assessment
Evaluation II
/
Scholer, Falk
/
Kelly, Diane
/
Wu, Wan-Ching
/
Lee, Hanseul S.
/
Webber, William
Proceedings of the 2013 Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2013-07-28
p.623-632
© Copyright 2013 ACM
Summary: Human assessments of document relevance are needed for the construction of
test collections, for ad-hoc evaluation, and for training text classifiers.
Showing documents to assessors in different orderings, however, may lead to
different assessment outcomes. We examine the effect that threshold priming,
seeing varying degrees of relevant documents, has on people's calibration of
relevance. Participants judged the relevance of a prologue of documents
containing highly relevant, moderately relevant, or non-relevant documents,
followed by a common epilogue of documents of mixed relevance. We observe that
participants exposed to only non-relevant documents in the prologue assigned
significantly higher average relevance scores to prologue and epilogue
documents than participants exposed to moderately or highly relevant documents
in the prologue. We also examine how need for cognition, an individual
difference measure of the extent to which a person enjoys engaging in effortful
cognitive activity, impacts relevance assessments. High need for cognition
participants had a significantly higher level of agreement with expert
assessors than low need for cognition participants did. Our findings indicate
that assessors should be exposed to documents from multiple relevance levels
early in the judging process, in order to calibrate their relevance thresholds
in a balanced way, and that individual difference measures might be a useful
way to screen assessors.
[20]
Models and metrics: IR evaluation as a user process
/
Moffat, Alistair
/
Scholer, Falk
/
Thomas, Paul
Proceedings of ADCS'12, Australasian Document Computing Symposium
2012-12-05
p.47-54
© Copyright 2012 ACM
Summary: Retrieval system effectiveness can be measured in two quite different ways:
by monitoring the behavior of users and gathering data about the ease and
accuracy with which they accomplish certain specified information-seeking
tasks; or by using numeric effectiveness metrics to score system runs in
reference to a set of relevance judgments. The former has the benefit of
directly assessing the actual goal of the system, namely the user's ability to
complete a search task; whereas the latter approach has the benefit of being
quantitative and repeatable. Each given effectiveness metric is an attempt to
bridge the gap between these two evaluation approaches, since the implicit
belief supporting the use of any particular metric is that user task
performance should be correlated with the numeric score provided by the metric.
In this work we explore that linkage, considering a range of effectiveness
metrics, and the user search behavior that each of them implies. We then
examine more complex user models, as a guide to the development of new
effectiveness metrics. We conclude by summarizing an experiment that we believe
will help establish the strength of the linkage between models and metrics.
[21]
Sentence length bias in TREC novelty track judgements
/
Bando, Lorena Leal
/
Scholer, Falk
/
Turpin, Andrew
Proceedings of ADCS'12, Australasian Document Computing Symposium
2012-12-05
p.55-61
© Copyright 2012 ACM
Summary: The Cranfield methodology for comparing document ranking systems has also
been applied recently to comparing sentence ranking methods, which are used as
pre-processors for summary generation methods. In particular, the TREC Novelty
track data has been used to assess whether one sentence ranking system is
better than another. This paper demonstrates that there is a strong bias in the
Novelty track data for relevant sentences to also be longer sentences. Thus,
systems that simply choose the longest sentences will often appear to perform
better in terms of identifying "relevant" sentences than systems that use other
methods. We demonstrate, by example, how this can lead to misleading
conclusions about the comparative effectiveness of sentence ranking systems. We
then demonstrate that if the Novelty track data is split into subcollections
based on sentence length, comparing systems on each of the subcollections leads
to conclusions that avoid the bias.
[22]
Differences in effectiveness across sub-collections
Information retrieval short paper session
/
Sanderson, Mark
/
Turpin, Andrew
/
Zhang, Ying
/
Scholer, Falk
Proceedings of the 2012 ACM Conference on Information and Knowledge
Management
2012-10-29
p.1965-1969
© Copyright 2012 ACM
Summary: The relative performance of retrieval systems when evaluated on one part of
a test collection may bear little or no similarity to the relative performance
measured on a different part of the collection. In this paper we report the
results of a detailed study of the impact that different sub-collections have
on retrieval effectiveness, analyzing the effect over many collections, and
with different approaches to sub-dividing the collections. The effect is shown
to be substantial, impacting on comparisons between retrieval runs that are
statistically significant. Some possible causes for the effect are
investigated, and the implications of this work are examined for test
collection design and for the strength of conclusions one can draw from
experimental results.
[23]
Efficient in-memory top-k document retrieval
Architectures 1
/
Culpepper, J. Shane
/
Petri, Matthias
/
Scholer, Falk
Proceedings of the 35th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2012-08-12
p.225-234
© Copyright 2012 ACM
Summary: For over forty years the dominant data structure for ranked document
retrieval has been the inverted index. Inverted indexes are effective for a
variety of document retrieval tasks, and particularly efficient for large data
collection scenarios that require disk access and storage. However, many
efficiency-bound search tasks can now easily be supported entirely in memory as
a result of recent hardware advances. In this paper we present a hybrid
algorithmic framework for in-memory bag of-words ranked document retrieval
using a self-index derived from the FM-Index, wavelet tree, and the compressed
suffix tree data structures, and evaluate the various algorithmic trade-offs
for performing efficient queries entirely in-memory. We compare our approach
with two classic approaches to bag-of-words queries using inverted indexes,
term-at-a-time (TAAT) and document-at-a-time (DAAT) query processing. We show
that our framework is competitive with state-of-the-art indexing structures,
and describe new capabilities provided by our algorithms that can be leveraged
by future systems to improve effectiveness and efficiency for a variety of
fundamental search operations.
[24]
Making Personal Retrieval Systems Comparable Using Self-Assigned Task
Attributes
Posters
/
Sadeghi, Seyedeh Sargol
/
Sanderson, Mark
/
Scholer, Falk
Proceedings of the Workshop on Human-Computer Interaction and Information
Retrieval
2011-10-20
p.40
Summary: Evaluating Personal Search Systems is challenging due to the lack of common
and shareable test collections in the personal context. Documents and search
task requirements associated with this context are inherently personal and can
vary widely among users. These characteristics make it difficult to gather
documents and devise search tasks in order to build controllable test
environments. This consequently leads to slow progress in the development of
effective personal retrieval systems.
In this position paper, we propose an approach to classifying search tasks
based on their general attributes, which encourages users to classify the tasks
themselves, as well as use tasks produced by others. To this end, we introduce
a new model for the extraction of general task attributes which we call the
Push-Pull Model. This approach can help to create comparable test environments
across the tasks of different users. Furthermore, we highlight some of the key
challenges for further investigation in this area.
[25]
Quantifying test collection quality based on the consistency of relevance
judgements
Test collections
/
Scholer, Falk
/
Turpin, Andrew
/
Sanderson, Mark
Proceedings of the 34th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2011-07-25
p.1063-1072
© Copyright 2011 ACM
Summary: Relevance assessments are a key component for test collection-based
evaluation of information retrieval systems. This paper reports on a feature of
such collections that is used as a form of ground truth data to allow analysis
of human assessment error. A wide range of test collections are retrospectively
examined to determine how accurately assessors judge the relevance of
documents. Our results demonstrate a high level of inconsistency across the
collections studied. The level of irregularity is shown to vary across topics,
with some showing a very high level of assessment error. We investigate
possible influences on the error, and demonstrate that inconsistency in judging
increases with time. While the level of detail in a topic specification does
not appear to influence the errors that assessors make, judgements are
significantly affected by the decisions made on previously seen similar
documents. Assessors also display an assessment inertia. Alternate approaches
to generating relevance judgements appear to reduce errors. A further
investigation of the way that retrieval systems are ranked using sets of
relevance judgements produced early and late in the judgement process reveals a
consistent influence measured across the majority of examined test collections.
We conclude that there is a clear value in examining, even inserting, ground
truth data in test collections, and propose ways to help minimise the sources
of inconsistency when creating future test collections.