HCI Bibliography : Search Results skip to search form | skip to results |
Database updated: 2016-05-10 Searches since 2006-12-01: 32,542,956
director@hcibib.org
Hosted by ACM SIGCHI
The HCI Bibliogaphy was moved to a new server 2015-05-12 and again 2016-01-05, substantially degrading the environment for making updates.
There are no plans to add to the database.
Please send questions or comments to director@hcibib.org.
Query: Scholer_F* Results: 41 Sorted by: Date  Comments?
Help Dates
Limit:   
<<First <Previous Permalink Next> Last>> Records: 1 to 25 of 41 Jump to: 2015 | 14 | 13 | 12 | 11 | 10 | 09 | 08 | 06 | 02 |
[1] Pooled Evaluation Over Query Variations: Users are as Diverse as Systems Short Papers: Information Retrieval / Moffat, Alistair / Scholer, Falk / Thomas, Paul / Bailey, Peter Proceedings of the 2015 ACM Conference on Information and Knowledge Management 2015-10-19 p.1759-1762
ACM Digital Library Link
Summary: Evaluation of information retrieval systems with test collections makes use of a suite of fixed resources: a document corpus; a set of topics; and associated judgments of the relevance of each document to each topic. With large modern collections, exhaustive judging is not feasible. Therefore an approach called pooling is typically used where, for example, the documents to be judged can be determined by taking the union of all documents returned in the top positions of the answer lists returned by a range of systems. Conventionally, pooling uses system variations to provide diverse documents to be judged for a topic; different user queries are not considered. We explore the ramifications of user query variability on pooling, and demonstrate that conventional test collections do not cover this source of variation. The effect of user query variation on the size of the judging pool is just as strong as the effect of retrieval system variation. We conclude that user query variation should be incorporated early in test collection construction, and cannot be considered effectively post hoc.

[2] The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation Session 7A: Assessing / Turpin, Andrew / Scholer, Falk / Mizzaro, Stefano / Maddalena, Eddy Proceedings of the 2015 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2015-08-09 p.565-574
ACM Digital Library Link
Summary: Magnitude estimation is a psychophysical scaling technique for the measurement of sensation, where observers assign numbers to stimuli in response to their perceived intensity. We investigate the use of magnitude estimation for judging the relevance of documents in the context of information retrieval evaluation, carrying out a large-scale user study across 18 TREC topics and collecting more than 50,000 magnitude estimation judgments. Our analysis shows that on average magnitude estimation judgments are rank-aligned with ordinal judgments made by expert relevance assessors. An advantage of magnitude estimation is that users can chose their own scale for judgments, allowing deeper investigations of user perceptions than when categorical scales are used.
    We explore the application of magnitude estimation for IR evaluation, calibrating two gain-based effectiveness metrics, nDCG and ERR, directly from user-reported perceptions of relevance. A comparison of TREC system effectiveness rankings based on binary, ordinal, and magnitude estimation relevance shows substantial variation; in particular, the top systems ranked using magnitude estimation and ordinal judgments differ substantially. Analysis of the magnitude estimation scores shows that this effect is due in part to varying perceptions of relevance, in terms of how impactful relative differences in document relevance are perceived to be. We further use magnitude estimation to investigate gain profiles, comparing the currently assumed linear and exponential approaches with actual user-reported relevance perceptions. This indicates that the currently used exponential gain profiles in nDCG and ERR are mismatched with an average user, but perhaps more importantly that individual perceptions are highly variable. These results have direct implications for IR evaluation, suggesting that current assumptions about a single view of relevance being sufficient to represent a population of users are unlikely to hold. Finally, we demonstrate that magnitude estimation judgments can be reliably collected using crowdsourcing, and are competitive in terms of assessor cost.

[3] User Variability and IR System Evaluation Session 8A: Variability in test collections / Bailey, Peter / Moffat, Alistair / Scholer, Falk / Thomas, Paul Proceedings of the 2015 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2015-08-09 p.625-634
ACM Digital Library Link
Summary: Test collection design eliminates sources of user variability to make statistical comparisons among information retrieval (IR) systems more affordable. Does this choice unnecessarily limit generalizability of the outcomes to real usage scenarios? We explore two aspects of user variability with regard to evaluating the relative performance of IR systems, assessing effectiveness in the context of a subset of topics from three TREC collections, with the embodied information needs categorized against three levels of increasing task complexity. First, we explore the impact of widely differing queries that searchers construct for the same information need description. By executing those queries, we demonstrate that query formulation is critical to query effectiveness. The results also show that the range of scores characterizing effectiveness for a single system arising from these queries is comparable or greater than the range of scores arising from variation among systems using only a single query per topic. Second, our experiments reveal that searchers display substantial individual variation in the numbers of documents and queries they anticipate needing to issue, and there are underlying significant differences in these numbers in line with increasing task complexity levels. Our conclusion is that test collection design would be improved by the use of multiple query variations per topic, and could be further improved by the use of metrics which are sensitive to the expected numbers of useful documents.

[4] Features of Disagreement Between Retrieval Effectiveness Measures Short Papers / Jones, Timothy / Thomas, Paul / Scholer, Falk / Sanderson, Mark Proceedings of the 2015 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2015-08-09 p.847-850
ACM Digital Library Link
Summary: Many IR effectiveness measures are motivated from intuition, theory, or user studies. In general, most effectiveness measures are well correlated with each other. But, what about where they don't correlate? Which rankings cause measures to disagree? Are these rankings predictable for particular pairs of measures? In this work, we examine how and where metrics disagree, and identify differences that should be considered when selecting metrics for use in evaluating retrieval systems.

[5] Different Rankers on Different Subcollections Evaluation / Jones, Timothy / Scholer, Falk / Turpin, Andrew / Mizzaro, Stefano / Sanderson, Mark Proceedings of ECIR 2015, the 2015 European Conference on Information Retrieval 2015-03-29 p.203-208
Keywords: Collection Partitioning; Subcollections; Retrieval effectiveness
Link to Digital Content at Springer
Summary: Recent work has shown that when documents in a TREC ad hoc collection are partitioned, different rankers will perform optimally on different partitions. This result suggests that choosing different highly effective rankers for each partition and merging the results, should be able to improve overall effectiveness. Analyzing results from a novel oracle merge process, we demonstrate that this is not the case: selecting the best performing ranker on each subcollection is very unlikely to outperform just using a single best ranker across the whole collection.

[6] Judging Relevance Using Magnitude Estimation Evaluation / Maddalena, Eddy / Mizzaro, Stefano / Scholer, Falk / Turpin, Andrew Proceedings of ECIR 2015, the 2015 European Conference on Information Retrieval 2015-03-29 p.215-220
Link to Digital Content at Springer
Summary: Magnitude estimation is a psychophysical scaling technique whereby numbers are assigned to stimuli to reflect the ratios of their perceived intensity. We report on a crowdsourcing experiment aimed at understanding if magnitude estimation can be used to gather reliable relevance judgements for documents, as is commonly required for test collection-based evaluation of information retrieval systems. Results on a small dataset show that: (i) magnitude estimation can produce relevance rankings that are consistent with more classical ordinal judgements; (ii) both an upper-bounded and an unbounded scale can be used effectively, though with some differences; (iii) the presentation order of the documents being judged has a limited effect, if any; and (iv) only a small number repeat judgements are required to obtain reliable magnitude estimation scores.

[7] Predicting Re-finding Activity and Difficulty User Behavior / Sadeghi, Sargol / Blanco, Roi / Mika, Peter / Sanderson, Mark / Scholer, Falk / Vallet, David Proceedings of ECIR 2015, the 2015 European Conference on Information Retrieval 2015-03-29 p.715-727
Keywords: Re-finding Identification; Difficulty Detection; Behavioral Features
Link to Digital Content at Springer
Summary: In this study, we address the problem of identifying if users are attempting to re-find information and estimating the level of difficulty of the re-finding task. We propose to consider the task information (e.g. multiple queries and click information) rather than only queries. Our resultant prediction models are shown to be significantly more accurate (by 2%) than the current state of the art. While past research assumes that previous search history of the user is available to the prediction model, we examine if re-finding detection is possible without access to this information. Our evaluation indicates that such detection is possible, but more challenging. We further describe the first predictive model in detecting re-finding difficulty, showing it to be significantly better than existing approaches for detecting general search difficulty.

[8] A Study of Querying Behaviour of Expert and Non-expert Users of Biomedical Search Systems Papers / Kharazmi, Sadegh / Karimi, Sarvnaz / Scholer, Falk / Clark, Adam Proceedings of the 2014 Australasian Document Computing Symposium 2014-11-27 p.10-17
ACM Digital Library Link
Summary: The amount of biomedical literature, and the popularity of health-related searches, are both growing rapidly. While most biomedical search systems offer a range of advanced features, there is limited understanding of user preferences, and how searcher expertise relates to the use and perception of different search features in this domain. Through a controlled user study where both medical experts and non-medical participants were asked to carry-out informational searches in a task-based environment, we seek to understand how querying behaviour differs, both in the formulation of query strings, and in the use of advanced querying features. Our results suggest that preferences vary substantially between these groups of users, and that biomedical search systems need to offer a range of tools in order to effectively support both types of searchers.

[9] Assessing the Cognitive Complexity of Information Needs Posters / Moffat, Alistair / Bailey, Peter / Scholer, Falk / Thomas, Paul Proceedings of the 2014 Australasian Document Computing Symposium 2014-11-27 p.97-100
ACM Digital Library Link
Summary: Information retrieval systems can be evaluated in laboratory settings through the use of user studies, and through the use of test collections and effectiveness metrics. In a larger investigation we are exploring the extent to which individual user differences and behaviours can affect the scores generated by a retrieval system.
    Our objective in the first phase of that project is to define information need statements corresponding to a range of TREC search tasks, and to categorise those statements in terms of task complexity. The goal is to reach a position from which we can determine whether user actions while searching are influenced by the way the information need is expressed, and by the fundamental nature of the information need. We describe the process used to create information need statements, and then report inter- and intra-assessor agreements across four annotators. We conclude that assessing the relative cognitive complexity of tasks is a complex activity, even for experienced annotators.

[10] Identifying Re-finding Difficulty from User Query Logs Posters / Sadeghi, Sargol / Blanco, Roi / Mika, Peter / Sanderson, Mark / Scholer, Falk / Vallet, David Proceedings of the 2014 Australasian Document Computing Symposium 2014-11-27 p.105-108
ACM Digital Library Link
Summary: This paper presents a first study of how consistently human assessors are able to identify, from query logs, when searchers are facing difficulties re-finding documents. Using 12 assessors, we investigate the effect of two variables on assessor agreement: the assessment guideline detail, and assessor experience. The results indicate statistically significant better agreement when using detailed guidelines. An upper agreement of 78.9% was achieved, which is comparable to the levels of agreement in other information retrieval contexts. The effects of two contextual factors, representative of system performance and user effort, were studied. Significant differences between agreement levels were found for both factors, suggesting that contextual factors may play an important role in obtaining higher agreement levels. The findings contribute to a better understanding of how to generate ground truth data both in the re-finding and other labeling contexts, and have further implications for building automatic re-finding difficulty prediction models.

[11] Size and Source Matter: Understanding Inconsistencies in Test Collection-Based Evaluation IR Track Posters / Jones, Timothy / Turpin, Andrew / Mizzaro, Stefano / Scholer, Falk / Sanderson, Mark Proceedings of the 2014 ACM Conference on Information and Knowledge Management 2014-11-03 p.1843-1846
ACM Digital Library Link
Summary: Past work showed that significant inconsistencies between retrieval results occurred on different test collections, even when one of the test collections contained only a subset of the documents in the other. However, the experimental methodologies in that paper made it hard to determine the cause of the inconsistencies. Using a novel methodology that eliminates the problems with uneven distribution of relevant documents, we confirm that observing a statistically significant improvement between two IR systems can be strongly influenced by the choice of documents in the test collection. We investigate two possible causes of this problem of test collections. Our results show that collection size and document source have a strong influence in the way that a test collection will rank one retrieval system relative to another. This is of particular interest when constructing test collections, as we show that using different subsets of a collection produces differing evaluation results.

[12] Modeling decision points in user search behavior Short papers / Thomas, Paul / Moffat, Alistair / Bailey, Peter / Scholer, Falk Proceedings of the 2014 Symposium on Information Interaction in Context 2014-08-26 p.239-242
ACM Digital Library Link
Summary: Understanding and modeling user behavior is critical to designing search systems: it allows us to drive batch evaluations, predict how users would respond to changes in systems or interfaces, and suggest ideas for improvement. In this work we present a comprehensive model of the interactions between a searcher and a search engine, and the decisions users make in these interactions. The model is designed to deal only with observable phenomena. Based on data from a user study, we are therefore able to make initial estimates of the probabilities associated with various decision points.
    More sophisticated estimates of these decision points could include probabilities conditioned on some amount of search activity state. In particular, we suggest that one important part of this state is the amount of utility a user is seeking, and how much of this they have collected so far. We propose an experiment to test this, and to elucidate other factors which influence user actions.

[13] Using score differences for search result diversification Poster session (short papers) / Kharazmi, Sadegh / Sanderson, Mark / Scholer, Falk / Vallet, David Proceedings of the 2014 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2014-07-06 p.1143-1146
ACM Digital Library Link
Summary: We investigate the application of a light-weight approach to result list clustering for the purposes of diversifying search results. We introduce a novel post-retrieval approach, which is independent of external information or even the full-text content of retrieved documents; only the retrieval score of a document is used. Our experiments show that this novel approach is beneficial to effectiveness, albeit only on certain baseline systems. The fact that the method works indicates that the retrieval score is potentially exploitable in diversity.

[14] TREC: topic engineering exercise Poster session (short papers) / Culpepper, J Shane / Mizzaro, Stefano / Sanderson, Mark / Scholer, Falk Proceedings of the 2014 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2014-07-06 p.1147-1150
ACM Digital Library Link
Summary: In this work, we investigate approaches to engineer better topic sets in information retrieval test collections. By recasting the TREC evaluation exercise from one of building more effective systems to an exercise in building better topics, we present two possible approaches to quantify topic "goodness": topic ease and topic set predictivity. A novel interpretation of a well known result and a twofold analysis of data from several TREC editions lead to a result that has been neglected so far: both topic ease and topic set predictivity have changed significantly across the years, sometimes in a perhaps undesirable way.

[15] Using eye tracking for evaluating web search interfaces / Al Maqbali, Hilal / Scholer, Falk / Thom, James A. / Wu, Mingfang Proceedings of ADCS'13, Australasian Document Computing Symposium 2013-12-05 p.2-9
ACM Digital Library Link
Summary: Using eye tracking in the evaluation of web search interfaces can provide rich information on users' information search behaviour, particularly in the matter of user interaction with different informative components on a search results screen. One of the main issues affecting the use of eye tracking in research is the quality of captured eye movements (calibration), therefore, in this paper, we propose a method that allows us to determine the quality of calibration, since the existing eye tracking system (Tobii Studio) does not provide any criteria for this aspect. Another issue addressed in this paper is the adaptation of gaze direction. We use a black screen displaying for 3 seconds between screens to avoid the effect of the previous screen on user gaze direction on the coming screen. A further issue when employing eye tracking in the evaluation of web search interfaces is the selection of the appropriate filter for the raw gaze-points data. In our studies, we filtered this data by removing noise, identifying gaze points that occur in Area of Interests (AOIs), optimising gaze data and identifying viewed AOIs.

[16] Choices in batch information retrieval evaluation / Scholer, Falk / Moffat, Alistair / Thomas, Paul Proceedings of ADCS'13, Australasian Document Computing Symposium 2013-12-05 p.74-81
ACM Digital Library Link
Summary: Web search tools are used on a daily basis by billions of people. The commercial providers of these services spend large amounts of money measuring their own effectiveness and benchmarking against their competitors; nothing less than their corporate survival is at stake. Techniques for offline or "batch" evaluation of search quality have received considerable attention, spanning ways of constructing relevance judgments; ways of using them to generate numeric scores; and ways of inferring system "superiority" from sets of such scores.
    Our purpose in this paper is consider these mechanisms as a chain of inter-dependent activities, in order to explore some of the ramifications of alternative components. By disaggregating the different activities, and asking what the ultimate objective of the measurement process is, we provide new insights into evaluation approaches, and are able to suggest new combinations that might prove fruitful avenues for exploration. Our observations are examined with reference to data collected from a user study covering 34 users undertaking a total of six search tasks each, using two systems of markedly different quality.
    We hope to encourage broader awareness of the many factors that go into an evaluation of search effectiveness, and of the implications of these choices, and encourage researchers to carefully report all aspects of the evaluation process when describing their system performance experiments.

[17] Augmenting web search surrogates with images IR track: search engines / Capra, Robert / Arguello, Jaime / Scholer, Falk Proceedings of the 2013 ACM Conference on Information and Knowledge Management 2013-10-27 p.399-408
ACM Digital Library Link
Summary: While images are commonly used in search result presentation for vertical domains such as shopping and news, web search results surrogates remain primarily text-based. In this paper, we present results of two large-scale user studies to examine the effects of augmenting text-based surrogates with images extracted from the underlying webpage. We evaluate effectiveness and efficiency at both the individual surrogate level and at the results page level. Additionally, we investigate the influence of two factors: the goodness of the image in terms of representing the underlying page content, and the diversity of the results on a results page. Our results show that at the individual surrogate level, good images provide only a small benefit in judgment accuracy versus text-only surrogates, with a slight increase in judgment time. At the results page level, surrogates with good images had similar effectiveness and efficiency compared to the text-only condition. However, in situations where the results page items had diverse senses, surrogates with images had higher click precision versus text-only ones. Results of these studies show tradeoffs in the use of images in web search surrogates, and highlight particular situations where they can provide benefits.

[18] Users versus models: what observation tells us about effectiveness metrics IR track: evaluation / Moffat, Alistair / Thomas, Paul / Scholer, Falk Proceedings of the 2013 ACM Conference on Information and Knowledge Management 2013-10-27 p.659-668
ACM Digital Library Link
Summary: Retrieval system effectiveness can be measured in two quite different ways: by monitoring the behavior of users and gathering data about the ease and accuracy with which they accomplish certain specified information-seeking tasks; or by using numeric effectiveness metrics to score system runs in reference to a set of relevance judgments. In the second approach, the effectiveness metric is chosen in the belief that user task performance, if it were to be measured by the first approach, should be linked to the score provided by the metric.
    This work explores that link, by analyzing the assumptions and implications of a number of effectiveness metrics, and exploring how these relate to observable user behaviors. Data recorded as part of a user study included user self-assessment of search task difficulty; gaze position; and click activity. Our results show that user behavior is influenced by a blend of many factors, including the extent to which relevant documents are encountered, the stage of the search process, and task difficulty. These insights can be used to guide development of batch effectiveness metrics.

[19] The effect of threshold priming and need for cognition on relevance calibration and assessment Evaluation II / Scholer, Falk / Kelly, Diane / Wu, Wan-Ching / Lee, Hanseul S. / Webber, William Proceedings of the 2013 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2013-07-28 p.623-632
ACM Digital Library Link
Summary: Human assessments of document relevance are needed for the construction of test collections, for ad-hoc evaluation, and for training text classifiers. Showing documents to assessors in different orderings, however, may lead to different assessment outcomes. We examine the effect that threshold priming, seeing varying degrees of relevant documents, has on people's calibration of relevance. Participants judged the relevance of a prologue of documents containing highly relevant, moderately relevant, or non-relevant documents, followed by a common epilogue of documents of mixed relevance. We observe that participants exposed to only non-relevant documents in the prologue assigned significantly higher average relevance scores to prologue and epilogue documents than participants exposed to moderately or highly relevant documents in the prologue. We also examine how need for cognition, an individual difference measure of the extent to which a person enjoys engaging in effortful cognitive activity, impacts relevance assessments. High need for cognition participants had a significantly higher level of agreement with expert assessors than low need for cognition participants did. Our findings indicate that assessors should be exposed to documents from multiple relevance levels early in the judging process, in order to calibrate their relevance thresholds in a balanced way, and that individual difference measures might be a useful way to screen assessors.

[20] Models and metrics: IR evaluation as a user process / Moffat, Alistair / Scholer, Falk / Thomas, Paul Proceedings of ADCS'12, Australasian Document Computing Symposium 2012-12-05 p.47-54
ACM Digital Library Link
Summary: Retrieval system effectiveness can be measured in two quite different ways: by monitoring the behavior of users and gathering data about the ease and accuracy with which they accomplish certain specified information-seeking tasks; or by using numeric effectiveness metrics to score system runs in reference to a set of relevance judgments. The former has the benefit of directly assessing the actual goal of the system, namely the user's ability to complete a search task; whereas the latter approach has the benefit of being quantitative and repeatable. Each given effectiveness metric is an attempt to bridge the gap between these two evaluation approaches, since the implicit belief supporting the use of any particular metric is that user task performance should be correlated with the numeric score provided by the metric. In this work we explore that linkage, considering a range of effectiveness metrics, and the user search behavior that each of them implies. We then examine more complex user models, as a guide to the development of new effectiveness metrics. We conclude by summarizing an experiment that we believe will help establish the strength of the linkage between models and metrics.

[21] Sentence length bias in TREC novelty track judgements / Bando, Lorena Leal / Scholer, Falk / Turpin, Andrew Proceedings of ADCS'12, Australasian Document Computing Symposium 2012-12-05 p.55-61
ACM Digital Library Link
Summary: The Cranfield methodology for comparing document ranking systems has also been applied recently to comparing sentence ranking methods, which are used as pre-processors for summary generation methods. In particular, the TREC Novelty track data has been used to assess whether one sentence ranking system is better than another. This paper demonstrates that there is a strong bias in the Novelty track data for relevant sentences to also be longer sentences. Thus, systems that simply choose the longest sentences will often appear to perform better in terms of identifying "relevant" sentences than systems that use other methods. We demonstrate, by example, how this can lead to misleading conclusions about the comparative effectiveness of sentence ranking systems. We then demonstrate that if the Novelty track data is split into subcollections based on sentence length, comparing systems on each of the subcollections leads to conclusions that avoid the bias.

[22] Differences in effectiveness across sub-collections Information retrieval short paper session / Sanderson, Mark / Turpin, Andrew / Zhang, Ying / Scholer, Falk Proceedings of the 2012 ACM Conference on Information and Knowledge Management 2012-10-29 p.1965-1969
ACM Digital Library Link
Summary: The relative performance of retrieval systems when evaluated on one part of a test collection may bear little or no similarity to the relative performance measured on a different part of the collection. In this paper we report the results of a detailed study of the impact that different sub-collections have on retrieval effectiveness, analyzing the effect over many collections, and with different approaches to sub-dividing the collections. The effect is shown to be substantial, impacting on comparisons between retrieval runs that are statistically significant. Some possible causes for the effect are investigated, and the implications of this work are examined for test collection design and for the strength of conclusions one can draw from experimental results.

[23] Efficient in-memory top-k document retrieval Architectures 1 / Culpepper, J. Shane / Petri, Matthias / Scholer, Falk Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2012-08-12 p.225-234
ACM Digital Library Link
Summary: For over forty years the dominant data structure for ranked document retrieval has been the inverted index. Inverted indexes are effective for a variety of document retrieval tasks, and particularly efficient for large data collection scenarios that require disk access and storage. However, many efficiency-bound search tasks can now easily be supported entirely in memory as a result of recent hardware advances. In this paper we present a hybrid algorithmic framework for in-memory bag of-words ranked document retrieval using a self-index derived from the FM-Index, wavelet tree, and the compressed suffix tree data structures, and evaluate the various algorithmic trade-offs for performing efficient queries entirely in-memory. We compare our approach with two classic approaches to bag-of-words queries using inverted indexes, term-at-a-time (TAAT) and document-at-a-time (DAAT) query processing. We show that our framework is competitive with state-of-the-art indexing structures, and describe new capabilities provided by our algorithms that can be leveraged by future systems to improve effectiveness and efficiency for a variety of fundamental search operations.

[24] Making Personal Retrieval Systems Comparable Using Self-Assigned Task Attributes Posters / Sadeghi, Seyedeh Sargol / Sanderson, Mark / Scholer, Falk Proceedings of the Workshop on Human-Computer Interaction and Information Retrieval 2011-10-20 p.40
sites.google.com/site/hcirworkshop/P_Sadeghi_Sanderson_Scholer_hcir2011_submission_42.pdf
Summary: Evaluating Personal Search Systems is challenging due to the lack of common and shareable test collections in the personal context. Documents and search task requirements associated with this context are inherently personal and can vary widely among users. These characteristics make it difficult to gather documents and devise search tasks in order to build controllable test environments. This consequently leads to slow progress in the development of effective personal retrieval systems.
    In this position paper, we propose an approach to classifying search tasks based on their general attributes, which encourages users to classify the tasks themselves, as well as use tasks produced by others. To this end, we introduce a new model for the extraction of general task attributes which we call the Push-Pull Model. This approach can help to create comparable test environments across the tasks of different users. Furthermore, we highlight some of the key challenges for further investigation in this area.

[25] Quantifying test collection quality based on the consistency of relevance judgements Test collections / Scholer, Falk / Turpin, Andrew / Sanderson, Mark Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2011-07-25 p.1063-1072
ACM Digital Library Link
Summary: Relevance assessments are a key component for test collection-based evaluation of information retrieval systems. This paper reports on a feature of such collections that is used as a form of ground truth data to allow analysis of human assessment error. A wide range of test collections are retrospectively examined to determine how accurately assessors judge the relevance of documents. Our results demonstrate a high level of inconsistency across the collections studied. The level of irregularity is shown to vary across topics, with some showing a very high level of assessment error. We investigate possible influences on the error, and demonstrate that inconsistency in judging increases with time. While the level of detail in a topic specification does not appear to influence the errors that assessors make, judgements are significantly affected by the decisions made on previously seen similar documents. Assessors also display an assessment inertia. Alternate approaches to generating relevance judgements appear to reduce errors. A further investigation of the way that retrieval systems are ranked using sets of relevance judgements produced early and late in the judgement process reveals a consistent influence measured across the majority of examined test collections.
    We conclude that there is a clear value in examining, even inserting, ground truth data in test collections, and propose ways to help minimise the sources of inconsistency when creating future test collections.
<<First <Previous Permalink Next> Last>> Records: 1 to 25 of 41 Jump to: 2015 | 14 | 13 | 12 | 11 | 10 | 09 | 08 | 06 | 02 |