HCI Bibliography : Search Results skip to search form | skip to results |
Database updated: 2016-05-10 Searches since 2006-12-01: 32,646,487
director@hcibib.org
Hosted by ACM SIGCHI
The HCI Bibliogaphy was moved to a new server 2015-05-12 and again 2016-01-05, substantially degrading the environment for making updates.
There are no plans to add to the database.
Please send questions or comments to director@hcibib.org.
Query: Radlinski_F* Results: 23 Sorted by: Date  Comments?
Help Dates
Limit:   
[1] Predicting Search Satisfaction Metrics with Interleaved Comparisons Session 6A: Experiment Design / Schuth, Anne / Hofmann, Katja / Radlinski, Filip Proceedings of the 2015 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2015-08-09 p.463-472
ACM Digital Library Link
Summary: The gold standard for online retrieval evaluation is AB testing. Rooted in the idea of a controlled experiment, AB tests compare the performance of an experimental system (treatment) on one sample of the user population, to that of a baseline system (control) on another sample. Given an online evaluation metric that accurately reflects user satisfaction, these tests enjoy high validity. However, due to the high variance across users, these comparisons often have low sensitivity, requiring millions of queries to detect statistically significant differences between systems. Interleaving is an alternative online evaluation approach, where each user is presented with a combination of results from both the control and treatment systems. Compared to AB tests, interleaving has been shown to be substantially more sensitive. However, interleaving methods have so far focused on user clicks only, and lack support for more sophisticated user satisfaction metrics as used in AB testing. In this paper we present the first method for integrating user satisfaction metrics with interleaving. We show how interleaving can be extended to (1) directly match user signals and parameters of AB metrics, and (2) how parameterized interleaving credit functions can be automatically calibrated to predict AB outcomes. We also develop a new method for estimating the relative sensitivity of interleaving and AB metrics, and show that our interleaving credit functions improve agreement with AB metrics without sacrificing sensitivity. Our results, using 38 large-scale online experiments encompassing over 3 billion clicks in a web search setting, demonstrate up to a 22% improvement in agreement with AB metrics (constituting over a 50% error reduction), while maintaining sensitivity of one to two orders of magnitude above the AB tests. This paves the way towards more sensitive and accurate online evaluation.

[2] Online Search Evaluation with Interleaving OOEW 2015 / Radlinski, Filip Companion Proceedings of the 2015 International Conference on the World Wide Web 2015-05-18 v.2 p.917
ACM Digital Library Link
Summary: Online evaluation allows information retrieval systems to be assessed based on how real users respond to search results presented. Compared with traditional offline evaluation based on manual relevance assessments, online evaluation is particularly attractive in settings where reliable assessments are difficult or too expensive to obtain. However, the successful use of online evaluation requires the right metrics to be used, as real user behaviour is often difficult to interpret. I will present interleaving, a sensitive online evaluation approach that creates paired comparisons for every user query, and compare it with alternative A/B online evaluation approaches. I will also show how interleaving can be parameterized to create a family of evaluation metrics that can be chosen to best match the goals of an evaluation.

[3] Relevance and Effort: An Analysis of Document Utility IR Session 1: IR Evaluation / Yilmaz, Emine / Verma, Manisha / Craswell, Nick / Radlinski, Filip / Bailey, Peter Proceedings of the 2014 ACM Conference on Information and Knowledge Management 2014-11-03 p.91-100
ACM Digital Library Link
Summary: In this paper, we study one important source of the mis-match between user data and relevance judgments, those due to the high degree of effort required by users to identify and consume the information in a document. Information retrieval relevance judges are trained to search for evidence of relevance when assessing documents. For complex documents, this can lead to judges' spending substantial time considering each document. However, in practice, search users are often much more impatient: if they do not see evidence of relevance quickly, they tend to give up.
    Relevance judgments sit at the core of test collection construction, and are assumed to model the utility of documents to real users. However, comparisons of judgments with signals of relevance obtained from real users, such as click counts and dwell time, have demonstrated a systematic mismatch.
    Our results demonstrate that the amount of effort required to find the relevant information in a document plays an important role in the utility of that document to a real user. This effort is ignored in the way relevance judgments are currently obtained, despite the expectation that judges inform us about real users. We propose that if the goal is to evaluate the likelihood of utility to the user, effort as well as relevance should be taken into consideration, and possibly characterized independently, when judgments are obtained.

[4] An Eye-tracking Study of User Interactions with Query Auto Completion IR Session 5: Users / Hofmann, Kajta / Mitra, Bhaskar / Radlinski, Filip / Shokouhi, Milad Proceedings of the 2014 ACM Conference on Information and Knowledge Management 2014-11-03 p.549-558
ACM Digital Library Link
Summary: Query Auto Completion (QAC) suggests possible queries to web search users from the moment they start entering a query. This popular feature of web search engines is thought to reduce physical and cognitive effort when formulating a query.
    Perhaps surprisingly, despite QAC being widely used, users' interactions with it are poorly understood. This paper begins to address this gap. We present the results of an in-depth user study of user interactions with QAC in web search. While study participants completed web search tasks, we recorded their interactions using eye-tracking and client-side logging. This allows us to provide a first look at how users interact with QAC. We specifically focus on the effects of QAC ranking, by controlling the quality of the ranking in a within-subject design.
    We identify a strong position bias that is consistent across ranking conditions. Due to this strong position bias, ranking quality affects QAC usage. We also find an effect on task completion, in particular on the number of result pages visited. We show how these effects can be explained by a combination of searchers' behavior patterns, namely monitoring or ignoring QAC, and searching for spelling support or complete queries to express a search intent. We conclude the paper with a discussion of the important implications of our findings for QAC evaluation.

[5] On user interactions with query auto-completion Poster session (short papers) / Mitra, Bhaskar / Shokouhi, Milad / Radlinski, Filip / Hofmann, Katja Proceedings of the 2014 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2014-07-06 p.1055-1058
ACM Digital Library Link
Summary: Query Auto-Completion (QAC) is a popular feature of web search engines that aims to assist users to formulate queries faster and avoid spelling mistakes by presenting them with possible completions as soon as they start typing. However, despite the wide adoption of auto-completion in search systems, there is little published on how users interact with such services.
    In this paper, we present the first large-scale study of user interactions with auto-completion based on query logs of Bing, a commercial search engine. Our results confirm that lower-ranked auto-completion suggestions receive substantially lower engagement than those ranked higher. We also observe that users are most likely to engage with auto-completion after typing about half of the query, and in particular at word boundaries. Interestingly, we also noticed that the likelihood of using auto-completion varies with the distance of query characters on the keyboard.
    Overall, we believe that the results reported in our study provide valuable insights for understanding user engagement with auto-completion, and are likely to inform the design of more effective QAC systems.

[6] On correlation of absence time and search effectiveness Poster session (short papers) / Chakraborty, Sunandan / Radlinski, Filip / Shokouhi, Milad / Baecke, Paul Proceedings of the 2014 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2014-07-06 p.1163-1166
ACM Digital Library Link
Summary: Online search evaluation metrics are typically derived based on implicit feedback from the users. For instance, computing the number of page clicks, number of queries, or dwell time on a search result. In a recent paper, Dupret and Lalmas introduced a new metric called absence time, which uses the time interval between successive sessions of users to measure their satisfaction with the system. They evaluated this metric on a version of Yahoo! Answers. In this paper, we investigate the effectiveness of absence time in evaluating new features in a web search engine, such as new ranking algorithm or a new user interface. We measured the variation of absence time to the effects of 21 experiments performed on a search engine. Our findings show that the outcomes of absence time agreed with the judgement of human experts performing a thorough analysis of a wide range of online and offline metrics in 14 out of these 21 cases.
    We also investigated the relationship between absence time and a set of commonly-used covariates (features) such as the number of queries and clicks in the session. Our results suggest that users are likely to return to the search engine sooner when their previous session has more queries and more clicks.

[7] Choices and constraints: research goals and approaches in information retrieval (part 1) Tutorials / Kelly, Diane / Radlinski, Filip / Teevan, Jaime Proceedings of the 2014 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2014-07-06 p.1283
ACM Digital Library Link
Summary: All research projects begin with a goal, for instance to describe search behavior, to predict when a person will enter a second query, or to discover which IR system performs the best. Different research goals suggest different research approaches, ranging from field studies to lab studies to online experimentation. This tutorial will provide an overview of the different types of research goals, common evaluation approaches used to address each type, and the constraints each approach entails. Participants will come away with a broad perspective of research goals and approaches in IR, and an understanding of the benefits and limitations of these research approaches. The tutorial will take place in two independent, but interrelated parts, each focusing on a unique set of research approaches but with the same intended tutorial outcomes. These outcomes will be accomplished by deconstructing and analyzing our own published research papers, with further illustrations of each technique using the broader literature. By using our own research as anchors, we will provide insight about the research process, revealing the difficult choices and trade-offs researchers make when designing and conducting IR studies.

[8] Choices and constraints: research goals and approaches in information retrieval (part 2) Tutorials / Kelly, Diane / Radlinski, Filip / Teevan, Jaime Proceedings of the 2014 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2014-07-06 p.1284
ACM Digital Library Link
Summary: All research projects begin with a goal, for instance to describe search behavior, to predict when a person will enter a second query, or to discover which IR system performs the best. Different research goals suggest different research approaches, ranging from field studies to lab studies to online experimentation. This tutorial will provide an overview of the different types of research goals, common evaluation approaches used to address each type, and the constraints each approach entails. Participants will come away with a broad perspective of research goals and approaches in IR, and an understanding of the benefits and limitations of these research approaches. The tutorial will take place in two independent, but interrelated parts, each focusing on a unique set of research approaches but with the same intended tutorial outcomes. These outcomes will be accomplished by deconstructing and analyzing our own published research papers, with further illustrations of each technique using the broader literature. By using our own research as anchors, we will provide insight about the research process, revealing the difficult choices and trade-offs researchers make when designing and conducting IR studies.

[9] Fighting search engine amnesia: reranking repeated results Users and interactive IR II / Shokouhi, Milad / White, Ryen W. / Bennett, Paul / Radlinski, Filip Proceedings of the 2013 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2013-07-28 p.273-282
ACM Digital Library Link
Summary: Web search engines frequently show the same documents repeatedly for different queries within the same search session, in essence forgetting when the same documents were already shown to users. Depending on previous user interaction with the repeated results, and the details of the session, we show that sometimes the repeated results should be promoted, while some other times they should be demoted.
    Analysing search logs from two different commercial search engines, we find that results are repeated in about 40% of multi-query search sessions, and that users engage differently with repeats than with results shown for the first time. We demonstrate how statistics about result repetition within search sessions can be incorporated into ranking for personalizing search results. Our results on query logs of two large-scale commercial search engines suggest that we successfully promote documents that are more likely to be clicked by the user in the future while maintaining performance over standard measures of non-personalized relevance.

[10] Practical Online Retrieval Evaluation Tutorials / Radlinski, Filip / Hofmann, Katja Proceedings of ECIR'13, the 2013 European Conference on Information Retrieval 2013-03-24 p.878-881
Keywords: Interleaving; Clicks; Search Engine; Online Evaluation
Link to Digital Content at Springer
Summary: Online evaluation allows the assessment of information retrieval (IR) techniques based on how real users respond to them. Because this technique is directly based on observed user behavior, it is a promising alternative to traditional offline evaluation, which is based on manual relevance assessments. In particular, online evaluation can enable comparisons in settings where reliable assessments are difficult to obtain (e.g., personalized search) or expensive (e.g., for search by trained experts in specialized collections).
    Despite its advantages, and its successful use in commercial settings, online evaluation is rarely employed outside of large commercial search engines due to a perception that it is impractical at small scales. The goal of this tutorial is to show how online evaluations can be conducted in such settings, demonstrate software to facilitate its use, and promote further research in the area. We will also contrast online evaluation with standard offline evaluation, and provide an overview of online approaches.

[11] On caption bias in interleaving experiments IR track: evaluation methodologies / Hofmann, Katja / Behr, Fritz / Radlinski, Filip Proceedings of the 2012 ACM Conference on Information and Knowledge Management 2012-10-29 p.115-124
ACM Digital Library Link
Summary: Information retrieval evaluation most often involves manually assessing the relevance of particular query-document pairs. In cases where this is difficult (such as personalized search), interleaved comparison methods are becoming increasingly common. These methods compare pairs of ranking functions based on user clicks on search results, thus better reflecting true user preferences. However, by depending on clicks, there is a potential for bias. For example, users have been previously shown to be more likely to click on results with attractive titles and snippets. An interleaving evaluation where one ranker tends to generate results that attract more clicks (without being more relevant) may thus be biased.
    We present an approach for detecting and compensating for this type of bias in interleaving evaluations. Introducing a new model of caption bias, we propose features that model bias based on (1) per-document effects, and (2) the (pairwise) relationships between a document and surrounding documents. We show that our model can effectively capture click behavior, with best results achieved by a model that combines both per-document and pairwise features. Applying this model to re-weight observed user clicks, we find a small overall effect on real interleaving comparisons, but also identify a case where initially detected preferences vanish after caption bias re-weighting is applied. Our results indicate that our model of caption bias is effective and can successfully identify interleaving experiments affected by caption bias.

[12] Large-scale validation and analysis of interleaved search evaluation / Chapelle, Olivier / Joachims, Thorsten / Radlinski, Filip / Yue, Yisong ACM Transactions on Information Systems 2012-02 v.30 n.1 p.6
ACM Digital Library Link
Summary: Interleaving is an increasingly popular technique for evaluating information retrieval systems based on implicit user feedback. While a number of isolated studies have analyzed how this technique agrees with conventional offline evaluation approaches and other online techniques, a complete picture of its efficiency and effectiveness is still lacking. In this paper we extend and combine the body of empirical evidence regarding interleaving, and provide a comprehensive analysis of interleaving using data from two major commercial search engines and a retrieval system for scientific literature. In particular, we analyze the agreement of interleaving with manual relevance judgments and observational implicit feedback measures, estimate the statistical efficiency of interleaving, and explore the relative performance of different interleaving variants. We also show how to learn improved credit-assignment functions for clicks that further increase the sensitivity of interleaving.

[13] Inferring and using location metadata to personalize web search Personalization / Bennett, Paul N. / Radlinski, Filip / White, Ryen W. / Yilmaz, Emine Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2011-07-25 p.135-144
ACM Digital Library Link
Summary: Personalization of search results offers the potential for significant improvements in Web search. Among the many observable user attributes, approximate user location is particularly simple for search engines to obtain and allows personalization even for a first-time Web search user. However, acting on user location information is difficult, since few Web documents include an address that can be interpreted as constraining the locations where the document is relevant. Furthermore, many Web documents -- such as local news stories, lottery results, and sports team fan pages -- may not correspond to physical addresses, but the location of the user still plays an important role in document relevance. In this paper, we show how to infer a more general location relevance which uses not only physical location but a more general notion of locations of interest for Web pages. We compute this information using implicit user behavioral data, characterize the most location-centric pages, and show how location information can be incorporated into Web search ranking. Our results show that a substantial fraction of Web search queries can be significantly improved by incorporating location-based features.

[14] Practical online retrieval evaluation Tutorials / Radlinski, Filip / Yue, Yisong Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2011-07-25 p.1301-1302
ACM Digital Library Link
Summary: Online evaluation is amongst the few evaluation techniques available to the information retrieval community that is guaranteed to reflect how users actually respond to improvements developed by the community. Broadly speaking, online evaluation refers to any evaluation of retrieval quality conducted while observing user behavior in a natural context. However, it is rarely employed outside of large commercial search engines due primarily to a perception that it is impractical at small scales. The goal of this tutorial is to familiarize information retrieval researchers with state-of-the-art techniques in evaluating information retrieval systems based on natural user clicking behavior, as well as to show how such methods can be practically deployed. In particular, our focus will be on demonstrating how the Interleaving approach and other click based techniques contrast with traditional offline evaluation, and how these online methods can be effectively used in academic-scale research. In addition to lecture notes, we will also provide sample software and code walk-throughs to showcase the ease with which Interleaving and other click-based methods can be employed by students, academics and other researchers.

[15] Comparing the sensitivity of information retrieval metrics Non-English IR & evaluation / Radlinski, Filip / Craswell, Nick Proceedings of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2010-07-19 p.667-674
Keywords: evaluation, interleaving, search
ACM Digital Library Link
Summary: Information retrieval effectiveness is usually evaluated using measures such as Normalized Discounted Cumulative Gain (NDCG), Mean Average Precision (MAP) and Precision at some cutoff (Precision@k) on a set of judged queries. Recent research has suggested an alternative, evaluating information retrieval systems based on user behavior. Particularly promising are experiments that interleave two rankings and track user clicks. According to a recent study, interleaving experiments can identify large differences in retrieval effectiveness with much better reliability than other click-based methods.
    We study interleaving in more detail, comparing it with traditional measures in terms of reliability, sensitivity and agreement. To detect very small differences in retrieval effectiveness, a reliable outcome with standard metrics requires about 5,000 judged queries, and this is about as reliable as interleaving with 50,000 user impressions. Amongst the traditional measures, NDCG has the strongest correlation with interleaving. Finally, we present some new forms of analysis, including an approach to enhance interleaving sensitivity.

[16] Metrics for assessing sets of subtopics Poster presentations / Radlinski, Filip / Szummer, Martin / Craswell, Nick Proceedings of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2010-07-19 p.853-854
Keywords: diversity, novelty, subtopic
ACM Digital Library Link
Summary: To evaluate the diversity of search results, test collections have been developed that identify multiple intents for each query. Intents are the different meanings or facets that should be covered in a search results list. This means that topic development involves proposing a set of intents for each query. We propose four measurable properties of query-to-intent mappings, allowing for more principled topic development for such test collections.

[17] Inferring query intent from reformulations and clicks WWW posters / Radlinski, Filip / Szummer, Martin / Craswell, Nick Proceedings of the 2010 International Conference on the World Wide Web 2010-04-26 v.1 p.1171-1172
Keywords: diversity, intents, subtopics
ACM Digital Library Link
Summary: Many researchers have noted that web search queries are often ambiguous or unclear. We present an approach for identifying the popular meanings of queries using web search logs and user click behavior. We show our approach to produce more complete and user-centric intents than expert judges by evaluating on TREC queries. This approach was also used by the TREC 2009 Web Track judges to obtain more representative topic descriptions from real queries.

[18] How does clickthrough data reflect retrieval quality? IR: web search 1 / Radlinski, Filip / Kurup, Madhu / Joachims, Thorsten Proceedings of the 2008 ACM Conference on Information and Knowledge Management 2008-10-26 p.43-52
ACM Digital Library Link
Summary: Automatically judging the quality of retrieval functions based on observable user behavior holds promise for making retrieval evaluation faster, cheaper, and more user centered. However, the relationship between observable user behavior and retrieval quality is not yet fully understood. We present a sequence of studies investigating this relationship for an operational search engine on the arXiv.org e-print archive. We find that none of the eight absolute usage metrics we explore (e.g., number of clicks, frequency of query reformulations, abandonment) reliably reflect retrieval quality for the sample sizes we consider. However, we find that paired experiment designs adapted from sensory analysis produce accurate and reliable statements about the relative quality of two retrieval functions. In particular, we investigate two paired comparison tests that analyze clickthrough data from an interleaved presentation of ranking pairs, and we find that both give accurate and consistent results. We conclude that both paired comparison tests give substantially more accurate and sensitive evaluation results than absolute usage metrics in our domain.

[19] Optimizing relevance and revenue in ad search: a query substitution approach Non-topicality / Radlinski, Filip / Broder, Andrei / Ciccolo, Peter / Gabrilovich, Evgeniy / Josifovski, Vanja / Riedel, Lance Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2008-07-20 p.403-410
ACM Digital Library Link
Summary: The primary business model behind Web search is based on textual advertising, where contextually relevant ads are displayed alongside search results. We address the problem of selecting these ads so that they are both relevant to the queries and profitable to the search engine, showing that optimizing ad relevance and revenue is not equivalent. Selecting the best ads that satisfy these constraints also naturally incurs high computational costs, and time constraints can lead to reduced relevance and profitability. We propose a novel two-stage approach, which conducts most of the analysis ahead of time. An offine preprocessing phase leverages additional knowledge that is impractical to use in real time, and rewrites frequent queries in a way that subsequently facilitates fast and accurate online matching. Empirical evaluation shows that our method optimized for relevance matches a state-of-the-art method while improving expected revenue. When optimizing for revenue, we see even more substantial improvements in expected revenue.

[20] A support vector method for optimizing average precision Learning to rank I / Yue, Yisong / Finley, Thomas / Radlinski, Filip / Joachims, Thorsten Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007-07-23 p.271-278
ACM Digital Library Link
Summary: Machine learning is commonly used to improve ranked retrieval systems. Due to computational difficulties, few learning techniques have been developed to directly optimize for mean average precision (MAP), despite its widespread use in evaluating such systems. Existing approaches optimizing MAP either do not find a globally optimal solution, or are computationally expensive. In contrast, we present a general SVM learning algorithm that efficiently finds a globally optimal solution to a straightforward relaxation of MAP. We evaluate our approach using the TREC 9 and TREC 10 Web Track corpora (WT10g), comparing against SVMs optimized for accuracy and ROCArea. In most cases we show our method to produce statistically significant improvements in MAP scores.

[21] Recommending related papers based on digital library access records Historical digital libraries / Pohl, Stefan / Radlinski, Filip / Joachims, Thorsten JCDL'07: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries 2007-06-18 p.417-418
ACM Digital Library Link
Summary: An important goal for digital libraries is to enable researchers to more easily explore related work. While citation data is often used as an indicator of relatedness, in this paper we demonstrate that digital access records (e.g. http-server logs) can be used as indicators as well. In particular, we show that measures based on co-access provide better coverage than co-citation, that they are available much sooner, and that they are more accurate for recent papers.

[22] Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search / Joachims, Thorsten / Granka, Laura / Pan, Bing / Hembrooke, Helene / Radlinski, Filip / Gay, Geri ACM Transactions on Information Systems 2007 v.25 n.2 p.7
ACM Digital Library Link
Summary: This article examines the reliability of implicit feedback generated from clickthrough data and query reformulations in World Wide Web (WWW) search. Analyzing the users' decision process using eyetracking and comparing implicit feedback against manual relevance judgments, we conclude that clicks are informative but biased. While this makes the interpretation of clicks as absolute relevance judgments difficult, we show that relative preferences derived from clicks are reasonably accurate on average. We find that such relative preferences are accurate not only between results from an individual query, but across multiple sets of results within chains of query reformulations.

[23] Improving personalized web search using result diversification Posters / Radlinski, Filip / Dumais, Susan Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2006-08-06 p.691-692
ACM Digital Library Link
Summary: We present and evaluate methods for diversifying search results to improve personalized web search. A common personalization approach involves reranking the top N search results such that documents likely to be preferred by the user are presented higher. The usefulness of reranking is limited in part by the number and diversity of results considered. We propose three methods to increase the diversity of the top results and evaluate the effectiveness of these methods.