HCI Bibliography : Search Results skip to search form | skip to results |
Database updated: 2016-05-10 Searches since 2006-12-01: 32,689,207
director@hcibib.org
Hosted by ACM SIGCHI
The HCI Bibliogaphy was moved to a new server 2015-05-12 and again 2016-01-05, substantially degrading the environment for making updates.
There are no plans to add to the database.
Please send questions or comments to director@hcibib.org.
Query: Hawking_D* Results: 23 Sorted by: Date  Comments?
Help Dates
Limit:   
[1] On Term Selection Techniques for Patent Prior Art Search Short Papers / Far, Mona Golestan / Sanne, Scott / Bouadjenek, Mohamed Reda / Ferraro, Gabriela / Hawking, David Proceedings of the 2015 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2015-08-09 p.803-806
ACM Digital Library Link
Summary: In this paper, we investigate the influence of term selection on retrieval performance on the CLEF-IP prior art test collection, using the Description section of the patent query with Language Model (LM) and BM25 scoring functions. We find that an oracular relevance feedback system that extracts terms from the judged relevant documents far outperforms the baseline and performs twice as well on MAP as the best competitor in CLEF-IP 2010. We find a very clear term selection value threshold for use when choosing terms. We also noticed that most of the useful feedback terms are actually present in the original query and hypothesized that the baseline system could be substantially improved by removing negative query terms. We tried four simple automated approaches to identify negative terms for query reduction but we were unable to notably improve on the baseline performance with any of them. However, we show that a simple, minimal interactive relevance feedback approach where terms are selected from only the first retrieved relevant document outperforms the best result from CLEF-IP 2010 suggesting the promise of interactive methods for term selection in patent prior art search.

[2] If SIGIR had an Academic Track, What Would Be In It? Industry Track Invited Talks / Hawking, David Proceedings of the 2015 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2015-08-09 p.1077
ACM Digital Library Link
Summary: It used to be the case that very little industry research was presented at SIGIR. Now the balance has radically changed -- many accepted papers have industry authors and many rely on industry data sets -- To the extent that a leading academic member of the SIGIR community has light-heartedly proposed the creation of an Academic Track.
    Behind the levity lies the important question of how a researcher can make a meaningful contribution to the field, in the absence of petabyte-scale sets of documents and massive user-interaction logs. Theoretical contributions can revolutionize thinking, but have greatest impact when applicable in practice, and when empirically validated.
    In my years at Funnelback and more recently at Microsoft I have been very aware of high-impact but not-well-solved IR problems involving relatively tiny datasets. Many of them are characterized by sparsity of user interaction data and are hence not well-suited to simple machine learning approaches or to large scale A/B testing. My talk will illustrate and attempt to characterize these problems and to suggest fruitful areas for academic research.
    If time permits, I will mention some areas in which academic research has contributed to current large-scale industry practice.

[3] An enterprise search paradigm based on extended query auto-completion: do we still need search and navigation? / Hawking, David / Griffiths, Kathy Proceedings of ADCS'13, Australasian Document Computing Symposium 2013-12-05 p.18-25
ACM Digital Library Link
Summary: Enterprise query auto-completion (QAC) can allow website or intranet visitors to satisfy a need more efficiently than traditional searching and browsing. The limited scope of an enterprise makes it possible to satisfy a high proportion of information needs through completion. Further, the availability of structured sources of completions such as product catalogues compensates for sparsity of log data. Extended forms (X-QAC) can give access to information that is inaccessible via a conventional crawled index.
    We show that it can be guaranteed that for every suggestion there is a prefix which causes it to appear in the top k suggestions. Using university query logs and structured lists, we quantify the significant keystroke savings attributable to this guarantee (worst case). Such savings may be of particular value for mobile devices. A user experiment showed that a staff lookup task took an average of 61% longer with a conventional search interface than with an X-QAC system.
    Using wine catalogue data we demonstrate a further extension which allows a user to home in on desired items in faceted-navigation style. We also note that advertisements can be triggered from QAC.
    Given the advantages and power of X-QAC systems, we envisage that websites and intranets of the [near] future will provide less navigation and rely less on conventional search.

[4] Merging algorithms for enterprise search / Li, PengFei (Vincent) / Thomas, Paul / Hawking, David Proceedings of ADCS'13, Australasian Document Computing Symposium 2013-12-05 p.42-49
ACM Digital Library Link
Summary: Effective enterprise search must draw on a number of sources -- for example web pages, telephone directories, and databases. Doing this means we need a way to make a single sorted list from results of very different types.
    Many merging algorithms have been proposed but none have been applied to this, realistic, application. We report the results of an experiment which simulates heterogeneous enterprise retrieval, in a university setting, and uses multi-grade expert judgements to compare merging algorithms. Merging algorithms considered include several variants of round-robin, several methods proposed by Rasolofo et al. in the Current News Metasearcher, and four novel variations including a learned multi-weight method.
    We find that the round-robin methods and one of the Rasolofo methods perform significantly worse than others. The GDS_TS method of Rasolofo achieves the highest average NDCG@10 score but the differences between it and the other GDS_methods, local reranking, and the multi-weight method were not significant.

[5] Reordering an index to speed query processing without loss of effectiveness / Hawking, David / Jones, Timothy Proceedings of ADCS'12, Australasian Document Computing Symposium 2012-12-05 p.17-24
ACM Digital Library Link
Summary: Following Long and Suel, we empirically investigate the importance of document order in search engines which rank documents using a combination of dynamic (query-dependent) and static (query-independent) scores, and use document-at-a-time (DAAT) processing. When inverted file postings are in collection order, assigning document numbers in order of descending static score supports lossless early termination while maintaining good compression.
    Since static scores may not be available until all documents have been gathered and indexed, we build a tool for reordering an existing index and show that it operates in less than 20% of the original indexing time. We note that this additional cost is easily recouped by savings at query processing time. We compare best early-termination points for several different index orders on three enterprise search collections (a whole-of-government index with two very different query sets, and a collection from a UK university). We also present results for the same orders for ClueWeb09-CatB. Our evaluation focuses on finding results likely to be clicked on by users of Web or website search engines -- Nav and Key results in the TREC 2011 Web Track judging scheme.
    The orderings tested are Original, Reverse, Random, and QIE (descending order of static score). For three enterprise search test sets we find that QIE order can achieve close-to-maximal search effectiveness with much lower computational cost than for other orderings. Additionally, reordering has negligible impact on compressed index size for indexes that contain position information. Our results for an artificial query set against the TREC ClueWeb09 Category B collection are much more equivocal and we canvass possible explanations for future investigation.

[6] Relative effect of spam and irrelevant documents on user interaction with search engines Poster session: information retrieval / Jones, Timothy / Hawking, David / Thomas, Paul / Sankaranarayana, Ramesh Proceedings of the 2011 ACM Conference on Information and Knowledge Management 2011-10-24 p.2113-2116
ACM Digital Library Link
Summary: Meaningful evaluation of web search must take account of spam. Here we conduct a user experiment to investigate whether satisfaction with search engine result pages as a whole is harmed more by spam or by irrelevant documents. On some measures, search result pages are differentially harmed by the insertion of spam and irrelevant documents. Additionally we find that when users are given two documents of equal utility, the one with the lower spam score will be preferred; a result page without any spam documents will be preferred to one with spam; and an irrelevant document high in a result list is surprisingly more damaging to user satisfaction than a spam document. We conclude that web ranking and evaluation should consider both utility (relevance) and "spamminess" of documents.

[7] What deliberately degrading search quality tells us about discount functions Posters presentations / Thomas, Paul / Jones, Timothy / Hawking, David Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2011-07-25 p.1107-1108
ACM Digital Library Link
Summary: Deliberate degradation of search results is a common tool in user experiments. We degrade high-quality search results by inserting non-relevant documents at different ranks. The effect of these manipulations, on a number of commonly-used metrics, is counter-intuitive: the discount functions implicit in P@k, MRR, NDCG, and others do not account for the true relationship between rank and value to the user. We propose an alternative, based on visibility data.

[8] Live web search experiments for the rest of us WWW 2010 demos / Jones, Timothy / Hawking, David / Sankaranarayana, Ramesh Proceedings of the 2010 International Conference on the World Wide Web 2010-04-26 v.1 p.1265-1268
Keywords: browser extensions, implicit measures, web search
ACM Digital Library Link
Summary: There are significant barriers to academic research into user Web search preferences. Academic researchers are unable to manipulate the results shown by a major search engine to users and would have no access to the interaction data collected by the engine. Our initial approach to overcoming this was to ask participants to submit queries to an experimental search engine rather than their usual search tool. Over several different experiments we found that initial user buy-in was high but that people quickly drifted back to their old habits and stopped contributing data. Here, we report our investigation of possible reasons why this occurs. An alternative approach is exemplified by the Lemur browser toolbar, which allows local collection of user interaction data from search engine sessions, but does not allow result pages to be modified. We will demonstrate a new Firefox toolbar that we have developed to support experiments in which search results may be arbitrarily manipulated. Using our toolbar, academics can set up the experiments they want to conduct, while collecting (subject to human experimentation guidelines) queries, clicks and dwell times as well as optional explicit judgments.

[9] New-web search with microblog annotations WWW 2010 demos / Rowlands, Tom / Hawking, David / Sankaranarayana, Ramesh Proceedings of the 2010 International Conference on the World Wide Web 2010-04-26 v.1 p.1293-1296
Keywords: demonstration, information retrieval, microblogging, search, twitter, web search
ACM Digital Library Link
Summary: Web search engines discover indexable documents by recursively 'crawling' from a seed URL. Their rankings take into account link popularity. While this works well, it introduces biases towards older documents. Older documents are more likely to be the target of links, while new documents with few, or no, incoming links are unlikely to rank highly in search results.
    We describe a novel system for 'new-Web' search based on links retrieved from the Twitter micro-blogging service. The Twitter service allows individuals, organisations and governments to rapidly disseminate very short messages to a wide variety of interested parties. When a Twitter message contains a URL, we use the Twitter message as a description of the URL's target. As Twitter is frequently used for discussion of current events, these messages offer useful, up-to-date annotations and instantaneous popularity readings for a small, but timely, portion of the Web.
    Our working system is simple and fast and we believe may offer a significant advantage in revealing new information on the Web that would otherwise be hidden from searchers. Beyond the basic system, we anticipate the Twitter messages may add supplementary terms for a URL, or add weight to existing terms, and that the reputation or authority of each message sender may serve to weight both annotations and query-independent popularity.

[10] Similarity-aware indexing for real-time entity resolution Poster session 3: IR track / Christen, Peter / Gayler, Ross / Hawking, David Proceedings of the 2009 ACM Conference on Information and Knowledge Management 2009-11-02 p.1565-1568
ACM Digital Library Link
Summary: Entity resolution, also known as data matching or record linkage, is the task of identifying and matching records from several databases that refer to the same entities. Traditionally, entity resolution has been applied in batch-mode and on static databases. However, many organisations are increasingly faced with the challenge of having large databases containing entities that need to be matched in real-time with a stream of query records also containing entities, such that the best matching records are retrieved. Example applications include online law enforcement and national security databases, public health surveillance and emergency response systems, financial verification systems, online retail stores, eGovernment services, and digital libraries.
    A novel inverted index based approach for real-time entity resolution is presented in this paper. At build time, similarities between attribute values are computed and stored to support the fast matching of records at query time. The presented approach differs from other approaches to approximate query matching in that it allows any similarity comparison function, and any 'blocking' (encoding) function, both possibly domain specific, to be incorporated.
    Experimental results on a real-world database indicate that the total size of all data structures of this novel index approach grows sub-linearly with the size of the database, and that it allows matching of query records in sub-second time, more than two orders of magnitude faster than a traditional entity resolution index approach. The interested reader is referred to the longer version of this paper [5].

[11] Quality-Oriented Search for Depression Portals Short Papers / Tang, Thanh / Hawking, David / Sankaranarayana, Ramesh / Griffiths, Kathleen M. / Craswell, Nick Proceedings of ECIR'09, the 2009 European Conference on Information Retrieval 2009-04-06 p.637-644
Keywords: Health portal search; Quality filtering of search results
Link to Digital Content at Springer
Summary: The problem of low-quality information on the Web is nowhere more important than in the domain of health, where unsound information and misleading advice can have serious consequences. The quality of health web sites can be rated by subject experts against evidence-based guidelines. We previously developed an automated quality rating technique (AQA) for depression websites and showed that it correlated 0.85 with such expert ratings.
    In this paper, we use AQA to filter or rerank Google results returned in response to queries relating to depression. We compare this to an unrestricted quality-oriented (AQA based) focused crawl starting from an Open Directory category and a conventional crawl with manually constructed seedlist and inclusion rules. The results show that post-processed Google outperforms other forms of search engine restricted to the domain of depressive illness on both relevance and quality.

[12] Experiences evaluating personal metasearch Evaluation & relevance II / Thomas, Paul / Hawking, David Proceedings of the 2008 Symposium on Information Interaction in Context 2008-10-14 p.136-138
ACM Digital Library Link
Summary: Many current evaluation techniques for information retrieval, such as test collections and simulations, are difficult to apply in situations where queries and preferred results are context-dependent. This is particularly true in personal metasearch applications, which provide a person with unified search access to all their usual online sources. A recently-proposed technique, based on presenting two or more search results sets in a single comparison interface, offers an alternative.
    We have embedded this technique in a working personal metasearch tool which we have distributed to volunteers. Initial experiments with server selection suggest that the technique is accepted by users, can operate over diverse and unarticulated contexts, and that the data it provides can provide a useful comparison to that from test collections. Further experimentation with the technique is continuing.

[13] Fast generation of result snippets in web search Summaries / Turpin, Andrew / Tsegay, Yohannes / Hawking, David / Williams, Hugh E. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007-07-23 p.127-134
ACM Digital Library Link
Summary: The presentation of query biased document snippets as part of results pages presented by search engines has become an expectation of search engine users. In this paper we explore the algorithms and data structures required as part of a search engine to allow efficient generation of query biased snippets. We begin by proposing and analysing a document compression method that reduces snippet generation time by 58% over a baseline using the zlib compression library. These experiments reveal that finding documents on secondary storage dominates the total cost of generating snippets, and so caching documents in RAM is essential for a fast snippet generation process. Using simulation, we examine snippet generation performance for different size RAM caches. Finally we propose and analyse document reordering and compaction, revealing a scheme that increases the number of document cache hits with only a marginal affect on snippet quality. This scheme effectively doubles the number of documents that can fit in a fixed size cache.

[14] Evaluating sampling methods for uncooperative collections Collection representation in distributed IR / Thomas, Paul / Hawking, David Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007-07-23 p.503-510
ACM Digital Library Link
Summary: Many server selection methods suitable for distributed information retrieval applications rely, in the absence of cooperation, on the availability of unbiased samples of documents from the constituent collections. We describe a number of sampling methods which depend only on the normal query-response mechanism of the applicable search facilities. We evaluate these methods on a number of collections typical of a personal metasearch application. Results demonstrate that biases exist for all methods, particularly toward longer documents, and that in some cases these biases can be reduced but not eliminated by choice of parameters.
    We also introduce a new sampling technique, "multiple queries", which produces samples of similar quality to the best current techniques but with significantly reduced cost.

[15] Workload sampling for enterprise search evaluation Posters / Rowlands, Tom / Hawking, David / Sankaranarayana, Ramesh Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2007-07-23 p.887-888
ACM Digital Library Link
Summary: In real world use of test collection methods, it is essential that the query test set be representative of the work load expected in the actual application. Using a random sample of queries from a media company's query log as a 'gold standard' test set we demonstrate that biases in sitemap-derived and top n query sets can lead to significant perturbations in engine rankings and big differences in estimated performance levels.

[16] Enterprise Search -- The New Frontier? Progress in Information Retrieval / Hawking, David Proceedings of ECIR'06, the 2006 European Conference on Information Retrieval 2006-04-10 p.12
Link to Digital Content at Springer
Summary: The advent of the current generation of Web search engines around 1998 challenged the relevance of academic information retrieval research -- established evaluation methodologies didn't scale and nor did they reflect the diverse purposes to which search engines are now put. Academic ranking algorithms of the time almost completely ignored the features which underpin modern web search: query-independent evidence and evidence external to the document. Unlike their commercial counterparts, academic researchers have for years been unable to access Web scale collections and their corresponding link graphs and search logs.

[17] Server selection methods in hybrid portal search Distributed / Hawking, David / Thomas, Paul Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2005-08-15 p.75-82
ACM Digital Library Link
Summary: The TREC.GOV collection makes a valuable web testbed for distributed information retrieval methods because it is naturally partitioned and includes 725 web-oriented queries with judged answers. It can usefully model aspects of government and large corporate portals. Analysis of the .gov data shows that a purely distributed approach would not be feasible for providing search on a.gov portal because of the large number (17,000+) of web sites and the high proportion that do not provide a search interface. An alternative hybrid approach, combining both distributed and centralized techniques, is proposed and server selection methods are evaluated within this framework using web-oriented evaluation methodology. A number of well-known algorithms are compared against representatives (highest anchor ranked page (HARP) and anchor weighted sum (AWSUM)) of a family of new selection methods which use link anchortext extracted from an auxiliary crawl to provide descriptions of sites which are not themselves crawled. Of the previously published methods, ReDDE substantially outperformed three variants of CORI and also outperformed a method based on Kullback-Leibler Divergence (extended) except on topic distillation. HARP and AWSUM performed best overall but were outperformed on the topic distillation task by extended KL Divergence.

[18] Toward better weighting of anchors Posters / Hawking, David / Upstill, Trystan / Craswell, Nick Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2004-07-25 p.512-513
ACM Digital Library Link
Summary: Okapi BM25 scoring of anchor text surrogate documents has been shown to facilitate effective ranking in navigational search tasks over web data. We hypothesize that even better ranking can be achieved in certain important cases, particularly when anchor scores must be fused with content scores, by avoiding length normalisation and by reducing the attentuation of scores associated with high tf. Preliminary results are presented.

[19] Query-independent evidence in home page finding / Upstill, Trystan / Craswell, Nick / Hawking, David ACM Transactions on Information Systems 2003 v.21 n.3 p.286-313
Keywords: Web information retrieval, citation and link analysis, connectivity
ACM Digital Library Link
Summary: Hyperlink recommendation evidence, that is, evidence based on the structure of a web's link graph, is widely exploited by commercial Web search systems. However there is little published work to support its popularity. Another form of query-independent evidence, URL-type, has been shown to be beneficial on a home page finding task. We compared the usefulness of these types of evidence on the home page finding task, combined with both content and anchor text baselines. Our experiments made use of five query sets spanning three corpora -- one enterprise crawl, and the WT10g and VLC2 Web test collections. We found that, in optimal conditions, all of the query-independent methods studied (in-degree, URL-type, and two variants of PageRank) offered a better than random improvement on a content-only baseline. However, only URL-type offered a better than random improvement on an anchor text baseline. In realistic settings, for either baseline, only URL-type offered consistent gains. In combination with URL-type the anchor text baseline was more useful for finding popular home pages, but URL-type with content was more useful for finding randomly selected home pages. We conclude that a general home page finding system should combine evidence from document content, anchor text, and URL-type classification.

[20] Effective site finding using link anchor information / Craswell, Nick / Hawking, David / Robertson, Stephen Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2001-09-09 p.250-257
ACM Digital Library Link
Summary: Link-based ranking methods have been described in the literature and applied in commercial Web search engines. However, according to recent TREC experiments, they are no better than traditional content-based methods. We conduct a different type of experiment, in which the task is to find the main entry point of a specific Web site. In our experiments, ranking based on link anchor text is twice as effective as ranking based on document content, even though both methods used the same BM25 formula. We obtained these results using two sets of 100 queries on a 18.5 million document set and another set of 100 on a 0.4 million document set. This site finding effectiveness begins to explain why many search engines have adopted link methods. It also opens a rich new area for effectiveness improvement, where traditional methods fail.

[21] Server Selection on the World Wide Web Full Papers / Craswell, Nick / Bailey, Peter / Hawking, David DL'00: Proceedings of the 5th ACM International Conference on Digital Libraries 2000-06-02 p.37-46
Keywords: Information Systems -Information Storage and Retrieval - Digital Libraries (H.3.7); Information Systems -Information Storage and Retrieval - Online Information Services (H.3.5): Web-based services; Information Systems -Information Interfaces and Presentation - Group and Organization Interfaces (H.5.3): Web-based interaction; Information Systems -Database Management - Systems (H.2.4): Distributed databases; Information Systems -Information Storage and Retrieval - Systems and Software (H.3.4): Distributed systems; Design, Documentation, Experimentation, Human Factors, Measurement, Management, Performance, Theory; World Wide Web, distributed information retrieval, effectiveness evaluation, server selection
Broken Link to ACM Digital Library
Summary: Significant efforts are being made to digitize rare and valuable library materials, with the goal of providing patrons and historians digital facsimiles that capture the "look and feel" of the original materials. This is often done by digitally photographing the materials and making high resolution 2D images available. The underlying assumption is that the objects are flat. However, older materials may not be flat in practice, being warped and crinkled due to decay, neglect, accident and the passing of time. In such cases, 2D imaging is insufficient to capture the "look and feel" of the original. For these materials, 3D acquisition is necessary to create a realistic facsimile. This paper outlines a technique for capturing an accurate 3D representation of library materials which can be integrated directly into current digitization setups. This will allow digitization efforts to provide patrons with more realistic digital facsimile of library materials.

[22] Methods for information server selection / Hawking, David / Thistlewaite, Paul ACM Transactions on Information Systems 1999 v.17 n.1 p.40-76
Keywords: Lightweight Probe queries, information servers, network servers, server ranking, server selection, text retrieval
ACM Digital Library Link
Summary: The problem of using a broker to select a subset of available information servers in order to achieve a good trade-off between document retrieval effectiveness and cost is addressed. Server selection methods which are capable of operating in the absence of global information, and where servers have no knowledge of brokers, are investigated. A novel method using Lightweight Probe queries (LWP method) is compared with several methods based on data from past query processing, while Random and Optimal server rankings serve as controls. Methods are evaluated, using TREC data and relevance judgments, by computing ratios, both empirical and ideal, of recall and early precision for the subset versus the complete set of available servers. Estimates are also made of the best-possible performance of each of the methods. LWP and Topic Similarity methods achieved best results, each being capable of retrieving about 60% of the relevant documents for only one-third of the cost of querying all servers. Subject to the applicable cost model, the LWP method is likely to be preferred because it is suited to dynamic environments. The good results obtained with a simple automatic LWP implementation were replicated using different data and a larger set of query topics.

[23] Scalable Text Retrieval for Large Digital Libraries Information Retrieval I / Hawking, David ECDL'97: Proceedings of the European Conference on Digital Libraries 1997-09-01 p.127-145
Link to Digital Content at Springer
Summary: It is argued that digital libraries of the future will contain terabyte-scale collections of digital text and that full-text searching techniques will be required to operate over collections of this magnitude. Algorithms expected to be capable of scaling to these data sizes using clusters of modern workstations are described. First, basic indexing and retrieval algorithms operating at performance levels comparable to other leading systems over gigabytes of text on a single workstation are presented. Next, simple mechanisms for extending query processing capacity to much greater collection sizes are presented, to tens of gigabytes for single workstations and to terabytes for clusters of such workstations. Query-processing efficiency on a single workstation is shown to deteriorate dramatically when data size is increased above a certain multiple of physical memory size. By contrast, the number of clustered workstations necessary to maintain a constant level of service increases linearly with increasing data size. Experiments using clusters of up to 16 workstations are reported. A non-replicated 20 gigabyte collection was indexed in just over 5 hours using a ten workstation cluster and scalability results are presented for query processing over replicated collections of up to 102 gigabytes.