[1]
On Term Selection Techniques for Patent Prior Art Search
Short Papers
/
Far, Mona Golestan
/
Sanne, Scott
/
Bouadjenek, Mohamed Reda
/
Ferraro, Gabriela
/
Hawking, David
Proceedings of the 2015 Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2015-08-09
p.803-806
© Copyright 2015 ACM
Summary: In this paper, we investigate the influence of term selection on retrieval
performance on the CLEF-IP prior art test collection, using the Description
section of the patent query with Language Model (LM) and BM25 scoring
functions. We find that an oracular relevance feedback system that extracts
terms from the judged relevant documents far outperforms the baseline and
performs twice as well on MAP as the best competitor in CLEF-IP 2010. We find a
very clear term selection value threshold for use when choosing terms. We also
noticed that most of the useful feedback terms are actually present in the
original query and hypothesized that the baseline system could be substantially
improved by removing negative query terms. We tried four simple automated
approaches to identify negative terms for query reduction but we were unable to
notably improve on the baseline performance with any of them. However, we show
that a simple, minimal interactive relevance feedback approach where terms are
selected from only the first retrieved relevant document outperforms the best
result from CLEF-IP 2010 suggesting the promise of interactive methods for term
selection in patent prior art search.
[2]
If SIGIR had an Academic Track, What Would Be In It?
Industry Track Invited Talks
/
Hawking, David
Proceedings of the 2015 Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2015-08-09
p.1077
© Copyright 2015 ACM
Summary: It used to be the case that very little industry research was presented at
SIGIR. Now the balance has radically changed -- many accepted papers have
industry authors and many rely on industry data sets -- To the extent that a
leading academic member of the SIGIR community has light-heartedly proposed the
creation of an Academic Track.
Behind the levity lies the important question of how a researcher can make a
meaningful contribution to the field, in the absence of petabyte-scale sets of
documents and massive user-interaction logs. Theoretical contributions can
revolutionize thinking, but have greatest impact when applicable in practice,
and when empirically validated.
In my years at Funnelback and more recently at Microsoft I have been very
aware of high-impact but not-well-solved IR problems involving relatively tiny
datasets. Many of them are characterized by sparsity of user interaction data
and are hence not well-suited to simple machine learning approaches or to large
scale A/B testing. My talk will illustrate and attempt to characterize these
problems and to suggest fruitful areas for academic research.
If time permits, I will mention some areas in which academic research has
contributed to current large-scale industry practice.
[3]
An enterprise search paradigm based on extended query auto-completion: do we
still need search and navigation?
/
Hawking, David
/
Griffiths, Kathy
Proceedings of ADCS'13, Australasian Document Computing Symposium
2013-12-05
p.18-25
© Copyright 2013 ACM
Summary: Enterprise query auto-completion (QAC) can allow website or intranet
visitors to satisfy a need more efficiently than traditional searching and
browsing. The limited scope of an enterprise makes it possible to satisfy a
high proportion of information needs through completion. Further, the
availability of structured sources of completions such as product catalogues
compensates for sparsity of log data. Extended forms (X-QAC) can give access to
information that is inaccessible via a conventional crawled index.
We show that it can be guaranteed that for every suggestion there is a
prefix which causes it to appear in the top k suggestions. Using university
query logs and structured lists, we quantify the significant keystroke savings
attributable to this guarantee (worst case). Such savings may be of particular
value for mobile devices. A user experiment showed that a staff lookup task
took an average of 61% longer with a conventional search interface than with an
X-QAC system.
Using wine catalogue data we demonstrate a further extension which allows a
user to home in on desired items in faceted-navigation style. We also note that
advertisements can be triggered from QAC.
Given the advantages and power of X-QAC systems, we envisage that websites
and intranets of the [near] future will provide less navigation and rely less
on conventional search.
[4]
Merging algorithms for enterprise search
/
Li, PengFei (Vincent)
/
Thomas, Paul
/
Hawking, David
Proceedings of ADCS'13, Australasian Document Computing Symposium
2013-12-05
p.42-49
© Copyright 2013 ACM
Summary: Effective enterprise search must draw on a number of sources -- for example
web pages, telephone directories, and databases. Doing this means we need a way
to make a single sorted list from results of very different types.
Many merging algorithms have been proposed but none have been applied to
this, realistic, application. We report the results of an experiment which
simulates heterogeneous enterprise retrieval, in a university setting, and uses
multi-grade expert judgements to compare merging algorithms. Merging algorithms
considered include several variants of round-robin, several methods proposed by
Rasolofo et al. in the Current News Metasearcher, and four novel variations
including a learned multi-weight method.
We find that the round-robin methods and one of the Rasolofo methods perform
significantly worse than others. The GDS_TS method of Rasolofo achieves the
highest average NDCG@10 score but the differences between it and the other
GDS_methods, local reranking, and the multi-weight method were not significant.
[5]
Reordering an index to speed query processing without loss of effectiveness
/
Hawking, David
/
Jones, Timothy
Proceedings of ADCS'12, Australasian Document Computing Symposium
2012-12-05
p.17-24
© Copyright 2012 ACM
Summary: Following Long and Suel, we empirically investigate the importance of
document order in search engines which rank documents using a combination of
dynamic (query-dependent) and static (query-independent) scores, and use
document-at-a-time (DAAT) processing. When inverted file postings are in
collection order, assigning document numbers in order of descending static
score supports lossless early termination while maintaining good compression.
Since static scores may not be available until all documents have been
gathered and indexed, we build a tool for reordering an existing index and show
that it operates in less than 20% of the original indexing time. We note that
this additional cost is easily recouped by savings at query processing time. We
compare best early-termination points for several different index orders on
three enterprise search collections (a whole-of-government index with two very
different query sets, and a collection from a UK university). We also present
results for the same orders for ClueWeb09-CatB. Our evaluation focuses on
finding results likely to be clicked on by users of Web or website search
engines -- Nav and Key results in the TREC 2011 Web Track judging scheme.
The orderings tested are Original, Reverse, Random, and QIE (descending
order of static score). For three enterprise search test sets we find that QIE
order can achieve close-to-maximal search effectiveness with much lower
computational cost than for other orderings. Additionally, reordering has
negligible impact on compressed index size for indexes that contain position
information. Our results for an artificial query set against the TREC ClueWeb09
Category B collection are much more equivocal and we canvass possible
explanations for future investigation.
[6]
Relative effect of spam and irrelevant documents on user interaction with
search engines
Poster session: information retrieval
/
Jones, Timothy
/
Hawking, David
/
Thomas, Paul
/
Sankaranarayana, Ramesh
Proceedings of the 2011 ACM Conference on Information and Knowledge
Management
2011-10-24
p.2113-2116
© Copyright 2011 ACM
Summary: Meaningful evaluation of web search must take account of spam. Here we
conduct a user experiment to investigate whether satisfaction with search
engine result pages as a whole is harmed more by spam or by irrelevant
documents. On some measures, search result pages are differentially harmed by
the insertion of spam and irrelevant documents. Additionally we find that when
users are given two documents of equal utility, the one with the lower spam
score will be preferred; a result page without any spam documents will be
preferred to one with spam; and an irrelevant document high in a result list is
surprisingly more damaging to user satisfaction than a spam document. We
conclude that web ranking and evaluation should consider both utility
(relevance) and "spamminess" of documents.
[7]
What deliberately degrading search quality tells us about discount functions
Posters presentations
/
Thomas, Paul
/
Jones, Timothy
/
Hawking, David
Proceedings of the 34th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2011-07-25
p.1107-1108
© Copyright 2011 ACM
Summary: Deliberate degradation of search results is a common tool in user
experiments. We degrade high-quality search results by inserting non-relevant
documents at different ranks. The effect of these manipulations, on a number of
commonly-used metrics, is counter-intuitive: the discount functions implicit in
P@k, MRR, NDCG, and others do not account for the true relationship between
rank and value to the user. We propose an alternative, based on visibility
data.
[8]
Live web search experiments for the rest of us
WWW 2010 demos
/
Jones, Timothy
/
Hawking, David
/
Sankaranarayana, Ramesh
Proceedings of the 2010 International Conference on the World Wide Web
2010-04-26
v.1
p.1265-1268
Keywords: browser extensions, implicit measures, web search
© Copyright 2010 ACM
Summary: There are significant barriers to academic research into user Web search
preferences. Academic researchers are unable to manipulate the results shown by
a major search engine to users and would have no access to the interaction data
collected by the engine. Our initial approach to overcoming this was to ask
participants to submit queries to an experimental search engine rather than
their usual search tool. Over several different experiments we found that
initial user buy-in was high but that people quickly drifted back to their old
habits and stopped contributing data. Here, we report our investigation of
possible reasons why this occurs. An alternative approach is exemplified by the
Lemur browser toolbar, which allows local collection of user interaction data
from search engine sessions, but does not allow result pages to be modified. We
will demonstrate a new Firefox toolbar that we have developed to support
experiments in which search results may be arbitrarily manipulated. Using our
toolbar, academics can set up the experiments they want to conduct, while
collecting (subject to human experimentation guidelines) queries, clicks and
dwell times as well as optional explicit judgments.
[9]
New-web search with microblog annotations
WWW 2010 demos
/
Rowlands, Tom
/
Hawking, David
/
Sankaranarayana, Ramesh
Proceedings of the 2010 International Conference on the World Wide Web
2010-04-26
v.1
p.1293-1296
Keywords: demonstration, information retrieval, microblogging, search, twitter, web
search
© Copyright 2010 ACM
Summary: Web search engines discover indexable documents by recursively 'crawling'
from a seed URL. Their rankings take into account link popularity. While this
works well, it introduces biases towards older documents. Older documents are
more likely to be the target of links, while new documents with few, or no,
incoming links are unlikely to rank highly in search results.
We describe a novel system for 'new-Web' search based on links retrieved
from the Twitter micro-blogging service. The Twitter service allows
individuals, organisations and governments to rapidly disseminate very short
messages to a wide variety of interested parties. When a Twitter message
contains a URL, we use the Twitter message as a description of the URL's
target. As Twitter is frequently used for discussion of current events, these
messages offer useful, up-to-date annotations and instantaneous popularity
readings for a small, but timely, portion of the Web.
Our working system is simple and fast and we believe may offer a significant
advantage in revealing new information on the Web that would otherwise be
hidden from searchers. Beyond the basic system, we anticipate the Twitter
messages may add supplementary terms for a URL, or add weight to existing
terms, and that the reputation or authority of each message sender may serve to
weight both annotations and query-independent popularity.
[10]
Similarity-aware indexing for real-time entity resolution
Poster session 3: IR track
/
Christen, Peter
/
Gayler, Ross
/
Hawking, David
Proceedings of the 2009 ACM Conference on Information and Knowledge
Management
2009-11-02
p.1565-1568
© Copyright 2009 ACM
Summary: Entity resolution, also known as data matching or record linkage, is the
task of identifying and matching records from several databases that refer to
the same entities. Traditionally, entity resolution has been applied in
batch-mode and on static databases. However, many organisations are
increasingly faced with the challenge of having large databases containing
entities that need to be matched in real-time with a stream of query records
also containing entities, such that the best matching records are retrieved.
Example applications include online law enforcement and national security
databases, public health surveillance and emergency response systems, financial
verification systems, online retail stores, eGovernment services, and digital
libraries.
A novel inverted index based approach for real-time entity resolution is
presented in this paper. At build time, similarities between attribute values
are computed and stored to support the fast matching of records at query time.
The presented approach differs from other approaches to approximate query
matching in that it allows any similarity comparison function, and any
'blocking' (encoding) function, both possibly domain specific, to be
incorporated.
Experimental results on a real-world database indicate that the total size
of all data structures of this novel index approach grows sub-linearly with the
size of the database, and that it allows matching of query records in
sub-second time, more than two orders of magnitude faster than a traditional
entity resolution index approach. The interested reader is referred to the
longer version of this paper [5].
[11]
Quality-Oriented Search for Depression Portals
Short Papers
/
Tang, Thanh
/
Hawking, David
/
Sankaranarayana, Ramesh
/
Griffiths, Kathleen M.
/
Craswell, Nick
Proceedings of ECIR'09, the 2009 European Conference on Information
Retrieval
2009-04-06
p.637-644
Keywords: Health portal search; Quality filtering of search results
© Copyright 2009 Springer-Verlag
Summary: The problem of low-quality information on the Web is nowhere more important
than in the domain of health, where unsound information and misleading advice
can have serious consequences. The quality of health web sites can be rated by
subject experts against evidence-based guidelines. We previously developed an
automated quality rating technique (AQA) for depression websites and showed
that it correlated 0.85 with such expert ratings.
In this paper, we use AQA to filter or rerank Google results returned in
response to queries relating to depression. We compare this to an unrestricted
quality-oriented (AQA based) focused crawl starting from an Open Directory
category and a conventional crawl with manually constructed seedlist and
inclusion rules. The results show that post-processed Google outperforms other
forms of search engine restricted to the domain of depressive illness on both
relevance and quality.
[12]
Experiences evaluating personal metasearch
Evaluation & relevance II
/
Thomas, Paul
/
Hawking, David
Proceedings of the 2008 Symposium on Information Interaction in Context
2008-10-14
p.136-138
© Copyright 2008 ACM
Summary: Many current evaluation techniques for information retrieval, such as test
collections and simulations, are difficult to apply in situations where queries
and preferred results are context-dependent. This is particularly true in
personal metasearch applications, which provide a person with unified search
access to all their usual online sources. A recently-proposed technique, based
on presenting two or more search results sets in a single comparison interface,
offers an alternative.
We have embedded this technique in a working personal metasearch tool which
we have distributed to volunteers. Initial experiments with server selection
suggest that the technique is accepted by users, can operate over diverse and
unarticulated contexts, and that the data it provides can provide a useful
comparison to that from test collections. Further experimentation with the
technique is continuing.
[13]
Fast generation of result snippets in web search
Summaries
/
Turpin, Andrew
/
Tsegay, Yohannes
/
Hawking, David
/
Williams, Hugh E.
Proceedings of the 30th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2007-07-23
p.127-134
© Copyright 2007 ACM
Summary: The presentation of query biased document snippets as part of results pages
presented by search engines has become an expectation of search engine users.
In this paper we explore the algorithms and data structures required as part of
a search engine to allow efficient generation of query biased snippets. We
begin by proposing and analysing a document compression method that reduces
snippet generation time by 58% over a baseline using the zlib compression
library. These experiments reveal that finding documents on secondary storage
dominates the total cost of generating snippets, and so caching documents in
RAM is essential for a fast snippet generation process. Using simulation, we
examine snippet generation performance for different size RAM caches. Finally
we propose and analyse document reordering and compaction, revealing a scheme
that increases the number of document cache hits with only a marginal affect on
snippet quality. This scheme effectively doubles the number of documents that
can fit in a fixed size cache.
[14]
Evaluating sampling methods for uncooperative collections
Collection representation in distributed IR
/
Thomas, Paul
/
Hawking, David
Proceedings of the 30th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2007-07-23
p.503-510
© Copyright 2007 ACM
Summary: Many server selection methods suitable for distributed information retrieval
applications rely, in the absence of cooperation, on the availability of
unbiased samples of documents from the constituent collections. We describe a
number of sampling methods which depend only on the normal query-response
mechanism of the applicable search facilities. We evaluate these methods on a
number of collections typical of a personal metasearch application. Results
demonstrate that biases exist for all methods, particularly toward longer
documents, and that in some cases these biases can be reduced but not
eliminated by choice of parameters.
We also introduce a new sampling technique, "multiple queries", which
produces samples of similar quality to the best current techniques but with
significantly reduced cost.
[15]
Workload sampling for enterprise search evaluation
Posters
/
Rowlands, Tom
/
Hawking, David
/
Sankaranarayana, Ramesh
Proceedings of the 30th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2007-07-23
p.887-888
© Copyright 2007 ACM
Summary: In real world use of test collection methods, it is essential that the query
test set be representative of the work load expected in the actual application.
Using a random sample of queries from a media company's query log as a 'gold
standard' test set we demonstrate that biases in sitemap-derived and top n
query sets can lead to significant perturbations in engine rankings and big
differences in estimated performance levels.
[16]
Enterprise Search -- The New Frontier?
Progress in Information Retrieval
/
Hawking, David
Proceedings of ECIR'06, the 2006 European Conference on Information
Retrieval
2006-04-10
p.12
© Copyright 2006 Springer-Verlag
Summary: The advent of the current generation of Web search engines around 1998
challenged the relevance of academic information retrieval research --
established evaluation methodologies didn't scale and nor did they reflect the
diverse purposes to which search engines are now put. Academic ranking
algorithms of the time almost completely ignored the features which underpin
modern web search: query-independent evidence and evidence external to the
document. Unlike their commercial counterparts, academic researchers have for
years been unable to access Web scale collections and their corresponding link
graphs and search logs.
[17]
Server selection methods in hybrid portal search
Distributed
/
Hawking, David
/
Thomas, Paul
Proceedings of the 28th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2005-08-15
p.75-82
© Copyright 2005 ACM
Summary: The TREC.GOV collection makes a valuable web testbed for distributed
information retrieval methods because it is naturally partitioned and includes
725 web-oriented queries with judged answers. It can usefully model aspects of
government and large corporate portals. Analysis of the .gov data shows that a
purely distributed approach would not be feasible for providing search on a.gov
portal because of the large number (17,000+) of web sites and the high
proportion that do not provide a search interface. An alternative hybrid
approach, combining both distributed and centralized techniques, is proposed
and server selection methods are evaluated within this framework using
web-oriented evaluation methodology. A number of well-known algorithms are
compared against representatives (highest anchor ranked page (HARP) and anchor
weighted sum (AWSUM)) of a family of new selection methods which use link
anchortext extracted from an auxiliary crawl to provide descriptions of sites
which are not themselves crawled. Of the previously published methods, ReDDE
substantially outperformed three variants of CORI and also outperformed a
method based on Kullback-Leibler Divergence (extended) except on topic
distillation. HARP and AWSUM performed best overall but were outperformed on
the topic distillation task by extended KL Divergence.
[18]
Toward better weighting of anchors
Posters
/
Hawking, David
/
Upstill, Trystan
/
Craswell, Nick
Proceedings of the 27th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2004-07-25
p.512-513
© Copyright 2004 ACM
Summary: Okapi BM25 scoring of anchor text surrogate documents has been shown to
facilitate effective ranking in navigational search tasks over web data. We
hypothesize that even better ranking can be achieved in certain important
cases, particularly when anchor scores must be fused with content scores, by
avoiding length normalisation and by reducing the attentuation of scores
associated with high tf. Preliminary results are presented.
[19]
Query-independent evidence in home page finding
/
Upstill, Trystan
/
Craswell, Nick
/
Hawking, David
ACM Transactions on Information Systems
2003
v.21
n.3
p.286-313
Keywords: Web information retrieval, citation and link analysis, connectivity
© Copyright 2003 ACM
Summary: Hyperlink recommendation evidence, that is, evidence based on the structure
of a web's link graph, is widely exploited by commercial Web search systems.
However there is little published work to support its popularity. Another form
of query-independent evidence, URL-type, has been shown to be beneficial on a
home page finding task. We compared the usefulness of these types of evidence
on the home page finding task, combined with both content and anchor text
baselines. Our experiments made use of five query sets spanning three corpora
-- one enterprise crawl, and the WT10g and VLC2 Web test collections. We found
that, in optimal conditions, all of the query-independent methods studied
(in-degree, URL-type, and two variants of PageRank) offered a better than
random improvement on a content-only baseline. However, only URL-type offered a
better than random improvement on an anchor text baseline. In realistic
settings, for either baseline, only URL-type offered consistent gains. In
combination with URL-type the anchor text baseline was more useful for finding
popular home pages, but URL-type with content was more useful for finding
randomly selected home pages. We conclude that a general home page finding
system should combine evidence from document content, anchor text, and URL-type
classification.
[20]
Effective site finding using link anchor information
/
Craswell, Nick
/
Hawking, David
/
Robertson, Stephen
Proceedings of the 24th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2001-09-09
p.250-257
Summary: Link-based ranking methods have been described in the literature and applied
in commercial Web search engines. However, according to recent TREC
experiments, they are no better than traditional content-based methods. We
conduct a different type of experiment, in which the task is to find the main
entry point of a specific Web site. In our experiments, ranking based on link
anchor text is twice as effective as ranking based on document content, even
though both methods used the same BM25 formula. We obtained these results using
two sets of 100 queries on a 18.5 million document set and another set of 100
on a 0.4 million document set. This site finding effectiveness begins to
explain why many search engines have adopted link methods. It also opens a rich
new area for effectiveness improvement, where traditional methods fail.
[21]
Server Selection on the World Wide Web
Full Papers
/
Craswell, Nick
/
Bailey, Peter
/
Hawking, David
DL'00: Proceedings of the 5th ACM International Conference on Digital
Libraries
2000-06-02
p.37-46
Keywords: Information Systems -Information Storage and Retrieval - Digital Libraries
(H.3.7); Information Systems -Information Storage and Retrieval - Online
Information Services (H.3.5): Web-based services; Information Systems
-Information Interfaces and Presentation - Group and Organization Interfaces
(H.5.3): Web-based interaction; Information Systems -Database Management -
Systems (H.2.4): Distributed databases; Information Systems -Information
Storage and Retrieval - Systems and Software (H.3.4): Distributed systems;
Design, Documentation, Experimentation, Human Factors, Measurement, Management,
Performance, Theory; World Wide Web, distributed information retrieval,
effectiveness evaluation, server selection
© Copyright 2000 ACM
Summary: Significant efforts are being made to digitize rare and valuable library
materials, with the goal of providing patrons and historians digital facsimiles
that capture the "look and feel" of the original materials. This is often done
by digitally photographing the materials and making high resolution 2D images
available. The underlying assumption is that the objects are flat. However,
older materials may not be flat in practice, being warped and crinkled due to
decay, neglect, accident and the passing of time. In such cases, 2D imaging is
insufficient to capture the "look and feel" of the original. For these
materials, 3D acquisition is necessary to create a realistic facsimile. This
paper outlines a technique for capturing an accurate 3D representation of
library materials which can be integrated directly into current digitization
setups. This will allow digitization efforts to provide patrons with more
realistic digital facsimile of library materials.
[22]
Methods for information server selection
/
Hawking, David
/
Thistlewaite, Paul
ACM Transactions on Information Systems
1999
v.17
n.1
p.40-76
Keywords: Lightweight Probe queries, information servers, network servers, server
ranking, server selection, text retrieval
© Copyright 1999 ACM
Summary: The problem of using a broker to select a subset of available information
servers in order to achieve a good trade-off between document retrieval
effectiveness and cost is addressed. Server selection methods which are capable
of operating in the absence of global information, and where servers have no
knowledge of brokers, are investigated. A novel method using Lightweight Probe
queries (LWP method) is compared with several methods based on data from past
query processing, while Random and Optimal server rankings serve as controls.
Methods are evaluated, using TREC data and relevance judgments, by computing
ratios, both empirical and ideal, of recall and early precision for the subset
versus the complete set of available servers. Estimates are also made of the
best-possible performance of each of the methods. LWP and Topic Similarity
methods achieved best results, each being capable of retrieving about 60% of
the relevant documents for only one-third of the cost of querying all servers.
Subject to the applicable cost model, the LWP method is likely to be preferred
because it is suited to dynamic environments. The good results obtained with a
simple automatic LWP implementation were replicated using different data and a
larger set of query topics.
[23]
Scalable Text Retrieval for Large Digital Libraries
Information Retrieval I
/
Hawking, David
ECDL'97: Proceedings of the European Conference on Digital Libraries
1997-09-01
p.127-145
© Copyright 1997 Springer-Verlag
Summary: It is argued that digital libraries of the future will contain
terabyte-scale collections of digital text and that full-text searching
techniques will be required to operate over collections of this magnitude.
Algorithms expected to be capable of scaling to these data sizes using clusters
of modern workstations are described. First, basic indexing and retrieval
algorithms operating at performance levels comparable to other leading systems
over gigabytes of text on a single workstation are presented. Next, simple
mechanisms for extending query processing capacity to much greater collection
sizes are presented, to tens of gigabytes for single workstations and to
terabytes for clusters of such workstations. Query-processing efficiency on a
single workstation is shown to deteriorate dramatically when data size is
increased above a certain multiple of physical memory size. By contrast, the
number of clustered workstations necessary to maintain a constant level of
service increases linearly with increasing data size. Experiments using
clusters of up to 16 workstations are reported. A non-replicated 20 gigabyte
collection was indexed in just over 5 hours using a ten workstation cluster and
scalability results are presented for query processing over replicated
collections of up to 102 gigabytes.