| PicASHOW: pictorial authority search by hyperlinks on the web | | BIBAK | Full-Text | 1-24 | |
| Ronny Lempel; Aya Soffer | |||
| We describe PicASHOW, a fully automated WWW image retrieval system that is
based on several link-structure analyzing algorithms. Our basic premise is that
a page p displays (or links to) an image when the author of p considers the
image to be of value to the viewers of the page. We thus extend some well known
link-based WWW page retrieval schemes to the context of image
retrieval.PicASHOW's analysis of the link structure enables it to retrieve
relevant images even when those are stored in files with meaningless names. The
same analysis also allows it to identify image containers and image hubs. We
define these as Web pages that are rich in relevant images, or from which many
images are readily accessible.PicASHOW requires no image analysis whatsoever
and no creation of taxonomies for preclassification of the Web's images. It can
be implemented by standard WWW search engines with reasonable overhead, in
terms of both computations and storage, and with no change to user query
formats. It can thus be used to easily add image retrieving capabilities to
standard search engines.Our results demonstrate that PicASHOW, while relying
almost exclusively on link analysis, compares well with dedicated WWW image
retrieval systems. We conclude that link analysis, a proven effective technique
for Web page search, can improve the performance of Web image retrieval, as
well as extend its definition to include the retrieval of image hubs and
containers. Keywords: Image retrieval, hubs and authorities, image hubs, link structure analysis | |||
| Knowledge encapsulation for focused search from pervasive devices | | BIBAK | Full-Text | 25-46 | |
| Yariv Aridor; David Carmel; Yoelle S. Maarek; Aya Soffer; Ronny Lempel | |||
| Mobile knowledge seekers often need access to information on the Web during
a meeting or on the road, while away from their desktop. A common practice
today is to use pervasive devices such as Personal Digital Assistants or mobile
phones. However, these devices have inherent constraints (e.g., slow
communication, form factor) which often make information discovery tasks
impractical.In this paper, we present a new focused-search approach
specifically oriented for the mode of work and the constraints dictated by
pervasive devices. It combines focused search within specific topics with
encapsulation of topic-specific information in a persistent repository. One key
characteristic of these persistent repositories is that their footprint is
small enough to fit on local devices, and yet they are rich enough to support
many information discovery tasks in disconnected mode. More specifically, we
suggest a representation for topic-specific information based on
"knowledge-agent bases" that comprise all the information necessary to access
information about a topic (under the form of key concepts and key Web pages)
and assist in the full search process from query formulation assistance to
result scanning on the device itself. The key contribution of our work is the
coupling of focused search with encapsulated knowledge representation making
information discovery from pervasive devices practical as well as efficient. We
describe our model in detail and demonstrate its aspects through sample
scenarios. Keywords: Focused searches, disconnected search, knowledge agents, pervasive devices | |||
| When experts agree: using non-affiliated experts to rank popular topics | | BIBAK | Full-Text | 47-58 | |
| Krishna Bharat; George A. Mihaila | |||
| In response to a query, a search engine returns a ranked list of documents.
If the query is about a popular topic (i.e., it matches many documents), then
the returned list is usually too long to view fully. Studies show that users
usually look at only the top 10 to 20 results. However, we can exploit the fact
that the best targets for popular topics are usually linked to by enthusiasts
in the same domain. In this paper, we propose a novel ranking scheme for
popular topics that places the most authoritative pages on the query topic at
the top of the ranking. Our algorithm operates on a special index of "expert
documents." These are a subset of the pages on the WWW identified as
directories of links to non-affiliated sources on specific topics. Results are
ranked based on the match between the query and relevant descriptive text for
hyperlinks on expert pages pointing to a given result page. We present a
prototype search engine that implements our ranking scheme and discuss its
performance. With a relatively small (2.5 million page) expert index, our
algorithm was able to perform comparably on popular queries with the best of
the mainstream search engines. Keywords: WWW search, authorities, connectivity, host affiliation, link analysis,
ranking, topic experts | |||
| Query clustering using user logs | | BIBAK | Full-Text | 59-81 | |
| J. R. Wen; J. Y. Nie; H. J. Zhang | |||
| Query clustering is a process used to discover frequently asked questions or
most popular topics on a search engine. This process is crucial for search
engines based on question-answering. Because of the short lengths of queries,
approaches based on keywords are not suitable for query clustering. This paper
describes a new query clustering method that makes use of user logs which allow
us to identify the documents the users have selected for a query. The
similarity between two queries may be deduced from the common documents the
users selected for them. Our experiments show that a combination of both
keywords and user logs is better than using either method alone. Keywords: Query clustering, search engine, user log, web data mining | |||
| Efficient web browsing on handheld devices using page and form summarization | | BIBAK | Full-Text | 82-115 | |
| Orkut Buyukkokten; Oliver Kaljuvee; Hector Garcia-Molina; Andreas Paepcke; Terry Winograd | |||
| We present a design and implementation for displaying and manipulating HTML
pages on small handheld devices such as personal digital assistants (PDAs), or
cellular phones. We introduce methods for summarizing parts of Web pages and
HTML forms. Each Web page is broken into text units that can each be hidden,
partially displayed, made fully visible, or summarized. A variety of methods
are introduced that summarize the text units. In addition, HTML forms are also
summarized by displaying just the text labels that prompt the use for input. We
tested the relative performance of the summarization methods by asking human
subjects to accomplish single-page information search tasks. We found that the
combination of keywords and single-sentence summaries provides significant
improvements in access times and number of required pen actions, as compared to
other schemes. Our experiments also show that our algorithms can identify the
appropriate labels for forms in 95% of the cases, allowing effective form
support for small screens. Keywords: PDA, Personal digital assistant, WAP, WML, forms, handheld computers, mobile
computing, summarization, ubiquitous computing, wireless computing | |||
| Placing search in context: the concept revisited | | BIBAK | Full-Text | 116-131 | |
| Lev Finkelstein; Evgeniy Gabrilovich; Yossi Matias; Ehud Rivlin; Zach Solan; Gadi Wolfman; Eytan Ruppin | |||
| Keyword-based search engines are in widespread use today as a popular means
for Web-based information retrieval. Although such systems seem deceptively
simple, a considerable amount of skill is required in order to satisfy
non-trivial information needs. This paper presents a new conceptual paradigm
for performing search in context, that largely automates the search process,
providing even non-professional users with highly relevant results. This
paradigm is implemented in practice in the IntelliZap system, where search is
initiated from a text query marked by the user in a document she views, and is
guided by the text surrounding the marked query in that document ("the
context"). The context-driven information retrieval process involves semantic
keyword extraction and clustering to automatically generate new, augmented
queries. The latter are submitted to a host of general and domain-specific
search engines. Search results are then semantically reranked, using context.
Experimental results testify that using context to guide search, effectively
offers even inexperienced users an advanced search tool on the Web. Keywords: Search, context, invisible web, semantic processing, statistical natural
language processing | |||
| Peer-to-peer data trading to preserve information | | BIBAK | Full-Text | 133-170 | |
| Brian F. Cooper; Hector Garcia-Molina | |||
| Data archiving systems rely on replication to preserve information. This
paper discusses how a network of autonomous archiving sites can trade data to
achieve the most reliable replication. A series of binary trades among sites
produces a peer-to-peer archiving network. Two trading algorithms are examined,
one based on trading collections (even if they are different sizes) and another
based on trading equal sized blocks of space (which can then store
collections). The concept of deeds is introduced; deeds track the blocks of
space owned by one site at another. Policies for tuning these algorithms to
provide the highest reliability, for example by changing the order in which
sites are contacted and offered trades, are discussed. Finally, simulation
results are presented that reveal which policies are best. The experiments
indicate that a digital archive can achieve the best reliability by trading
blocks of space (deeds), and that following certain policies will allow that
site to maximize its reliability. Keywords: Data replication, digital archiving, digital library, fault tolerance,
resource negotiation | |||
| Collection statistics for fast duplicate document detection | | BIBA | Full-Text | 171-191 | |
| Abdur Chowdhury; Ophir Frieder; David Grossman; Mary Catherine McCabe | |||
| We present a new algorithm for duplicate document detection that uses collection statistics. We compare our approach with the state-of-the-art approach using multiple collections. These collections include a 30 MB 18,577 web document collection developed by Excite@Home and three NIST collections. The first NIST collection consists of 100 MB 18,232 LA-Times documents, which is roughly similar in the number of documents to the Excite&at;Home collection. The other two collections are both 2 GB and are the 247,491-web document collection and the TREC disks 4 and 5 -- 528,023 document collection. We show that our approach called I-Match, scales in terms of the number of documents and works well for documents of all sizes. We compared our solution to the state of the art and found that in addition to improved accuracy of detection, our approach executed in roughly one-fifth the time. | |||
| Burst tries: a fast, efficient data structure for string keys | | BIBAK | Full-Text | 192-223 | |
| Steffen Heinz; Justin Zobel; Hugh E. Williams | |||
| Many applications depend on efficient management of large sets of distinct
strings in memory. For example, during index construction for text databases a
record is held for each distinct word in the text, containing the word itself
and information such as counters. We propose a new data structure, the burst
trie, that has significant advantages over existing options for such
applications: it uses about the same memory as a binary search tree; it is as
fast as a trie; and, while not as fast as a hash table, a burst trie maintains
the strings in sorted or near-sorted order. In this paper we describe burst
tries and explore the parameters that govern their performance. We
experimentally determine good choices of parameters, and compare burst tries to
other structures used for the same task, with a variety of data sets. These
experiments show that the burst trie is particularly effective for the skewed
frequency distributions common in text collections, and dramatically
outperforms all other data structures for the task of managing strings while
maintaining sort order. Keywords: Binary trees, splay trees, string data structures, text databases, tries,
vocabulary accumulation | |||
| Theory of keyblock-based image retrieval | | BIBAK | Full-Text | 224-257 | |
| Lei Zhu; Aibing Rao; Aidong Zhang | |||
| The success of text-based retrieval motivates us to investigate analogous
techniques which can support the querying and browsing of image data. However,
images differ significantly from text both syntactically and semantically in
their mode of representing and expressing information. Thus, the generalization
of information retrieval from the text domain to the image domain is
non-trivial. This paper presents a framework for information retrieval in the
image domain which supports content-based querying and browsing of images. A
critical first step to establishing such a framework is to construct a codebook
of "keywords" for images which is analogous to the dictionary for text
documents. We refer to such "keywords" in the image domain as "keyblocks." In
this paper, we first present various approaches to generating a codebook
containing keyblocks at different resolutions. Then we present a keyblock-based
approach to content-based image retrieval. In this approach, each image is
encoded as a set of one-dimensional index codes linked to the keyblocks in the
codebook, analogous to considering a text document as a linear list of
keywords. Generalizing upon text-based information retrieval methods, we then
offer various techniques for image-based information retrieval. By comparing
the performance of this approach with conventional techniques using color and
texture features, we demonstrate the effectiveness of the keyblock-based
approach to content-based image retrieval. Keywords: clustering, codebook, content-based image retrieval, keyblock | |||
| Improving retrieval feedback with multiple term-ranking function combination | | BIBAK | Full-Text | 259-290 | |
| Claudio Carpineto; Giovanni Romano; Vittorio Giannini | |||
| In this article we consider methods for automatic query expansion from top
retrieved documents (i.e., retrieval feedback) that make use of various
functions for scoring expansion terms within Rocchio's classical reweighting
scheme. An analytical comparison shows that the retrieval performance of
methods based on distinct term-scoring functions is comparable on the whole
query set but differs considerably on single queries, consistent with the fact
that the ordered sets of expansion terms suggested for each query by the
different functions are largely uncorrelated. Motivated by these findings, we
argue that the results of multiple functions can be merged, by analogy with
ensembling classifiers, and present a simple combination technique based on the
rank values of the suggested terms. The combined retrieval feedback method is
effective not only with respect to unexpanded queries but also to any
individual method, with notable improvements on the system's precision.
Furthermore, the combined method is robust with respect to variation of
experimental parameters and it is beneficial even when the same information
needs are expressed with shorter queries. Keywords: automatic query expansion, information retrieval, method combination,
retrieval feedback, short queries | |||
| An intelligent approach to handling imperfect information in concept-based natural language queries | | BIBAK | Full-Text | 291-328 | |
| Vesper Owei | |||
| Missing information, imprecision, inconsistency, vagueness, uncertainty, and
ignorance abound in information systems. Such imperfection is a fact of life in
database systems. Although these problems are widely studied in relational
database systems, this is not the case in conceptual query systems. And yet,
concept-based query languages have been proposed and some are already
commercial products. It is therefore imperative to study these problems in
concept-based query languages, with a view to prescribing formal approaches to
dealing with the problems. In this article, we have done just that for a
concept-based natural language query system that we developed. A methodology
for handling and resolving each type of imperfection is developed. The proposed
approaches are automated as much as possible, with the user mainly serving an
assistive function. Keywords: ambiguous query, anaphoric query, concept-based query, conceptual query
language, elliptical query, imperfect queries, incomplete information,
inconsistency, inexplicit query, missing information, natural language
interface, natural language query, semantically mismatched query | |||
| A general-purpose compression scheme for large collections | | BIBAK | Full-Text | 329-355 | |
| Adam Cannane; Hugh E. Williams | |||
| Compression of large collections can lead to improvements in retrieval times
by offsetting the CPU decompression costs with the cost of seeking and
retrieving data from disk. We propose a semistatic phrase-based approach called
xray that builds a model offline using sample training data extracted from a
collection, and then compresses the entire collection online in a single pass.
The particular benefits of xray are that it can be used in applications where
individual records or documents must be decompressed, and that decompression is
fast. The xray scheme also allows new data to be added to a collection without
modifying the semistatic model. Moreover, xray can be used to compress
general-purpose data such as genomic, scientific, image, and geographic
collections without prior knowledge of the structure of the data. We show that
xray is effective on both text and general-purpose collections. In general,
xray is more effective than the popular gzip and compress schemes, while being
marginally less effective than bzip2. We also show that xray is efficient: of
the popular schemes we tested, it is typically only slower than gzip in
decompression. Moreover, the query evaluation costs of retrieval of documents
from a large collection with our search engine is improved by more than 30%
when xray is incorporated compared to an uncompressed approach. We use simple
techniques for obtaining the training data from the collection to be compressed
and show that with just over 4% of data the entire collection can be
effectively compressed. We also propose four schemes for phrase-match selection
during the single pass compression of the collection. We conclude that with
these novel approaches xray is a fast and effective scheme for compression and
decompression of large general-purpose collections. Keywords: phrase-based compression, random access, sampling | |||
| Probabilistic models of information retrieval based on measuring the divergence from randomness | | BIBAK | Full-Text | 357-389 | |
| Gianni Amati; Cornelis Joost Van Rijsbergen | |||
| We introduce and create a framework for deriving probabilistic models of
Information Retrieval. The models are nonparametric models of IR obtained in
the language model approach. We derive term-weighting models by measuring the
divergence of the actual term distribution from that obtained under a random
process. Among the random processes we study the binomial distribution and
Bose--Einstein statistics. We define two types of term frequency normalization
for tuning term weights in the document--query matching process. The first
normalization assumes that documents have the same length and measures the
information gain with the observed term once it has been accepted as a good
descriptor of the observed document. The second normalization is related to the
document length and to other statistics. These two normalization methods are
applied to the basic models in succession to obtain weighting formulae. Results
show that our framework produces different nonparametric models forming
baseline alternatives to the standard tf-idf model. Keywords: Aftereffect model, BM25, Bose--Einstein statistics, Laplace, Poisson,
binomial law, document length normalization, eliteness, idf, information
retrieval, probabilistic models, randomness, succession law, term frequency
normalization, term weighting | |||
| A semantic network-based design methodology for XML documents | | BIBAK | Full-Text | 390-421 | |
| Ling Feng; Elizabeth Chang; Tharam Dillon | |||
| The eXtensible Markup Language (XML) is fast emerging as the dominant
standard for describing and interchanging data among various systems and
databases on the Internet. It offers the Document Type Definition (DTD) as a
formalism for defining the syntax and structure of XML documents. The XML
Schema definition language, as a replacement for the DTD, provides more rich
facilities for defining and constraining the content of XML documents. However,
it does not concentrate on the semantics that underlies these documents,
representing a logical data model rather than a conceptual model. To enable
efficient business application development in large-scale electronic commerce
environments, it is necessary to describe and model real-world data semantics
and their complex interrelationships. In this article, we describe a design
methodology for XML documents. The aim is to enforce XML conceptual modeling
power and bridge the gap between software development and XML document
structures. The proposed methodology is comprised of two design levels: the
semantic level and the schema level. The first level is based on a semantic
network, which provides semantic modeling of XML through four major components:
a set of atomic and complex nodes, representing real-world objects; a set of
directed edges, representing semantic relationships between the objects; a set
of labels denoting different types of semantic relationships, including
aggregation, generalization, association, and of-property relationships; and
finally a set of constraints defined over nodes and edges to constrain semantic
relationships and object domains. The other level of the proposed methodology
is concerned with detailed XML schema design, including element/attribute
declarations and simple/complex type definitions. The mapping between the two
design levels is proposed to transform the XML semantic model into the XML
Schema, based on which XML documents can be systematically created, managed,
and validated. Keywords: XML, XML Schema, conceptual modeling, design methodology, semantic network | |||
| Cumulated gain-based evaluation of IR techniques | | BIBAK | Full-Text | 422-446 | |
| Kalervo Jarvelin; Jaana Kekalainen | |||
| Modern large retrieval environments tend to overwhelm their users by their
large output. Since all documents are not of equal relevance to their users,
highly relevant documents should be identified and ranked first for
presentation. In order to develop IR techniques in this direction, it is
necessary to develop evaluation approaches and methods that credit IR methods
for their ability to retrieve highly relevant documents. This can be done by
extending traditional evaluation methods, that is, recall and precision based
on binary relevance judgments, to graded relevance judgments. Alternatively,
novel measures based on graded relevance judgments may be developed. This
article proposes several novel measures that compute the cumulative gain the
user obtains by examining the retrieval result up to a given ranked position.
The first one accumulates the relevance scores of retrieved documents along the
ranked result list. The second one is similar but applies a discount factor to
the relevance scores in order to devaluate late-retrieved documents. The third
one computes the relative-to-the-ideal performance of IR techniques, based on
the cumulative gain they are able to yield. These novel measures are defined
and discussed and their use is demonstrated in a case study using TREC data:
sample system run results for 20 queries in TREC-7. As a relevance base we used
novel graded relevance judgments on a four-point scale. The test results
indicate that the proposed measures credit IR methods for their ability to
retrieve highly relevant documents and allow testing of statistical
significance of effectiveness differences. The graphs based on the measures
also provide insight into the performance IR techniques and allow
interpretation, for example, from the user point of view. Keywords: Graded relevance judgments, cumulated gain | |||