| QProber: A system for automatic classification of hidden-Web databases | | BIBAK | Full-Text | 1-41 | |
| Luis Gravano; Panagiotis G. Ipeirotis; Mehran Sahami | |||
| The contents of many valuable Web-accessible databases are only available
through search interfaces and are hence invisible to traditional Web
"crawlers." Recently, commercial Web sites have started to manually organize
Web-accessible databases into Yahoo!-like hierarchical classification schemes.
Here we introduce QProber, a modular system that automates this classification
process by using a small number of query probes, generated by document
classifiers. QProber can use a variety of types of classifiers to generate the
probes. To classify a database, QProber does not retrieve or inspect any
documents or pages from the database, but rather just exploits the number of
matches that each query probe generates at the database in question. We have
conducted an extensive experimental evaluation of QProber over collections of
real documents, experimenting with different types of document classifiers and
retrieval models. We have also tested our system with over one hundred
Web-accessible databases. Our experiments show that our system has low overhead
and achieves high classification accuracy across a variety of databases. Keywords: Database classification, Web databases, hidden Web | |||
| Local versus global link information in the Web | | BIBAK | Full-Text | 42-63 | |
| Pavel Calado; Berthier Ribeiro-Neto; Nivio Ziviani; Edleno Moura; Ilmerio Silva | |||
| Information derived from the cross-references among the documents in a
hyperlinked environment, usually referred to as link information, is considered
important since it can be used to effectively improve document retrieval.
Depending on the retrieval strategy, link information can be local or global.
Local link information is derived from the set of documents returned as answers
to the current user query. Global link information is derived from all the
documents in the collection. In this work, we investigate how the use of local
link information compares to the use of global link information. For the
comparison, we run a series of experiments using a large document collection
extracted from the Web. For our reference collection, the results indicate that
the use of local link information improves precision by 74%. When global link
information is used, precision improves by 35%. However, when only the first 10
documents in the ranking are considered, the average gain in precision obtained
with the use of global link information is higher than the gain obtained with
the use of local link information. This is an interesting result since it
provides insight and justification for the use of global link information in
major Web search engines, where users are mostly interested in the first 10
answers. Furthermore, global information can be computed in the background,
which allows speeding up query processing. Keywords: Belief networks, World Wide Web, link analysis, local and global information | |||
| Exploiting hierarchical domain structure to compute similarity | | BIBAK | Full-Text | 64-93 | |
| Prasanna Ganesan; Hector Garcia-Molina; Jennifer Widom | |||
| The notion of similarity between objects finds use in many contexts, for
example, in search engines, collaborative filtering, and clustering. Objects
being compared often are modeled as sets, with their similarity traditionally
determined based on set intersection. Intersection-based measures do not
accurately capture similarity in certain domains, such as when the data is
sparse or when there are known relationships between items within sets. We
propose new measures that exploit a hierarchical domain structure in order to
produce more intuitive similarity scores. We extend our similarity measures to
provide appropriate results in the presence of multisets (also handled
unsatisfactorily by traditional measures), for example, to correctly compute
the similarity between customers who buy several instances of the same product
(say milk), or who buy several products in the same category (say dairy
products). We also provide an experimental comparison of our measures against
traditional similarity measures, and report on a user study that evaluated how
well our measures match human intuition. Keywords: Similarity measures, collaborative filtering, data mining, hierarchy | |||
| Early user-system interaction for database selection in massive domain-specific online environments | | BIBAK | Full-Text | 94-131 | |
| Jack G. Conrad; Joanne R. S. Claussen | |||
| The continued growth of very large data environments such as Westlaw and
Dialog, in addition to the World Wide Web, increases the importance of
effective and efficient database selection and searching. Current research
focuses largely on completely autonomous and automatic selection, searching,
and results merging in distributed environments. This fully automatic approach
has significant deficiencies, including reliance upon thresholds below which
databases with relevant documents are not searched (compromised recall). It
also merges documents, often from disparate data sources that users may have
discarded before their source selection task proceeded (diluted precision). We
examine the impact that early user interaction can have on the process of
database selection. After analyzing thousands of real user queries, we show
that precision can be significantly increased when queries are categorized by
the users themselves, then handled effectively by the system. Such query
categorization strategies may eliminate limitations of fully automated query
processing approaches. Our system harnesses the WIN search engine, a sibling to
INQUERY, run against one or more authority sources when search is required. We
compare our approach to one that does not recognize or utilize distinct
features associated with user queries. We show that by avoiding a
one-size-fits-all approach that restricts the role users can play in
information discovery, database selection effectiveness can be appreciably
improved. Keywords: Database selection, metadata for retrieval, structuring information to aid
search and navigation, user interaction | |||
| Performance issues and error analysis in an open-domain question answering system | | BIBAK | Full-Text | 133-154 | |
| Dan Moldovan; Marius Pasca; Sanda Harabagiu; Mihai Surdeanu | |||
| This paper presents an in-depth analysis of a state-of-the-art Question
Answering system. Several scenarios are examined: (1) the performance of each
module in a serial baseline system, (2) the impact of feedbacks and the
insertion of a logic prover, and (3) the impact of various retrieval strategies
and lexical resources. The main conclusion is that the overall performance
depends on the depth of natural language processing resources and the tools
used for answer finding. Keywords: Question answering, natural language applications, performance analysis,
text retrieval | |||
| A hierarchical access control model for video database systems | | BIBAK | Full-Text | 155-191 | |
| Elisa Bertino; Jianping Fan; Elena Ferrari; Mohand-Said Hacid; Ahmed K. Elmagarmid; Xingquan Zhu | |||
| Content-based video database access control is becoming very important, but
it depends on the progresses of the following related research issues: (a)
efficient video analysis for supporting semantic visual concept representation;
(b) effective video database indexing structure; (c) the development of
suitable video database models; and (d) the development of access control
models tailored to the characteristics of video data. In this paper, we propose
a novel approach to support multilevel access control in video databases. Our
access control technique combines a video database indexing mechanism with a
hierarchical organization of visual concepts (i.e., video database indexing
units), so that different classes of users can access different video elements
or even the same video element with different quality levels according to their
permissions. These video elements, which, in our access control mechanism, are
used for specifying the authorization objects, can be a semantic cluster, a
subcluster, a video scene, a video shot, a video frame, or even a salient
object (i.e., region of interest). In the paper, we first introduce our
techniques for obtaining these multilevel video access units. We also propose a
hierarchical video database indexing technique to support our multilevel video
access control mechanism. Then, we present an innovative access control model
which is able to support flexible multilevel access control to video elements.
Moreover, the application of our multilevel video database modeling,
representation, and indexing for MPEG-7 is discussed. Keywords: Video database models, access control, indexing schemes | |||
| Region proximity in metric spaces and its use for approximate similarity search | | BIBAK | Full-Text | 192-227 | |
| Giuseppe Amato; Fausto Rabitti; Pasquale Savino; Pavel Zezula | |||
| Similarity search structures for metric data typically bound object
partitions by ball regions. Since regions can overlap, a relevant issue is to
estimate the proximity of regions in order to predict the number of objects in
the regions' intersection. This paper analyzes the problem using a
probabilistic approach and provides a solution that effectively computes the
proximity through realistic heuristics that only require small amounts of
auxiliary data. An extensive simulation to validate the technique is provided.
An application is developed to demonstrate how the proximity measure can be
successfully applied to the approximate similarity search. Search speedup is
achieved by ignoring data regions whose proximity to the query region is
smaller than a user-defined threshold. This idea is implemented in a metric
tree environment for the similarity range and "nearest neighbors" queries.
Several measures of efficiency and effectiveness are applied to evaluate
proposed approximate search algorithms on real-life data sets. An analytical
model is developed to relate proximity parameters and the quality of search.
Improvements of two orders of magnitude are achieved for moderately
approximated search results. We demonstrate that the precision of proximity
measures can significantly influence the quality of approximated algorithms. Keywords: Approximation algorithms, approximate similarity search, metric data, metric
trees, performance evaluation | |||
| The use of dynamic contexts to improve casual internet searching | | BIBAK | Full-Text | 229-253 | |
| Gondy Leroy; Ann M. Lally; Hsinchun Chen | |||
| Research has shown that most users' online information searches are
suboptimal. Query optimization based on a relevance feedback or genetic
algorithm using dynamic query contexts can help casual users search the
Internet. These algorithms can draw on implicit user feedback based on the
surrounding links and text in a search engine result set to expand user queries
with a variable number of keywords in two manners. Positive expansion adds
terms to a user's keywords with a Boolean "and," negative expansion adds terms
to the user's keywords with a Boolean "not." Each algorithm was examined for
three user groups, high, middle, and low achievers, who were classified
according to their overall performance. The interactions of users with
different levels of expertise with different expansion types or algorithms were
evaluated. The genetic algorithm with negative expansion tripled recall and
doubled precision for low achievers, but high achievers displayed an opposed
trend and seemed to be hindered in this condition. The effect of other
conditions was less substantial. Keywords: Information retrieval, Internet, automatic query expansion, genetic
algorithm, implicit user feedback, personalization, relevance feedback | |||
| Logical and physical design issues for smart card databases | | BIBAK | Full-Text | 254-285 | |
| Cristiana Bolchini; Fabio Salice; Fabio A. Schreiber; Letizia Tanca | |||
| The design of very small databases for smart cards and for portable embedded
systems is deeply constrained by the peculiar features of the physical medium.
We propose a joint approach to the logical and physical database design phases
and evaluate several data structures with respect to the performance, power
consumption, and endurance parameters of read/program operations on the
Flash-EEPROM storage medium. Keywords: Design methodology, access methods, data structures, flash memory, personal
information systems, smart card | |||
| Query-independent evidence in home page finding | | BIBAK | Full-Text | 286-313 | |
| Trystan Upstill; Nick Craswell; David Hawking | |||
| Hyperlink recommendation evidence, that is, evidence based on the structure
of a web's link graph, is widely exploited by commercial Web search systems.
However there is little published work to support its popularity. Another form
of query-independent evidence, URL-type, has been shown to be beneficial on a
home page finding task. We compared the usefulness of these types of evidence
on the home page finding task, combined with both content and anchor text
baselines. Our experiments made use of five query sets spanning three corpora
-- one enterprise crawl, and the WT10g and VLC2 Web test collections. We found
that, in optimal conditions, all of the query-independent methods studied
(in-degree, URL-type, and two variants of PageRank) offered a better than
random improvement on a content-only baseline. However, only URL-type offered a
better than random improvement on an anchor text baseline. In realistic
settings, for either baseline, only URL-type offered consistent gains. In
combination with URL-type the anchor text baseline was more useful for finding
popular home pages, but URL-type with content was more useful for finding
randomly selected home pages. We conclude that a general home page finding
system should combine evidence from document content, anchor text, and URL-type
classification. Keywords: Web information retrieval, citation and link analysis, connectivity | |||
| Measuring praise and criticism: Inference of semantic orientation from association | | BIBAK | Full-Text | 315-346 | |
| Peter D. Turney; Michael L. Littman | |||
| The evaluative character of a word is called its semantic orientation.
Positive semantic orientation indicates praise (e.g., "honest", "intrepid") and
negative semantic orientation indicates criticism (e.g., "disturbing",
"superfluous"). Semantic orientation varies in both direction (positive or
negative) and degree (mild to strong). An automated system for measuring
semantic orientation would have application in text classification, text
filtering, tracking opinions in online discussions, analysis of survey
responses, and automated chat systems (chatbots). This article introduces a
method for inferring the semantic orientation of a word from its statistical
association with a set of positive and negative paradigm words. Two instances
of this approach are evaluated, based on two different statistical measures of
word association: pointwise mutual information (PMI) and latent semantic
analysis (LSA). The method is experimentally tested with 3,596 words (including
adjectives, adverbs, nouns, and verbs) that have been manually labeled positive
(1,614 words) and negative (1,982 words). The method attains an accuracy of
82.8% on the full test set, but the accuracy rises above 95% when the algorithm
is allowed to abstain from classifying mild words. Keywords: latent semantic analysis, mutual information, semantic association, semantic
orientation, text classification, text mining, unsupervised learning, web
mining | |||
| MEGA -- the maximizing expected generalization algorithm for learning complex query concepts | | BIBAK | Full-Text | 347-382 | |
| Edward Chang; Beitao Li | |||
| Specifying exact query concepts has become increasingly challenging to
end-users. This is because many query concepts (e.g., those for looking up a
multimedia object) can be hard to articulate, and articulation can be
subjective. In this study, we propose a query-concept learner that learns query
criteria through an intelligent sampling process. Our concept learner aims to
fulfill two primary design objectives: (1) it has to be expressive in order to
model most practical query concepts and (2) it must learn a concept quickly and
with a small number of labeled data since online users tend to be too impatient
to provide much feedback. To fulfill the first goal, we model query concepts in
k-CNF, which can express almost all practical query concepts. To fulfill the
second design goal, we propose our maximizing expected generalization algorithm
(MEGA), which converges to target concepts quickly by its two complementary
steps: sample selection and concept refinement. We also propose a
divide-and-conquer method that divides the concept-learning task into G
subtasks to achieve speedup. We notice that a task must be divided carefully,
or search accuracy may suffer. Through analysis and mining results, we observe
that organizing image features in a multiresolution manner, and minimizing
intragroup feature correlation, can speed up query-concept learning
substantially while maintaining high search accuracy. Through examples,
analysis, experiments, and a prototype implementation, we show that MEGA
converges to query concepts significantly faster than traditional methods. Keywords: Active learning, data mining, query concept, relevance feedback | |||
| Coverage, relevance, and ranking: The impact of query operators on Web search engine results | | BIBAK | Full-Text | 383-411 | |
| Caroline M. Eastman; Bernard J. Jansen | |||
| Research has reported that about 10% of Web searchers utilize advanced query
operators, with the other 90% using extremely simple queries. It is often
assumed that the use of query operators, such as Boolean operators and phrase
searching, improves the effectiveness of Web searching. We test this assumption
by examining the effects of query operators on the performance of three major
Web search engines. We selected one hundred queries from the transaction log of
a Web search service. Each of these original queries contained query operators
such as AND, OR, MUST APPEAR (+), or PHRASE (" "). We then removed the
operators from these one hundred advanced queries. We submitted both the
original and modified queries to three major Web search engines; a total of 600
queries were submitted and 5,748 documents evaluated. We compared the results
from the original queries with the operators to the results from the modified
queries without the operators. We examined the results for changes in coverage,
relative precision, and ranking of relevant documents. The use of most query
operators had no significant effect on coverage, relative precision, or
ranking, although the effect varied depending on the search engine. We discuss
implications for the effectiveness of searching techniques as currently taught,
for future information retrieval system design, and for future research. Keywords: Boolean operators, Relative precision, Web results, coverage, query
operators, ranking, search engines | |||
| Comparing the performance of collection selection algorithms | | BIBAK | Full-Text | 412-456 | |
| Allison L. Powell; James C. French | |||
| The proliferation of online information resources increases the importance
of effective and efficient information retrieval in a multicollection
environment. Multicollection searching is cast in three parts: collection
selection (also referred to as database selection), query processing and
results merging. In this work, we focus our attention on the evaluation of the
first step, collection selection. In this article, we present a detailed
discussion of the methodology that we used to evaluate and compare collection
selection approaches, covering both test environments and evaluation measures.
We compare the CORI, CVV and gGLOSS collection selection approaches using six
test environments utilizing three document testbeds. We note similar trends in
performance among the collection selection approaches, but the CORI approach
consistently outperforms the other approaches, suggesting that effective
collection selection can be achieved using limited information about each
collection. The contributions of this work are both the assembled evaluation
methodology as well as the application of that methodology to compare
collection selection approaches in a standardized environment. Keywords: Collection selection, database selection, distributed information retrieval,
distributed text retrieval, metasearch engine, resource discovery, resource
ranking, resource selection, server ranking, server selection, text retrieval | |||
| A semisupervised learning method to merge search engine results | | BIBAK | Full-Text | 457-491 | |
| Luo Si; Jamie Callan | |||
| The proliferation of searchable text databases on local area networks and
the Internet causes the problem of finding information that may be distributed
among many disjoint text databases (distributed information retrieval). How to
merge the results returned by selected databases is an important subproblem of
the distributed information retrieval task. Previous research assumed that
either resource providers cooperate to provide normalizing statistics or search
clients download all retrieved documents and compute normalized scores without
cooperation from resource providers. This article presents a semisupervised
learning solution to the result merging problem. The key contribution is the
observation that information used to create resource descriptions for resource
selection can also be used to create a centralized sample database to guide the
normalization of document scores returned by different databases. At retrieval
time, the query is sent to the selected databases, which return
database-specific document scores, and to a centralized sample database, which
returns database-independent document scores. Documents that have both a
database-specific score and a database-independent score serve as training data
for learning to normalize the scores of other documents. An extensive set of
experiments demonstrates that this method is more effective than the well-known
CORI result-merging algorithm under a variety of conditions. Keywords: Distributed information retrieval, resource ranking, resource selection,
results merging, semisupervised learning method, server selection | |||