HCI Bibliography Home | HCI Journals | About TOIS | Journal Info | TOIS Journal Volumes | Detailed Records | RefWorks | EndNote | Hide Abstracts
TOIS Tables of Contents: 111213141516171819202122232425262728293031

ACM Transactions on Information Systems 21

Editors:W. Bruce Croft
Standard No:ISSN 1046-8188; HF S548.125 A33
Links:Table of Contents
  1. TOIS 2003 Volume 21 Issue 1
  2. TOIS 2003 Volume 21 Issue 2
  3. TOIS 2003 Volume 21 Issue 3
  4. TOIS 2003 Volume 21 Issue 4

TOIS 2003 Volume 21 Issue 1

QProber: A system for automatic classification of hidden-Web databases BIBAKFull-Text 1-41
  Luis Gravano; Panagiotis G. Ipeirotis; Mehran Sahami
The contents of many valuable Web-accessible databases are only available through search interfaces and are hence invisible to traditional Web "crawlers." Recently, commercial Web sites have started to manually organize Web-accessible databases into Yahoo!-like hierarchical classification schemes. Here we introduce QProber, a modular system that automates this classification process by using a small number of query probes, generated by document classifiers. QProber can use a variety of types of classifiers to generate the probes. To classify a database, QProber does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of QProber over collections of real documents, experimenting with different types of document classifiers and retrieval models. We have also tested our system with over one hundred Web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases.
Keywords: Database classification, Web databases, hidden Web
Local versus global link information in the Web BIBAKFull-Text 42-63
  Pavel Calado; Berthier Ribeiro-Neto; Nivio Ziviani; Edleno Moura; Ilmerio Silva
Information derived from the cross-references among the documents in a hyperlinked environment, usually referred to as link information, is considered important since it can be used to effectively improve document retrieval. Depending on the retrieval strategy, link information can be local or global. Local link information is derived from the set of documents returned as answers to the current user query. Global link information is derived from all the documents in the collection. In this work, we investigate how the use of local link information compares to the use of global link information. For the comparison, we run a series of experiments using a large document collection extracted from the Web. For our reference collection, the results indicate that the use of local link information improves precision by 74%. When global link information is used, precision improves by 35%. However, when only the first 10 documents in the ranking are considered, the average gain in precision obtained with the use of global link information is higher than the gain obtained with the use of local link information. This is an interesting result since it provides insight and justification for the use of global link information in major Web search engines, where users are mostly interested in the first 10 answers. Furthermore, global information can be computed in the background, which allows speeding up query processing.
Keywords: Belief networks, World Wide Web, link analysis, local and global information
Exploiting hierarchical domain structure to compute similarity BIBAKFull-Text 64-93
  Prasanna Ganesan; Hector Garcia-Molina; Jennifer Widom
The notion of similarity between objects finds use in many contexts, for example, in search engines, collaborative filtering, and clustering. Objects being compared often are modeled as sets, with their similarity traditionally determined based on set intersection. Intersection-based measures do not accurately capture similarity in certain domains, such as when the data is sparse or when there are known relationships between items within sets. We propose new measures that exploit a hierarchical domain structure in order to produce more intuitive similarity scores. We extend our similarity measures to provide appropriate results in the presence of multisets (also handled unsatisfactorily by traditional measures), for example, to correctly compute the similarity between customers who buy several instances of the same product (say milk), or who buy several products in the same category (say dairy products). We also provide an experimental comparison of our measures against traditional similarity measures, and report on a user study that evaluated how well our measures match human intuition.
Keywords: Similarity measures, collaborative filtering, data mining, hierarchy
Early user-system interaction for database selection in massive domain-specific online environments BIBAKFull-Text 94-131
  Jack G. Conrad; Joanne R. S. Claussen
The continued growth of very large data environments such as Westlaw and Dialog, in addition to the World Wide Web, increases the importance of effective and efficient database selection and searching. Current research focuses largely on completely autonomous and automatic selection, searching, and results merging in distributed environments. This fully automatic approach has significant deficiencies, including reliance upon thresholds below which databases with relevant documents are not searched (compromised recall). It also merges documents, often from disparate data sources that users may have discarded before their source selection task proceeded (diluted precision). We examine the impact that early user interaction can have on the process of database selection. After analyzing thousands of real user queries, we show that precision can be significantly increased when queries are categorized by the users themselves, then handled effectively by the system. Such query categorization strategies may eliminate limitations of fully automated query processing approaches. Our system harnesses the WIN search engine, a sibling to INQUERY, run against one or more authority sources when search is required. We compare our approach to one that does not recognize or utilize distinct features associated with user queries. We show that by avoiding a one-size-fits-all approach that restricts the role users can play in information discovery, database selection effectiveness can be appreciably improved.
Keywords: Database selection, metadata for retrieval, structuring information to aid search and navigation, user interaction

TOIS 2003 Volume 21 Issue 2

Performance issues and error analysis in an open-domain question answering system BIBAKFull-Text 133-154
  Dan Moldovan; Marius Pasca; Sanda Harabagiu; Mihai Surdeanu
This paper presents an in-depth analysis of a state-of-the-art Question Answering system. Several scenarios are examined: (1) the performance of each module in a serial baseline system, (2) the impact of feedbacks and the insertion of a logic prover, and (3) the impact of various retrieval strategies and lexical resources. The main conclusion is that the overall performance depends on the depth of natural language processing resources and the tools used for answer finding.
Keywords: Question answering, natural language applications, performance analysis, text retrieval
A hierarchical access control model for video database systems BIBAKFull-Text 155-191
  Elisa Bertino; Jianping Fan; Elena Ferrari; Mohand-Said Hacid; Ahmed K. Elmagarmid; Xingquan Zhu
Content-based video database access control is becoming very important, but it depends on the progresses of the following related research issues: (a) efficient video analysis for supporting semantic visual concept representation; (b) effective video database indexing structure; (c) the development of suitable video database models; and (d) the development of access control models tailored to the characteristics of video data. In this paper, we propose a novel approach to support multilevel access control in video databases. Our access control technique combines a video database indexing mechanism with a hierarchical organization of visual concepts (i.e., video database indexing units), so that different classes of users can access different video elements or even the same video element with different quality levels according to their permissions. These video elements, which, in our access control mechanism, are used for specifying the authorization objects, can be a semantic cluster, a subcluster, a video scene, a video shot, a video frame, or even a salient object (i.e., region of interest). In the paper, we first introduce our techniques for obtaining these multilevel video access units. We also propose a hierarchical video database indexing technique to support our multilevel video access control mechanism. Then, we present an innovative access control model which is able to support flexible multilevel access control to video elements. Moreover, the application of our multilevel video database modeling, representation, and indexing for MPEG-7 is discussed.
Keywords: Video database models, access control, indexing schemes
Region proximity in metric spaces and its use for approximate similarity search BIBAKFull-Text 192-227
  Giuseppe Amato; Fausto Rabitti; Pasquale Savino; Pavel Zezula
Similarity search structures for metric data typically bound object partitions by ball regions. Since regions can overlap, a relevant issue is to estimate the proximity of regions in order to predict the number of objects in the regions' intersection. This paper analyzes the problem using a probabilistic approach and provides a solution that effectively computes the proximity through realistic heuristics that only require small amounts of auxiliary data. An extensive simulation to validate the technique is provided. An application is developed to demonstrate how the proximity measure can be successfully applied to the approximate similarity search. Search speedup is achieved by ignoring data regions whose proximity to the query region is smaller than a user-defined threshold. This idea is implemented in a metric tree environment for the similarity range and "nearest neighbors" queries. Several measures of efficiency and effectiveness are applied to evaluate proposed approximate search algorithms on real-life data sets. An analytical model is developed to relate proximity parameters and the quality of search. Improvements of two orders of magnitude are achieved for moderately approximated search results. We demonstrate that the precision of proximity measures can significantly influence the quality of approximated algorithms.
Keywords: Approximation algorithms, approximate similarity search, metric data, metric trees, performance evaluation

TOIS 2003 Volume 21 Issue 3

The use of dynamic contexts to improve casual internet searching BIBAKFull-Text 229-253
  Gondy Leroy; Ann M. Lally; Hsinchun Chen
Research has shown that most users' online information searches are suboptimal. Query optimization based on a relevance feedback or genetic algorithm using dynamic query contexts can help casual users search the Internet. These algorithms can draw on implicit user feedback based on the surrounding links and text in a search engine result set to expand user queries with a variable number of keywords in two manners. Positive expansion adds terms to a user's keywords with a Boolean "and," negative expansion adds terms to the user's keywords with a Boolean "not." Each algorithm was examined for three user groups, high, middle, and low achievers, who were classified according to their overall performance. The interactions of users with different levels of expertise with different expansion types or algorithms were evaluated. The genetic algorithm with negative expansion tripled recall and doubled precision for low achievers, but high achievers displayed an opposed trend and seemed to be hindered in this condition. The effect of other conditions was less substantial.
Keywords: Information retrieval, Internet, automatic query expansion, genetic algorithm, implicit user feedback, personalization, relevance feedback
Logical and physical design issues for smart card databases BIBAKFull-Text 254-285
  Cristiana Bolchini; Fabio Salice; Fabio A. Schreiber; Letizia Tanca
The design of very small databases for smart cards and for portable embedded systems is deeply constrained by the peculiar features of the physical medium. We propose a joint approach to the logical and physical database design phases and evaluate several data structures with respect to the performance, power consumption, and endurance parameters of read/program operations on the Flash-EEPROM storage medium.
Keywords: Design methodology, access methods, data structures, flash memory, personal information systems, smart card
Query-independent evidence in home page finding BIBAKFull-Text 286-313
  Trystan Upstill; Nick Craswell; David Hawking
Hyperlink recommendation evidence, that is, evidence based on the structure of a web's link graph, is widely exploited by commercial Web search systems. However there is little published work to support its popularity. Another form of query-independent evidence, URL-type, has been shown to be beneficial on a home page finding task. We compared the usefulness of these types of evidence on the home page finding task, combined with both content and anchor text baselines. Our experiments made use of five query sets spanning three corpora -- one enterprise crawl, and the WT10g and VLC2 Web test collections. We found that, in optimal conditions, all of the query-independent methods studied (in-degree, URL-type, and two variants of PageRank) offered a better than random improvement on a content-only baseline. However, only URL-type offered a better than random improvement on an anchor text baseline. In realistic settings, for either baseline, only URL-type offered consistent gains. In combination with URL-type the anchor text baseline was more useful for finding popular home pages, but URL-type with content was more useful for finding randomly selected home pages. We conclude that a general home page finding system should combine evidence from document content, anchor text, and URL-type classification.
Keywords: Web information retrieval, citation and link analysis, connectivity

TOIS 2003 Volume 21 Issue 4

Measuring praise and criticism: Inference of semantic orientation from association BIBAKFull-Text 315-346
  Peter D. Turney; Michael L. Littman
The evaluative character of a word is called its semantic orientation. Positive semantic orientation indicates praise (e.g., "honest", "intrepid") and negative semantic orientation indicates criticism (e.g., "disturbing", "superfluous"). Semantic orientation varies in both direction (positive or negative) and degree (mild to strong). An automated system for measuring semantic orientation would have application in text classification, text filtering, tracking opinions in online discussions, analysis of survey responses, and automated chat systems (chatbots). This article introduces a method for inferring the semantic orientation of a word from its statistical association with a set of positive and negative paradigm words. Two instances of this approach are evaluated, based on two different statistical measures of word association: pointwise mutual information (PMI) and latent semantic analysis (LSA). The method is experimentally tested with 3,596 words (including adjectives, adverbs, nouns, and verbs) that have been manually labeled positive (1,614 words) and negative (1,982 words). The method attains an accuracy of 82.8% on the full test set, but the accuracy rises above 95% when the algorithm is allowed to abstain from classifying mild words.
Keywords: latent semantic analysis, mutual information, semantic association, semantic orientation, text classification, text mining, unsupervised learning, web mining
MEGA -- the maximizing expected generalization algorithm for learning complex query concepts BIBAKFull-Text 347-382
  Edward Chang; Beitao Li
Specifying exact query concepts has become increasingly challenging to end-users. This is because many query concepts (e.g., those for looking up a multimedia object) can be hard to articulate, and articulation can be subjective. In this study, we propose a query-concept learner that learns query criteria through an intelligent sampling process. Our concept learner aims to fulfill two primary design objectives: (1) it has to be expressive in order to model most practical query concepts and (2) it must learn a concept quickly and with a small number of labeled data since online users tend to be too impatient to provide much feedback. To fulfill the first goal, we model query concepts in k-CNF, which can express almost all practical query concepts. To fulfill the second design goal, we propose our maximizing expected generalization algorithm (MEGA), which converges to target concepts quickly by its two complementary steps: sample selection and concept refinement. We also propose a divide-and-conquer method that divides the concept-learning task into G subtasks to achieve speedup. We notice that a task must be divided carefully, or search accuracy may suffer. Through analysis and mining results, we observe that organizing image features in a multiresolution manner, and minimizing intragroup feature correlation, can speed up query-concept learning substantially while maintaining high search accuracy. Through examples, analysis, experiments, and a prototype implementation, we show that MEGA converges to query concepts significantly faster than traditional methods.
Keywords: Active learning, data mining, query concept, relevance feedback
Coverage, relevance, and ranking: The impact of query operators on Web search engine results BIBAKFull-Text 383-411
  Caroline M. Eastman; Bernard J. Jansen
Research has reported that about 10% of Web searchers utilize advanced query operators, with the other 90% using extremely simple queries. It is often assumed that the use of query operators, such as Boolean operators and phrase searching, improves the effectiveness of Web searching. We test this assumption by examining the effects of query operators on the performance of three major Web search engines. We selected one hundred queries from the transaction log of a Web search service. Each of these original queries contained query operators such as AND, OR, MUST APPEAR (+), or PHRASE (" "). We then removed the operators from these one hundred advanced queries. We submitted both the original and modified queries to three major Web search engines; a total of 600 queries were submitted and 5,748 documents evaluated. We compared the results from the original queries with the operators to the results from the modified queries without the operators. We examined the results for changes in coverage, relative precision, and ranking of relevant documents. The use of most query operators had no significant effect on coverage, relative precision, or ranking, although the effect varied depending on the search engine. We discuss implications for the effectiveness of searching techniques as currently taught, for future information retrieval system design, and for future research.
Keywords: Boolean operators, Relative precision, Web results, coverage, query operators, ranking, search engines
Comparing the performance of collection selection algorithms BIBAKFull-Text 412-456
  Allison L. Powell; James C. French
The proliferation of online information resources increases the importance of effective and efficient information retrieval in a multicollection environment. Multicollection searching is cast in three parts: collection selection (also referred to as database selection), query processing and results merging. In this work, we focus our attention on the evaluation of the first step, collection selection. In this article, we present a detailed discussion of the methodology that we used to evaluate and compare collection selection approaches, covering both test environments and evaluation measures. We compare the CORI, CVV and gGLOSS collection selection approaches using six test environments utilizing three document testbeds. We note similar trends in performance among the collection selection approaches, but the CORI approach consistently outperforms the other approaches, suggesting that effective collection selection can be achieved using limited information about each collection. The contributions of this work are both the assembled evaluation methodology as well as the application of that methodology to compare collection selection approaches in a standardized environment.
Keywords: Collection selection, database selection, distributed information retrieval, distributed text retrieval, metasearch engine, resource discovery, resource ranking, resource selection, server ranking, server selection, text retrieval
A semisupervised learning method to merge search engine results BIBAKFull-Text 457-491
  Luo Si; Jamie Callan
The proliferation of searchable text databases on local area networks and the Internet causes the problem of finding information that may be distributed among many disjoint text databases (distributed information retrieval). How to merge the results returned by selected databases is an important subproblem of the distributed information retrieval task. Previous research assumed that either resource providers cooperate to provide normalizing statistics or search clients download all retrieved documents and compute normalized scores without cooperation from resource providers. This article presents a semisupervised learning solution to the result merging problem. The key contribution is the observation that information used to create resource descriptions for resource selection can also be used to create a centralized sample database to guide the normalization of document scores returned by different databases. At retrieval time, the query is sent to the selected databases, which return database-specific document scores, and to a centralized sample database, which returns database-independent document scores. Documents that have both a database-specific score and a database-independent score serve as training data for learning to normalize the scores of other documents. An extensive set of experiments demonstrates that this method is more effective than the well-known CORI result-merging algorithm under a variety of conditions.
Keywords: Distributed information retrieval, resource ranking, resource selection, results merging, semisupervised learning method, server selection