The new economy: an engineer's perspective BIBAFull-Text 1
  David Brown
From his twin perspectives as a career-long telecommunications engineer and Chairman of one of the UK's largest electronics companies, Sir David Brown will reflect on whether and when the New Economy, seemingly so long coming, will finally arrive. He will begin by exploring how the prospect of everything being digital; everyone having broadband; and intelligence being everywhere is changing our understanding of mobility. Then he will comment on the economic effects of that changed understanding under three headings -- the macroeconomy, microeconomy and socioeconomy -- before suggesting the criteria we might use to decide when the New Economy has arrived.


Position paper: a comparison of two modelling paradigms in the Semantic Web BIBAKFull-Text 3-12
  Peter F. Patel-Schneider; Ian Horrocks
Classical logics and Datalog-related logics have both been proposed as underlying formalisms for the Semantic Web. Although these two different formalism groups have some commonalities, and look similar in the context of expressively-impoverished languages like RDF, their differences become apparent at more expressive language levels. After considering some of these differences, we argue that, although some of the characteristics of Datalog have their utility, the open environment of the Semantic Web is better served by standard logics.
Keywords: Semantic Web, modelling, philosophical foundations, representation
Web ontology segmentation: analysis, classification and use BIBAKFull-Text 13-22
  Julian Seidenberg; Alan Rector
Ontologies are at the heart of the semantic web. They define the concepts and relationships that make global interoperability possible. However, as these ontologies grow in size they become more and more difficult to create, use, understand, maintain, transform and classify. We present and evaluate several algorithms for extracting relevant segments out of large description logic ontologies for the purposes of increasing tractability for both humans and computers. The segments are not mere fragments, but stand alone as ontologies in their own right. This technique takes advantage of the detailed semantics captured within an OWL ontology to produce highly relevant segments. The research was evaluated using the GALEN ontology of medical terms and procedures.
Keywords: OWL, Semantic Web, ontology, scalability, segmentation
Constructing virtual documents for ontology matching BIBAKFull-Text 23-31
  Yuzhong Qu; Wei Hu; Gong Cheng
On the investigation of linguistic techniques used in ontology matching, we propose a new idea of virtual documents to pursue a cost-effective approach to linguistic matching in this paper. Basically, as a collection of weighted words, the virtual document of a URIref declared in an ontology contains not only the local descriptions but also the neighboring information to reflect the intended meaning of the URIref. Document similarity can be computed by traditional vector space techniques, and then be used in the similarity-based approaches to ontology matching. In particular, the RDF graph structure is exploited to define the description formulations and the neighboring operations. Experimental results show that linguistic matching based on the virtual documents is dominant in average F-Measure as compared to other three approaches. It is also demonstrated by our experiments that the virtual documents approach is cost-effective as compared to other linguistic matching approaches.
Keywords: description, formulation, linguistic matching, neighboring operation, ontology matching, vector space model

Adaptivity & mobility

Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework BIBAKFull-Text 33-42
  Shumeet Baluja
Fitting enough information from webpages to make browsing on small screens compelling is a challenging task. One approach is to present the user with a thumbnail image of the full web page and allow the user to simply press a single key to zoom into a region (which may then be transcoded into wml/xhtml, summarized, etc). However, if regions for zooming are presented naively, this yields a frustrating experience because of the number of coherent regions, sentences, images, and words that may be inadvertently separated. Here, we cast the web page segmentation problem into a machine learning framework, where we re-examine this task through the lens of entropy reduction and decision tree learning. This yields an efficient and effective page segmentation algorithm. We demonstrate how simple techniques from computer vision can be used to fine-tune the results. The resulting segmentation keeps coherent regions together when tested on a broad set of complex webpages.
Keywords: browser, machine learning, mobile browsing, mobile devices, small screen, thumbnail browsing, web page segmentation
Image classification for mobile web browsing BIBAKFull-Text 43-52
  Takuya Maekawa; Takahiro Hara; Shojiro Nishio
It is difficult for users of mobile devices such as cellular phones equipped with a small screen and a poor input interface to browse Web pages designed for desktop PCs with large displays. Many studies and commercial products have tried to solve this problem. Web pages include images that have various roles such as site menus, line headers for itemization, and page titles. However, most studies of mobile Web browsing haven't paid much attention to the roles of Web images. In this paper, we define eleven Web image categories according to their roles and use these categories for proper Web image handling. We manually categorized 3,901 Web images collected from forty Web sites and extracted image features of each category according to the classification. By making use of the extracted features, we devised an automatic Web image classification method. Furthermore, we evaluated the automatic classification of real Web pages and achieved up to 83.1% classification accuracy. We also implemented an automatic Web page scrolling system as an application of our automatic image classification method.
Keywords: mobile computing, web browsing, web images
Fine grained content-based adaptation mechanism for providing high end-user quality of experience with adaptive hypermedia systems BIBAKFull-Text 53-62
  Cristina Hava Muntean; Jennifer McManis
New communication technologies can enable Web users to access personalised information "anytime, anywhere". However, the network environments allowing this "anytime, anywhere" access may have widely varying performance characteristics such as bandwidth, level of congestion, mobility support, and cost of transmission. It is unrealistic to expect that the quality of delivery of the same content can be maintained in this variable environment, but rather an effort must be made to fit the content served to the current delivery conditions, thus ensuring high Quality of Experience (QoE) to the users. This paper introduces an end-user QoE-aware adaptive hypermedia framework that extends the adaptation functionality of adaptive hypermedia systems with a fine-grained content-based adaptation mechanism. The proposed mechanism attempts to take into account multiple factors affecting QoE in relation to the delivery of Web content. Various simulation tests investigate the performance improvements provided by this mechanism, in a home-like, low bit rate operational environment, in terms of access time per page, aggregate access time per browsing session and quantity of transmitted information.
Keywords: adaptive hypermedia, content-based adaptation mechanism, distance education, end-user quality of experience

Fighting search spam

Topical TrustRank: using topicality to combat web spam BIBAKFull-Text 63-72
  Baoning Wu; Vinay Goel; Brian D. Davison
Web spam is behavior that attempts to deceive search engine ranking algorithms. TrustRank is a recent algorithm that can combat web spam. However, TrustRank is vulnerable in the sense that the seed set used by TrustRank may not be sufficiently representative to cover well the different topics on the Web. Also, for a given seed set, TrustRank has a bias towards larger communities. We propose the use of topical information to partition the seed set and calculate trust scores for each topic separately to address the above issues. A combination of these trust scores for a page is used to determine its ranking. Experimental results on two large datasets show that our Topical TrustRank has a better performance than TrustRank in demoting spam sites or pages. Compared to TrustRank, our best technique can decrease spam from the top ranked sites by as much as 43.1%.
Keywords: PageRank, TrustRank, spam, web search engine
Site level noise removal for search engines BIBAKFull-Text 73-82
  André Luiz da Costa Carvalho; Paul-Alexandru Chirita; Edleno Silva de Moura; Pável Calado; Wolfgang Nejdl
The currently booming search engine industry has determined many online organizations to attempt to artificially increase their ranking in order to attract more visitors to their web sites. At the same time, the growth of the web has also inherently generated several navigational hyperlink structures that have a negative impact on the importance measures employed by current search engines. In this paper we propose and evaluate algorithms for identifying all these noisy links on the web graph, may them be spam or simple relationships between real world entities represented by sites, replication of content, etc. Unlike prior work, we target a different type of noisy link structures, residing at the site level, instead of the page level. We thus investigate and annihilate site level mutual reinforcement relationships, abnormal support coming from one site towards another, as well as complex link alliances between web sites. Our experiments with the link database of the TodoBR search engine show a very strong increase in the quality of the output rankings after having applied our techniques.
Keywords: PageRank, link analysis, noise reduction, spam
Detecting spam web pages through content analysis BIBAKFull-Text 83-92
  Alexandros Ntoulas; Marc Najork; Mark Manasse; Dennis Fetterly
In this paper, we continue our investigations of "web spam": the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%).
Keywords: data mining, web characterization, web pages, web spam


XML screamer: an integrated approach to high performance XML parsing, validation and deserialization BIBAKFull-Text 93-102
  Margaret G. Kostoulas; Morris Matsa; Noah Mendelsohn; Eric Perkins; Abraham Heifets; Martha Mercaldi
This paper describes an experimental system in which customized high performance XML parsers are prepared using parser generation and compilation techniques. Parsing is integrated with Schema-based validation and deserialization, and the resulting validating processors are shown to be as fast as or in many cases significantly faster than traditional nonvalidating parsers. High performance is achieved by integration across layers of software that are traditionally separate, by avoiding unnecessary data copying and transformation, and by careful attention to detail in the generated code. The effect of API design on XML performance is also briefly discussed..
Keywords: JAX-RPC, SAX, XML, XML schema, parsing, performance, schema compilation, validation
Symmetrically exploiting XML BIBAKFull-Text 103-111
  Shuohao Zhang; Curtis Dyreson
Path expressions are the principal means of locating data in a hierarchical model. But path expressions are brittle because they often depend on the structure of data and break if the data is structured differently. The structure of data could be unfamiliar to a user, may differ within a data collection, or may change over time as the schema evolves. This paper proposes a novel construct that locates related nodes in an instance of an XML data model, independent of a specific structure. It can augment many XPath expressions and can be seamlessly incorporated in XQuery or XSLT.
Keywords: XML, XPath, XQuery, path expressions

Developing regions & peer-to-peer

FeedEx: collaborative exchange of news feeds BIBAKFull-Text 113-122
  Seung Jun; Mustaque Ahamad
As most blogs and traditional media support RSS or Atom feeds, the news feed technology becomes increasingly prevalent. Taking advantage of ubiquitous news feeds, we design FeedEx, a news feed exchange system. Forming a distribution overlay network, nodes in FeedEx not only fetch feed documents from the servers but also exchange them with neighbors. Among many benefits of collaborative feed exchange, we focus on the low-overhead, scalable delivery mechanism that increases the availability of news feeds. Our design of FeedEx is incentive-compatible so that nodes are encouraged into cooperating rather than free riding. In addition, for a better design of FeedEx, we analyze the data collected from 245 feeds for 10 days and present relevant statistics about news feed publishing, including the distributions of feed size, entry lifetime, and publishing rate.
   Our experimental evaluation using 189 PlanetLab machines, which fetch from real-world feed servers, shows that FeedEx is an efficient system in many respects. Even when a node fetches feed documents as infrequently as every 16 hours, it captures more than 90% of the total entries published, and those captured entries are available within 22 minutes on average after published at the servers. By contrast, stand-alone applications in the same condition show 36% of entry coverage and 5.7 hours of time lag. The efficient delivery of FeedEx is achieved with low communication overhead as each node receives only 0.9 document exchange calls and 6.3 document checking calls per minute on average.
Keywords: FeedEx, RSS, atom, collaborative exchange, news feeds


Examining the content and privacy of web browsing incidental information BIBAKFull-Text 123-132
  Kirstie Hawkey; Kori M. Inkpen
This research examines the privacy comfort levels of participants if others can view traces of their web browsing activity. During a week-long field study, participants used an electronic diary daily to annotate each web page visited with a privacy level. Content categories were used by participants to theoretically specify their privacy comfort for each category and by researchers to partition participants' actual browsing. The content categories were clustered into groups based on the dominant privacy levels applied to the pages. Inconsistencies between participants in their privacy ratings of categories suggest that a general privacy management scheme is inappropriate. Participants' consistency within categories suggests that a personalized scheme may be feasible; however a more fine-grained approach to classification is required to improve results for sites that tend to be general, of multiple task purposes, or dynamic in content.
Keywords: ad hoc collaboration, client-side logging, field study, personalization, privacy, web browsing behaviour, web page content
Off the beaten tracks: exploring three aspects of web navigation BIBAKFull-Text 133-142
  Harald Weinreich; Hartmut Obendorf; Eelco Herder; Matthias Mayer
This paper presents results of a long-term client-side Web usage study, updating previous studies that range in age from five to ten years. We focus on three aspects of Web navigation: changes in the distribution of navigation actions, speed of navigation and within-page navigation."Navigation actions" corresponding to users' individual page requests are discussed by type. We reconfirm links to be the most important navigation element, while backtracking has lost more than half of its previously reported share and form submission has become far more common. Changes of the Web and the browser interfaces are candidates for causing these changes.
   Analyzing the time users stayed on pages, we confirm Web navigation to be a rapidly interactive activity. A breakdown of page characteristics shows that users often do not take the time to read the available text or consider all links. The performance of the Web is analyzed and reassessed against the resulting requirements.
   Finally, habits of within-page navigation are presented. Although most selected hyperlinks are located in the top left corner of the = screen, in nearly a quarter of all cases people choose links that require scrolling. We analyzed the available browser real estate to gain insights for the design of non-scrolling Web pages.
Keywords: browser interfaces, clickstream study, hypertext, navigation, user modeling
pTHINC: a thin-client architecture for mobile wireless web BIBAKFull-Text 143-152
  Joeng Kim; Ricardo A. Baratto; Jason Nieh
Although web applications are gaining popularity on mobile wireless PDAs, web browsers on these systems can be quite slow and often lack adequate functionality to access many web sites. We have developed pTHINC, a PDA thin-client solution that leverages more powerful servers to run full-function web browsers and other application logic, then sends simple screen updates to the PDA for display. pTHINC uses server-side screen scaling to provide high-fidelity display and seamless mobility across a broad range of different clients and screen sizes, including both portrait and landscape viewing modes. pTHINC also leverages existing PDA control buttons to improve system usability and maximize available screen resolution for application display. We have implemented pTHINC on Windows Mobile and evaluated its performance on mobile wireless devices. Our results compared to local PDA web browsers and other thin-client approaches demonstrate that pTHINC provides superior web browsing performance and is the only PDA thin client that effectively supports crucial browser helper applications such as video playback.
Keywords: mobility, pervasive web, remote display, thin-client computing


Bringing communities to the semantic web and the semantic web to communities BIBAKFull-Text 153-162
  K. Faith Lawrence; m. c. schraefel
In this paper we consider the types of community networks that are most often codified within the Semantic Web. We propose the recognition of a new structure which fulfils the definition of community used outside the Semantic Web. We argue that the properties inherent in a community allow additional processing to be done with the described relationships existing between entities within the community network. Taking an existing online community as a case study we describe the ontologies and applications that we developed to support this community in the Semantic Web environment and discuss what lessons can be learnt from this exercise and applied in more general settings.
Keywords: case study, communities, e-applications, semantic web
Invisible participants: how cultural capital relates to lurking behavior BIBAKFull-Text 163-172
  Vladimir Soroka; Sheizaf Rafaeli
The asymmetry of activity in virtual communities is of great interest. While participation in the activities of virtual communities is crucial for a community's survival and development, many people prefer lurking, that is passive attention over active participation. Lurking can be measured and perhaps affected by both dispositional and situational variables. This work investigates the concept of cultural capital as situational antecedent of lurking and de-lurking (the decision to start posting after a certain amount of lurking time). Cultural capital is defined as the knowledge that enables an individual to interpret various cultural codes. The main hypothesis states that a user's cultural capital affects her level of activity in a community and her decision to de-lurk and cease to exist in very active communities because of information overload. This hypothesis is analyzed by mathematically defining a social communication network (SCN) of activities in authenticated discussion forums. We validate this model by examining the SCN using data collected in a sample of 636 online forums in Open University in Israel and 2 work based communities from IBM. The hypotheses verified here make it clear that fostering receptive participation may be as important and constructive as encouraging active contributions in online communities.
Keywords: Web forums, cultural capital, e-learning, lurking
Probabilistic models for discovering e-communities BIBAKFull-Text 173-182
  Ding Zhou; Eren Manavoglu; Jia Li; C. Lee Giles; Hongyuan Zha
The increasing amount of communication between individuals in e-formats (e.g. email, Instant messaging and the Web) has motivated computational research in social network analysis (SNA). Previous work in SNA has emphasized the social network (SN) topology measured by communication frequencies while ignoring the semantic information in SNs. In this paper, we propose two generative Bayesian models for semantic community discovery in SNs, combining probabilistic modeling with community detection in SNs. To simulate the generative models, an EnF-Gibbs sampling algorithm is proposed to address the efficiency and performance problems of traditional methods. Experimental studies on Enron email corpus show that our approach successfully detects the communities of individuals and in addition provides semantic topic descriptions of these communities.
Keywords: Gibbs sampling, clustering, data mining, email, social network, statistical modeling

User interfaces: semantic tagging

The web beyond popularity: a really simple system for web scale RSS BIBAKFull-Text 183-192
  Daniel Gruhl; Daniel N. Meredith; Jan H. Pieper; Alex Cozzi; Stephen Dill
Popularity based search engines have served to stagnate information retrieval from the web. Developed to deal with the very real problem of degrading quality within keyword based search they have had the unintended side effect of creating "icebergs" around topics, where only a small minority of the information is above the popularity water-line. This problem is especially pronounced with emerging information -- new sites are often hidden until they become popular enough to be considered above the water-line. In domains new to a user this is often helpful -- they can focus on popular sites first. Unfortunately it is not the best tool for a professional seeking to keep up-to-date with a topic as it emerges and evolves.
   We present a tool focused on this audience -- a system that addresses the very large scale information gathering, filtering and routing, and presentation problems associated with creating a useful incremental stream of information from the web as a whole. Utilizing the WebFountain platform as the primary data engine and Really Simple Syndication (RSS) as the delivery mechanism, our "Daily Deltas" (Delta) application is able to provide an informative feed of relevant content directly to a user. Individuals receive a personalized, incremental feed of pages related to their topic allowing them to track their interests independent of the overall popularity of the topic.
Keywords: Daily Delta, RSS, WebFountain, crawler, document routing, internet
Visualizing tags over time BIBAKFull-Text 193-202
  Micah Dubinko; Ravi Kumar; Joseph Magnani; Jasmine Novak; Prabhakar Raghavan; Andrew Tomkins
We consider the problem of visualizing the evolution of tags within the Flickr (flickr.com) online image sharing community. Any user of the Flickr service may append a tag to any photo in the system. Over the past year, users have on average added over a million tags each week. Understanding the evolution of these tags over time is therefore a challenging task. We present a new approach based on a characterization of the most interesting tags associated with a sliding interval of time. An animation provided via Flash in a web browser allows the user to observe and interact with the interesting tags as they evolve over time.
   New algorithms and data structures are required to support the efficient generation of this visualization. We combine a novel solution to an interval covering problem with extensions to previous work on score aggregation in order to create an efficient backend system capable of producing visualizations at arbitrary scales on this large dataset in real time.
Keywords: Flickr, interval covering, social media, tags, temporal evolution, visualization
Knowing the user's every move: user activity tracking for website usability evaluation and implicit interaction BIBAKFull-Text 203-212
  Richard Atterer; Monika Wnuk; Albrecht Schmidt
In this paper, we investigate how detailed tracking of user interaction can be monitored using standard web technologies. Our motivation is to enable implicit interaction and to ease usability evaluation of web applications outside the lab. To obtain meaningful statements on how users interact with a web application, the collected information needs to be more detailed and fine-grained than that provided by classical log files. We focus on tasks such as classifying the user with regard to computer usage proficiency or making a detailed assessment of how long it took users to fill in fields of a form. Additionally, it is important in the context of our work that usage tracking should not alter the user's experience and that it should work with existing server and browser setups. We present an implementation for detailed tracking of user actions on web pages. An HTTP proxy modifies HTML pages by adding JavaScript code before delivering them to the client. This JavaScript tracking code collects data about mouse movements, keyboard input and more. We demonstrate the usefulness of our approach in a case study.
Keywords: HTTP proxy, implicit interaction, mouse tracking, user activity tracking, website usability evaluation

Mining the web

Finding advertising keywords on web pages BIBAKFull-Text 213-222
  Wen-tau Yih; Joshua Goodman; Vitor R. Carvalho
A large and growing number of web pages display contextual advertising based on keywords automatically extracted from the text of the page, and this is a substantial source of revenue supporting the web today. Despite the importance of this area, little formal, published research exists. We describe a system that learns how to extract keywords from web pages for advertisement targeting. The system uses a number of features, such as term frequency of each potential keyword, inverse document frequency, presence in meta-data, and how often the term occurs in search query logs. The system is trained with a set of example pages that have been hand-labeled with "relevant" keywords. Based on this training, it can then extract new keywords from previously unseen pages. Accuracy is substantially better than several baseline systems.
Keywords: advertising, information extraction, keyword extraction
Communities from seed sets BIBAKFull-Text 223-232
  Reid Andersen; Kevin J. Lang
Expanding a seed set into a larger community is a common procedure in link-based analysis. We show how to adapt recent results from theoretical computer science to expand a seed set into a community with small conductance and a strong relationship to the seed, while examining only a small neighborhood of the entire graph. We extend existing results to give theoretical guarantees that apply to a variety of seed sets from specified communities. We also describe simple and flexible heuristics for applying these methods in practice, and present early experiments showing that these methods compare favorably with existing approaches.
Keywords: community finding, graph conductance, link analysis, random walks, seed sets
What's really new on the web?: identifying new pages from a series of unstable web snapshots BIBAKFull-Text 233-241
  Masashi Toyoda; Masaru Kitsuregawa
Identifying and tracking new information on the Web is important in sociology, marketing, and survey research, since new trends might be apparent in the new information. Such changes can be observed by crawling the Web periodically. In practice, however, it is impossible to crawl the entire expanding Web repeatedly. This means that the novelty of a page remains unknown, even if that page did not exist in previous snapshots. In this paper, we propose a novelty measure for estimating the certainty that a newly crawled page appeared between the previous and current crawls. Using this novelty measure, new pages can be extracted from a series of unstable snapshots for further analysis and mining to identify new trends on the Web. We evaluated the precision, recall, and miss rate of the novelty measure using our Japanese web archive, and applied it to a Web archive search engine.
Keywords: information retrieval, link analysis, novelty, web evolution
A case for software assurance BIBAFull-Text 243
  Mary Ann Davidson
Information technology has become "infrastructure technology," as most sectors of critical infrastructure rest on an IT backbone. Yet IT systems are not yet designed to be as safe, secure and reliable as physical infrastructure. Improving the security worthiness of commercial software requires a significant change in the development and product delivery process across the board. The security worthiness of all commercial software -- from all vendors -- demands that assurance became a critical focus for both providers and customers of IT. During Oracle's long history of building and delivering secure software, we continue to invest heavily in building security into each component of the product lifecycle. This is also an "organic" process which is regularly being enhanced to improve overall security practices. Our efforts have evolved from a formal development process to now additionally include secure coding standards, intensive developer training, innovative "bug finding" tools and working with leading vendors to "raise the bar" for all of industry as it pertains to security.
'e-science and cyberinfrastructure: a middleware perspective' BIBAFull-Text 245
  Tony Hey
The Internet was the inspiration of J.C.R.Licklider when he was at the Advanced Research Projects Agency in the 1960's. In those pre-Moore's Law days, Licklider imagined a future in which researchers could access and use computers and data from anywhere in the world. Today, as everyone knows, the killer applications for the Internet were email in the 1970's and the World Wide Web in the 1990's which was developed initially as a collaboration tool for the particle physics academic community. In the future, frontier research in many fields will increasingly require the collaboration of globally distributed groups of researchers needing access to distributed computing, data resources and support for remote access to expensive, multi-national specialized facilities such as telescopes and accelerators or specialist data archives. In the context of science and engineering, this is the 'e-Science' agenda. Robust middleware services deployed on top of research networks will constitute a powerful 'Cyberinfrastructure' for collaborative science and engineering.
   This talk will review the elements of this vision and describe the present status of efforts to build such an internet-scale distributed infrastructure based on Web Services. The goal is to provide robust middleware components that will allow scientists and engineers to routinely construct the inter-organizational 'Virtual Organizations'. Given the present state of Web Services, we argue for the need to define such Virtual Organization 'Grid' services on well-established Web Service specifications that are widely supported by the IT industry. Only industry can provide the necessary tooling and development environments to enable widespread adoption of such Grid services. Extensions to these basic Grid services can be added as more Web Services mature and the research community has had the opportunity to experiment with new services providing potentially useful new functionalities. The new Cyberinfrastructure will be of relevance to more than just the research community: it will impact both the e-learning and digital library communities allow the creation of scientific 'mash-ups' of services giving significant added value.

Correctness & security

SecuBat: a web vulnerability scanner BIBAKFull-Text 247-256
  Stefan Kals; Engin Kirda; Christopher Kruegel; Nenad Jovanovic
As the popularity of the web increases and web applications become tools of everyday use, the role of web security has been gaining importance as well. The last years have shown a significant increase in the number of web-based attacks. For example, there has been extensive press coverage of recent security incidences involving the loss of sensitive credit card information belonging to millions of customers.
   Many web application security vulnerabilities result from generic input validation problems. Examples of such vulnerabilities are SQL injection and Cross-Site Scripting (XSS). Although the majority of web vulnerabilities are easy to understand and to avoid, many web developers are, unfortunately, not security-aware. As a result, there exist many web sites on the Internet that are vulnerable.
   This paper demonstrates how easy it is for attackers to automatically discover and exploit application-level vulnerabilities in a large number of web applications. To this end, we developed SecuBat, a generic and modular web vulnerability scanner that, similar to a port scanner, automatically analyzes web sites with the aim of finding exploitable SQL injection and XSS vulnerabilities. Using SecuBat, we were able to find many potentially vulnerable web sites. To verify the accuracy of SecuBat, we picked one hundred interesting web sites from the potential victim list for further analysis and confirmed exploitable flaws in the identified web pages. Among our victims were well-known global companies and a finance ministry. Of course, we notified the administrators of vulnerable sites about potential security problems. More than fifty responded to request additional information or to report that the security hole was closed.
Keywords: SQL injection, XSS, automated vulnerability detection, crawling, cross-site scripting, scanner, security
Access control enforcement for conversation-based web services BIBAKFull-Text 257-266
  Massimo Mecella; Mourad Ouzzani; Federica Paci; Elisa Bertino
Service Oriented Computing is emerging as the main approach to build distributed enterprise applications on the Web. The widespread use of Web services is hindered by the lack of adequate security and privacy support. In this paper, we present a novel framework for enforcing access control in conversation-based Web services. Our approach takes into account the conversational nature of Web services. This is in contrast with existing approaches to access control enforcement that assume a Web service as a set of independent operations. Furthermore, our approach achieves a tradeoff between the need to protect Web service's access control policies and the need to disclose to clients the portion of access control policies related to the conversations they are interested in. This is important to avoid situations where the client cannot progress in the conversation due to the lack of required security requirements. We introduce the concept of k-trustworthiness that defines the conversations for which a client can provide credentials maximizing the likelihood that it will eventually hit a final state.
Keywords: access control, conversations, transition systems, web services
Analysis of communication models in web service compositions BIBAKFull-Text 267-276
  Raman Kazhamiakin; Marco Pistore; Luca Santuari
In this paper we describe an approach for the verification of Web service compositions defined by sets of BPEL processes. The key aspect of such a verification is the model adopted for representing the communications among the services participating in the composition. Indeed, these communications are asynchronous and buffered in the existing execution frameworks, while most verification approaches assume a synchronous communication model for efficiency reasons. In our approach, we develop a parametric model for describing Web service compositions, which allows us to capture a hierarchy of communication models, ranging from synchronous communications to asynchronous communications with complex buffer structures. Moreover, we develop a technique to associate with a Web service composition the most adequate communication model, i.e., the simplest model that is sufficient to capture all the behaviors of the composition. This way, we can provide an accurate model of a wider class of service composition scenarios, while preserving as much as possible an efficient performance in verification.
Keywords: BPEL, asynchronous communications, formal verification, web service composition

Search engine engineering

Toward tighter integration of web search with a geographic information system BIBAKFull-Text 277-286
  Taro Tezuka; Takeshi Kurashima; Katsumi Tanaka
Integration of Web search with geographic information has recently attracted much attention. There are a number of local Web search systems enabling users to find location-specific Web content. In this paper, however, we point out that this integration is still at a superficial level. Most local Web search systems today only link local Web content to a map interface. They are extensions of a conventional stand-alone geographic information system (GIS), applied to a Web-based client-server architecture. In this paper, we discuss the directions available for tighter integration of Web search with a GIS, in terms of extraction, knowledge discovery, and presentation. We also describe implementations to support our argument that the integration must go beyond the simple map-and hyperlink architecture.
Keywords: local web search, web mining, web-GIS integration
Geographically focused collaborative crawling BIBAKFull-Text 287-296
  Weizheng Gao; Hyun Chul Lee; Yingbo Miao
A collaborative crawler is a group of crawling nodes, in which each crawling node is responsible for a specific portion of the web. We study the problem of collecting geographically-aware pages using collaborative crawling strategies. We first propose several collaborative crawling strategies for the geographically focused crawling, whose goal is to collect web pages about specified geographic locations, by considering features like URL address of page, content of page, extended anchor text of link, and others. Later, we propose various evaluation criteria to qualify the performance of such crawling strategies. Finally, we experimentally study our crawling strategies by crawling the real web data showing that some of our crawling strategies greatly outperform the simple URL-hash based partition collaborative crawling, in which the crawling assignments are determined according to the hash-value computation over URLs. More precisely, features like URL address of page and extended anchor text of link are shown to yield the best overall performance for the geographically focused crawling.
Keywords: collaborative crawling, geographic entities, geographically focused crawling
To randomize or not to randomize: space optimal summaries for hyperlink analysis BIBAKFull-Text 297-306
  Tamás Sarlós; Adrás A. Benczúr; Károly Csalogány; Dániel Fogaras; Balázs Rácz
Personalized PageRank expresses link-based page quality around user selected pages. The only previous personalized PageRank algorithm that can serve on-line queries for an unrestricted choice of pages on large graphs is our Monte Carlo algorithm [WAW 2004]. In this paper we achieve unrestricted personalization by combining rounding and randomized sketching techniques in the dynamic programming algorithm of Jeh and Widom [WWW 2003]. We evaluate the precision of approximation experimentally on large scale real-world data and find significant improvement over previous results. As a key theoretical contribution we show that our algorithms use an optimal amount of space by also improving earlier asymptotic worst-case lower bounds. Our lower bounds and algorithms apply to the SimRank as well; of independent interest is the reduction of the SimRank computation to personalized PageRank.
Keywords: data streams, link-analysis, scalability, similarity search

E-learning & scientific applications

Addressing the testing challenge with a web-based e-assessment system that tutors as it assesses BIBAKFull-Text 307-316
  Mingyu Feng; Neil T. Heffernan; Kenneth R. Koedinger
Secondary teachers across the country are being asked to use formative assessment data to inform their classroom instruction. At the same time, critics of No Child Left Behind are calling the bill "No Child Left Untested" emphasizing the negative side of assessment, in that every hour spent assessing students is an hour lost from instruction. Or does it have to be? What if we better integrated assessment into the classroom, and we allowed students to learn during the test? Maybe we could even provide tutoring on the steps of solving problems. Our hypothesis is that we can achieve more accurate assessment by not only using data on whether students get test items right or wrong, but by also using data on the effort required for students to learn how to solve a test item. We provide evidence for this hypothesis using data collected with our E-ASSISTment system by more than 600 students over the course of the 2004-2005 school year. We also show that we can track student knowledge over time using modern longitudinal data analysis techniques. In a separate paper [9], we report on the ASSISTment system's architecture and scalability, while this paper is focused on how we can reliably assess student learning.
Keywords: ASSISTment, MCAS, intelligent tutoring system, learning, predict
Knowledge modeling and its application in life sciences: a tale of two ontologies BIBAKFull-Text 317-326
  Satya S. Sahoo; Christopher Thomas; Amit Sheth; William S. York; Samir Tartir
High throughput glycoproteomics, similar to genomics and proteomics, involves extremely large volumes of distributed, heterogeneous data as a basis for identification and quantification of a structurally diverse collection of biomolecules. The ability to share, compare, query for and most critically correlate datasets using the native biological relationships are some of the challenges being faced by glycobiology researchers. As a solution for these challenges, we are building a semantic structure, using a suite of ontologies, which supports management of data and information at each step of the experimental lifecycle. This framework will enable researchers to leverage the large scale of glycoproteomics data to their benefit.
   In this paper, we focus on the design of these biological ontology schemas with an emphasis on relationships between biological concepts, on the use of novel approaches to populate these complex ontologies including integrating extremely large datasets ( 500MB) as part of the instance base and on the evaluation of ontologies using OntoQA [38] metrics. The application of these ontologies in providing informatics solutions, for high throughput glycoproteomics experimental domain, is also discussed. We present our experience as a use case of developing two ontologies in one domain, to be part of a set of use cases, which are used in the development of an emergent framework for building and deploying biological ontologies.
Keywords: ProPreO, bioinformatics ontology, biological ontology development, glycO, glycoproteomics, ontology population, ontology structural metrics, semantic bioinformatics
Reappraising cognitive styles in adaptive web applications BIBAKFull-Text 327-335
  Elizabeth Brown; Tim Brailsford; Tony Fisher; Adam Moore; Helen Ashman
The mechanisms for personalisation used in web applications are currently the subject of much debate amongst researchers from many diverse subject areas. One of the most contemporary ideas for user modelling in web applications is that of cognitive styles, where a user's psychological preferences are assessed stored in a database and then used to provide personalised content and/or links. We describe user trials of a case study that utilises visual-verbal preferences in an adaptive web-based educational system (AWBES). Students in this trial were assessed by the Felder-Solomon Inventory of Learning Styles (ILS) instrument, and their preferences were used as a means of content personalisation.
   Contrary to previous findings by other researchers, we found no significant differences in performance between matched and mismatched students. Conclusions are drawn about the value and validity of using cognitive styles as a way of modelling user preferences in educational web applications.
Keywords: adaptive hypermedia, cognitive styles, user modelling, user trials, web applications

High availability & performance

Cat and mouse: content delivery tradeoffs in web access BIBAKFull-Text 337-346
  Balachander Krishnamurthy; Craig E. Wills
Web pages include extraneous material that may be viewed as undesirable by a user. Increasingly many Web sites also require users to register to access either all or portions of the site. Such tension between content owners and users has resulted in a "cat and mouse" game between content provided and how users access it.
   We carried out a measurement-based study to understand the nature of extraneous content and its impact on performance as perceived by users. We characterize how this content is distributed and the effectiveness of blocking mechanisms to stop it as well as countermeasures taken by content owners to negate such mechanisms. We also examine sites that require some form of registration to control access and the attempts made to circumvent it.
   Results from our study show that extraneous content exists on a majority of popular pages and that a 25-30% reduction in downloaded objects and bytes with corresponding latency reduction can be attained by blocking such content. The top ten advertisement delivering companies delivered 40% of all URLs matched as ads in our study. Both the server name and the remainder of the URL are important in matching a URL as an ad. A majority of popular sites require some form of registration and for such sites users can obtain an account from a shared public database. We discuss future measures and countermeasures on the part of each side.
Keywords: anonymity, content blocking, privacy, web registration
WAP5: black-box performance debugging for wide-area systems BIBAKFull-Text 347-356
  Patrick Reynolds; Janet L. Wiener; Jeffrey C. Mogul; Marcos K. Aguilera; Amin Vahdat
Wide-area distributed applications are challenging to debug, optimize, and maintain. We present Wide-Area Project 5 (WAP5), which aims to make these tasks easier by exposing the causal structure of communication within an application and by exposing delays that imply bottlenecks. These bottlenecks might not otherwise be obvious, with or without the application's source code. Previous research projects have presented algorithms to reconstruct application structure and the corresponding timing information from black-box message traces of local-area systems. In this paper we present (1) a new algorithm for reconstructing application structure in both local- and wide-area distributed systems, (2) an infrastructure for gathering application traces in PlanetLab, and (3) our experiences tracing and analyzing three systems: CoDeeN and Coral, two content-distribution networks in PlanetLab; and Slurpee, an enterprise-scale incident-monitoring system.
Keywords: black box systems, distributed systems, performance analysis, performance debugging
WS-replication: a framework for highly available web services BIBAKFull-Text 357-366
  Jorge Salas; Francisco Perez-Sorrosal; Marta Patiño-Martínez; Ricardo Jiménez-Peris
Due to the rapid acceptance of web services and its fast spreading, a number of mission-critical systems will be deployed as web services in next years. The availability of those systems must be guaranteed in case of failures and network disconnections. An example of web services for which availability will be a crucial issue are those belonging to coordination web service infrastructure, such as web services for transactional coordination (e.g., WS-CAF and WS-Transaction). These services should remain available despite site and connectivity failures to enable business interactions on a 24x7 basis. Some of the common techniques for attaining availability consist in the use of a clustering approach. However, in an Internet setting a domain can get partitioned from the network due to a link overload or some other connectivity problems. The unavailability of a coordination service impacts the availability of all the partners in the business process. That is, coordination services are an example of critical components that need higher provisions for availability. In this paper, we address this problem by providing an infrastructure, WS-Replication, for WAN replication of web services. The infrastructure is based on a group communication web service, WS-Multicast, that respects the web service autonomy. The transport of WS-Multicast is based on SOAP and relies exclusively on web service technology for interaction across organizations. We have replicated WS-CAF using our WS-Replication framework and evaluated its performance.
Keywords: WS-CAF, availability, group communication, transactions, web services

Web mining with search engines

Random sampling from a search engine's index BIBAKFull-Text 367-376
  Ziv Bar-Yossef; Maxim Gurevich
We revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from a search engine's index using only the search engine's public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines.
   The technique of Bharat and Broder suffers from two well recorded biases: it favors long documents and highly ranked documents. In this paper we introduce two novel sampling techniques: a lexicon-based technique and a random walk technique. Our methods produce biased sample documents, but each sample is accompanied by a corresponding "weight", which represents the probability of this document to be selected in the sample. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to three well known Monte Carlo simulation methods: rejection sampling, importance sampling and the Metropolis-Hastings algorithm.
   We analyze our methods rigorously and prove that under plausible assumptions, our techniques are guaranteed to produce near-uniform samples from the search engine's index. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long or highly ranked documents. We use our algorithms to collect fresh data about the relative sizes of Google, MSN Search, and Yahoo!.
Keywords: benchmarks, sampling, search engines, size estimation
A web-based kernel function for measuring the similarity of short text snippets BIBAKFull-Text 377-386
  Mehran Sahami; Timothy D. Heilman
Determining the similarity of short text snippets, such as search queries, works poorly with traditional document similarity measures (e.g., cosine), since there are often few, if any, terms in common between two short text snippets. We address this problem by introducing a novel method for measuring the similarity between short text snippets (even those without any overlapping terms) by leveraging web search results to provide greater context for the short texts. In this paper, we define such a similarity kernel function, mathematically analyze some of its properties, and provide examples of its efficacy. We also show the use of this kernel function in a large-scale system for suggesting related queries to search engine users.
Keywords: information retrieval, kernel functions, query suggestion, text similarity measures, web search
Generating query substitutions BIBAKFull-Text 387-396
  Rosie Jones; Benjamin Rey; Omid Madani; Wiley Greiner
We introduce the notion of query substitution, that is, generating a new query to replace a user's original search query. Our technique uses modifications based on typical substitutions web searchers make to their queries. In this way the new query is strongly related to the original query, containing terms closely related to all of the original terms. This contrasts with query expansion through pseudo-relevance feedback, which is costly and can lead to query drift. This also contrasts with query relaxation through boolean or TFIDF retrieval, which reduces the specificity of the query. We define a scale for evaluating query substitution, and show that our method performs well at generating new queries related to the original queries. We build a model for selecting between candidates, by using a number of features relating the query-candidate pair, and by fitting the model to human judgments of relevance of query suggestions. This further improves the quality of the candidates generated. Experiments show that our techniques significantly increase coverage and effectiveness in the setting of sponsored search.
Keywords: paraphrasing, query rewriting, query substitution, sponsored search

Social networks

POLYPHONET: an advanced social network extraction system from the web BIBAKFull-Text 397-406
  Yutaka Matsuo; Junichiro Mori; Masahiro Hamasaki; Keisuke Ishida; Takuichi Nishimura; Hideaki Takeda; Koiti Hasida; Mitsuru Ishizuka
Social networks play important roles in the Semantic Web: knowledge management, information retrieval, ubiquitous computing, and so on. We propose a social network extraction system called POLYPHONET, which employs several advanced techniques to extract relations of persons, detect groups of persons, and obtain keywords for a person. Search engines, especially Google, are used to measure co-occurrence of information and obtain Web documents.
   Several studies have used search engines to extract social networks from the Web, but our research advances the following points: First, we reduce the related methods into simple pseudocodes using Google so that we can build up integrated systems. Second, we develop several new algorithms for social networking mining such as those to classify relations into categories, to make extraction scalable, and to obtain and utilize person-to-word relations. Third, every module is implemented in POLYPHONET, which has been used at four academic conferences, each with more than 500 participants. We overview that system. Finally, a novel architecture called Super Social Network Mining is proposed; it utilizes simple modules using Google and is characterized by scalability and Relate-Identify processes: Identification of each entity and extraction of relations are repeated to obtain a more precise social network.
Keywords: search engine, social network, web mining
Semantic analytics on social networks: experiences in addressing the problem of conflict of interest detection BIBAKFull-Text 407-416
  Boanerges Aleman-Meza; Meenakshi Nagarajan; Cartic Ramakrishnan; Li Ding; Pranam Kolari; Amit P. Sheth; I. Budak Arpinar; Anupam Joshi; Tim Finin
In this paper, we describe a Semantic Web application that detects Conflict of Interest (COI) relationships among potential reviewers and authors of scientific papers. This application discovers various 'semantic associations' between the reviewers and authors in a populated ontology to determine a degree of Conflict of Interest. This ontology was created by integrating entities and relationships from two social networks, namely "knows," from a FOAF (Friend-of-a-Friend) social network and "co-author," from the underlying co-authorship network of the DBLP bibliography. We describe our experiences developing this application in the context of a class of Semantic Web applications, which have important research and engineering challenges in common. In addition, we present an evaluation of our approach for real-life COI detection.
Keywords: RDF, conflict of interest, data fusion, entity disambiguation, ontologies, peer review process, semantic analytics, semantic associations, semantic web, social networks
Exploring social annotations for the semantic web BIBAKFull-Text 417-426
  Xian Wu; Lei Zhang; Yong Yu
In order to obtain a machine understandable semantics for web resources, research on the Semantic Web tries to annotate web resources with concepts and relations from explicitly defined formal ontologies. This kind of formal annotation is usually done manually or semi-automatically. In this paper, we explore a complement approach that focuses on the "social annotations of the web" which are annotations manually made by normal web users without a pre-defined formal ontology. Compared to the formal annotations, although social annotations are coarse-grained, informal and vague, they are also more accessible to more people and better reflect the web resources' meaning from the users' point of views during their actual usage of the web resources. Using a social bookmark service as an example, we show how emergent semantics [2] can be statistically derived from the social annotations. Furthermore, we apply the derived emergent semantics to discover and search shared web bookmarks. The initial evaluation on our implementation shows that our method can effectively discover semantically related web bookmarks that current social bookmark service can not discover easily.
Keywords: emergent semantics, semantic web, social annotation, social bookmarks

Web engineering: validation

Relaxed: on the way towards true validation of compound documents BIBAKFull-Text 427-436
  Jirka Kosek; Petr Nálevka
To maintain interoperability in the Web environment it is necessary to comply with Web standards. Current specifications of HTML and XHTML languages define conformance conditions both in specification prose and in a formalized way utilizing DTD. Unfortunately DTD is a very limited schema language and can not express many constraints that are specified in the free text parts of the specification. This means that a page which validates against DTD is not necessarily conforming to the specification. In this article we analyze features of modern schema languages that can improve validation of Web pages by covering more (X)HTML language constraints then DTD. Our schemas use combination of RELAX NG and Schematron to check not only the structure of the Web pages, but also datatypes of attributes and elements, more complex relations between elements and some WCAG checkpoints. A modular approach for schema composition is presented together with usage examples, including sample schemas for various compound documents (e.g. XHTML combined with MathML and SVG).The second part of this article contains description of Relaxed validator application we have developed. Relaxed is an extensible and powerful validation engine offering a convenient Web interface, a Web-service API, Java API and command-line interface. Combined with our RELAX NG + Schematron schemas, Relaxed offers very valuable validation results that surpass W3C validator in many aspects.
Keywords: RELAX NG, Schematron, XHTML, XML, compound documents, validation
Model-based version and configuration management for a web engineering lifecycle BIBAKFull-Text 437-446
  Tien N. Nguyen
During a lifecycle of a large-scale Web application, Web developers produce a wide variety of inter-related Web objects. Following good Web engineering practice, developers often create them based on a Web application development method, which requires certain logical models for the development and maintenance process. Web development is dynamic, thus, those logical models as well as Web artifacts evolve over time. However, the task of managing their evolution is still very inefficient because design decisions in models are not directly accessible in existing file-based software configuration management repositories. Key limitations of existing Web version control tools include their inadequacy in representing semantics of design models and inability to manage the evolution of model-based objects and their logical connections to Web documents. This paper presents a framework that allows developers to manage versions and configurations of models and to capture changes to model-to-model relations among Web objects. Model-based objects, Web documents, and relations are directly represented and versioned in a structure-oriented manner.
Keywords: model-based configuration management, versioned hypermedia, web engineering
Model-directed web transactions under constrained modalities BIBAKFull-Text 447-456
  Zan Sun; Jalal Mahmud; Saikat Mukherjee; I. V. Ramakrishnan
Online transactions (e.g., buying a book on the Web) typically involve a number of steps spanning several pages. Conducting such transactions under constrained interaction modalities as exemplified by small screen handhelds or interactive speech interfaces -- the primary mode of communication for visually impaired individuals -- is a strenuous, fatigue-inducing activity. But usually one needs to browse only a small fragment of a Web page to perform a transactional step such as a form fillout, selecting an item from a search results list, etc. We exploit this observation to develop an automata-based process model that delivers only the "relevant" page fragments at each transactional step, thereby reducing information overload on such narrow interaction bandwidths. We realize this model by coupling techniques from content analysis of Web documents, automata learning and statistical classification. The process model and associated techniques have been incorporated into Guide-O, a prototype system that facilitates online transactions using speech/keyboard interface (Guide-O-Speech), or with limited-display size handhelds (Guide-O-Mobile). Performance of Guide-O and its user experience are reported.
Keywords: assistive device, content adaption, web transaction

New search paradigms

Retroactive answering of search queries BIBAKFull-Text 457-466
  Beverly Yang; Glen Jeh
Major search engines currently use the history of a user's actions (e.g., queries, clicks) to personalize search results. In this paper, we present a new personalized service, query-specific web recommendations (QSRs), that retroactively answers queries from a user's history as new results arise. The QSR system addresses two important subproblems with applications beyond the system itself: (1) Automatic identification of queries in a user's history that represent standing interests and unfulfilled needs. (2) Effective detection of interesting new results to these queries. We develop a variety of heuristics and algorithms to address these problems, and evaluate them through a study of Google history users. Our results strongly motivate the need for automatic detection of standing interests from a user's history, and identifies the algorithms that are most useful in doing so. Our results also identify the algorithms, some which are counter-intuitive, that are most useful in identifying interesting new results for past queries, allowing us to achieve very high precision over our data set.
Keywords: automatic identification of user intent, personalized search, recommendations
CWS: a comparative web search system BIBAKFull-Text 467-476
  Jian-Tao Sun; Xuanhui Wang; Dou Shen; Hua-Jun Zeng; Zheng Chen
In this paper, we define and study a novel search problem: Comparative Web Search (CWS). The task of CWS is to seek relevant and comparative information from the Web to help users conduct comparisons among a set of topics. A system called CWS is developed to effectively facilitate Web users' comparison needs. Given a set of queries, which represent the topics that a user wants to compare, the system is characterized by: (1) automatic retrieval and ranking of Web pages by incorporating both their relevance to the queries and the comparative contents they contain; (2) automatic clustering of the comparative contents into semantically meaningful themes; (3) extraction of representative keyphrases to summarize the commonness and differences of the comparative contents in each theme. We developed a novel interface which supports two types of view modes: a pair-view which displays the result in the page level, and a cluster-view which organizes the comparative pages into the themes and displays the extracted phrases to facilitate users' comparison. Experiment results show the CWS system is effective and efficient.
Keywords: clustering, comparative web search, keyphrase extraction, search engine
Searching with context BIBAKFull-Text 477-486
  Reiner Kraft; Chi Chao Chang; Farzin Maghoul; Ravi Kumar
Contextual search refers to proactively capturing the information need of a user by automatically augmenting the user query with information extracted from the search context; for example, by using terms from the web page the user is currently browsing or a file the user is currently editing.
   We present three different algorithms to implement contextual search for the Web. The first, it query rewriting (QR), augments each query with appropriate terms from the search context and uses an off-the-shelf web search engine to answer this augmented query. The second, rank-biasing (RB), generates a representation of the context and answers queries using a custom-built search engine that exploits this representation. The third, iterative filtering meta-search (IFM), generates multiple subqueries based on the user query and appropriate terms from the search context, uses an off-the-shelf search engine to answer these subqueries, and re-ranks the results of the subqueries using rank aggregation methods.
   We extensively evaluate the three methods using 200 contexts and over 24,000 human relevance judgments of search results. We show that while QR works surprisingly well, the relevance and recall can be improved using RB and substantially more using IFM. Thus, QR, RB, and IFM represent a cost-effective design spectrum for contextual search.
Keywords: contextual search, meta-search, rank aggregation, specialized search engines, web search
Keynote talk BIBAFull-Text 487
  Richard Granger
Richard Granger will be providing an update on the deployment of information technology at a national scale in the NHS in England. Particular topics that will be covered include variability of performance and user organizations and suppliers. Access/channel strategies for NHS users and members of the public. Take-up rates for new technologies including internet adoption. Data on number of users and transactions to date will also be provided.
Broken links on the web: local laws and the global free flow of information BIBAFull-Text 489
  Daniel Weitzner
Across the World Wide Web there is government censorship and monitoring of political messages and "morally-corrupting" material. Google have been in the news recently for capitulating to the Chinese government's demands to ban certain kinds of content, and also for refusing to pass logs of browsing habits to the US government (while Microsoft and Yahoo complied wth the request). How can the Web survive as a unified, global information environment in the face of government censorship? Can governments and the private sector come to an agreement on international legal standards for the free flow of information and privacy.

Semantic web: ontology construction

Position paper: ontology construction from online ontologies BIBAKFull-Text 491-495
  Harith Alani
One of the main hurdles towards a wide endorsement of ontologies is the high cost of constructing them. Reuse of existing ontologies offers a much cheaper alternative than building new ones from scratch, yet tools to support such reuse are still in their infancy. However, more ontologies are becoming available on the web, and online libraries for storing and indexing ontologies are increasing in number and demand. Search engines have also started to appear, to facilitate search and retrieval of online ontologies. This paper presents a fresh view on constructing ontologies automatically, by identifying, ranking, and merging fragments of online ontologies.
Keywords: automatic ontology construction, ontology reuse
Position paper: towards the notion of gloss, and the adoption of linguistic resources in formal ontology engineering BIBAKFull-Text 497-503
  Mustafa Jarrar
In this paper, we first introduce the notion of gloss for ontology engineering purposes. We propose that each vocabulary in an ontology should have a gloss. A gloss basically is an informal description of the meaning of a vocabulary that is supposed to render factual and critical knowledge to understanding a concept, but that is unreasonable or very difficult to formalize and/or articulate formally. We present a set of guidelines on what should and should not be provided in a gloss. Second, we propose to incorporate linguistic resources in the ontology engineering process. We clarify the importance of using lexical resources as a "consensus reference" in ontology engineering, and so enabling the adoption of the glosses found in these resources. A linguistic resource (i.e. its list of terms and their definitions) shall be seen as a shared vocabulary space for ontologies. We present an ontology engineering software tool (called DogmaModeler), and illustrate its support of reusing of WordNet's terms and glosses in ontology modeling.
Keywords: deogmamodeler, formal ontology engineering, gloss, lexical semantics, ontologies and wordnet, ontology, wordnet
Bootstrapping semantics on the web: meaning elicitation from schemas BIBAKFull-Text 505-512
  Paolo Bouquet; Luciano Serafini; Stefano Zanobini; Simone Sceffer
In most web sites, web-based applications (such as web portals, e-marketplaces, search engines), and in the file systems of personal computers, a wide variety of schemas (such as taxonomies, directory trees, thesauri, Entity-Relationship schemas, RDF Schemas) are published which (i) convey a clear meaning to humans (e.g. help in the navigation of large collections of documents), but (ii) convey only a small fraction (if any) of their meaning to machines, as their intended meaning is not formally/explicitly represented. In this paper we present a general methodology for automatically eliciting and representing the intended meaning of these structures, and for making this meaning available in domains like information integration and interoperability, web service discovery and composition, peer-to-peer knowledge management, and semantic browsers. We also present an implementation (called CtxMatch2) of how such a method can be used for semantic interoperability.
Keywords: meaning elicitation, schema matching, semantic web

Security, privacy & ethics

Designing ethical phishing experiments: a study of (ROT13) rOnl query features BIBAKFull-Text 513-522
  Markus Jakobsson; Jacob Ratkiewicz
We study how to design experiments to measure the success rates of phishing attacks that are ethical and accurate, which are two requirements of contradictory forces. Namely, an ethical experiment must not expose the participants to any risk; it should be possible to locally verify by the participants or representatives thereof that this was the case. At the same time, an experiment is accurate if it is possible to argue why its success rate is not an upper or lower bound of that of a real attack -- this may be difficult if the ethics considerations make the user perception of the experiment different from the user perception of the attack. We introduce several experimental techniques allowing us to achieve a balance between these two requirements, and demonstrate how to apply these, using a context aware phishing experiment on a popular online auction site which we call "rOnl". Our experiments exhibit a measured average yield of 11% per collection of unique users. This study was authorized by the Human Subjects Committee at Indiana University (Study #05-10306).
Keywords: accurate, ethical, experiment, phishing, security
Invasive browser sniffing and countermeasures BIBAKFull-Text 523-532
  Markus Jakobsson; Sid Stamm
We describe the detrimental effects of browser cache/history sniffing in the context of phishing attacks, and detail an approach that neutralizes the threat by means of URL personalization; we report on an implementation performing such personalization on the fly, and analyze the costs of and security properties of our proposed solution.
Keywords: browser cache, cascading style sheets, personalization, phishing, sniffing

Data mining

A probabilistic approach to spatiotemporal theme pattern mining on weblogs BIBAKFull-Text 533-542
  Qiaozhu Mei; Chao Liu; Hang Su; ChengXiang Zhai
Mining subtopics from weblogs and analyzing their spatiotemporal patterns have applications in multiple domains. In this paper, we define the novel problem of mining spatiotemporal theme patterns from weblogs and propose a novel probabilistic approach to model the subtopic themes and spatiotemporal theme patterns simultaneously. The proposed model discovers spatiotemporal theme patterns by (1) extracting common themes from weblogs; (2) generating theme life cycles for each given location; and (3) generating theme snapshots for each given time period. Evolution of patterns can be discovered by comparative analysis of theme life cycles and theme snapshots. Experiments on three different data sets show that the proposed approach can discover interesting spatiotemporal theme patterns effectively. The proposed probabilistic model is general and can be used for spatiotemporal text mining on any domain with time and location information.
Keywords: mixture model, spatiotemporal text mining, theme pattern, weblog
Time-dependent semantic similarity measure of queries using historical click-through data BIBAKFull-Text 543-552
  Qiankun Zhao; Steven C. H. Hoi; Tie-Yan Liu; Sourav S. Bhowmick; Michael R. Lyu; Wei-Ying Ma
It has become a promising direction to measure similarity of Web search queries by mining the increasing amount of click-through data logged by Web search engines, which record the interactions between users and the search engines. Most existing approaches employ the click-through data for similarity measure of queries with little consideration of the temporal factor, while the click-through data is often dynamic and contains rich temporal information. In this paper we present a new framework of time-dependent query semantic similarity model on exploiting the temporal characteristics of historical click-through data. The intuition is that more accurate semantic similarity values between queries can be obtained by taking into account the timestamps of the log data. With a set of user-defined calendar schema and calendar patterns, our time-dependent query similarity model is constructed using the marginalized kernel technique, which can exploit both explicit similarity and implicit semantics from the click-through data effectively. Experimental results on a large set of click-through data acquired from a commercial search engine show that our time-dependent query similarity model is more accurate than the existing approaches. Moreover, we observe that our time-dependent query similarity model can, to some extent, reflect real-world semantics such as real-world events that are happening over time.
Keywords: click-through data, event detection, evolution pattern, marginalized kernel, semantic similarity measure
Interactive wrapper generation with minimal user effort BIBAKFull-Text 553-563
  Utku Irmak; Torsten Suel
While much of the data on the web is unstructured in nature, there is also a significant amount of embedded structured data, such as product information on e-commerce sites or stock data on financial sites. A large amount of research has focused on the problem of generating wrappers, i.e., software tools that allow easy and robust extraction of structured data from text and HTML sources. In many applications, such as comparison shopping, data has to be extracted from many different sources, making manual coding of a wrapper for each source impractical. On the other hand, fully automatic approaches are often not reliable enough, resulting in low quality of the extracted data.
   We describe a complete system for semi-automatic wrapper generation that can be trained on different data sources in a simple interactive manner. Our goal is to minimize the amount of user effort for training reliable wrappers through design of a suitable training interface that is implemented based on a powerful underlying extraction language and a set of training and ranking algorithms. Our experiments show that our system achieves reliable extraction with a very small amount of user effort.
Keywords: active learning, data extraction, wrapper generation

Semi-structured semantic data

Towards content trust of web resources BIBAKFull-Text 565-574
  Yolanda Gil; Donovan Artz
Trust is an integral part of the Semantic Web architecture. While most prior work focuses on entity-centered issues such as authentication and reputation, it does not model the content, i.e. the nature and use of the information being exchanged. This paper discusses content trust as an aggregate of other trust measures that have been previously studied. The paper introduces several factors that users consider in deciding whether to trust the content provided by a Web resource. Many of these factors are hard to capture in practice, since they would require a large amount of user input. Our goal is to discern which of these factors could be captured in practice with minimal user interaction in order to maximize the system's trust estimates. The paper also describes a simulation environment that we have designed to study alternative models of content trust.
Keywords: semantic web, trust, web of trust
Supporting online problem-solving communities with the semantic web BIBAKFull-Text 575-584
  Anupriya Ankolekar; Katia Sycara; James Herbsleb; Robert Kraut; Chris Welty
The Web plays a critical role in hosting Web communities, their content and interactions. A prime example is the open source software (OSS) community, whose members, including software developers and users, interact almost exclusively over the Web, constantly generating, sharing and refining content in the form of software code through active interaction over the Web on code design and bug resolution processes. The Semantic Web is an envisaged extension of the current Web, in which content is given a well-defined meaning, through the specification of metadata and ontologies, increasing the utility of the content and enabling information from heterogeneous sources to be integrated. We developed a prototype Semantic Web system for OSS communities, Dhruv. Dhruv provides an enhanced semantic interface to bug resolution messages and recommends related software objects and artifacts. Dhruv uses an integrated model of the OpenACS community, the software, and the Web interactions, which is semi-automatically populated from the existing artifacts of the community.
Keywords: computer-supported cooperative work, human-computer interaction, open source software communities, semantic web applications
Semantic Wikipedia BIBAKFull-Text 585-594
  Max Völkel; Markus Krötzsch; Denny Vrandecic; Heiko Haller; Rudi Studer
Wikipedia is the world's largest collaboratively edited source of encyclopaedic knowledge. But in spite of its utility, its contents are barely machine-interpretable. Structural knowledge, e.g., about how concepts are interrelated, can neither be formally stated nor automatically processed. Also the wealth of numerical data is only available as plain text and thus can not be processed by its actual meaning.
   We provide an extension to be integrated in Wikipedia, that allows the typing of links between articles and the specification of typed data inside the articles in an easy-to-use manner.
   Enabling even casual users to participate in the creation of an open semantic knowledge base, Wikipedia has the chance to become a resource of semantic statements, hitherto unknown regarding size, scope, openness, and internationalisation. These semantic enhancements bring to Wikipedia benefits of today's semantic technologies: more specific ways of searching and browsing. Also, the RDF export, that gives direct access to the formalised knowledge, opens Wikipedia up to a wide range of external applications, that will be able to use it as a background knowledge base.
   In this paper, we present the design, implementation, and possible uses of this extension.
Keywords: RDF, Semantic Web, Wiki, Wikipedia

Performance, reliability & scalability

Dynamic placement for clustered web applications BIBAKFull-Text 595-604
  A. Karve; T. Kimbrel; G. Pacifici; M. Spreitzer; M. Steinder; M. Sviridenko; A. Tantawi
We introduce and evaluate a middleware clustering technology capable of allocating resources to web applications through dynamic application instance placement. We define application instance placement as the problem of placing application instances on a given set of server machines to adjust the amount of resources available to applications in response to varying resource demands of application clusters. The objective is to maximize the amount of demand that may be satisfied using a configured placement. To limit the disturbance to the system caused by starting and stopping application instances, the placement algorithm attempts to minimize the number of placement changes. It also strives to keep resource utilization balanced across all server machines. Two types of resources are managed, one load-dependent and one load-independent. When putting the chosen placement in effect our controller schedules placement changes in a manner that limits the disruption to the system.
Keywords: dynamic application placement, performance management
Selective early request termination for busy internet services BIBAKFull-Text 605-614
  Jingyu Zhou; Tao Yang
Internet traffic is bursty and network servers are often overloaded with surprising events or abnormal client request patterns. This paper studies a load shedding mechanism called selective early request termination (SERT) for network services that use threads to handle multiple incoming requests continuously and concurrently. Our investigation with applications from Ask.com shows that during overloaded situations, a relatively small percentage of long requests that require excessive computing resource can dramatically affect other short requests and reduce the overall system throughput. By actively detecting and aborting overdue long requests, services can perform significantly better to achieve QoS objectives compared to a purely admission based approach. We have proposed a termination scheme that monitors running time of requests, accounts for their resource usage, adaptively adjusts the selection threshold, and performs a safe termination for a class of requests. This paper presents the design and implementation of this scheme and describes experimental results to validate the proposed approach.
Keywords: internet services, load shedding, request termination
SCTP: an innovative transport layer protocol for the web BIBAKFull-Text 615-624
  Preethi Natarajan; Janardhan R. Iyengar; Paul D. Amer; Randall Stewart
We propose using the Stream Control Transmission Protocol (SCTP), a recent IETF transport layer protocol, for reliable web transport. Although TCP has traditionally been used, we argue that SCTP better matches the needs of HTTP-based network applications. This position paper discusses SCTP features that address: (i) head-of-line blocking within a single TCP connection, (ii) vulnerability to network failures, and (iii) vulnerability to denial-of-service SYN attacks. We discuss our experience in modifying the Apache server and the Firefox browser to benefit from SCTP, and demonstrate our HTTP over SCTP design via simple experiments. We also discuss the benefits of using SCTP in other web domains through two example scenarios -- multiplexing user requests, and multiplexing resource access. Finally, we highlight several SCTP features that will be valuable to the design and implementation of current HTTP-based client-server applications.
Keywords: SCTP, fault-tolerance, head-of-line blocking, stream control transmission protocol, transport layer service, web applications, web transport

Data mining classification

Improved annotation of the blogosphere via autotagging and hierarchical clustering BIBAKFull-Text 625-632
  Christopher H. Brooks; Nancy Montanez
Tags have recently become popular as a means of annotating and organizing Web pages and blog entries. Advocates of tagging argue that the use of tags produces a 'folksonomy', a system in which the meaning of a tag is determined by its use among the community as a whole. We analyze the effectiveness of tags for classifying blog entries by gathering the top 350 tags from Technorati and measuring the similarity of all articles that share a tag. We find that tags are useful for grouping articles into broad categories, but less effective in indicating the particular content of an article. We then show that automatically extracting words deemed to be highly relevant can produce a more focused categorization of articles. We also show that clustering algorithms can be used to reconstruct a topical hierarchy among tags, and suggest that these approaches may be used to address some of the weaknesses in current tagging systems.
Keywords: automated annotation, blogs, hierarchical clustering, tagging
Large-scale text categorization by batch mode active learning BIBAKFull-Text 633-642
  Steven C. H. Hoi; Rong Jin; Michael R. Lyu
Large-scale text categorization is an important research topic for Web data mining. One of the challenges in large-scale text categorization is how to reduce the human efforts in labeling text documents for building reliable classification models. In the past, there have been many studies on applying active learning methods to automatic text categorization, which try to select the most informative documents for labeling manually. Most of these studies focused on selecting a single unlabeled document in each iteration. As a result, the text categorization model has to be retrained after each labeled document is solicited. In this paper, we present a novel active learning algorithm that selects a batch of text documents for labeling manually in each iteration. The key of the batch mode active learning is how to reduce the redundancy among the selected examples such that each example provides unique information for model updating. To this end, we use the Fisher information matrix as the measurement of model uncertainty and choose the set of documents to effectively maximize the Fisher information of a classification model. Extensive experiments with three different datasets have shown that our algorithm is more effective than the state-of-the-art active learning techniques for text categorization and can be a promising tool toward large-scale text categorization for World Wide Web documents.
Keywords: Fisher information, active learning, convex optimization, logistic regression, text categorization
A comparison of implicit and explicit links for web page classification BIBAKFull-Text 643-650
  Dou Shen; Jian-Tao Sun; Qiang Yang; Zheng Chen
It is well known that Web-page classification can be enhanced by using hyperlinks that provide linkages between Web pages. However, in the Web space, hyperlinks are usually sparse, noisy and thus in many situations can only provide limited help in classification. In this paper, we extend the concept of linkages from explicit hyperlinks to implicit links built between Web pages. By observing that people who search the Web with the same queries often click on different, but related documents together, we draw implicit links between Web pages that are clicked after the same queries. Those pages are implicitly linked. We provide an approach for automatically building the implicit links between Web pages using Web query logs, together with a thorough comparison between the uses of implicit and explicit links in Web page classification. Our experimental results on a large dataset confirm that the use of the implicit links is better than using explicit links in classification performance, with an increase of more than 10.5% in terms of the Macro-F1 measurement.
Keywords: explicit link, implicit link, query log, virtual document, web page classification

E-commerce & e-government

An e-market framework for informed trading BIBAKFull-Text 651-658
  John Debenham; Simeon Simoff
Fully automated trading, such as e-procurement, using the Internet is virtually unheard of today. Three core technologies are needed to fully automate the trading process: data mining, intelligent trading agents and virtual institutions in which informed trading agents can trade securely both with each other and with human agents in a natural way. This paper describes a demonstrable prototype e-trading system that integrates these three technologies and is available on the World Wide Web. This is part of a larger project that aims to make informed automated trading a reality.
Keywords: data mining, electronic markets, market reliability, trading agents, virtual institutions
The impact of online music services on the demand for stars in the music industry BIBAKFull-Text 659-667
  Ian Pascal Volz
The music industry's business model is to produce stars. In order to do so, musicians producing music that fits into well defined clusters of factors explaining the demand of the majority of music consumers are disproportionately promoted. This leads to a limitation of available diversity and therefore of a limitation of the end user's benefit from listening to music. This paper analyses online music consumer's needs and preferences. These factors are used in order to explain the demand for stars and the impact of different online music services on promoting a more diverse music market.
Keywords: JEL-classification, MP3, iTunes, music, peer-to-peer, stardom, virtual community
The web structure of e-government -- developing a methodology for quantitative evaluation BIBAKFull-Text 669-678
  Vaclav Petricek; Tobias Escher; Ingemar J. Cox; Helen Margetts
In this paper we describe preliminary work that examines whether statistical properties of the structure of websites can be an informative measure of their quality. We aim to develop a new method for evaluating e-government. E-government websites are evaluated regularly by consulting companies, international organizations and academic researchers using a variety of subjective measures. We aim to improve on these evaluations using a range of techniques from webmetric and social network analysis. To pilot our methodology, we examine the structure of government audit office sites in Canada, the USA, the UK, New Zealand and the Czech Republic.
   We report experimental values for a variety of characteristics, including the connected components, the average distance between nodes, the distribution of paths lengths, and the indegree and outdegree. These measures are expected to correlate with (i) the navigability of a website and (ii) with its "nodality" which is a combination of hubness and authority. Comparison of websites based on these characteristics raised a number of issues, related to the proportion of non-hyperlinked content (e.g. pdf and doc files) within a site, and both the very significant differences in the size of the websites and their respective national populations. Methods to account for these issues are proposed and discussed.
   There appears to be some correlation between the values measured and the league tables reported in the literature. However, this multi dimensional analysis provides a richer source of evaluative techniques than previous work. Our analysis indicates that the US and Canada provide better navigability, much better than the UK; however, the UK site is shown to have the strongest "nodality" on the Web.
Keywords: e-government, national audit offices, network, ranking, webmetric

XML & web services

One document to bind them: combining XML, web services, and the semantic web BIBAKFull-Text 679-686
  Harry Halpin; Henry S. Thompson
We present a paradigm for uniting the diverse strands of XML-based Web technologies by allowing them to be incorporated within a single document. This overcomes the distinction between programs and data to make XML truly "self-describing." A proposal for a lightweight yet powerful functional XML vocabulary called "Semantic fXML" is detailed, based on the well-understood functional programming paradigm and resembling the embedding of Lisp directly in XML. Infosets are made "dynamic," since documents can now directly embed local processes or Web Services into their Infoset. An optional typing regime for info-sets is provided by Semantic Web ontologies. By regarding Web Services as functions and the Semantic Web as providing types, and tying it all together within a single XML vocabulary, the Web can compute. In this light, the real Web 2.0 can be considered the transformation of the Web from a universal information space to a universal computation space.
Keywords: XML, functional programming, pipelining, semantic web, web services
ASDL: a wide spectrum language for designing web services BIBAKFull-Text 687-696
  Monika Solanki; Antonio Cau; Hussein Zedan
A Service oriented system emerges from composition of services. Dynamically composed reactive Web services form a special class of service oriented system, where the delays associated with communication, unreliability and unavailability of services, and competition for resources from multiple service requesters are dominant concerns. As complexity of services increase, an abstract design language for the specification of services and interaction between them is desired. In this paper, we present ASDL (Abstract Service Design Language), a wide spectrum language for modelling Web services. We initially provide an informal description of our computational model for service oriented systems. We then present ASDL along with its specification oriented semantics defined in Interval Temporal Logic (ITL): a sound formalism for specifying and reasoning about temporal properties of systems. The objective of ASDL is to provide a notation for the design of service composition and interaction protocols at an abstract level.
Keywords: ASDL, computational model, web services, wide spectrum
Semantic WS-agreement partner selection BIBAKFull-Text 697-706
  Nicole Oldham; Kunal Verma; Amit Sheth; Farshad Hakimpour
In a dynamic service oriented environment it is desirable for service consumers and providers to offer and obtain guarantees regarding their capabilities and requirements. WS-Agreement defines a language and protocol for establishing agreements between two parties. The agreements are complex and expressive to the extent that the manual matching of these agreements would be expensive both in time and resources. It is essential to develop a method for matching agreements automatically. This work presents the framework and implementation of an innovative tool for the matching providers and consumers based on WS-Agreements. The approach utilizes Semantic Web technologies to achieve rich and accurate matches. A key feature is the novel and flexible approach for achieving user personalized matches.
Keywords: ARL, OWL, WS-agreement, WSDL-S, agreement matching, dynamic service selection, multi-ontology service annotation, ontologies, semantic policy matching, semantic web service, snobase

Improved search ranking

Beyond PageRank: machine learning for static ranking BIBAKFull-Text 707-715
  Matthew Richardson; Amit Prakash; Eric Brill
Since the publication of Brin and Page's paper on PageRank, many in the Web community have depended on PageRank for the static (query-independent) ordering of Web pages. We show that we can significantly outperform PageRank using features that are independent of the link structure of the Web. We gain a further boost in accuracy by using data on the frequency at which users visit Web pages. We use RankNet, a ranking machine learning algorithm, to combine these and other static features based on anchor text and domain characteristics. The resulting model achieves a static ranking pairwise accuracy of 67.3% (vs. 56.7% for PageRank or 50% for random).
Keywords: PageRank, RankNet, relevance, search engines, static ranking
Optimizing scoring functions and indexes for proximity search in type-annotated corpora BIBAKFull-Text 717-726
  Soumen Chakrabarti; Kriti Puniyani; Sujatha Das
We introduce a new, powerful class of text proximity queries: find an instance of a given "answer type" (person, place, distance) near "selector" tokens matching given literals or satisfying given ground predicates. An example query is type=distance NEAR Hamburg Munich. Nearness is defined as a flexible, trainable parameterized aggregation function of the selectors, their frequency in the corpus, and their distance from the candidate answer. Such queries provide a key data reduction step for information extraction, data integration, question answering, and other text-processing applications. We describe the architecture of a next-generation information retrieval engine for such applications, and investigate two key technical problems faced in building it. First, we propose a new algorithm that estimates a scoring function from past logs of queries and answer spans. Plugging the scoring function into the query processor gives high accuracy: typically, an answer is found at rank 2-4. Second, we exploit the skew in the distribution over types seen in query logs to optimize the space required by the new index structures required by our system. Extensive performance studies with a 10GB, 2-million document TREC corpus and several hundred TREC queries show both the accuracy and the efficiency of our system. From an initial 4.3GB index using 18,000 types from WordNet, we can discard 88% of the space, while inflating query times by a factor of only 1.9. Our final index overhead is only 20% of the total index space needed.
Keywords: indexing annotated text
Automatic identification of user interest for personalized search BIBAKFull-Text 727-736
  Feng Qiu; Junghoo Cho
One hundred users, one hundred needs. As more and more topics are being discussed on the web and our vocabulary remains relatively stable, it is increasingly difficult to let the search engine know what we want. Coping with ambiguous queries has long been an important part of the research on Information Retrieval, but still remains a challenging task. Personalized search has recently got significant attention in addressing this challenge in the web search community, based on the premise that a user's general preference may help the search engine disambiguate the true intention of a query. However, studies have shown that users are reluctant to provide any explicit input on their personal preference. In this paper, we study how a search engine can learn a user's preference automatically based on her past click history and how it can use the user preference to personalize search results. Our experiments show that users' preferences can be learned accurately even from little click-history data and personalized search based on user preference yields significant improvements over the best existing ranking mechanism in the literature.
Keywords: personalized search, user profile, user search behavior, web search
Protecting browser state from web privacy attacks BIBAKFull-Text 737-744
  Collin Jackson; Andrew Bortz; Dan Boneh; John C. Mitchell
Through a variety of means, including a range of browser cache methods and inspecting the color of a visited hyperlink, client-side browser state can be exploited to track users against their wishes. This tracking is possible because persistent, client-side browser state is not properly partitioned on per-site basis in current browsers. We address this problem by refining the general notion of a "same-origin" policy and implementing two browser extensions that enforce this policy on the browser cache and visited links.
   We also analyze various degrees of cooperation between sites to track users, and show that even if long-term browser state is properly partitioned, it is still possible for sites to use modern web features to bounce users between sites and invisibly engage in cross-domain tracking of their visitors. Cooperative privacy attacks are an unavoidable consequence of all persistent browser state that affects the behavior of the browser, and disabling or frequently expiring this state is the only way to achieve true privacy against colluding parties.
Keywords: phishing, privacy, web browser design, web spoofing


Meaning on the web: evolution vs intelligent design? BIBAFull-Text 745
  Ron Brachman; Dan Connolly; Rohit Khare; Frank Smadja; Frank van Harmelen
It is a truism that as the Web grows in size and scope, it becomes harder to find what we want, to identify like-minded people and communities, to find the best ads to offer, and to have applications work together smoothly. Services don't interoperate; queries yield long lists of results, most of which seem to miss the point. If the Web were a person, we would expect richer and more successful interactions with it -- interactions that were, quite literally, more meaningful. That's because in human discourse, it is shared meaning that gives us real communication. Yet with the current Web, meaning cannot be found.
   Much recent work has aspired to change this, both for human-machine interchange and machine-machine synchronization. Certainly the "semantic web" looks to add meaning to our current simplistic matching of mere strings of characters against mere "bags" of words. But can we legislate meaning from on high? Isn't meaning organic and determined by use, a moving and context-dependent target? But if meaning is an evolving organic soup, how are humans able to get anything done with one another? Don't we love to "define our terms"? But then again, is real definition even possible?
   These questions have daunted philosophers for years, and we probably won't solve them here. But we'll try to understand what's at the root of our own current religious debate: should meaning on the Web be evolutionary, driven organically through the bottom-up human assignment of tags? Or does it need to be carefully crafted and managed by a higher authority, using structured representations with defined semantics? Without picket signs or violence (we hope), our panelists will explore the two extreme ends of the spectrum -- and several points in between.
Identity management on converged networks: a reality check BIBAFull-Text 747
  Arnaud Sahuguet; Stefan Brands; Kim Cameron; Cahill Conor; Aude Pichelin; Fulup Ar Foll; Mike Neuenschwander
Since the early days of the Web, identity management has been a big issue. As the famous cartoon from the New Yorker reminds us, "on the internet, nobody knows you are a dog". This was true back in July 1993. This is true today. For the last few years, numerous initiatives have emerged to tackle this issue: Microsoft Passport, Liberty Alliance, 3GPP GUP, Shibboleth, to name a few. Major investments are being made in this area and this is foreseen as a multi-billion dollar market. Yet, as of this writing, there is still no widespread identity management infrastructure in place ready to be used by the general public on converged networks.
   The goal of this panel is to do a reality check and try to answer the following five questions:
  • What is identity management?
  • Who needs identity management and why?
  • What will the identity management ecosystem look like?
  • What's agreed upon?
  • What's next?
  • Phoiling phishing BIBAFull-Text 749
      Rachna Dhamija; Peter Cassidy; Phillip Hallam-Baker; Markus Jacobsson
    In the last few years, Internet users have seen the rapid expansion of "phishing", the use of spoofed e-mails and fraudulent websites designed to trick users into divulging sensitive data. More recently, we have seen the growth of "pharming", the use of malware or DNS-based attacks to misdirect users to rogue websites. In this panel, we will examine the state of the art in anti-phishing solutions and explore promising directions for future research.
    The next wave of the web BIBAFull-Text 750
      Nigel Shadbolt; Tim Berners-Lee; Jim Hendler; Claire Hart; Richard Benjamins
    The World Wide Web has been revolutionary in terms of impact, scale and outreach. At every level society has been changed in some way by the Web. This Panel will consider likely developments in this extraordinary human construct as we attempt to realise the Next Wave of the Web -- a Semantic Web.
       Nigel Shadbolt will Chair a discussion that will focus on the prospects for the Semantic Web, its likely form and the challenges it faces. Can we achieve the necessary agreements on shared meaning for the Semantic Web? Can we achieve a critical mass of semantically annotated data and content? How are we to trust such content? Do the scientific and commercial drivers really demand a Semantic Web? How will the move to a mobile and ubiquitous Web affect the Semantic Web? How does Web 2.0 relate to the Semantic Web?


    Compressing and searching XML data via two zips BIBAKFull-Text 751-760
      P. Ferragina; F. Luccio; G. Manzini; S. Muthukrishnan
    XML is fast becoming the standard format to store, exchange and publish over the web, and is getting embedded in applications. Two challenges in handling XML are its size (the XML representation of a document is significantly larger than its native state) and the complexity of its search (XML search involves path and content searches on labeled tree structures). We address the basic problems of compression, navigation and searching of XML documents. In particular, we adopt recently proposed theoretical algorithms [11] for succinct tree representations to design and implement a compressed index for XML, called XBZIPiNDEX, in which the XML document is maintained in a highly compressed format, and both navigation and searching can be done uncompressing only a tiny fraction of the data. This solution relies on compressing and indexing two arrays derived from the XML data. With detailed experiments we compare this with other compressed XML indexing and searching engines to show that XBZIPiNDEX has compression ratio up to 35% better than the ones achievable by those other tools, and its time performance on some path and content search operations is order of magnitudes faster: few milliseconds over hundreds of MBs of XML files versus tens of seconds, on standard XML data sources.
    Keywords: XML compression and indexing, labeled trees

    Developing regions & peer-to-peer

    Wake-on-WLAN BIBAKFull-Text 761-769
      Nilesh Mishra; Kameswari Chebrolu; Bhaskaran Raman; Abhinav Pathak
    In bridging the digital divide, two important criteria are cost-effectiveness, and power optimization. While 802.11 is cost-effective and is being used in several installations in the developing world, typical system configurations are not really power efficient. In this paper, we propose a novel "Wake-on-WLAN" mechanism for coarse-grained, on-demand power on/off of the networking equipment at a remote site. The novelty also lies in our implementation of a prototype system using low-power 802.15.4-based sensor motes. We describe the prototype, as well as its evaluation on field in a WiFi testbed. Preliminary estimates indicate that the proposed mechanism can save significant power in typical rural networking settings.
    Keywords: 802.11 mesh network, 802.15.4, power management, rural networking, wake-on-WLAN
    Analysis of WWW traffic in Cambodia and Ghana BIBAKFull-Text 771-780
      Bowei Du; Michael Demmer; Eric Brewer
    In this paper we present an analysis of HTTP traffic captured from Internet cafés and kiosks from two different developing countries -- Cambodia and Ghana. This paper has two main contributions. The first contribution is a analysis of the characteristics of the web trace, including the distribution and classification of the web objects requested by the users. We outline notable features of the data set which effect the performance of the web for users in developing regions. Using the trace data, we also perform several simulation analyses of cache performance, including both traditional caching and more novel off-line caching proposals. The second contribution is a set of suggestions on mechanisms to improve the user experience of the web in these regions. These mechanisms include both applications of well-known research techniques as well as offering some less well-studied suggestions based on intermittent connectivity.
    Keywords: Cambodia, Ghana, HTTP, WWW, caching, classification, delay tolerant networking, developing regions, dynamic content, hypertext transfer protocol, measurement, performance analysis, proxy, redundant transfers, trace, world wide web

    Developing regions 2

    The case for multi-user design for computer aided learning in developing regions BIBAKFull-Text 781-789
      Joyojeet Pal; Udai Singh Pawar; Eric A. Brewer; Kentaro Toyama
    Computer-aided learning is fast gaining traction in developing regions as a means to augment classroom instruction. Reasons for using computer-aided learning range from supplementing teacher shortages to starting underprivileged children off in technology, and funding for such initiatives range from state education funds to international agencies and private groups interested in child development. The interaction of children with computers is seen at various levels, from unsupervised self-guided learning at public booths without specific curriculum to highly regulated in-class computer applications with modules designed to go with school curriculum. Such learning is used at various levels from children as young as 5 year-old to high-schoolers. This paper uses field observations of primary school children in India using computer-aided learning modules, and finds patterns by which children who perform better in classroom activities seat themselves in front of computer monitors, and control the mouse, in cases where children are required to share computer resources. We find that in such circumstances, there emerges a pattern of learning, unique to multi-user environments -- wherein certain children tend to learn better because of their control of the mouse. This research also shows that while computer aided learning software for children is primarily designed for single-users, the implementation realities of resource-strapped learning environments in developing regions presents a strong case for multi-user design.
    Keywords: developing regions
    Designing an architecture for delivering mobile information services to the rural developing world BIBAKFull-Text 791-800
      Tapan S. Parikh; Edward D. Lazowska
    Implementing successful rural computing applications requires addressing a number of significant challenges. Recent advances in mobile phone computing capabilities make this device a likely candidate to address the client hardware constraints. Long battery life, wireless connectivity, solid-state memory, low price and immediate utility all make it better suited to rural conditions than a PC. However, current mobile software platforms are not as appropriate. Web-based mobile applications are hard to use, do not take advantage of the mobile phone's media capabilities and require an online connection. Custom mobile applications are difficult to develop and distribute. To address these limitations we present CAM -- a new framework for developing and deploying mobile computing applications in the rural developing world. CAM applications are accessed by capturing barcodes using the mobile phone camera, or entering numeric strings with the keypad. Supporting minimal navigation, direct linkage to paper practices and offline multi-media interaction, CAM is uniquely adapted to rural device, user and infrastructure constraints. To illustrate the breadth of the framework, we list a number of CAM-based applications that we have implemented or are planning. These include processing microfinance loans, facilitating rural supply chains, documenting grassroots innovation and accessing electronic medical histories.
    Keywords: ICT, client-server distributed systems, mobile computing, mobile phones, paper user interface, rural development
    WebKhoj: Indian language IR from multiple character encodings BIBAKFull-Text 801-809
      Prasad Pingali; Jagadeesh Jagarlamudi; Vasudeva Varma
    Today web search engines provide the easiest way to reach information on the web. In this scenario, more than 95% of Indian language content on the web is not searchable due to multiple encodings of web pages.
       Most of these encodings are proprietary and hence need some kind of standardization for making the content accessible via a search engine. In this paper we present a search engine called WebKhoj which is capable of searching multi-script and multi-encoded Indian language content on the web. We describe a language focused crawler and the transcoding processes involved to achieve accessibility of Indian langauge content. In the end we report some of the experiments that were conducted along with results on Indian language web content.
    Keywords: Indian languages, non-standard encodings, web search

    Industrial practice & experience

    Using annotations in enterprise search BIBAKFull-Text 811-817
      Pavel A. Dmitriev; Nadav Eiron; Marcus Fontoura; Eugene Shekita
    A major difference between corporate intranets and the Internet is that in intranets the barrier for users to create web pages is much higher. This limits the amount and quality of anchor text, one of the major factors used by Internet search engines, making intranet search more difficult. The social phenomenon at play also means that spam is relatively rare. Both on the Internet and in intranets, users are often willing to cooperate with the search engine in improving the search experience. These characteristics naturally lead to considering using user feedback to improve search quality in intranets. In this paper we show how a particular form of feedback, namely user annotations, can be used to improve the quality of intranet search. An annotation is a short description of the contents of a web page, which can be considered a substitute for anchor text. We propose two ways to obtain user annotations, using explicit and implicit feedback, and show how they can be integrated into a search engine. Preliminary experiments on the IBM intranet demonstrate that using annotations improves the search quality.
    Keywords: anchortext, community ranking, enterprise search
    Detecting semantic cloaking on the web BIBAKFull-Text 819-828
      Baoning Wu; Brian D. Davison
    By supplying different versions of a web page to search engines and to browsers, a content provider attempts to cloak the real content from the view of the search engine. Semantic cloaking refers to differences in meaning between pages which have the effect of deceiving search engine ranking algorithms. In this paper, we propose an automated two-step method to detect semantic cloaking pages based on different copies of the same page downloaded by a web crawler and a web browser. The first step is a filtering step, which generates a candidate list of semantic cloaking pages. In the second step, a classifier is used to detect semantic cloaking pages from the candidates generated by the filtering step. Experiments on manually labeled data sets show that we can generate a classifier with a precision of 93% and a recall of 85%. We apply our approach to links from the dmoz Open Directory Project and estimate that more than 50,000 of these pages employ semantic cloaking.
    Keywords: spam, web search engine
    Detecting online commercial intention (OCI) BIBAKFull-Text 829-837
      Honghua (Kathy) Dai; Lingzhi Zhao; Zaiqing Nie; Ji-Rong Wen; Lee Wang; Ying Li
    Understanding goals and preferences behind a user's online activities can greatly help information providers, such as search engine and E-Commerce web sites, to personalize contents and thus improve user satisfaction. Understanding a user's intention could also provide other business advantages to information providers. For example, information providers can decide whether to display commercial content based on user's intent to purchase. Previous work on Web search defines three major types of user search goals for search queries: navigational, informational and transactional or resource [1][7]. In this paper, we focus our attention on capturing commercial intention from search queries and Web pages, i.e., when a user submits the query or browse a Web page, whether he/she is about to commit or in the middle of a commercial activity, such as purchase, auction, selling, paid service, etc. We call the commercial intentions behind a user's online activities as OCI (Online Commercial Intention). We also propose the notion of "Commercial Activity Phase" (CAP), which identifies in which phase a user is in his/her commercial activities: Research or Commit. We present the framework of building machine learning models to learn OCI based on any Web page content. Based on that framework, we build models to detect OCI from search queries and Web pages. We train machine learning models from two types of data sources for a given search query: content of algorithmic search result page(s) and contents of top sites returned by a search engine. Our experiments show that the model based on the first data source achieved better performance. We also discover that frequent queries are more likely to have commercial intention. Finally we propose our future work in learning richer commercial intention behind users' online activities.
    Keywords: OCI, SVM, intention, online commercial intention, search intention

    Browsers and UI, web engineering, hypermedia & multimedia, security, and accessibility

    Temporal rules for mobile web personalization BIBAKFull-Text 839-840
      Martin Halvey; Mark T. Keane; Barry Smyth
    Many systems use past behavior, preferences and environmental factors to attempt to predict user navigation on the Internet. However we believe that many of these models have shortcomings, in that they do not take into account that users may have many different sets of preferences. Here we investigate an environmental factor, namely time, in making predictions about user navigation. We present methods for creating temporal rules that describe user navigation patterns. We also show the benefit of using these rules to predict user navigation and also show the benefits of these models over traditional methods. An analysis is carried out on a sample of usage logs for Wireless Application Protocol (WAP) browsing, and the results of this analysis verify our hypothesis.
    Keywords: WAP, WWW, mobile, temporal models, user modeling
    Behavior-based web page evaluation BIBAKFull-Text 841-842
      Ganesan Velayathan; Seiji Yamada
    This paper describes our efforts to factor in a user's browsing behavior to automatically evaluate web pages that the user shows interest in, based on user browsing behaviors while browsing. To evaluate a webpage automatically, we have developed a client-side logging tool: the GINIS Framework. We do not focus just on clicking, scrolling, navigation, or duration of visit alone, but we propose integrating these patterns of interaction to recognize and evaluate a user's response to a given web page.
    Keywords: automatic profiling, information extraction, web browsing behavior, web usage mining, web-human interaction
    Using web browser interactions to predict task BIBAKFull-Text 843-844
      Melanie Kellar; Carolyn Watters
    The automatic identification of a user's task has the potential to improve information filtering systems that rely on implicit measures of interest and whose effectiveness may be dependant upon the task at hand. Knowledge of a user's current task type would allow information filtering systems to apply the most useful measures of user interest. We recently conducted a field study in which we logged all participants' interactions with their web browsers and asked participants to categorize their web usage according to a high-level task schema. Using the data collected during this study, we have conducted a preliminary exploration of the usefulness of logged web browser interactions to predict users' tasks. The results of this initial analysis suggest that individual models of users' web browser interactions may be useful in predicting task type.
    Keywords: decision tree, field study, information filtering, task, task prediction, web
    An integrated method for social network extraction BIBAKFull-Text 845-846
      Tom Hope; Takuichi Nishimura; Hideaki Takeda
    A social network can become bases for information infrastructure in the future. It is important to extract social networks that are not biased. Providing a simple means for users to register their social relation is also important. We propose a method that combines various approaches to extract social networks. Especially, three kinds of networks are extracted; user-registered Know link network, Web-mined Web link network, and face-to-face Touch link network. In this paper, the combination of social network extraction for communities is described, and the analysis on the extracted social networks is shown.
    Keywords: social network, user interaction, web mining
    Integrating semantic web and language technologies to improve the online public administrations services BIBAKFull-Text 847-848
      Marta Gatius; Meritxell González; Sheyla Militello; Pablo Hernández
    In this paper, we describe how domain ontologies are used in a dialogue system guiding the user to access web public administration contents. The current implementation of the system supports speech (through the telephone) and text mode in different languages (English, Spanish, Catalan and Italian).
    Keywords: dialogue systems, e-government, ontologies, web usability
    DemIL: an online interaction language between citizen and government BIBAKFull-Text 849-850
      Cristiano Maciel; Ana Cristina Bicharra Garcia
    Electronic democracy should provide information and service for the citizens on the Internet, allowing room for debate, participation and electronic voting. The languages being adopted by mass communication means, especially Reality Shows, are efficient and encourage public participation in decision-making. This paper discusses a citizen-government interaction language intended to facilitate citizen participation in the government's decisions. An e-Democracy Model for people participation through web-based technologies is conceived. This model specifies the syntax of an Democracy Interaction Language, a DemIL. Such language incorporates characteristics of Reality Show Formats, and it is the back-end of a web-interface project in the domain researched. The study of case Participative Budget of Brazil represents the language proposed.
    Keywords: e-democracy, e-government, interaction, interface
    Web annotation sharing using P2P BIBAKFull-Text 851-852
      Osamu Segawa
    We have developed a system that allows users to add annotations immediately onto a Web page they are viewing, and share the information via a network. A novel feature of our method is that P2P nodes in the system determine their roles autonomously, and share the annotation data. Our method is based on P2P; however, P2P nodes in the system change their roles and data transfer procedures, depending on their network topology or the status of other nodes.
       Our method is robust to node or network problems, and has flexible scalability.
    Keywords: P2P, annotation
    Generating summaries for large collections of geo-referenced photographs BIBAKFull-Text 853-854
      Alexander Jaffe; Mor Naaman; Tamir Tassa; Marc Davis
    We describe a framework for automatically selecting a summary set of photographs from a large collection of geo-referenced photos. The summary algorithm is based on spatial patterns in photo sets, but can be expanded to support social, temporal, as well as textual-topical factors of the photo set. The summary set can be biased by the user, the content of the user's query, and the context in which the query is made. An initial evaluation on a set of geo-referenced photos shows that our algorithm performs well, producing results that are highly rated by users.
    Keywords: collection summary, geo-referenced information, geo-referenced photographs, photo browsing, photo collections, semantic zoom
    Determining user interests about museum collections BIBAKFull-Text 855-856
      Lloyd Rutledge; Lora Aroyo; Natalia Stash
    Currently, there is an increasing effort to provide various personalized services on museum web sites. This paper presents an approach for determining user interests in a museum collection with the help of an interactive dialog. It uses a semantically annotated collection of the Rijksmuseum Amsterdam to elicit specific user's interests in artists, periods, genres and themes and uses these values to recommend relevant artefacts and related concepts from the museum collection. In the presented prototype, we show how constructing a user profile and applying recommender strategies in this way enable dynamical generation personalized museum tours for different users.
    Keywords: museum collections, personalization, recommender systems, semantic browsing, user profiling
    GIO: a semantic web application using the information grid framework BIBAKFull-Text 857-858
      Omar Alonso; Sandeepan Banerjee; Mark Drake
    It is well understood that the key for successful Semantic Web applications depends on the availability of machine understandable meta-data. We describe the Information Grid, a practical approach to the Semantic Web, and show a prototype implementation. Information grid resources span all the data in the organization and all the metadata required to make it meaningful. The final goal is to let organizations view their assets in a smooth continuum from the Internet to the Intranet, with uniform semantically rich access.
    Keywords: RDF, browsing, clustering, databases, information visualization, meta-data, search, semantic web, tools, user interface
    Graphical representation of RDF queries BIBAKFull-Text 859-860
      Andreas Harth; Sebastian Ryszard Kruk; Stefan Decker
    In this poster we discuss a graphical notation for representing queries for semistructured data. We try to strike a balance between expressiveness of the query language and simplicity and understandability of the graphical notation. We present the primitives of the notation by means of examples.
    Keywords: RDF, metadata, query, semistructured data
    Question answering on top of the BT digital library BIBAKFull-Text 861-862
      Philipp Cimiano; Peter Haase; York Sure; Johanna Völker; Yimin Wang
    In this poster we present an approach to query answering over knowledge sources that makes use of different ontology management components within an application scenario of the BT Digital Library. The novelty of the approach lies in the combination of different semantic technologies providing a clear benefit for the application scenario considered.
    Keywords: natural language processing, ontology learning, question answering, web ontologies
    XPath filename expansion in a Unix shell BIBAFull-Text 863-864
      Kaspar Giger; Erik Wilde
    Locating files based on file system structure, file properties, and maybe even file contents is a core task of the user interface of operating systems. By adapting XPath's power to the environment of a Unix shell, it is possible to greatly increase the expressive power of the command line language. We present a concept for integrating an XPath view of the file system into a shell, the emphXPath Shell (XPsh), which can be used to find files based on file attributes and contents in a very flexible way. The syntax of the command line language is backwards compatible with traditional shells, and the new XPath-based expressions can be easily mastered with a little bit of XPath knowledge.
    Microformats: a pragmatic path to the semantic web BIBAKFull-Text 865-866
      Rohit Khare; Tantek Çelik
    Microformats are a clever adaptation of semantic XHTML that makes it easier to publish, index, and extract semi-structured information such as tags, calendar entries, contact information, and reviews on the Web. This makes it a pragmatic path towards achieving the vision set forth for the Semantic Web.
       Even though it sidesteps the existing "technology stack" of RDF, ontologies, and Artificial Intelligence-inspired processing tools, various microformats have emerged that parallel the goals of several well-known Semantic Web projects. This poster compares their prospects to the Semantic Web according to Rogers' Diffusion of Innovation model.
    Keywords: CSS, HTML, decentralization, microformats, semantic web
    SGSDesigner: a graphical interface for annotating and designing semantic grid services BIBAKFull-Text 867-868
      Asunción Gómez-Pérez; Rafael González-Cabero
    In this paper, we describe SGSDesigner, the ODESGS Environment user interface. ODESGS Environment (the realization of the ODESGS Framework [1]) is an environment for supporting both a) the annotation of pre-existing Grid Services (GSs) and b) the design of new complex Semantic Grid Services (SGSs) in a (semi) automatic way.
    Keywords: problem-solving methods, semantic grid services
    Status of the African Web BIBAKFull-Text 869-870
      Rizza Camus Caminero; Pavol Zavarsky; Yoshiki Mikami
    As part of the Language Observatory Project [4], we have been crawling all the web space since 2004. We have collected terabytes of data mostly from Asian and African ccTLDs. In this paper, we present results of the current status of the African web and compare it with its status in 2004 and 2002. This paper focuses on the accessibility of the web pages, the web tree growth, web technology, privacy protection, and web interconnection.
    Keywords: Africa, ccTLD, interconnection, internet statistics, privacy protection, web accessibility, web graph, web tree
    Personalization and accessibility: integration of library and web approaches BIBAKFull-Text 871-872
      Ann Chapman; Brian Kelly; Liddy Nevile; Andy Heath
    This paper describes personalization metadata standards that can be used to enable individuals to access and use resources based on a user's particular requirements. The paper describes two approaches which are being developed in the library and Web worlds and highlights some of the potential challenges which will need to be addressed in order to maximise interoperability. The paper concludes by arguing the need for greater dialogue across these two communities.
    Keywords: IMS, MARC, accessibility, metadata
    Testing google interfaces modified for the blind BIBAKFull-Text 873-874
      Patrizia Andronico; Marina Buzzi; Barbara Leporini; Carlos Castillo
    We present the results of a research project focus on improving the usability of web search tools for blind users who interact via screen reader and voice synthesizer. In the first stage of our study, we proposed eight specific guidelines for simplifying this interaction with search engines. Next, we evaluated these criteria by applying them to Google UIs, re-implementing the simple search and the result page. Finally, we prepared the environment for a remote test with 12 totally blind users. The results highlight how Google interfaces could be improved in order to simplify interaction for the blind.
    Keywords: accessibility, blind, search engine, usability, user interface design
    Verifying genre-based clustering approach to content extraction BIBAKFull-Text 875-876
      Suhit Gupta; Hila Becker; Gail Kaiser; Salvatore Stolfo
    The content of a webpage is usually contained within a small body of text and images, or perhaps several articles on the same page; however, the content may be lost in the clutter, particularly hurting users browsing on small cell phone and PDA screens and visually impaired users relying on speed rendering of web pages. Using the genre of a web page, we have created a solution, Crunch that automatically identifies clutter and removes it, thus leaving a clean content-full page. In order to evaluate the improvement in the applications for this technology, we identified a number of experiments. In this paper, we have those experiments, the associated results and their evaluation.
    Keywords: HTML, accessibility, clustering, content extraction, context, reformatting, speech rendering, website classification
    A browser for browsing the past web BIBAKFull-Text 877-878
      Adam Jatowt; Yukiko Kawai; Satoshi Nakamura; Yutaka Kidawara; Katsumi Tanaka
    We describe a browser for the past web. It can retrieve data from multiple past web resources and features a passive browsing style based on change detection and presentation. The browser shows past pages one by one along a time line. The parts that were changed between consecutive page versions are animated to reflect their deletion or insertion, thereby drawing the user's attention to them. The browser enables automatic skipping of changeless periods and filtered browsing based on user specified query.
    Keywords: past web, web archive browsing, web archives
    Live URLs: breathing life into URLs BIBAKFull-Text 879-880
      Natarajan Kannan; Toufeeq Hussain
    This paper provides a novel approach to use URI fragment identifiers to enable HTTP clients to address and process content, independent of its original representation.
    Keywords: ACM proceedings, HTML, HTTP, URL, browsers, fragment identifier, web addressing, web content
    Structuring namespace descriptions BIBAFull-Text 881-882
      Erik Wilde
    Namespaces are a central building block of XML technologies today, they provide the identification mechanism for many XML-related vocabularies. Despite their ubiquity, there is no established mechanism for describing namespaces, and in particular for describing the dependencies of namespaces. We propose a simple model for describing namespaces and their dependencies. Using these descriptions, it is possible to compile directories of namespaces providing searchable and browsable namespace descriptions.
    CiteSeerx: an architecture and web service design for an academic document search engine BIBAKFull-Text 883-884
      Huajing Li; Isaac Councill; Wang-Chien Lee; C. Lee Giles
    CiteSeer is a scientific literature digital library and search engine which automatically crawls and indexes scientific documents in the field of computer and information science. After serving as a public search engine for nearly ten years, CiteSeer is starting to have scaling problems for handling of more documents, adding new feature and more users. Its monolithic architecture design prevents it from effectively making use of new web technologies and providing new services. After analyzing the current system problems, we propose a new architecture and data model, CiteSeerx. CiteSeerx that will overcome the existing problems as well as provide scalability and better performance plus new services and system features.
    Keywords: data model, scalability, system architecture
    Tables and trees don't mix (very well) BIBAFull-Text 885-886
      Erik Wilde
    There are principal differences between the relational model and XML's tree model. This causes problems in all cases where information from these two worlds has to be brought together. Using a few rules for mapping the incompatible aspects of the two models, it becomes easier to process data in systems which need to work with relational and tree data. The most important requirement for a good mapping is that the conceptual model is available and can thus be used for making mapping decisions.
    Robust web content extraction BIBAKFull-Text 887-888
      Marek Kowalkiewicz; Maria E. Orlowska; Tomasz Kaczmarek; Witold Abramowicz
    We present an empirical evaluation and comparison of two content extraction methods in HTML: absolute XPath expressions and relative XPath expressions. We argue that the relative XPath expressions, although not widely used, should be used in preference to absolute XPath expressions in extracting content from human-created Web documents. Evaluation of robustness covers four thousand queries executed on several hundred webpages. We show that in referencing parts of real world dynamic HTML documents, relative XPath expressions are on average significantly more robust than absolute XPath ones.
    Keywords: content extraction, evaluation, robustness, wrappers
    Rapid prototyping of web applications combining domain specific languages and model driven design BIBAKFull-Text 889-890
      Demetrius Arraes Nunes; Daniel Schwabe
    There have been several authoring methods proposed in the literature that are model based, essentially following the Model Driven Design philosophy. While useful, such methods need an effective way to allow the application designer to somehow synthesize the actual running application from the specification. In this paper, we describe HyperDE, an environment that combines Model Driven Design and Domain Specific Languages to enable rapid prototyping of Web applications.
    Keywords: hypermedia authoring, model-based designs
    A pruning-based approach for supporting Top-K join queries BIBAKFull-Text 891-892
      Jie Liu; Liang Feng; Yunpeng Xing
    An important issue arising from large scale data integration is how to efficiently select the top-K ranking answers from multiple sources while minimizing the transmission cost. This paper resolves this issue by proposing an efficient pruning-based approach to answer top-K join queries. The total amount of transmitted data can be greatly reduced by pruning tuples that can not produce the desired join results with a rank value greater than or equal to the rank value generated so far.
    Keywords: join query, prune, top-K
    Towards DSL-based web engineering BIBAKFull-Text 893-894
      Martin Nussbaumer; Patrick Freudenstein; Martin Gaedke
    Strong user involvement and clear business objectives, both relying on efficient communication between the developers and the business, are key factors for a project's success. Domain-Specific Languages (DSLs) being simple, highly-focused and tailored to a clear problem domain are a promising alternative to heavy-weight modeling approaches in the field of Web Engineering. Thus, they enable stakeholders to validate, modify and even develop parts of a distributed Web-based solution.
    Keywords: DSL, conceptual modeling, web engineering, web services
    Capturing the essentials of federated systems BIBAKFull-Text 895-896
      Johannes Meinecke; Martin Gaedke; Frederic Majer; Alexander Brändle
    Today, the Web is increasingly used as a platform for distributed services, which transcend organizational boundaries to form federated applications. Consequently, there is a growing interest in the architectural aspect of Web-based systems, i.e. the composition of the overall solution into individual Web applications and Web services from different parties. The design and evolution of federated systems calls for models that give an overview of the structural as well as trust-specific composition and reflect the technical details of the various accesses. We introduce the WebComposition Architecture Model (WAM) as an overall modeling approach tailored to aspects of highly distributed systems with federation as an integral factor.
    Keywords: architecture, federation, modeling, security, web services
    From adaptation engineering to aspect-oriented context-dependency BIBAKFull-Text 897-898
      Sven Casteleyn; Zoltán Fiala; Geert-Jan Houben; Kees van der Sluijs
    The evolution of the Web requires to consider an increasing number of context-dependency issues. Therefore, in our research we focus on how to extend a Web application with additional adaptation concerns without having to redesign the entire application. Based on a generic transcoding tool we illustrate here how we can add adaptation functionality to an existing Web application. Furthermore, we consider how an aspect-oriented approach can support the high-level specification of such additional concerns in the design of the Web application.
    Keywords: adaptation, aspect-oriented programming, component-based web engineering, web engineering
    Living the TV revolution: unite MHP to the web or face IDTV irrelevance! BIBAKFull-Text 899-900
      Stefano Ferretti; Marco Roccetti; Johannes Andrich
    The union of Interactive Digital TV (IDTV) and Web promotes the development of new interactive multimedia services, enjoyable while watching TV even on the new handheld digital TV receivers. Yet, several design constraints complicate the deployment of this new pattern of services. Indeed, for a suitable presentation on a TV set, Web contents must be structured in such a way that they can be effectively displayed on TV screens via low-end Set Top Boxes (STBs). Moreover, usable interfaces for IDTV platforms are needed which ensure a smooth access to contents. Our claim is that the distribution of Web contents over the IDTV broadcast channels may bring IDTV to a new life. A failure of this attempt may put IDTV on a progressive track towards irrelevance. We propose a system for the distribution of Web contents towards IDTV under the Digital Video Broadcasting -- Multimedia Home Platform (DVB-MHP) standard. Our system is able to automatically transcode Web contents and ensure a proper visualization on IDTV. The system is endowed with a client application which permits to easily browse contents on the TV via a remote control. Real assessments have confirmed the effectiveness for such an automatic online service able to reconfigure Web contents for an appropriate distribution and presentation on IDTV.
    Keywords: DVB, IDTV, MHP, web contents transcoding
    Using graph matching techniques to wrap data from PDF documents BIBAKFull-Text 901-902
      Tamir Hassan; Robert Baumgartner
    Wrapping is the process of navigating a data source, semi-automatically extracting data and transforming it into a form suitable for data processing applications. There are currently a number of established products on the market for wrapping data from web pages. One such approach is Lixto [1], a product of research performed at our institute.
       Our work is concerned with extending the wrapping functionality of Lixto to PDF documents. As the PDF format is relatively unstructured, this is a challenging task. We have developed a method to segment the page into blocks, which are represented as nodes in a relational graph. This paper describes our current research in the use of relational matching techniques on this graph to locate wrapping instances.
    Keywords: PDF, document understanding, graph matching, logical structure, wrapping
    Requirements for multimedia document enrichment BIBAKFull-Text 903-904
      Ajay Chakravarthy; Vitaveska Lanfranchi; Fabio Ciravegna
    Nowadays a large and growing percentage of information is stored in various multimedia formats. In order for multimedia information to be efficiently utilised by users, it is very important to add suitable metadata. In this paper we will present AKTiveMedia, a tool for enriching multimedia documents with semantic information.
    Keywords: multimedia enrichment, semantic annotation interfaces
    DiTaBBu: automating the production of time-based hypermedia content BIBAKFull-Text 905-906
      Rui Lopes; Luís Carriço; Carlos Duarte
    We present DiTaBBu, Digital Talking Books Builder, a framework for automatic production of time-based hypermedia for the Web, focusing on the Digital Talking Books domain. Delivering Digital Talking Books collections to a wide range of users is an expensive task, as it must take into account each user profile's different needs, therefore authoring should be dismissed in favour of automation. With DiTaBBu, we enable automated content delivery in several playback platforms, targeted to specific user needs, featuring powerful navigation capabilities over the content. DiTaBBu can also be used as testbed for prototyping novel capabilities, through its flexible extension mechanisms.
    Keywords: accessibility, automatic presentation generation, digital talking books, ditabbu, hypermedia, multimodality
    Capturing RIA concepts in a web modeling language BIBAKFull-Text 907-908
      Alessandro Bozzon; Sara Comai; Piero Fraternali; Giovanni Toffetti Carughi
    This work addresses conceptual modeling and automatic code generation for Rich Internet Applications, a variant of Web-based systems bridging the gap between desktop and Web interfaces. The approach we propose is a first step towards a full integration of RIA paradigms into the Web development process, enabling the specification of complex Web solutions mixing HTTP+HTML and Rich Internet Applications, using a single modeling language and tool.
    Keywords: rich internet applications, web engineering, web site design
    Generation of multimedia TV news contents for WWW BIBAKFull-Text 909-910
      Hsin Chia Fu; Yeong Y. Xu; C. L. Tseng
    In this paper, we present a system we have developed for automatic TV News video indexing that successfully combines results from the fields of speaker verification, acoustic analysis, very large vocabulary video OCR, content based sampling of video, information retrieval, dialogue systems, and ASF media delivery over IP. The prototype of TV news content processing Web was completed in July 2003. Since then, the system has been up running continuously. Up to the date when this message is written (March 27, 2006), the system records and analyzes the prime time evening news program in Taiwan every day of these years, except a few power failure shutdown. The TV news web is at
    Keywords: TV news, content analysis, information retrieval, video OCR
    Proposal of integrated search engine of web and TV contents BIBAKFull-Text 911-912
      Hisashi Miyamori; Mitsuru Minakuchi; Zoran Stejic; Qiang Ma; Tadashi Araki; Katsumi Tanaka
    A search engine that can handle TV programs and Web content in an integrated way is proposed. Conventional search engines have been able to handle Web content and/or data stored in a PC desktop as target information. In the future, however, the target information is expected to be stored in various places such as in hard-disk (HD)/DVD recorders, digital cameras, mobile devices, and even in real space as ubiquitous content, and a search engine that can search across such heterogeneous resources will become essential. Therefore, as a first step towards developing such next-generation search engine, a prototype search system for Web and TV programs is developed that performs integrated search of those content, and that allows chain search where related content can be accessed from each search result. The integrated search is achieved by generating integrated indices for Web and TV content based on vector space model and by computing similarity between the query and all the content described by the indices. The chain search of related content is done by computing similarity between the selected result and all other content based on the integrated indices. Also, the zoom-based display of the search results enables to control media transition and level of details of the contents to acquire information efficiently. In this paper, testing of a prototype of the integrated search engine validated the approach taken by the proposed method.
    Keywords: TV programs, chain search, information integration, information retrieval, integrated search, search engine, web content
    Using semantic rules to determine access control for web services BIBAKFull-Text 913-914
      Brian Shields; Owen Molloy; Gerard Lyons; Jim Duggan
    Semantic Web technologies are bring increasingly employed to solve knowledge management issues in traditional Web technologies. This paper follows that trend and proposes using Semantic rule languages to construct rules for defining access control rules for Web Services. Using these rules, a system will be able to manage access to Web Services and also the information accessed via these services.
    Keywords: OWL, SWRL, authorisation, web service security
    Strong authentication in web proxies BIBAKFull-Text 915-916
      Domenico Rotiroti
    In this paper we present a way to integrate web proxies with smart card based authentication systems.
    Keywords: HTTP, proxy, smart card
    Safeguard against unicode attacks: generation and applications of UC-simlist BIBAKFull-Text 917-918
      Anthony Y. Fu; Wan Zhang; Xiaotie Deng; Liu Wenyin
    A severe potential security problem in utilization of Unicode on the Web is identified, which is resulted from the fact that there are many similar characters in the Universal Character Set (UCS). The foundation of our solution relies on evaluating the similarity of characters in UCS. We develop a solution based on the renowned Kernel Density Estimation (KDE) method to establish such a Unicode Similarity List (UC-SimList).
    Keywords: phishing, secure web identity, unicode
    Efficient edge-services for colorblind users BIBKFull-Text 919-920
      Gennaro Iaccarino; Delfina Malandrino; Marco Del Percio; Vittorio Scarano
    Keywords: colorblindness, edge services, vision, web accessibility
    A user profile-based approach for personal information access: shaping your information portfolio BIBAKFull-Text 921-922
      Lo Ka Kan; Xiang Peng; Irwin King
    In the spread of internet, internet-based information service business has started to become profitable. One of the key technologies is personalization. Successful internet information services must realize personalized information delivery, by which the users can automatically receive highly tuned information according to their personal needs and preferences. In order to realize such personalized information services, we have developed an automatic user preference capture and an automatic information clipping function based on a Personalized Information Access technique. In this paper, those techniques will be demonstrated by showing a deployed personalized webpage service application.
    Keywords: information retrieval, internet behavior, personal information access, system, user profile
    Finding visual concepts by web image mining BIBAKFull-Text 923-924
      Keiji Yanai; Kobus Barnard
    We propose measuring "visualness" of concepts with images on the Web, that is, what extent concepts have visual characteristics. This is a new application of "Web image mining". To know which concept has visually discriminative power is important for image recognition, since not all concepts are related to visual contents. Mining image data on the Web with our method enables it. Our method performs probabilistic region selection for images and computes an entropy measure which represents "visualness" of concepts. In the experiments, we collected about forty thousand images from the Web for 150 concepts. We examined which concepts are suitable for annotation of image contents.
    Keywords: image recognition, probabilistic method, web image mining
    Deriving wishlists from blogs show us your blog, and we'll tell you what books to buy BIBAKFull-Text 925-926
      Gilad Mishne; Maarten de Rijke
    We use a combination of text analysis and external knowledge sources to estimate the commercial taste of bloggers from their text; our methods are evaluated using product wishlists found in the blogs. Initial results are promising, showing that valuable insights can be mined from blogs, not just at the aggregate but also at the individual blog level.
    Keywords: amazon, blogs, wishlists
    Relationship between web links and trade BIBAKFull-Text 927-928
      Ricardo Baeza-Yates; Carlos Castillo
    We report on observations on Web characterization studies that suggest that the amount of Web links among sites under different country-code top-level domains is related to the amount of trade between the corresponding countries.
    Keywords: national web domains, world trade graph
    System for spatio-temporal analysis of online news and blogs BIBAKFull-Text 929-930
      Angelo Dalli
    Previous work on spatio-temporal analysis of news items and other documents has largely focused on broad categorization of small text collections by region or country. A system for large-scale spatio-temporal analysis of online news media and blogs is presented, together with an analysis of global news media coverage over a nine year period. We demonstrate the benefits of using a hierarchical geospatial database to disambiguate between geographical named entities, and provide results for an extremely fine-grained analysis of news items. Aggregate maps of media attention for particular places around the world are compared with geographical and socio-economic data. Our analysis suggests that GDP per capita is the best indicator for media attention.
    Keywords: blogs, disambiguation of geographical named entities, geolocation, media attention, news, social behavior, spatio-temporal
    Extracting news-related queries from web query log BIBAKFull-Text 931-932
      Michael Maslov; Alexander Golovko; Ilya Segalovich; Pavel Braslavski
    In this poster, we present a method for extracting queries related to real-life events, or news-related queries, from large web query logs. The method employs query frequencies and search over a collection of recent news. News-related queries can be helpful for disambiguating user information needs, as well as for effective online news processing. The performed evaluation proves that the method yields good precision.
    Keywords: query log analysis, web search
    Visually guided bottom-up table detection and segmentation in web documents BIBAKFull-Text 933-934
      Bernhard Krüpl; Marcus Herzog
    In the AllRight project, we are developing an algorithm for unsupervised table detection and segmentation that uses the visual rendition of a Web page rather than the HTML code. Our algorithm works bottom-up by grouping word bounding boxes into larger groups and uses a set of heuristics. It has already been implemented and a preliminary evaluation on about 6000 Web documents has been carried out.
    Keywords: table detection, web information extraction
    Generating maps of web pages using cellular automata BIBAKFull-Text 935-936
      Hanene Azzag; David Ratsimba; David Da Costa; Gilles Venturini; Christiane Guinot
    The aim of web pages visualization is to present in a very informative and interactive way a set of web documents to the user in order to let him or her navigate through these documents. In the web context, this may correspond to several user's tasks: displaying the results of a search engine, or visualizing a graph of pages such as a hypertext or a surf map. In addition to web pages visualization, web pages clustering also greatly improves the amount of information presented to the user by highlighting the similarities between the documents [6]. In this paper we explore the use of a cellular automata (CA) to generate such maps of web pages.
    Keywords: cellular automata, unsupervised clustering, visualization, web pages
    BuzzRank ... and the trend is your friend BIBAKFull-Text 937-938
      Klaus Berberich; Srikanta Bedathur; Michalis Vazirgiannis; Gerhard Weikum
    Ranking methods like PageRank assess the importance of Web pages based on the current state of the rapidly evolving Web graph. The dynamics of the resulting importance scores, however, have not been considered yet, although they provide the key to an understanding of the Zeitgeist on the Web. This paper proposes the BuzzRank method that quantifies trends in time series of importance scores and is based on a relevant growth model of importance scores. We experimentally demonstrate the usefulness of BuzzRank on a bibliographic dataset.
    Keywords: pagerank, web dynamics, web graph
    Detecting nepotistic links by language model disagreement BIBAKFull-Text 939-940
      András A. Benczúr; István Bíró; Károly Csalogány; Máté Uher
    In this short note we demonstrate the applicability of hyperlink downweighting by means of language model disagreement. The method filters out hyperlinks with no relevance to the target page without the need of white and blacklists or human interaction. We fight various forms of nepotism such as common maintainers, ads, link exchanges or misused affiliate programs. Our method is tested on a 31 M page crawl of the .de domain with a manually classified 1000-page random sample.
    Keywords: anchor text, language modeling, link Spam
    The distribution of pageRank follows a power-law only for particular values of the damping factor BIBAKFull-Text 941-942
      Luca Becchetti; Carlos Castillo
    We show that the empirical distribution of the PageRank values in a large set of Web pages does not follow a power-law except for some particular choices of the damping factor. We argue that for a graph with an in-degree distribution following a power-law with exponent between 2.1 and 2.2, choosing a damping factor around 0.85 for PageRank yields a power-law distribution of its values. We suggest that power-law distributions of PageRank in Web graphs have been observed because the typical damping factor used in practice is between 0.85 and 0.90.
    Keywords: pagerank distribution, web graph
    Mining related queries from search engine query logs BIBAKFull-Text 943-944
      Xiaodong Shi; Christopher C. Yang
    In this work we propose a method that retrieves a list of related queries given an initial input query. The related queries are based on the query log of previously issued queries by human users, which can be discovered using our improved association rule mining model. Users can use the suggested related queries to tune or redirect the search process. Our method not only discovers the related queries, but also ranks them according to the degree of their relatedness. Unlike many other rival techniques, it exploits only limited query log information and performs relatively better on queries in all frequency divisions.
    Keywords: association rule, edit distance, query log, related query, web searching
    Discovering event evolution graphs from newswires BIBAKFull-Text 945-946
      Christopher C. Yang; Xiaodong Shi
    In this paper, we propose an approach to automatically mine event evolution graphs from newswires on the Web. Event evolution graph is a directed graph in which the vertices and edges denote news events and the evolutions between events respectively, in a news affair. Our model utilizes the content similarity between events and incorporates temporal proximity and document distributional proximity as decaying functions. Our approach is effective in presenting the inside developments of news affairs along the timeline, which can facilitate users' information browsing tasks.
    Keywords: event evolution, event evolution graph, knowledge management, web content mining
    Mining clickthrough data for collaborative web search BIBAKFull-Text 947-948
      Jian-Tao Sun; Xuanhui Wang; Dou Shen; Hua-Jun Zeng; Zheng Chen
    This paper is to investigate the group behavior patterns of search activities based on Web search history data, i.e., clickthrough data, to boost search performance. We propose a Collaborative Web Search (CWS) framework based on the probabilistic modeling of the co-occurrence relationship among the heterogeneous web objects: users, queries, and Web pages. The CWS framework consists of two steps: (1) a cube-clustering approach is put forward to estimate the semantic cluster structures of the Web objects; (2) Web search activities are conducted by leveraging the probabilistic relations among the estimated cluster structures. Experiments on a real-world clickthrough data set validate the effectiveness of our CWS approach.
    Keywords: clickthrough data, collaborative web search, cube-clustering
    Background knowledge for ontology construction BIBAKFull-Text 949-950
      Bla Fortuna; Marko Grobelnik; A Dunja Mladenic
    In this paper we describe a solution for incorporating background knowledge into the OntoGen system for semi-automatic ontology construction. This makes it easier for different users to construct different and more personalized ontologies for the same domain. To achieve this we introduce a word weighting schema to be used in the document representation. The weighting schema is learned based on the background knowledge provided by user. It is than used by OntoGen's machine learning and text mining algorithms.
    Keywords: background knowledge, semi-automatic ontology construction
    Mining RDF metadata for generalized association rules: knowledge discovery in the semantic web era BIBAKFull-Text 951-952
      Tao Jiang; Ah-Hwee Tan
    In this paper, we present a novel frequent generalized pattern mining algorithm, called GP-Close, for mining generalized associations from RDF metadata. To solve the over-generalization problem encountered by existing methods, GP-Close employs the notion of emphgeneralization closure for systematic over-generalization reduction.
    Keywords: RDF mining, association rule mining
    AutoTag: a collaborative approach to automated tag assignment for weblog posts BIBAKFull-Text 953-954
      Gilad Mishne
    This paper describes AutoTag, a tool which suggests tags for weblog posts using collaborative filtering methods. An evaluation of AutoTag on a large collection of posts shows good accuracy; coupled with the blogger's final quality control, AutoTag assists both in simplifying the tagging process and in improving its quality.
    Keywords: blogs, tags
    Merging trees: file system and content integration BIBAFull-Text 955-956
      Erik Wilde
    XML is the predominant format for representing structured information inside documents, but it stops at the level of files. This makes it hard to use XML-oriented tools to process information which is scattered over multiple documents within a file system. File System XML (FSX) and its content integration provides a unified view of file system structure and content. FSX's adaptors map file contents to XML, which means that any file format can be integrated with an XML view in the integrated view of the file system.
    A content and structure website mining model BIBAKFull-Text 957-958
      Barbara Poblete; Ricardo Baeza-Yates
    We present a novel model for validating and improving the content and structure organization of a website. This model studies the website as a graph and evaluates its interconnectivity in relation to the similarity of its documents. The aim of this model is to provide a simple way for improving the overall structure, contents and interconnectivity of a website. This model has been implemented as a prototype and applied to several websites, showing very interesting results. Our model is complementary to other methods of website personalization and improvement.
    Keywords: web mining, website improvement
    Online mining of frequent query trees over XML data streams BIBAKFull-Text 959-960
      Hua-Fu Li; Man-Kwan Shan; Suh-Yin Lee
    In this paper, we proposed an online algorithm, called FQT-Stream (Frequent Query Trees of Streams), to mine the set of all frequent tree patterns over a continuous XML data stream. A new numbering method is proposed to represent the tree structure of a XML query tree. An effective sub-tree numeration approach is developed to extract the essential information from the XML data stream. The extracted information is stored in an effective summary data structure. Frequent query trees are mined from the current summary data structure by a depth-first-search manner.
    Keywords: XML, data streams, frequent query trees, online mining, web mining
    Using proportional transportation similarity with learned element semantics for XML document clustering BIBAKFull-Text 961-962
      Xiaojun Wan; Jianwu Yang
    This paper proposes a novel approach to measuring XML document similarity by taking into account the semantics between XML elements. The motivation of the proposed approach is to overcome the problems of "under-contribution" and "over-contribution" existing in previous work. The element semantics are learned in an unsupervised way and the Proportional Transportation Similarity is proposed to evaluate XML document similarity by modeling the similarity calculation as a transportation problem. Experiments of clustering are performed on three ACM SIGMOD data sets and results show the favorable performance of the proposed approach.
    Keywords: XML document clustering, proportional transportation similarity
    Template guided association rule mining from XML documents BIBAKFull-Text 963-964
      Rahman AliMohammadzadeh; Sadegh Soltan; Masoud Rahgozar
    Compared with traditional association rule mining in the structured world (e.g. Relational Databases), mining from XML data is confronted with more challenges due to the inherent flexibilities of XML in both structure and semantics. The major challenges include 1) a more complicated hierarchical data structure; 2) an ordered data context; and 3) a much bigger size for each data element. In order to make XML-enabled association rule mining truly practical and computationally tractable, we propose a practical model for mining association rules from XML documents and demonstrate the usability and effectiveness of model through a set of experiments on real-life data.
    Keywords: XML, association rule mining, data mining
    Automatic geotagging of Russian web sites BIBAKFull-Text 965-966
      Alexei Pyalling; Michael Maslov; Pavel Braslavski
    The poster describes a fast, simple, yet accurate method to associate large amounts of web resources stored in a search engine database with geographic locations. The method uses location-by-IP data, domain names, and content-related features: ZIP and area codes. The novelty of the approach lies in building location-by-IP database by using continuous IP blocks method. Another contribution is domain name analysis. The method uses search engine infrastructure and makes it possible to effectively associate large amounts of search engine data with geography on a regular basis. Experiments ran on Yandex search engine index; evaluation has proved the efficacy of the approach.
    Keywords: geographic information retrieval, geotagging
    Using symbolic objects to cluster web documents BIBAKFull-Text 967-968
      Esteban Meneses; Oldemar Rodríguez-Rojas
    Web Clustering is useful for several activities in the WWW, from automatically building web directories to improve retrieval performance. Nevertheless, due to the huge size of the web, a linear mechanism must be employed to cluster web documents. The k-means is one classic algorithm used in this problem. We present a variant of the vector model to be used with the k-means algorithm. Our representation uses symbolic objects for clustering web documents. Some experiments were done with positive results and future work is optimistic.
    Keywords: symbolic data analysis, web clustering
    Estimating required recall for successful knowledge acquisition from the web BIBAKFull-Text 969-970
      Wolfgang Gatterbauer
    Information on the Web is not only abundant but also redundant. This redundancy of information has an important consequence on the relation between the recall of an information gathering system and its capacity to harvest the core information of a certain domain of knowledge. This paper provides a new idea for estimating the necessary Web coverage of a knowledge acquisition system in order to achieve a certain desired coverage of the contained core information.
    Keywords: information extraction, quantitative performance measures, recall, redundancy, web metrics
    Text-based video blogging BIBAKFull-Text 971-972
      Narichika Hamaguchi; Mamoru Doke; Masaki Hayashi; Nobuyuki Yagi
    A video blogging system has been developed for easily producing your own video programs that can be made available to the public in much the same way that blogs are created. The user merely types a program script on a webpage, the same as creating a blog, selects a direction style, and pastes in some additional material content to create a CG-based video program that can be openly distributed to the general public. The script, direction style, and material content are automatically combined to create a movie file on the server side. The movie file can then be accessed by referring to an RSS feed and viewed on the screens of various devices.
    Keywords: APE, TVML, blog, vlog, web-casting
    A decentralized CF approach based on cooperative agents BIBAKFull-Text 973-974
      Byeong Man Kim; Qing Li; Adele E. Howe
    In this paper, we propose a decentralized collaborative filtering (CF) approach based on P2P overlay network for the autonomous agents' environment. Experiments show that our approach is more scalable than traditional centralized CF filtering systems and alleviates the sparsity problem in distributed CF.
    Keywords: P2P system, distributed collaborative filtering, friend network
    Adaptive web sites: user studies and simulation BIBAKFull-Text 975-976
      Doug Warner; Stephen D. Durbin; J. Neal Richter; Zuzana Gedeon
    Adaptive web sites have been proposed to enhance ease of navigation and information retrieval. A variety of approaches are described in the literature, but consideration of interface presentation issues and realistic user studies are generally lacking. We report here a large-scale study of sites with dynamic information collections and user interests, where adaptation is based on an Ant Colony Optimization technique. We find that most users were able to locate information effectively without needing to perform explicit searches. The behavior of users who did search was similar to that on Internet search engines. Simulations based on site and user models give insight into the adaptive behavior and correspond to observations.
    Keywords: adaptive web site, ant colony optimization
    On a service-oriented approach for an engineering knowledge desktop BIBAKFull-Text 977-978
      Sylvia C. Wong; Richard M. Crowder; Gary B. Wills
    Increasingly, manufacturing companies are shifting their focus from selling products to providing services. As a result, when designing new products, engineers must increasingly consider the life cycle costs in addition to any design requirements. To identify possible areas of concern, designers are required to consult existing maintenance information from identical products. However, in a large engineering company, the amount of information available is significant and in wide range of formats. This paper presents a prototype knowledge desktop suitable for the design engineer. The Engineering Knowledge Desktop analyses and suggests relevant information from ontologically marked-up heterogeneous web resources. It is designed using a Service-Oriented Architecture, with an ontology to mediate between Web Services. It has been delivered to the user community for evaluation.
    Keywords: semantic web, service-oriented architecture, web services
    Design and development of learning management system at universiti Putra Malaysia: a case study of e-SPRINT BIBAKFull-Text 979-980
      Sidek H. A. Aziz; Aida Suraya; M. Yunus; Kamariah A. Bakar; Hamidah B. Meseran
    This paper reports the design and development of the e-SPRINT, Learning Management System, which has been derived from Sistem Pengurusan Rangkaian Integrasi Notakuliah dalam Talian -- mod Elektronik) and currently being implemented at Universiti Putra Malaysia (UPM). The e-SPRINT was developed by utilizing PERL (Practical Extraction and Report Language) and was supported by standard database in Linux/UNIX environment operating system. The system is currently being used to supplement and complement part of the classroom-based teaching. This paper covers the architecture and features of the e-SPRINT system which consists of five main modules. Some general issues and challenges of such e-learning initiatives implementation will also be discussed.
    Keywords: internet, learning management system
    Providing SCORM with adaptivity BIBAKFull-Text 981-982
      M. Rey-López; A. Fernández-Vilas; R. Díaz-Redondo; J. Pazos-Arias
    Content personalization is a very important aspect in the field of e-learning, although current standards do not fully support it. In this paper, we outline an extension to the ADL SCORM (Sharable Content Object Reference Model) standard in an effort to permit a suitable adaptivity based on user's characteristics. Applying this extension, we can create adaptable courses, which should be personalized before shown to the student.
    Keywords: AH, SCORM, adaptivity, e-learning
    A framework for XML data streams history checking and monitoring BIBAKFull-Text 983-984
      Alessandro Campi; Paola Spoletini
    The need of formal verification is a problem that involves all the fields in which sensible data are managed. In this context the verification of data streams became a fundamental task. The purpose of this paper is to present a framework, based on the model checker SPIN, for the verification of data streams.
       The proposed method uses a linear temporal logic, called TRIO, to describe data constraints and properties. Constraints are automatically translated into Promela, the input language of the model checker SPIN in order to verify them.
    Keywords: XML, semi-structured data, verification
    The credibility of the posted information in a recommendation system based on a map BIBAKFull-Text 985-986
      Koji Yamamoto; Daisuke Katagami; Katsumi Nitta; Akira Aiba; Hitoshi Kuwata
    We propose a method for estimating the credibility of the posted information from users. The system displays these information on the map. Since posted information can include subjective information from various perspectives, we can't trust all of the postings as they are. We propose and integrate factors of the user's geographic posting tendency and votes by other users.
    Keywords: GIS, credibility, navigation, posting, recommendation
    Archiving web site resources: a records management view BIBAKFull-Text 987-988
      Maureen Pennock; Brian Kelly
    In this paper, we propose the use of records management principles to identify and manage Web site resources with enduring value as records. Current Web archiving activities, collaborative or organisational, whilst extremely valuable in their own right, often do not and cannot incorporate requirements for proper records management. Material collected under such initiatives therefore may not be reliable or authentic from a legal or archival perspective, with insufficient metadata collected about the object during its active life, and valuable materials destroyed whilst ephemeral items are maintained. Education, training, and collaboration between stakeholders are integral to avoiding these risks and successfully preserving valuable Web-based materials.
    Keywords: archiving web sites, best practices, records management
    Geographic locations of web servers BIBAKFull-Text 989-990
      Katsuko T. Nakahira; Tetsuya Hoshino; Yoshiki Mikami
    The ccTLD (country code Top Level Domain) in a URL does not necessarily point to the geographic location of the server concerned. The authors have surveyed sample servers belonging to 60 ccTLDs in Africa, with regard to the number of hops required to reach the target site from Japan, the response time, and the NIC registration information of each domain. The survey has revealed the geographical distribution of server sites as well as their connection environments. It has been found that the percentage of offshore (out of home country) servers is as high as 80% and more than half of these are located in Europe. Offshore servers not only provide little benefit to the people of the country to which each ccTLD rightly belongs but their existence also heightens the risk of a country being unable to control them with its own policies and regulations. Offshore servers constitute a significant aspect of the digital divide problem.
    Keywords: Africa, NIC registration information, ccTLD, digital-divide, geographic location of servers, number of hops, offshore server, response time, traceroute
    Why is connectivity in developing regions expensive: policy challenges more than technical limitations? BIBAKFull-Text 991-992
      Rahul Tongia
    I present analysis examining some of the causes of poor connectivity in developing countries. Based on a techno-economic analysis and design, I show that technical limitations per se are not the bottleneck for widespread connectivity; rather, design, policy, and regulatory challenges dominate.
    Keywords: Africa, broadband, digital divide, internet and telecom access, open access, optical fibers, techno-economics, wireless
    Bilingual web page and site readability assessment BIBAKFull-Text 993-994
      Tak Pang Lau; Irwin King
    Readability assessment is a method to measure the difficulty of a piece of text material, and it is widely used in educational field to assist instructors to prepare appropriate materials for students. In this paper, we investigate the applications of readability assessment in Web development, such that users can retrieve information which is appropriate to their levels. We propose a bilingual (English and Chinese) assessment scheme for Web page and Web site readability based on textual features, and conduct a series of experiments with real Web data to evaluate our scheme. Experimental results show that, apart from just indicating the readability level, the estimated score acts as a good heuristic to figure out pages with low textual content. Furthermore, we can obtain the overall content distribution in a Web site by studying the variation of its readability.
    Keywords: Chinese, English, readability, web pages, web sites
    Mobile web publishing and surfing based on environmental sensing data BIBKFull-Text 995-996
      Daisuke Morikawa; Masaru Honjo; Satoshi Nishiyama; Masayoshi Ohashi
    Keywords: GPS, RFID, location, personalization, sensor, web browsing, web publishing
    DoNet: a semantic domotic framework BIBAKFull-Text 997-998
      Malcolm Attard; Matthew Montebello
    In the very near future complete households will be entirely networked as a de facto standard. In this poster we briefly describe our work in the area of domotics, where personalization, semantics and agent technology come together. We illustrate a home system oriented ontology and an intelligent agent based framework for the rapid development of home control and automation. The ever changing nature of the home, places the user in a position were he needs to be involved and become, through DoNet, a part of an ongoing home system optimization process.
    Keywords: agents, domotics, semantic web
    Web based device independent mobile map applications.: the m-CHARTIS system BIBAKFull-Text 999-1000
      John Garofalakis; Theofanis-Aristofanis Michail; Athanasios Plessas
    A map is one of the most useful media in disseminating spatial information. As mobile devices are becoming increasingly powerful and ubiquitous, new possibilities to access map information are created. However, mobile devices still face severe constraints that limit the possibilities that a mobile map application may offer. We present the m-CHARTIS system, a device independent mobile map application that enables mobile users to access map information from their device.
    Keywords: handheld devices, mobile cartography, mobile devices, mobile map application
    Context-orientated news filtering for Web 2.0 and beyond BIBAKFull-Text 1001-1002
      David Webster; Weihong Huang; Darren Mundy; Paul Warren
    How can we solve the problem of information overload in news syndication? This poster outlines the path from keyword-based body text matching to distance-measurable taxonomic tag matching, on to context scale and practical uses.
    Keywords: RSS, aggregation, context, tags, Web 2.0, word senses
    Efficient search for peer-to-peer information retrieval using semantic small world BIBAKFull-Text 1003-1004
      Hai Jin; Xiaomin Ning; Hanhua Chen
    This paper proposes a semantic overlay based on the small world phenomenon that facilitates efficient search for information retrieval in unstructured P2P systems. In the semantic overlay, each node maintains a number of short-range links which are semantically similar to each other, together with a small collection of long-range links that help increasing recall rate of information retrieval and reduce network traffic as well. Experimental results show that our model can improve performance by 150% compared to Gnutella and by up to 60% compared to the Interest-based model -- a similar shortcut-based search technique.
    Keywords: information retrieval, peer-to-peer, semantic, small world
    Semantic link based top-K join queries in P2P networks BIBAKFull-Text 1005-1006
      Jie Liu; Liang Feng; Chao He
    An important issue arising from Peer-to-Peer applications is how to accurately and efficiently retrieve a set of K best matching data objects from different sources while minimizing the number of objects that have to be accessed. This paper resolves this issue by organizing peers in a Semantic Link Network Overlay, where semantic links are established to denote the semantic relationship between peers' data schemas. A query request will be routed to appropriate peers according to the semantic link type and a lower bound of rank function. Optimization strategies are proposed to reduce the total amount of data transmitted.
    Keywords: join query, peer-to-peer, semantic link, top-K
    Ontology-based legal information retrieval to improve the information access in e-government BIBAKFull-Text 1007-1008
      Asunción Gómez-Pérez; Fernando Ortiz-Rodriguez; Boris Villazón-Terrazas
    In this paper, we present EgoIR, an approach for retrieving legal information based on ontologies; this approach has been developed with Legal Ontologies to be deployed within the e-government context.
    Keywords: information retrieval, ontology
    Oyster: sharing and re-using ontologies in a peer-to-peer community BIBAKFull-Text 1009-1010
      Raul Palma; Peter Haase; Asunción Gómez-Pérez
    In this paper, we present Oyster, a Peer-to-Peer system for exchanging ontology metadata among communities in the Semantic Web. Oyster exploits semantic web techniques in data representation, query formulation and query result presentation to provide an online solution for sharing ontologies, thus assisting researchers in re-using existing ontologies.
    Keywords: metadata, ontology, peer-to-peer, repository
    GoGetIt!: a tool for generating structure-driven web crawlers BIBAKFull-Text 1011-1012
      Márcio L. A. Vidal; Altigran S. da Silva; Edleno S. de Moura; João M. B. Cavalcanti
    We present GoGetIt!, a tool for generating structure-driven crawlers that requires a minimum effort from the users. The tool takes as input a sample page and an entry point to a Web site and generates a structure-driven crawler based on navigation patterns, sequences of patterns for the links a crawler has to follow to reach the pages structurally similar to the sample page. In the experiments we have performed, structure-driven crawlers generated by GoGetIt! were able to collect all pages that match the samples given, including those pages added after their generation.
    Keywords: tree edit distance, web crawlers, web data extraction
    Towards practical genre classification of web documents BIBAKFull-Text 1013-1014
      George Ferizis; Peter Bailey
    Classification of documents by genre is typically done either using linguistic analysis or term frequency based techniques. The former provides better classification accuracy than the latter but at the cost of two orders of magnitude more computation time. While term frequency analysis requires much less computational resources than linguistic analysis, it returns poor classification accuracy when the genres are not sufficiently distinct. A method that removes or approximates the expensive portions of linguistic analysis is presented.
       The accuracy and computation time of this method then compared with both linguistic analysis and term frequency analysis. The results in this paper show that this method can significantly reduce the computation of both time of linguistic analysis and term frequency analysis, while retaining an accuracy that is higher than that of term frequency analysis.
    Keywords: genre classification, linguistic, term frequency
    Do not crawl in the DUST: different URLs with similar text BIBAKFull-Text 1015-1016
      Uri Schonfeld; Ziv Bar-Yossef; Idit Keidar
    We consider the problem of dust: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, translates URLs to some canonical form, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering dust; that is, for discovering rules for transforming a given URL to others that are likely to have similar content. DustBuster is able to detect dust effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can benefit from this information to increase the effectiveness of crawling, reduce indexing overhead as well as improve the quality of popularity statistics such as PageRank.
    Keywords: duplicates, mining, rules, similarity
    Community discovery and analysis in blogspace BIBAKFull-Text 1017-1018
      Ying Zhou; Joseph Davis
    Weblog has quickly evolved into a new information and knowledge dissemination channel. Yet it is not easy to discover weblog communities through keyword search. The main contribution of this paper is the study of weblog communities from the perspective of social network analysis. We proposed a new way of collecting and preparing data for weblog community discovery. The data collection stage focuses on gaining knowledge of the strength of social ties between weblogs. The strength of social ties and the clustering feature of social network guided the discovery of weblog communities.
    Keywords: community, social network, social tie, weblog
    PageSim: a novel link-based measure of web page similarity BIBAKFull-Text 1019-1020
      Zhenjiang Lin; Michael R. Lyu; Irwin King
    To find similar web pages to a query page on the Web, this paper introduces a novel link-based similarity measure, called PageSim. Contrast to SimRank, a recursive refinement of cocitation, PageSim can measure similarity between any two web pages, whereas SimRank cannot in some cases. We give some intuitions to the PageSim model, and outline the model with mathematical definitions. Finally, we give an example to illustrate its effectiveness.
    Keywords: link analysis, pagerank, search engine, similarity measure, simrank
    Finding specification pages according to attributes BIBAKFull-Text 1021-1022
      Naoki Yoshinag; Kentaro Torisaw
    This paper presents a method for finding a specification page on the web for a given object (e.g."Titanic")and its class label (e.g."film"). A specification page for an object is a web page which gives concise attribute-value information about the object (e.g."director"-"James Cameron" for "Titanic"). A simple unsupervised method using layout and symbolic decoration cues was applied to a large number of web pages to acquire the class attributes. We used these acquired attributes to select a representative specification page for a given object from the web pages retrieved by a normal search engine. Experimental results revealed that our method greatly outperformed the normal search engine in terms of specification retrieval.
    Keywords: attribute acquisition, specification finding, web search
    Selective hypertext induced topic search BIBAKFull-Text 1023-1024
      Amit C. Awekar; Pabitra Mitra; Jaewoo Kang
    We address the problem of answering broad-topic queries on the World Wide Web. We present a link based analysis algorithm SelHITS, which is an improvement over Kleinberg's HITS [2] algorithm. We introduce the concept of virtual links to exploit the latent information in the hyperlinked environment. We propose a novel approach to calculate hub and authority values. We also present a selective expansion method which avoids topic drift and provides results consistent with only one interpretation of the query, even if the query is ambiguous. Initial experimental evaluation and user feedback show that our algorithm indeed distills the most important and relevant pages for broad-topic queries. We also infer that there exists a uniform notion of quality of search results within users.
    Keywords: link analysis, searching, topic distillation
    An audio/video analysis mechanism for web indexing BIBAKFull-Text 1025-1026
      Marco Furini; Marco Aragone
    The high availability of video streams is making necessary mechanisms for indexing such contents in the Web world. In this paper we focus on news programs and we propose a mechanism that integrates low and high level video features to provide a high level semantic description. A color/luminance analysis is coupled with audio analysis to provide a better identification of all the video segments that compose the video stream. Each video segment is subject to speech detection and is described through MPEG7 so that the resulting metadata description can be used to index the video stream. An experimental evaluation shows the benefits of integrating audio and video analysis.
    Keywords: MPEG7-DDL, automatic speech recognition, contents indexing, shot boundary detection, video indexing
    The SOWES approach to P2P web search using semantic overlays BIBAKFull-Text 1027-1028
      Christos Doulkeridis; Kjetil Nørvåg; Michalis Vazirgiannis
    Peer-to-peer (P2P) Web search has gained a lot of interest lately, due to the salient characteristics of P2P systems, namely scalability, fault-tolerance and load-balancing. However, the lack of global knowledge in a vast and dynamically evolving environment like the Web presents a grand challenge for organizing content and providing efficient searching. Semantic overlay networks (SONs) have been proposed as an approach to reduce cost and increase quality of results, and in this paper we present an unsupervised approach for distributed and decentralized SON construction, aiming to support efficient search mechanisms in unstructured P2P systems.
    Keywords: distributed and peer-to-peer search, semantic overlay networks
    Topic-oriented query expansion for web search BIBAKFull-Text 1029-1030
      Shao-Chi Wang; Yuzuru Tanaka
    The contribution of this paper includes three folders: (1) To introduce a topic-oriented query expansion model based on the Information Bottleneck theory that classify terms into distinct topical clusters in order to find out candidate terms for the query expansion. (2) To define a term-term similarity matrix that is available to improve the term ambiguous problem. (3) To propose two measures, intracluster and intercluster similarities, that are based on proximity between the topics represented by two clusters in order to evaluate the retrieval effectiveness. Results of several evaluation experiments in Web search exhibit the average intracluster similarity was improved for the gain of 79.1% while the average intercluster similarity was decreased for the loss of 36.0%.
    Keywords: information bottleneck, intercluster similarity, intracluster similarity, query expansion, term-term similarity matrix, topic-oriented
    Predictive modeling of first-click behavior in web-search BIBAKFull-Text 1031-1032
      Maeve O'Brien; Mark T. Keane; Barry Smyth
    Search engine results are usually presented in some form of text summary (e.g., document title, some snippets of the page's content, a URL, etc). Based on the information contained within these summaries users make relevance judgments about what links best suit their information needs. Current research suggests that these relevance judgments are in the service of some search strategy. In this paper, we model two different search strategies (the comparison and threshold strategies) and determine how well they fit data gathered from an experiment on user search within a simulated Google environment.
    Keywords: empirical tests, information navigation, information scent, link analysis, predictive user modeling, search behavior, web evolution
    Proximity within paragraph: a measure to enhance document retrieval performance BIBAKFull-Text 1033-1034
      Srisupa Palakvangsa-Na-Ayudhya; John A. Keane
    We created a proximity measure, called Proximity Within Paragraph (PWP), which is based on the concept of value assignment to queried words, grouped by associated ideas within paragraphs. Based on the WT10G dataset, a test system comprising three test sets and fifty queries were constructed to evaluate the effectiveness of PWP by comparing it with the existing method: Minimum Distance Between Queried Pairs. A further experiment combines the scores obtained from both methods and the results suggest that the combination can significantly improve the effectiveness.
    Keywords: proximity measure, ranking algorithm
    Finding experts and their eetails in e-mail corpora BIBAKFull-Text 1035-1036
      Krisztian Balog; Maarten de Rijke
    We present methods for finding experts (and their contact details) using e-mail messages. We locate messages on a topic, and then find the associated experts. Our approach is unsupervised: both the list of potential experts and their personal details are obtained automatically from e-mail message headers and signatures, respectively. Evaluation is done using the e-mail lists in the W3C corpus.
    Keywords: e-mail processing, expert finding, expert search
    Efficient query subscription processing for prospective search engines BIBAKFull-Text 1037-1038
      Utku Irmak; Svilen Mihaylov; Torsten Suel; Samrat Ganguly; Rauf Izmailov
    Current web search engines are retrospective in that they limit users to searches against already existing pages. Prospective search engines, on the other hand, allow users to upload queries that will be applied to newly discovered pages in the future. We study and compare algorithms for efficiently matching large numbers of simple keyword queries against a stream of newly discovered pages.
    Keywords: inverted index, prospective search, query processing
    Mining search engine query logs for query recommendation BIBAKFull-Text 1039-1040
      Zhiyong Zhang; Olfa Nasraoui
    This paper presents a simple and intuitive method for mining search engine query logs to get fast query recommendations on a large scale industrial strength search engine. In order to get a more comprehensive solution, we combine two methods together. On the one hand, we study and model search engine users' sequential search behavior, and interpret this consecutive search behavior as client-side query refinement, that should form the basis for the search engine's own query refinement process. On the other hand, we combine this method with a traditional content based similarity method to compensate for the high sparsity of real query log data, and more specifically, the shortness of most query sessions. To evaluate our method, we use one hundred day worth query logs from SINA' search engine to do off-line mining. Then we analyze three independent editors evaluations on a query test set. Based on their judgement, our method was found to be effective for finding related queries, despite its simplicity. In addition to the subjective editors' rating, we also perform tests based on actual anonymous user search sessions.
    Keywords: mining, query logs, recommendation, session
    Effective web-scale crawling through website analysis BIBAKFull-Text 1041-1042
      Iván Gonzlez; Adam Marcus; Daniel N. Meredith; Linda A. Nguyen
    The web crawler space is often delimited into two general areas: full-web crawling and focused crawling. We present netSifter, a crawler system which integrates features from these two areas to provide an effective mechanism for web-scale crawling. netSifter utilizes a combination of page-level analytics and heuristics which are applied to a sample of web pages from a given website. These algorithms score individual web pages to determine the general utility of the overall website. In doing so, netSifter can formulate an in-depth opinion of a website (and the entirety of its web pages) with a relative minimum of work. netSifter is then able to bias the future efforts of its crawl towards higher quality websites, and away from the myriad of low quality websites and crawler traps that litter the World Wide Web.
    Keywords: UIMA, crawling, netsifter, sampling, webfountain
    Focused crawling: experiences in a real world project BIBKFull-Text 1043-1044
      Antonio Badia; Tulay Muezzinoglu; Olfa Nasraoui
    Keywords: crawling, information retrieval, thesaurus, topic
    Image annotation using search and mining technologies BIBAKFull-Text 1045-1046
      Xin-Jing Wang; Lei Zhang; Feng Jing; Wei-Ying Ma
    In this paper, we present a novel solution to the image annotation problem which annotates images using search and data mining technologies. An accurate keyword is required to initialize this process, and then leveraging a large-scale image database, it 1) searches for semantically and visually similar images, 2) and mines annotations from them. A notable advantage of this approach is that it enables unlimited vocabulary, while it is not possible for all existing approaches. Experimental results on real web images show the effectiveness and efficiency of the proposed algorithm.
    Keywords: hash indexing, image annotation, search result clustering
    Semantic web integration of cultural heritage sources BIBAKFull-Text 1047-1048
      P. Sinclair; P. Lewis; K. Martinez; M. Addis; D. Prideaux
    In this paper, we describe research into the use of ontologies to integrate access to cultural heritage and photographic archives. The use of the CIDOC CRM and CRM Core ontologies are described together with the metadata mapping methodology. A system integrating data from four content providers will be demonstrated.
    Keywords: interoperability, multimedia, ontologies, semantic web
    The ODESeW 2.0 semantic web application framework BIBAKFull-Text 1049-1050
      Oscar Corcho; Angel López-Cima; Asunción Gómez-Pérez
    We describe the architecture of the ODESeW 2.0 Semantic Web application development platform, which has been used to generate the internal and external Web sites of several R&D projects.
    Keywords: framework, semantic web, web application
    Visualizing an historical semantic web with Heml BIBAKFull-Text 1051-1052
      Bruce G. Robertson
    This poster presents ongoing efforts to enrich the RDF-based semantic Web with the tools of the Historical Event Markup and Linking Project (Heml). An experimental RDF vocabulary for Heml data is illustrated, as well as its use in storing and querying encoded historical events. Finally, the practical use of Heml-RDF is illustrated with a toolkit for the Piggy Bank semantic browser plugin.
    Keywords: ACM proceedings, Heml, RDF, chronology, history
    Beyond XML and RDF: the versatile web query language xcerpt BIBAKFull-Text 1053-1054
      Benedikt Linse; Andreas Schroeder
    Applications and services that access Web data are becoming increasingly more useful and wide-spread. Current main-stream Web query languages such as XQuery, XSLT, or SPARQL, however, focus only on one of the different data formats available on the Web. In contrast, Xcerpt is a emphversatile semi-structured query language, i.e., a query language able to access all kinds of Web data such as XML and RDF in the same language reusing common concepts and language constructs. To integrate heterogeneous data and as a foundation for Semantic Web reasoning, Xcerpt also provides rules. Xcerpt has a visual companion language, visXcerpt, that is conceived as a mere rendering of the (textual) query language Xcerpt using a slightly extended CSS. Both languages are demonstrated along a realistic use case integrating XML and RDF data highlighting interesting and unique features. Novel language constructs and optimization techniques are currently under investigation in the Xcerpt project (cf. http://xcerpt.org/).
    Keywords: RDF, XML, query languages, versatility, web, xcerpt
    An ontology for internal and external business processes BIBAKFull-Text 1055-1056
      Armin Haller; Eyal Oren; Paavo Kotinurmi
    In this paper we introduce our multi metamodel process ontology (m3po), which is based on various existing reference models and languages from the workflow and choreography domain. This ontology allows the extraction of arbitrary choreography interface descriptions from arbitrary internal workflow models. We also report on an initial validation: we translate an IBM Websphere MQ Workflow model into the m3po ontology and then extract an Abstract BPEL model from the ontology.
    Keywords: choreography, meta model integration, ontology, workflow modelling
    Automatic matchmaking of web services BIBKFull-Text 1057-1058
      Sudhir Agarwal; Anupriya Ankolekar
    Keywords: matchmaking, semantic web services
    Adding semantics to rosettaNet specifications BIBAKFull-Text 1059-1060
      Paavo Kotinurmi; Tomas Vitvar
    The use of Semantic Web Service (SWS) technologies have been suggested to enable more dynamic B2B integration of heterogeneous systems and partners. We present how we add semantics to RosettaNet specifications to enable the WSMX SWS environment to automate mediation of messages. The benefits of applying SWS technologies include flexibility in accepting heterogeneity in B2B integrations.
    Keywords: B2B integration, XML, ontologysing, rosettaNet
    HTML2RSS: automatic generation of RSS feed based on structure analysis of HTML document BIBAKFull-Text 1061-1062
      Tomoyuki Nanno; Manabu Okumura
    We present a system to automatically generate RSS feeds from HTML documents that consist of time-series items with date expressions, e.g., archives of weblogs, BBSs, chats, mailing lists, site update descriptions, and event announcements. Our system extracts date expressions, performs structure analysis of a HTML document, and detects or generates titles from the document.
    Keywords: RSS, atom, document analysis, feed, syndication
    Logical structure based semantic relationship extraction from semi-structured documents BIBAKFull-Text 1063-1064
      Zhang Kuo; Wu Gang; Li JuanZi
    Addressed in this paper is the issue of semantic relationship extraction from semi-structured documents. Many research efforts have been made so far on the semantic information extraction. However, much of the previous work focuses on detecting 'isolated' semantic information by making use of linguistic analysis or linkage information in web pages and limited research has been done on extracting semantic relationship from the semi-structured documents. In this paper, we propose a method for semantic relationship extraction by using the logical information in the semi-structured document (semi-structured document usually has various types of structure information, e.g. a semi-structured document may be hierarchical laid out). To the best of our knowledge, extracting semantic relationships by using logical information has not been investigated previously. A probabilistic approach has been proposed in the paper. Features used in the probabilistic model have been defined.
    Keywords: logical structure, ontology, relationship extraction, semi-structured document
    OWL FA: a metamodeling extension of OWL D BIBAKFull-Text 1065-1066
      Jeff Z. Pan; Ian Horrocks
    This paper proposes OWL FA, a decidable extension of OWL DL with the metamodeling architecture of RDFS(FA). It shows that the knowledge base satisfiability problem of OWL FA can be reduced to that of OWL DL, and compares the FA semantics with the recently proposed contextual semantics and Hilog semantics for OWL.
    Keywords: metamodeling, ontology, reasoning
    Learning and inferencing in user ontology for personalized semantic web services BIBAKFull-Text 1067-1068
      Xing Jiang; Ah-Hwee Tan
    Domain ontology has been used in many Semantic Web applications. However, few applications explore the use of ontology for personalized services. This paper proposes an ontology based user model consisting of both concepts and semantic relations to represent users' interests. Specifically, we adopt a statistical approach to learning a semantic-based user ontology model from domain ontology and a spreading activation procedure for inferencing in the user ontology model. We apply the methods of learning and exploiting user ontology to a semantic search engine for finding academic publications. Our experimental results support the efficacy of user ontology and spreading activation theory (SAT) for providing personalized semantic services.
    Keywords: spreading-activation theory, user ontology
    Upgrading relational legacy data to the semantic web BIBAKFull-Text 1069-1070
      Jesús Barrasa Rodriguez; Asunción Gómez-Pérez
    In this poster, we describe a framework composed of the R2O mapping language and the ODEMapster processor to upgrade relational legacy data to the Semantic Web. The framework is based on the declarative description of mappings between relational and ontology elements and the exploitation of such mapping descriptions by a generic processor capable of performing both massive and query driven data upgrade.
    Keywords: database-to-ontology mappings, relational databases, semantic web, upgrade
    How semantics make better wikis BIBAKFull-Text 1071-1072
      Eyal Oren; John G. Breslin; Stefan Decker
    Wikis are popular collaborative hypertext authoring environments, but they neither support structured access nor information reuse. Adding semantic annotations helps to address these limitations. We present an architecture for Semantic Wikis and discuss design decisions including structured access, views, and annotation language. We present our prototype SemperWiki that implements this architecture.
    Keywords: information access, semantic annotation, semantic web, semantic wikis, wikis
    Integrating ecoinformatics resources on the semantic web BIBAKFull-Text 1073-1074
      Cynthia Sims Parr; Andriy Parafiynyk; Joel Sachs; Li Ding; Sandor Dornbush; Tim Finin; David Wang; Allan Hollander
    We describe ELVIS (the Ecosystem Location Visualization and Information System), a suite of tools for constructing food webs for a given location. We express both ELVIS input and output data in OWL, thereby enabling its integration with other semantic web resources. In particular, we describe using a Triple Shop application to answer SPARQL queries from a collection of semantic web documents. This is an end-to-end case study of the semantic web's utility for ecological and environmental research.
    Keywords: biodiversity, ecological forecasting, food webs, invasive species, ontologies, semantic web, service oriented design
    Path summaries and path partitioning in modern XML databases BIBKFull-Text 1077-1078
      Andrei Arion; Angela Bonifati; Ioana Manolescu; Andrea Pugliese
    Keywords: XML, XQuery processing, path partition, path summaries
    Evaluating structural summaries as access methods for XML BIBAKFull-Text 1079-1080
      Mirella M. Moro; Zografoula Vagena; Vassilis J. Tsotras
    Structural summaries are data structures that preserve all structural features of XML documents in a compact form. We investigate the applicability of the most popular summaries as textitaccess methods within XML query processing. In this context, issues like space and false positives introduced by the summaries need to be examined. Our evaluation reveals that the additional space required by the more precise structures is usually small and justified by the considerable performance gains that they achieve.
    Keywords: precision, query processing, structural summaries
    FLUX: fuzzy content and structure matching of XML range queries BIBAKFull-Text 1081-1082
      Hua-Gang Li; S. Alireza Aghili; Divyakant Agrawal; Amr El Abbadi
    An XML range query may impose predicates on the numerical or textual contents of the elements and/or their respective path structures. In order to handle content and structure range queries efficiently, an XML query processing engine needs to incorporate effective indexing and summarization techniques to efficiently partition the XML document and locate the results. In this paper, we propose a dynamic summarization and indexing method, FLUX, based on Bloom filters and B+-trees to tackle these problems. The results of our extensive experimental evaluations indicated the efficiency of the proposed system.
    Keywords: XML database, range query, xpath