| The new economy: an engineer's perspective | | BIBA | Full-Text | 1 | |
| David Brown | |||
| From his twin perspectives as a career-long telecommunications engineer and Chairman of one of the UK's largest electronics companies, Sir David Brown will reflect on whether and when the New Economy, seemingly so long coming, will finally arrive. He will begin by exploring how the prospect of everything being digital; everyone having broadband; and intelligence being everywhere is changing our understanding of mobility. Then he will comment on the economic effects of that changed understanding under three headings -- the macroeconomy, microeconomy and socioeconomy -- before suggesting the criteria we might use to decide when the New Economy has arrived. | |||
| Position paper: a comparison of two modelling paradigms in the Semantic Web | | BIBAK | Full-Text | 3-12 | |
| Peter F. Patel-Schneider; Ian Horrocks | |||
| Classical logics and Datalog-related logics have both been proposed as
underlying formalisms for the Semantic Web. Although these two different
formalism groups have some commonalities, and look similar in the context of
expressively-impoverished languages like RDF, their differences become apparent
at more expressive language levels. After considering some of these
differences, we argue that, although some of the characteristics of Datalog
have their utility, the open environment of the Semantic Web is better served
by standard logics. Keywords: Semantic Web, modelling, philosophical foundations, representation | |||
| Web ontology segmentation: analysis, classification and use | | BIBAK | Full-Text | 13-22 | |
| Julian Seidenberg; Alan Rector | |||
| Ontologies are at the heart of the semantic web. They define the concepts
and relationships that make global interoperability possible. However, as these
ontologies grow in size they become more and more difficult to create, use,
understand, maintain, transform and classify. We present and evaluate several
algorithms for extracting relevant segments out of large description logic
ontologies for the purposes of increasing tractability for both humans and
computers. The segments are not mere fragments, but stand alone as ontologies
in their own right. This technique takes advantage of the detailed semantics
captured within an OWL ontology to produce highly relevant segments. The
research was evaluated using the GALEN ontology of medical terms and
procedures. Keywords: OWL, Semantic Web, ontology, scalability, segmentation | |||
| Constructing virtual documents for ontology matching | | BIBAK | Full-Text | 23-31 | |
| Yuzhong Qu; Wei Hu; Gong Cheng | |||
| On the investigation of linguistic techniques used in ontology matching, we
propose a new idea of virtual documents to pursue a cost-effective approach to
linguistic matching in this paper. Basically, as a collection of weighted
words, the virtual document of a URIref declared in an ontology contains not
only the local descriptions but also the neighboring information to reflect the
intended meaning of the URIref. Document similarity can be computed by
traditional vector space techniques, and then be used in the similarity-based
approaches to ontology matching. In particular, the RDF graph structure is
exploited to define the description formulations and the neighboring
operations. Experimental results show that linguistic matching based on the
virtual documents is dominant in average F-Measure as compared to other three
approaches. It is also demonstrated by our experiments that the virtual
documents approach is cost-effective as compared to other linguistic matching
approaches. Keywords: description, formulation, linguistic matching, neighboring operation,
ontology matching, vector space model | |||
| Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework | | BIBAK | Full-Text | 33-42 | |
| Shumeet Baluja | |||
| Fitting enough information from webpages to make browsing on small screens
compelling is a challenging task. One approach is to present the user with a
thumbnail image of the full web page and allow the user to simply press a
single key to zoom into a region (which may then be transcoded into wml/xhtml,
summarized, etc). However, if regions for zooming are presented naively, this
yields a frustrating experience because of the number of coherent regions,
sentences, images, and words that may be inadvertently separated. Here, we cast
the web page segmentation problem into a machine learning framework, where we
re-examine this task through the lens of entropy reduction and decision tree
learning. This yields an efficient and effective page segmentation algorithm.
We demonstrate how simple techniques from computer vision can be used to
fine-tune the results. The resulting segmentation keeps coherent regions
together when tested on a broad set of complex webpages. Keywords: browser, machine learning, mobile browsing, mobile devices, small screen,
thumbnail browsing, web page segmentation | |||
| Image classification for mobile web browsing | | BIBAK | Full-Text | 43-52 | |
| Takuya Maekawa; Takahiro Hara; Shojiro Nishio | |||
| It is difficult for users of mobile devices such as cellular phones equipped
with a small screen and a poor input interface to browse Web pages designed for
desktop PCs with large displays. Many studies and commercial products have
tried to solve this problem. Web pages include images that have various roles
such as site menus, line headers for itemization, and page titles. However,
most studies of mobile Web browsing haven't paid much attention to the roles of
Web images. In this paper, we define eleven Web image categories according to
their roles and use these categories for proper Web image handling. We manually
categorized 3,901 Web images collected from forty Web sites and extracted image
features of each category according to the classification. By making use of the
extracted features, we devised an automatic Web image classification method.
Furthermore, we evaluated the automatic classification of real Web pages and
achieved up to 83.1% classification accuracy. We also implemented an automatic
Web page scrolling system as an application of our automatic image
classification method. Keywords: mobile computing, web browsing, web images | |||
| Fine grained content-based adaptation mechanism for providing high end-user quality of experience with adaptive hypermedia systems | | BIBAK | Full-Text | 53-62 | |
| Cristina Hava Muntean; Jennifer McManis | |||
| New communication technologies can enable Web users to access personalised
information "anytime, anywhere". However, the network environments allowing
this "anytime, anywhere" access may have widely varying performance
characteristics such as bandwidth, level of congestion, mobility support, and
cost of transmission. It is unrealistic to expect that the quality of delivery
of the same content can be maintained in this variable environment, but rather
an effort must be made to fit the content served to the current delivery
conditions, thus ensuring high Quality of Experience (QoE) to the users. This
paper introduces an end-user QoE-aware adaptive hypermedia framework that
extends the adaptation functionality of adaptive hypermedia systems with a
fine-grained content-based adaptation mechanism. The proposed mechanism
attempts to take into account multiple factors affecting QoE in relation to the
delivery of Web content. Various simulation tests investigate the performance
improvements provided by this mechanism, in a home-like, low bit rate
operational environment, in terms of access time per page, aggregate access
time per browsing session and quantity of transmitted information. Keywords: adaptive hypermedia, content-based adaptation mechanism, distance education,
end-user quality of experience | |||
| Topical TrustRank: using topicality to combat web spam | | BIBAK | Full-Text | 63-72 | |
| Baoning Wu; Vinay Goel; Brian D. Davison | |||
| Web spam is behavior that attempts to deceive search engine ranking
algorithms. TrustRank is a recent algorithm that can combat web spam. However,
TrustRank is vulnerable in the sense that the seed set used by TrustRank may
not be sufficiently representative to cover well the different topics on the
Web. Also, for a given seed set, TrustRank has a bias towards larger
communities. We propose the use of topical information to partition the seed
set and calculate trust scores for each topic separately to address the above
issues. A combination of these trust scores for a page is used to determine its
ranking. Experimental results on two large datasets show that our Topical
TrustRank has a better performance than TrustRank in demoting spam sites or
pages. Compared to TrustRank, our best technique can decrease spam from the top
ranked sites by as much as 43.1%. Keywords: PageRank, TrustRank, spam, web search engine | |||
| Site level noise removal for search engines | | BIBAK | Full-Text | 73-82 | |
| André Luiz da Costa Carvalho; Paul-Alexandru Chirita; Edleno Silva de Moura; Pável Calado; Wolfgang Nejdl | |||
| The currently booming search engine industry has determined many online
organizations to attempt to artificially increase their ranking in order to
attract more visitors to their web sites. At the same time, the growth of the
web has also inherently generated several navigational hyperlink structures
that have a negative impact on the importance measures employed by current
search engines. In this paper we propose and evaluate algorithms for
identifying all these noisy links on the web graph, may them be spam or simple
relationships between real world entities represented by sites, replication of
content, etc. Unlike prior work, we target a different type of noisy link
structures, residing at the site level, instead of the page level. We thus
investigate and annihilate site level mutual reinforcement relationships,
abnormal support coming from one site towards another, as well as complex link
alliances between web sites. Our experiments with the link database of the
TodoBR search engine show a very strong increase in the quality of the output
rankings after having applied our techniques. Keywords: PageRank, link analysis, noise reduction, spam | |||
| Detecting spam web pages through content analysis | | BIBAK | Full-Text | 83-92 | |
| Alexandros Ntoulas; Marc Najork; Mark Manasse; Dennis Fetterly | |||
| In this paper, we continue our investigations of "web spam": the injection
of artificially-created pages into the web in order to influence the results
from search engines, to drive traffic to certain pages for fun or profit. This
paper considers some previously-undescribed techniques for automatically
detecting spam pages, examines the effectiveness of these techniques in
isolation and when aggregated using classification algorithms. When combined,
our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%)
in our judged collection of 17,168 pages, while misidentifying 526 spam and
non-spam pages (3.1%). Keywords: data mining, web characterization, web pages, web spam | |||
| XML screamer: an integrated approach to high performance XML parsing, validation and deserialization | | BIBAK | Full-Text | 93-102 | |
| Margaret G. Kostoulas; Morris Matsa; Noah Mendelsohn; Eric Perkins; Abraham Heifets; Martha Mercaldi | |||
| This paper describes an experimental system in which customized high
performance XML parsers are prepared using parser generation and compilation
techniques. Parsing is integrated with Schema-based validation and
deserialization, and the resulting validating processors are shown to be as
fast as or in many cases significantly faster than traditional nonvalidating
parsers. High performance is achieved by integration across layers of software
that are traditionally separate, by avoiding unnecessary data copying and
transformation, and by careful attention to detail in the generated code. The
effect of API design on XML performance is also briefly discussed.. Keywords: JAX-RPC, SAX, XML, XML schema, parsing, performance, schema compilation,
validation | |||
| Symmetrically exploiting XML | | BIBAK | Full-Text | 103-111 | |
| Shuohao Zhang; Curtis Dyreson | |||
| Path expressions are the principal means of locating data in a hierarchical
model. But path expressions are brittle because they often depend on the
structure of data and break if the data is structured differently. The
structure of data could be unfamiliar to a user, may differ within a data
collection, or may change over time as the schema evolves. This paper proposes
a novel construct that locates related nodes in an instance of an XML data
model, independent of a specific structure. It can augment many XPath
expressions and can be seamlessly incorporated in XQuery or XSLT. Keywords: XML, XPath, XQuery, path expressions | |||
| FeedEx: collaborative exchange of news feeds | | BIBAK | Full-Text | 113-122 | |
| Seung Jun; Mustaque Ahamad | |||
| As most blogs and traditional media support RSS or Atom feeds, the news feed
technology becomes increasingly prevalent. Taking advantage of ubiquitous news
feeds, we design FeedEx, a news feed exchange system. Forming a distribution
overlay network, nodes in FeedEx not only fetch feed documents from the servers
but also exchange them with neighbors. Among many benefits of collaborative
feed exchange, we focus on the low-overhead, scalable delivery mechanism that
increases the availability of news feeds. Our design of FeedEx is
incentive-compatible so that nodes are encouraged into cooperating rather than
free riding. In addition, for a better design of FeedEx, we analyze the data
collected from 245 feeds for 10 days and present relevant statistics about news
feed publishing, including the distributions of feed size, entry lifetime, and
publishing rate.
Our experimental evaluation using 189 PlanetLab machines, which fetch from real-world feed servers, shows that FeedEx is an efficient system in many respects. Even when a node fetches feed documents as infrequently as every 16 hours, it captures more than 90% of the total entries published, and those captured entries are available within 22 minutes on average after published at the servers. By contrast, stand-alone applications in the same condition show 36% of entry coverage and 5.7 hours of time lag. The efficient delivery of FeedEx is achieved with low communication overhead as each node receives only 0.9 document exchange calls and 6.3 document checking calls per minute on average. Keywords: FeedEx, RSS, atom, collaborative exchange, news feeds | |||
| Examining the content and privacy of web browsing incidental information | | BIBAK | Full-Text | 123-132 | |
| Kirstie Hawkey; Kori M. Inkpen | |||
| This research examines the privacy comfort levels of participants if others
can view traces of their web browsing activity. During a week-long field study,
participants used an electronic diary daily to annotate each web page visited
with a privacy level. Content categories were used by participants to
theoretically specify their privacy comfort for each category and by
researchers to partition participants' actual browsing. The content categories
were clustered into groups based on the dominant privacy levels applied to the
pages. Inconsistencies between participants in their privacy ratings of
categories suggest that a general privacy management scheme is inappropriate.
Participants' consistency within categories suggests that a personalized scheme
may be feasible; however a more fine-grained approach to classification is
required to improve results for sites that tend to be general, of multiple task
purposes, or dynamic in content. Keywords: ad hoc collaboration, client-side logging, field study, personalization,
privacy, web browsing behaviour, web page content | |||
| Off the beaten tracks: exploring three aspects of web navigation | | BIBAK | Full-Text | 133-142 | |
| Harald Weinreich; Hartmut Obendorf; Eelco Herder; Matthias Mayer | |||
| This paper presents results of a long-term client-side Web usage study,
updating previous studies that range in age from five to ten years. We focus on
three aspects of Web navigation: changes in the distribution of navigation
actions, speed of navigation and within-page navigation."Navigation actions"
corresponding to users' individual page requests are discussed by type. We
reconfirm links to be the most important navigation element, while backtracking
has lost more than half of its previously reported share and form submission
has become far more common. Changes of the Web and the browser interfaces are
candidates for causing these changes.
Analyzing the time users stayed on pages, we confirm Web navigation to be a rapidly interactive activity. A breakdown of page characteristics shows that users often do not take the time to read the available text or consider all links. The performance of the Web is analyzed and reassessed against the resulting requirements. Finally, habits of within-page navigation are presented. Although most selected hyperlinks are located in the top left corner of the = screen, in nearly a quarter of all cases people choose links that require scrolling. We analyzed the available browser real estate to gain insights for the design of non-scrolling Web pages. Keywords: browser interfaces, clickstream study, hypertext, navigation, user modeling | |||
| pTHINC: a thin-client architecture for mobile wireless web | | BIBAK | Full-Text | 143-152 | |
| Joeng Kim; Ricardo A. Baratto; Jason Nieh | |||
| Although web applications are gaining popularity on mobile wireless PDAs,
web browsers on these systems can be quite slow and often lack adequate
functionality to access many web sites. We have developed pTHINC, a PDA
thin-client solution that leverages more powerful servers to run full-function
web browsers and other application logic, then sends simple screen updates to
the PDA for display. pTHINC uses server-side screen scaling to provide
high-fidelity display and seamless mobility across a broad range of different
clients and screen sizes, including both portrait and landscape viewing modes.
pTHINC also leverages existing PDA control buttons to improve system usability
and maximize available screen resolution for application display. We have
implemented pTHINC on Windows Mobile and evaluated its performance on mobile
wireless devices. Our results compared to local PDA web browsers and other
thin-client approaches demonstrate that pTHINC provides superior web browsing
performance and is the only PDA thin client that effectively supports crucial
browser helper applications such as video playback. Keywords: mobility, pervasive web, remote display, thin-client computing | |||
| Bringing communities to the semantic web and the semantic web to communities | | BIBAK | Full-Text | 153-162 | |
| K. Faith Lawrence; m. c. schraefel | |||
| In this paper we consider the types of community networks that are most
often codified within the Semantic Web. We propose the recognition of a new
structure which fulfils the definition of community used outside the Semantic
Web. We argue that the properties inherent in a community allow additional
processing to be done with the described relationships existing between
entities within the community network. Taking an existing online community as a
case study we describe the ontologies and applications that we developed to
support this community in the Semantic Web environment and discuss what lessons
can be learnt from this exercise and applied in more general settings. Keywords: case study, communities, e-applications, semantic web | |||
| Invisible participants: how cultural capital relates to lurking behavior | | BIBAK | Full-Text | 163-172 | |
| Vladimir Soroka; Sheizaf Rafaeli | |||
| The asymmetry of activity in virtual communities is of great interest. While
participation in the activities of virtual communities is crucial for a
community's survival and development, many people prefer lurking, that is
passive attention over active participation. Lurking can be measured and
perhaps affected by both dispositional and situational variables. This work
investigates the concept of cultural capital as situational antecedent of
lurking and de-lurking (the decision to start posting after a certain amount of
lurking time). Cultural capital is defined as the knowledge that enables an
individual to interpret various cultural codes. The main hypothesis states that
a user's cultural capital affects her level of activity in a community and her
decision to de-lurk and cease to exist in very active communities because of
information overload. This hypothesis is analyzed by mathematically defining a
social communication network (SCN) of activities in authenticated discussion
forums. We validate this model by examining the SCN using data collected in a
sample of 636 online forums in Open University in Israel and 2 work based
communities from IBM. The hypotheses verified here make it clear that fostering
receptive participation may be as important and constructive as encouraging
active contributions in online communities. Keywords: Web forums, cultural capital, e-learning, lurking | |||
| Probabilistic models for discovering e-communities | | BIBAK | Full-Text | 173-182 | |
| Ding Zhou; Eren Manavoglu; Jia Li; C. Lee Giles; Hongyuan Zha | |||
| The increasing amount of communication between individuals in e-formats
(e.g. email, Instant messaging and the Web) has motivated computational
research in social network analysis (SNA). Previous work in SNA has emphasized
the social network (SN) topology measured by communication frequencies while
ignoring the semantic information in SNs. In this paper, we propose two
generative Bayesian models for semantic community discovery in SNs, combining
probabilistic modeling with community detection in SNs. To simulate the
generative models, an EnF-Gibbs sampling algorithm is proposed to address the
efficiency and performance problems of traditional methods. Experimental
studies on Enron email corpus show that our approach successfully detects the
communities of individuals and in addition provides semantic topic descriptions
of these communities. Keywords: Gibbs sampling, clustering, data mining, email, social network, statistical
modeling | |||
| The web beyond popularity: a really simple system for web scale RSS | | BIBAK | Full-Text | 183-192 | |
| Daniel Gruhl; Daniel N. Meredith; Jan H. Pieper; Alex Cozzi; Stephen Dill | |||
| Popularity based search engines have served to stagnate information
retrieval from the web. Developed to deal with the very real problem of
degrading quality within keyword based search they have had the unintended side
effect of creating "icebergs" around topics, where only a small minority of the
information is above the popularity water-line. This problem is especially
pronounced with emerging information -- new sites are often hidden until they
become popular enough to be considered above the water-line. In domains new to
a user this is often helpful -- they can focus on popular sites first.
Unfortunately it is not the best tool for a professional seeking to keep
up-to-date with a topic as it emerges and evolves.
We present a tool focused on this audience -- a system that addresses the very large scale information gathering, filtering and routing, and presentation problems associated with creating a useful incremental stream of information from the web as a whole. Utilizing the WebFountain platform as the primary data engine and Really Simple Syndication (RSS) as the delivery mechanism, our "Daily Deltas" (Delta) application is able to provide an informative feed of relevant content directly to a user. Individuals receive a personalized, incremental feed of pages related to their topic allowing them to track their interests independent of the overall popularity of the topic. Keywords: Daily Delta, RSS, WebFountain, crawler, document routing, internet | |||
| Visualizing tags over time | | BIBAK | Full-Text | 193-202 | |
| Micah Dubinko; Ravi Kumar; Joseph Magnani; Jasmine Novak; Prabhakar Raghavan; Andrew Tomkins | |||
| We consider the problem of visualizing the evolution of tags within the
Flickr (flickr.com) online image sharing community. Any user of the Flickr
service may append a tag to any photo in the system. Over the past year, users
have on average added over a million tags each week. Understanding the
evolution of these tags over time is therefore a challenging task. We present a
new approach based on a characterization of the most interesting tags
associated with a sliding interval of time. An animation provided via Flash in
a web browser allows the user to observe and interact with the interesting tags
as they evolve over time.
New algorithms and data structures are required to support the efficient generation of this visualization. We combine a novel solution to an interval covering problem with extensions to previous work on score aggregation in order to create an efficient backend system capable of producing visualizations at arbitrary scales on this large dataset in real time. Keywords: Flickr, interval covering, social media, tags, temporal evolution,
visualization | |||
| Knowing the user's every move: user activity tracking for website usability evaluation and implicit interaction | | BIBAK | Full-Text | 203-212 | |
| Richard Atterer; Monika Wnuk; Albrecht Schmidt | |||
| In this paper, we investigate how detailed tracking of user interaction can
be monitored using standard web technologies. Our motivation is to enable
implicit interaction and to ease usability evaluation of web applications
outside the lab. To obtain meaningful statements on how users interact with a
web application, the collected information needs to be more detailed and
fine-grained than that provided by classical log files. We focus on tasks such
as classifying the user with regard to computer usage proficiency or making a
detailed assessment of how long it took users to fill in fields of a form.
Additionally, it is important in the context of our work that usage tracking
should not alter the user's experience and that it should work with existing
server and browser setups. We present an implementation for detailed tracking
of user actions on web pages. An HTTP proxy modifies HTML pages by adding
JavaScript code before delivering them to the client. This JavaScript tracking
code collects data about mouse movements, keyboard input and more. We
demonstrate the usefulness of our approach in a case study. Keywords: HTTP proxy, implicit interaction, mouse tracking, user activity tracking,
website usability evaluation | |||
| Finding advertising keywords on web pages | | BIBAK | Full-Text | 213-222 | |
| Wen-tau Yih; Joshua Goodman; Vitor R. Carvalho | |||
| A large and growing number of web pages display contextual advertising based
on keywords automatically extracted from the text of the page, and this is a
substantial source of revenue supporting the web today. Despite the importance
of this area, little formal, published research exists. We describe a system
that learns how to extract keywords from web pages for advertisement targeting.
The system uses a number of features, such as term frequency of each potential
keyword, inverse document frequency, presence in meta-data, and how often the
term occurs in search query logs. The system is trained with a set of example
pages that have been hand-labeled with "relevant" keywords. Based on this
training, it can then extract new keywords from previously unseen pages.
Accuracy is substantially better than several baseline systems. Keywords: advertising, information extraction, keyword extraction | |||
| Communities from seed sets | | BIBAK | Full-Text | 223-232 | |
| Reid Andersen; Kevin J. Lang | |||
| Expanding a seed set into a larger community is a common procedure in
link-based analysis. We show how to adapt recent results from theoretical
computer science to expand a seed set into a community with small conductance
and a strong relationship to the seed, while examining only a small
neighborhood of the entire graph. We extend existing results to give
theoretical guarantees that apply to a variety of seed sets from specified
communities. We also describe simple and flexible heuristics for applying these
methods in practice, and present early experiments showing that these methods
compare favorably with existing approaches. Keywords: community finding, graph conductance, link analysis, random walks, seed sets | |||
| What's really new on the web?: identifying new pages from a series of unstable web snapshots | | BIBAK | Full-Text | 233-241 | |
| Masashi Toyoda; Masaru Kitsuregawa | |||
| Identifying and tracking new information on the Web is important in
sociology, marketing, and survey research, since new trends might be apparent
in the new information. Such changes can be observed by crawling the Web
periodically. In practice, however, it is impossible to crawl the entire
expanding Web repeatedly. This means that the novelty of a page remains
unknown, even if that page did not exist in previous snapshots. In this paper,
we propose a novelty measure for estimating the certainty that a newly crawled
page appeared between the previous and current crawls. Using this novelty
measure, new pages can be extracted from a series of unstable snapshots for
further analysis and mining to identify new trends on the Web. We evaluated the
precision, recall, and miss rate of the novelty measure using our Japanese web
archive, and applied it to a Web archive search engine. Keywords: information retrieval, link analysis, novelty, web evolution | |||
| A case for software assurance | | BIBA | Full-Text | 243 | |
| Mary Ann Davidson | |||
| Information technology has become "infrastructure technology," as most sectors of critical infrastructure rest on an IT backbone. Yet IT systems are not yet designed to be as safe, secure and reliable as physical infrastructure. Improving the security worthiness of commercial software requires a significant change in the development and product delivery process across the board. The security worthiness of all commercial software -- from all vendors -- demands that assurance became a critical focus for both providers and customers of IT. During Oracle's long history of building and delivering secure software, we continue to invest heavily in building security into each component of the product lifecycle. This is also an "organic" process which is regularly being enhanced to improve overall security practices. Our efforts have evolved from a formal development process to now additionally include secure coding standards, intensive developer training, innovative "bug finding" tools and working with leading vendors to "raise the bar" for all of industry as it pertains to security. | |||
| 'e-science and cyberinfrastructure: a middleware perspective' | | BIBA | Full-Text | 245 | |
| Tony Hey | |||
| The Internet was the inspiration of J.C.R.Licklider when he was at the
Advanced Research Projects Agency in the 1960's. In those pre-Moore's Law days,
Licklider imagined a future in which researchers could access and use computers
and data from anywhere in the world. Today, as everyone knows, the killer
applications for the Internet were email in the 1970's and the World Wide Web
in the 1990's which was developed initially as a collaboration tool for the
particle physics academic community. In the future, frontier research in many
fields will increasingly require the collaboration of globally distributed
groups of researchers needing access to distributed computing, data resources
and support for remote access to expensive, multi-national specialized
facilities such as telescopes and accelerators or specialist data archives. In
the context of science and engineering, this is the 'e-Science' agenda. Robust
middleware services deployed on top of research networks will constitute a
powerful 'Cyberinfrastructure' for collaborative science and engineering.
This talk will review the elements of this vision and describe the present status of efforts to build such an internet-scale distributed infrastructure based on Web Services. The goal is to provide robust middleware components that will allow scientists and engineers to routinely construct the inter-organizational 'Virtual Organizations'. Given the present state of Web Services, we argue for the need to define such Virtual Organization 'Grid' services on well-established Web Service specifications that are widely supported by the IT industry. Only industry can provide the necessary tooling and development environments to enable widespread adoption of such Grid services. Extensions to these basic Grid services can be added as more Web Services mature and the research community has had the opportunity to experiment with new services providing potentially useful new functionalities. The new Cyberinfrastructure will be of relevance to more than just the research community: it will impact both the e-learning and digital library communities allow the creation of scientific 'mash-ups' of services giving significant added value. | |||
| SecuBat: a web vulnerability scanner | | BIBAK | Full-Text | 247-256 | |
| Stefan Kals; Engin Kirda; Christopher Kruegel; Nenad Jovanovic | |||
| As the popularity of the web increases and web applications become tools of
everyday use, the role of web security has been gaining importance as well. The
last years have shown a significant increase in the number of web-based
attacks. For example, there has been extensive press coverage of recent
security incidences involving the loss of sensitive credit card information
belonging to millions of customers.
Many web application security vulnerabilities result from generic input validation problems. Examples of such vulnerabilities are SQL injection and Cross-Site Scripting (XSS). Although the majority of web vulnerabilities are easy to understand and to avoid, many web developers are, unfortunately, not security-aware. As a result, there exist many web sites on the Internet that are vulnerable. This paper demonstrates how easy it is for attackers to automatically discover and exploit application-level vulnerabilities in a large number of web applications. To this end, we developed SecuBat, a generic and modular web vulnerability scanner that, similar to a port scanner, automatically analyzes web sites with the aim of finding exploitable SQL injection and XSS vulnerabilities. Using SecuBat, we were able to find many potentially vulnerable web sites. To verify the accuracy of SecuBat, we picked one hundred interesting web sites from the potential victim list for further analysis and confirmed exploitable flaws in the identified web pages. Among our victims were well-known global companies and a finance ministry. Of course, we notified the administrators of vulnerable sites about potential security problems. More than fifty responded to request additional information or to report that the security hole was closed. Keywords: SQL injection, XSS, automated vulnerability detection, crawling, cross-site
scripting, scanner, security | |||
| Access control enforcement for conversation-based web services | | BIBAK | Full-Text | 257-266 | |
| Massimo Mecella; Mourad Ouzzani; Federica Paci; Elisa Bertino | |||
| Service Oriented Computing is emerging as the main approach to build
distributed enterprise applications on the Web. The widespread use of Web
services is hindered by the lack of adequate security and privacy support. In
this paper, we present a novel framework for enforcing access control in
conversation-based Web services. Our approach takes into account the
conversational nature of Web services. This is in contrast with existing
approaches to access control enforcement that assume a Web service as a set of
independent operations. Furthermore, our approach achieves a tradeoff between
the need to protect Web service's access control policies and the need to
disclose to clients the portion of access control policies related to the
conversations they are interested in. This is important to avoid situations
where the client cannot progress in the conversation due to the lack of
required security requirements. We introduce the concept of k-trustworthiness
that defines the conversations for which a client can provide credentials
maximizing the likelihood that it will eventually hit a final state. Keywords: access control, conversations, transition systems, web services | |||
| Analysis of communication models in web service compositions | | BIBAK | Full-Text | 267-276 | |
| Raman Kazhamiakin; Marco Pistore; Luca Santuari | |||
| In this paper we describe an approach for the verification of Web service
compositions defined by sets of BPEL processes. The key aspect of such a
verification is the model adopted for representing the communications among the
services participating in the composition. Indeed, these communications are
asynchronous and buffered in the existing execution frameworks, while most
verification approaches assume a synchronous communication model for efficiency
reasons. In our approach, we develop a parametric model for describing Web
service compositions, which allows us to capture a hierarchy of communication
models, ranging from synchronous communications to asynchronous communications
with complex buffer structures. Moreover, we develop a technique to associate
with a Web service composition the most adequate communication model, i.e., the
simplest model that is sufficient to capture all the behaviors of the
composition. This way, we can provide an accurate model of a wider class of
service composition scenarios, while preserving as much as possible an
efficient performance in verification. Keywords: BPEL, asynchronous communications, formal verification, web service
composition | |||
| Toward tighter integration of web search with a geographic information system | | BIBAK | Full-Text | 277-286 | |
| Taro Tezuka; Takeshi Kurashima; Katsumi Tanaka | |||
| Integration of Web search with geographic information has recently attracted
much attention. There are a number of local Web search systems enabling users
to find location-specific Web content. In this paper, however, we point out
that this integration is still at a superficial level. Most local Web search
systems today only link local Web content to a map interface. They are
extensions of a conventional stand-alone geographic information system (GIS),
applied to a Web-based client-server architecture. In this paper, we discuss
the directions available for tighter integration of Web search with a GIS, in
terms of extraction, knowledge discovery, and presentation. We also describe
implementations to support our argument that the integration must go beyond the
simple map-and hyperlink architecture. Keywords: local web search, web mining, web-GIS integration | |||
| Geographically focused collaborative crawling | | BIBAK | Full-Text | 287-296 | |
| Weizheng Gao; Hyun Chul Lee; Yingbo Miao | |||
| A collaborative crawler is a group of crawling nodes, in which each crawling
node is responsible for a specific portion of the web. We study the problem of
collecting geographically-aware pages using collaborative crawling strategies.
We first propose several collaborative crawling strategies for the
geographically focused crawling, whose goal is to collect web pages about
specified geographic locations, by considering features like URL address of
page, content of page, extended anchor text of link, and others. Later, we
propose various evaluation criteria to qualify the performance of such crawling
strategies. Finally, we experimentally study our crawling strategies by
crawling the real web data showing that some of our crawling strategies greatly
outperform the simple URL-hash based partition collaborative crawling, in which
the crawling assignments are determined according to the hash-value computation
over URLs. More precisely, features like URL address of page and extended
anchor text of link are shown to yield the best overall performance for the
geographically focused crawling. Keywords: collaborative crawling, geographic entities, geographically focused crawling | |||
| To randomize or not to randomize: space optimal summaries for hyperlink analysis | | BIBAK | Full-Text | 297-306 | |
| Tamás Sarlós; Adrás A. Benczúr; Károly Csalogány; Dániel Fogaras; Balázs Rácz | |||
| Personalized PageRank expresses link-based page quality around user selected
pages. The only previous personalized PageRank algorithm that can serve on-line
queries for an unrestricted choice of pages on large graphs is our Monte Carlo
algorithm [WAW 2004]. In this paper we achieve unrestricted personalization by
combining rounding and randomized sketching techniques in the dynamic
programming algorithm of Jeh and Widom [WWW 2003]. We evaluate the precision of
approximation experimentally on large scale real-world data and find
significant improvement over previous results. As a key theoretical
contribution we show that our algorithms use an optimal amount of space by also
improving earlier asymptotic worst-case lower bounds. Our lower bounds and
algorithms apply to the SimRank as well; of independent interest is the
reduction of the SimRank computation to personalized PageRank. Keywords: data streams, link-analysis, scalability, similarity search | |||
| Addressing the testing challenge with a web-based e-assessment system that tutors as it assesses | | BIBAK | Full-Text | 307-316 | |
| Mingyu Feng; Neil T. Heffernan; Kenneth R. Koedinger | |||
| Secondary teachers across the country are being asked to use formative
assessment data to inform their classroom instruction. At the same time,
critics of No Child Left Behind are calling the bill "No Child Left Untested"
emphasizing the negative side of assessment, in that every hour spent assessing
students is an hour lost from instruction. Or does it have to be? What if we
better integrated assessment into the classroom, and we allowed students to
learn during the test? Maybe we could even provide tutoring on the steps of
solving problems. Our hypothesis is that we can achieve more accurate
assessment by not only using data on whether students get test items right or
wrong, but by also using data on the effort required for students to learn how
to solve a test item. We provide evidence for this hypothesis using data
collected with our E-ASSISTment system by more than 600 students over the
course of the 2004-2005 school year. We also show that we can track student
knowledge over time using modern longitudinal data analysis techniques. In a
separate paper [9], we report on the ASSISTment system's architecture and
scalability, while this paper is focused on how we can reliably assess student
learning. Keywords: ASSISTment, MCAS, intelligent tutoring system, learning, predict | |||
| Knowledge modeling and its application in life sciences: a tale of two ontologies | | BIBAK | Full-Text | 317-326 | |
| Satya S. Sahoo; Christopher Thomas; Amit Sheth; William S. York; Samir Tartir | |||
| High throughput glycoproteomics, similar to genomics and proteomics,
involves extremely large volumes of distributed, heterogeneous data as a basis
for identification and quantification of a structurally diverse collection of
biomolecules. The ability to share, compare, query for and most critically
correlate datasets using the native biological relationships are some of the
challenges being faced by glycobiology researchers. As a solution for these
challenges, we are building a semantic structure, using a suite of ontologies,
which supports management of data and information at each step of the
experimental lifecycle. This framework will enable researchers to leverage the
large scale of glycoproteomics data to their benefit.
In this paper, we focus on the design of these biological ontology schemas with an emphasis on relationships between biological concepts, on the use of novel approaches to populate these complex ontologies including integrating extremely large datasets ( 500MB) as part of the instance base and on the evaluation of ontologies using OntoQA [38] metrics. The application of these ontologies in providing informatics solutions, for high throughput glycoproteomics experimental domain, is also discussed. We present our experience as a use case of developing two ontologies in one domain, to be part of a set of use cases, which are used in the development of an emergent framework for building and deploying biological ontologies. Keywords: ProPreO, bioinformatics ontology, biological ontology development, glycO,
glycoproteomics, ontology population, ontology structural metrics, semantic
bioinformatics | |||
| Reappraising cognitive styles in adaptive web applications | | BIBAK | Full-Text | 327-335 | |
| Elizabeth Brown; Tim Brailsford; Tony Fisher; Adam Moore; Helen Ashman | |||
| The mechanisms for personalisation used in web applications are currently
the subject of much debate amongst researchers from many diverse subject areas.
One of the most contemporary ideas for user modelling in web applications is
that of cognitive styles, where a user's psychological preferences are assessed
stored in a database and then used to provide personalised content and/or
links. We describe user trials of a case study that utilises visual-verbal
preferences in an adaptive web-based educational system (AWBES). Students in
this trial were assessed by the Felder-Solomon Inventory of Learning Styles
(ILS) instrument, and their preferences were used as a means of content
personalisation.
Contrary to previous findings by other researchers, we found no significant differences in performance between matched and mismatched students. Conclusions are drawn about the value and validity of using cognitive styles as a way of modelling user preferences in educational web applications. Keywords: adaptive hypermedia, cognitive styles, user modelling, user trials, web
applications | |||
| Cat and mouse: content delivery tradeoffs in web access | | BIBAK | Full-Text | 337-346 | |
| Balachander Krishnamurthy; Craig E. Wills | |||
| Web pages include extraneous material that may be viewed as undesirable by a
user. Increasingly many Web sites also require users to register to access
either all or portions of the site. Such tension between content owners and
users has resulted in a "cat and mouse" game between content provided and how
users access it.
We carried out a measurement-based study to understand the nature of extraneous content and its impact on performance as perceived by users. We characterize how this content is distributed and the effectiveness of blocking mechanisms to stop it as well as countermeasures taken by content owners to negate such mechanisms. We also examine sites that require some form of registration to control access and the attempts made to circumvent it. Results from our study show that extraneous content exists on a majority of popular pages and that a 25-30% reduction in downloaded objects and bytes with corresponding latency reduction can be attained by blocking such content. The top ten advertisement delivering companies delivered 40% of all URLs matched as ads in our study. Both the server name and the remainder of the URL are important in matching a URL as an ad. A majority of popular sites require some form of registration and for such sites users can obtain an account from a shared public database. We discuss future measures and countermeasures on the part of each side. Keywords: anonymity, content blocking, privacy, web registration | |||
| WAP5: black-box performance debugging for wide-area systems | | BIBAK | Full-Text | 347-356 | |
| Patrick Reynolds; Janet L. Wiener; Jeffrey C. Mogul; Marcos K. Aguilera; Amin Vahdat | |||
| Wide-area distributed applications are challenging to debug, optimize, and
maintain. We present Wide-Area Project 5 (WAP5), which aims to make these tasks
easier by exposing the causal structure of communication within an application
and by exposing delays that imply bottlenecks. These bottlenecks might not
otherwise be obvious, with or without the application's source code. Previous
research projects have presented algorithms to reconstruct application
structure and the corresponding timing information from black-box message
traces of local-area systems. In this paper we present (1) a new algorithm for
reconstructing application structure in both local- and wide-area distributed
systems, (2) an infrastructure for gathering application traces in PlanetLab,
and (3) our experiences tracing and analyzing three systems: CoDeeN and Coral,
two content-distribution networks in PlanetLab; and Slurpee, an
enterprise-scale incident-monitoring system. Keywords: black box systems, distributed systems, performance analysis, performance
debugging | |||
| WS-replication: a framework for highly available web services | | BIBAK | Full-Text | 357-366 | |
| Jorge Salas; Francisco Perez-Sorrosal; Marta Patiño-Martínez; Ricardo Jiménez-Peris | |||
| Due to the rapid acceptance of web services and its fast spreading, a number
of mission-critical systems will be deployed as web services in next years. The
availability of those systems must be guaranteed in case of failures and
network disconnections. An example of web services for which availability will
be a crucial issue are those belonging to coordination web service
infrastructure, such as web services for transactional coordination (e.g.,
WS-CAF and WS-Transaction). These services should remain available despite site
and connectivity failures to enable business interactions on a 24x7 basis. Some
of the common techniques for attaining availability consist in the use of a
clustering approach. However, in an Internet setting a domain can get
partitioned from the network due to a link overload or some other connectivity
problems. The unavailability of a coordination service impacts the availability
of all the partners in the business process. That is, coordination services are
an example of critical components that need higher provisions for availability.
In this paper, we address this problem by providing an infrastructure,
WS-Replication, for WAN replication of web services. The infrastructure is
based on a group communication web service, WS-Multicast, that respects the web
service autonomy. The transport of WS-Multicast is based on SOAP and relies
exclusively on web service technology for interaction across organizations. We
have replicated WS-CAF using our WS-Replication framework and evaluated its
performance. Keywords: WS-CAF, availability, group communication, transactions, web services | |||
| Random sampling from a search engine's index | | BIBAK | Full-Text | 367-376 | |
| Ziv Bar-Yossef; Maxim Gurevich | |||
| We revisit a problem introduced by Bharat and Broder almost a decade ago:
how to sample random pages from a search engine's index using only the search
engine's public interface? Such a primitive is particularly useful in creating
objective benchmarks for search engines.
The technique of Bharat and Broder suffers from two well recorded biases: it favors long documents and highly ranked documents. In this paper we introduce two novel sampling techniques: a lexicon-based technique and a random walk technique. Our methods produce biased sample documents, but each sample is accompanied by a corresponding "weight", which represents the probability of this document to be selected in the sample. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to three well known Monte Carlo simulation methods: rejection sampling, importance sampling and the Metropolis-Hastings algorithm. We analyze our methods rigorously and prove that under plausible assumptions, our techniques are guaranteed to produce near-uniform samples from the search engine's index. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long or highly ranked documents. We use our algorithms to collect fresh data about the relative sizes of Google, MSN Search, and Yahoo!. Keywords: benchmarks, sampling, search engines, size estimation | |||
| A web-based kernel function for measuring the similarity of short text snippets | | BIBAK | Full-Text | 377-386 | |
| Mehran Sahami; Timothy D. Heilman | |||
| Determining the similarity of short text snippets, such as search queries,
works poorly with traditional document similarity measures (e.g., cosine),
since there are often few, if any, terms in common between two short text
snippets. We address this problem by introducing a novel method for measuring
the similarity between short text snippets (even those without any overlapping
terms) by leveraging web search results to provide greater context for the
short texts. In this paper, we define such a similarity kernel function,
mathematically analyze some of its properties, and provide examples of its
efficacy. We also show the use of this kernel function in a large-scale system
for suggesting related queries to search engine users. Keywords: information retrieval, kernel functions, query suggestion, text similarity
measures, web search | |||
| Generating query substitutions | | BIBAK | Full-Text | 387-396 | |
| Rosie Jones; Benjamin Rey; Omid Madani; Wiley Greiner | |||
| We introduce the notion of query substitution, that is, generating a new
query to replace a user's original search query. Our technique uses
modifications based on typical substitutions web searchers make to their
queries. In this way the new query is strongly related to the original query,
containing terms closely related to all of the original terms. This contrasts
with query expansion through pseudo-relevance feedback, which is costly and can
lead to query drift. This also contrasts with query relaxation through boolean
or TFIDF retrieval, which reduces the specificity of the query. We define a
scale for evaluating query substitution, and show that our method performs well
at generating new queries related to the original queries. We build a model for
selecting between candidates, by using a number of features relating the
query-candidate pair, and by fitting the model to human judgments of relevance
of query suggestions. This further improves the quality of the candidates
generated. Experiments show that our techniques significantly increase coverage
and effectiveness in the setting of sponsored search. Keywords: paraphrasing, query rewriting, query substitution, sponsored search | |||
| POLYPHONET: an advanced social network extraction system from the web | | BIBAK | Full-Text | 397-406 | |
| Yutaka Matsuo; Junichiro Mori; Masahiro Hamasaki; Keisuke Ishida; Takuichi Nishimura; Hideaki Takeda; Koiti Hasida; Mitsuru Ishizuka | |||
| Social networks play important roles in the Semantic Web: knowledge
management, information retrieval, ubiquitous computing, and so on. We propose
a social network extraction system called POLYPHONET, which employs several
advanced techniques to extract relations of persons, detect groups of persons,
and obtain keywords for a person. Search engines, especially Google, are used
to measure co-occurrence of information and obtain Web documents.
Several studies have used search engines to extract social networks from the Web, but our research advances the following points: First, we reduce the related methods into simple pseudocodes using Google so that we can build up integrated systems. Second, we develop several new algorithms for social networking mining such as those to classify relations into categories, to make extraction scalable, and to obtain and utilize person-to-word relations. Third, every module is implemented in POLYPHONET, which has been used at four academic conferences, each with more than 500 participants. We overview that system. Finally, a novel architecture called Super Social Network Mining is proposed; it utilizes simple modules using Google and is characterized by scalability and Relate-Identify processes: Identification of each entity and extraction of relations are repeated to obtain a more precise social network. Keywords: search engine, social network, web mining | |||
| Semantic analytics on social networks: experiences in addressing the problem of conflict of interest detection | | BIBAK | Full-Text | 407-416 | |
| Boanerges Aleman-Meza; Meenakshi Nagarajan; Cartic Ramakrishnan; Li Ding; Pranam Kolari; Amit P. Sheth; I. Budak Arpinar; Anupam Joshi; Tim Finin | |||
| In this paper, we describe a Semantic Web application that detects Conflict
of Interest (COI) relationships among potential reviewers and authors of
scientific papers. This application discovers various 'semantic associations'
between the reviewers and authors in a populated ontology to determine a degree
of Conflict of Interest. This ontology was created by integrating entities and
relationships from two social networks, namely "knows," from a FOAF
(Friend-of-a-Friend) social network and "co-author," from the underlying
co-authorship network of the DBLP bibliography. We describe our experiences
developing this application in the context of a class of Semantic Web
applications, which have important research and engineering challenges in
common. In addition, we present an evaluation of our approach for real-life COI
detection. Keywords: RDF, conflict of interest, data fusion, entity disambiguation, ontologies,
peer review process, semantic analytics, semantic associations, semantic web,
social networks | |||
| Exploring social annotations for the semantic web | | BIBAK | Full-Text | 417-426 | |
| Xian Wu; Lei Zhang; Yong Yu | |||
| In order to obtain a machine understandable semantics for web resources,
research on the Semantic Web tries to annotate web resources with concepts and
relations from explicitly defined formal ontologies. This kind of formal
annotation is usually done manually or semi-automatically. In this paper, we
explore a complement approach that focuses on the "social annotations of the
web" which are annotations manually made by normal web users without a
pre-defined formal ontology. Compared to the formal annotations, although
social annotations are coarse-grained, informal and vague, they are also more
accessible to more people and better reflect the web resources' meaning from
the users' point of views during their actual usage of the web resources. Using
a social bookmark service as an example, we show how emergent semantics [2] can
be statistically derived from the social annotations. Furthermore, we apply the
derived emergent semantics to discover and search shared web bookmarks. The
initial evaluation on our implementation shows that our method can effectively
discover semantically related web bookmarks that current social bookmark
service can not discover easily. Keywords: emergent semantics, semantic web, social annotation, social bookmarks | |||
| Relaxed: on the way towards true validation of compound documents | | BIBAK | Full-Text | 427-436 | |
| Jirka Kosek; Petr Nálevka | |||
| To maintain interoperability in the Web environment it is necessary to
comply with Web standards. Current specifications of HTML and XHTML languages
define conformance conditions both in specification prose and in a formalized
way utilizing DTD. Unfortunately DTD is a very limited schema language and can
not express many constraints that are specified in the free text parts of the
specification. This means that a page which validates against DTD is not
necessarily conforming to the specification. In this article we analyze
features of modern schema languages that can improve validation of Web pages by
covering more (X)HTML language constraints then DTD. Our schemas use
combination of RELAX NG and Schematron to check not only the structure of the
Web pages, but also datatypes of attributes and elements, more complex
relations between elements and some WCAG checkpoints. A modular approach for
schema composition is presented together with usage examples, including sample
schemas for various compound documents (e.g. XHTML combined with MathML and
SVG).The second part of this article contains description of Relaxed validator
application we have developed. Relaxed is an extensible and powerful validation
engine offering a convenient Web interface, a Web-service API, Java API and
command-line interface. Combined with our RELAX NG + Schematron schemas,
Relaxed offers very valuable validation results that surpass W3C validator in
many aspects. Keywords: RELAX NG, Schematron, XHTML, XML, compound documents, validation | |||
| Model-based version and configuration management for a web engineering lifecycle | | BIBAK | Full-Text | 437-446 | |
| Tien N. Nguyen | |||
| During a lifecycle of a large-scale Web application, Web developers produce
a wide variety of inter-related Web objects. Following good Web engineering
practice, developers often create them based on a Web application development
method, which requires certain logical models for the development and
maintenance process. Web development is dynamic, thus, those logical models as
well as Web artifacts evolve over time. However, the task of managing their
evolution is still very inefficient because design decisions in models are not
directly accessible in existing file-based software configuration management
repositories. Key limitations of existing Web version control tools include
their inadequacy in representing semantics of design models and inability to
manage the evolution of model-based objects and their logical connections to
Web documents. This paper presents a framework that allows developers to manage
versions and configurations of models and to capture changes to model-to-model
relations among Web objects. Model-based objects, Web documents, and relations
are directly represented and versioned in a structure-oriented manner. Keywords: model-based configuration management, versioned hypermedia, web engineering | |||
| Model-directed web transactions under constrained modalities | | BIBAK | Full-Text | 447-456 | |
| Zan Sun; Jalal Mahmud; Saikat Mukherjee; I. V. Ramakrishnan | |||
| Online transactions (e.g., buying a book on the Web) typically involve a
number of steps spanning several pages. Conducting such transactions under
constrained interaction modalities as exemplified by small screen handhelds or
interactive speech interfaces -- the primary mode of communication for visually
impaired individuals -- is a strenuous, fatigue-inducing activity. But usually
one needs to browse only a small fragment of a Web page to perform a
transactional step such as a form fillout, selecting an item from a search
results list, etc. We exploit this observation to develop an automata-based
process model that delivers only the "relevant" page fragments at each
transactional step, thereby reducing information overload on such narrow
interaction bandwidths. We realize this model by coupling techniques from
content analysis of Web documents, automata learning and statistical
classification. The process model and associated techniques have been
incorporated into Guide-O, a prototype system that facilitates online
transactions using speech/keyboard interface (Guide-O-Speech), or with
limited-display size handhelds (Guide-O-Mobile). Performance of Guide-O and its
user experience are reported. Keywords: assistive device, content adaption, web transaction | |||
| Retroactive answering of search queries | | BIBAK | Full-Text | 457-466 | |
| Beverly Yang; Glen Jeh | |||
| Major search engines currently use the history of a user's actions (e.g.,
queries, clicks) to personalize search results. In this paper, we present a new
personalized service, query-specific web recommendations (QSRs), that
retroactively answers queries from a user's history as new results arise. The
QSR system addresses two important subproblems with applications beyond the
system itself: (1) Automatic identification of queries in a user's history that
represent standing interests and unfulfilled needs. (2) Effective detection of
interesting new results to these queries. We develop a variety of heuristics
and algorithms to address these problems, and evaluate them through a study of
Google history users. Our results strongly motivate the need for automatic
detection of standing interests from a user's history, and identifies the
algorithms that are most useful in doing so. Our results also identify the
algorithms, some which are counter-intuitive, that are most useful in
identifying interesting new results for past queries, allowing us to achieve
very high precision over our data set. Keywords: automatic identification of user intent, personalized search,
recommendations | |||
| CWS: a comparative web search system | | BIBAK | Full-Text | 467-476 | |
| Jian-Tao Sun; Xuanhui Wang; Dou Shen; Hua-Jun Zeng; Zheng Chen | |||
| In this paper, we define and study a novel search problem: Comparative Web
Search (CWS). The task of CWS is to seek relevant and comparative information
from the Web to help users conduct comparisons among a set of topics. A system
called CWS is developed to effectively facilitate Web users' comparison needs.
Given a set of queries, which represent the topics that a user wants to
compare, the system is characterized by: (1) automatic retrieval and ranking of
Web pages by incorporating both their relevance to the queries and the
comparative contents they contain; (2) automatic clustering of the comparative
contents into semantically meaningful themes; (3) extraction of representative
keyphrases to summarize the commonness and differences of the comparative
contents in each theme. We developed a novel interface which supports two types
of view modes: a pair-view which displays the result in the page level, and a
cluster-view which organizes the comparative pages into the themes and displays
the extracted phrases to facilitate users' comparison. Experiment results show
the CWS system is effective and efficient. Keywords: clustering, comparative web search, keyphrase extraction, search engine | |||
| Searching with context | | BIBAK | Full-Text | 477-486 | |
| Reiner Kraft; Chi Chao Chang; Farzin Maghoul; Ravi Kumar | |||
| Contextual search refers to proactively capturing the information need of a
user by automatically augmenting the user query with information extracted from
the search context; for example, by using terms from the web page the user is
currently browsing or a file the user is currently editing.
We present three different algorithms to implement contextual search for the Web. The first, it query rewriting (QR), augments each query with appropriate terms from the search context and uses an off-the-shelf web search engine to answer this augmented query. The second, rank-biasing (RB), generates a representation of the context and answers queries using a custom-built search engine that exploits this representation. The third, iterative filtering meta-search (IFM), generates multiple subqueries based on the user query and appropriate terms from the search context, uses an off-the-shelf search engine to answer these subqueries, and re-ranks the results of the subqueries using rank aggregation methods. We extensively evaluate the three methods using 200 contexts and over 24,000 human relevance judgments of search results. We show that while QR works surprisingly well, the relevance and recall can be improved using RB and substantially more using IFM. Thus, QR, RB, and IFM represent a cost-effective design spectrum for contextual search. Keywords: contextual search, meta-search, rank aggregation, specialized search
engines, web search | |||
| Keynote talk | | BIBA | Full-Text | 487 | |
| Richard Granger | |||
| Richard Granger will be providing an update on the deployment of information technology at a national scale in the NHS in England. Particular topics that will be covered include variability of performance and user organizations and suppliers. Access/channel strategies for NHS users and members of the public. Take-up rates for new technologies including internet adoption. Data on number of users and transactions to date will also be provided. | |||
| Broken links on the web: local laws and the global free flow of information | | BIBA | Full-Text | 489 | |
| Daniel Weitzner | |||
| Across the World Wide Web there is government censorship and monitoring of political messages and "morally-corrupting" material. Google have been in the news recently for capitulating to the Chinese government's demands to ban certain kinds of content, and also for refusing to pass logs of browsing habits to the US government (while Microsoft and Yahoo complied wth the request). How can the Web survive as a unified, global information environment in the face of government censorship? Can governments and the private sector come to an agreement on international legal standards for the free flow of information and privacy. | |||
| Position paper: ontology construction from online ontologies | | BIBAK | Full-Text | 491-495 | |
| Harith Alani | |||
| One of the main hurdles towards a wide endorsement of ontologies is the high
cost of constructing them. Reuse of existing ontologies offers a much cheaper
alternative than building new ones from scratch, yet tools to support such
reuse are still in their infancy. However, more ontologies are becoming
available on the web, and online libraries for storing and indexing ontologies
are increasing in number and demand. Search engines have also started to
appear, to facilitate search and retrieval of online ontologies. This paper
presents a fresh view on constructing ontologies automatically, by identifying,
ranking, and merging fragments of online ontologies. Keywords: automatic ontology construction, ontology reuse | |||
| Position paper: towards the notion of gloss, and the adoption of linguistic resources in formal ontology engineering | | BIBAK | Full-Text | 497-503 | |
| Mustafa Jarrar | |||
| In this paper, we first introduce the notion of gloss for ontology
engineering purposes. We propose that each vocabulary in an ontology should
have a gloss. A gloss basically is an informal description of the meaning of a
vocabulary that is supposed to render factual and critical knowledge to
understanding a concept, but that is unreasonable or very difficult to
formalize and/or articulate formally. We present a set of guidelines on what
should and should not be provided in a gloss. Second, we propose to incorporate
linguistic resources in the ontology engineering process. We clarify the
importance of using lexical resources as a "consensus reference" in ontology
engineering, and so enabling the adoption of the glosses found in these
resources. A linguistic resource (i.e. its list of terms and their definitions)
shall be seen as a shared vocabulary space for ontologies. We present an
ontology engineering software tool (called DogmaModeler), and illustrate its
support of reusing of WordNet's terms and glosses in ontology modeling. Keywords: deogmamodeler, formal ontology engineering, gloss, lexical semantics,
ontologies and wordnet, ontology, wordnet | |||
| Bootstrapping semantics on the web: meaning elicitation from schemas | | BIBAK | Full-Text | 505-512 | |
| Paolo Bouquet; Luciano Serafini; Stefano Zanobini; Simone Sceffer | |||
| In most web sites, web-based applications (such as web portals,
e-marketplaces, search engines), and in the file systems of personal computers,
a wide variety of schemas (such as taxonomies, directory trees, thesauri,
Entity-Relationship schemas, RDF Schemas) are published which (i) convey a
clear meaning to humans (e.g. help in the navigation of large collections of
documents), but (ii) convey only a small fraction (if any) of their meaning to
machines, as their intended meaning is not formally/explicitly represented. In
this paper we present a general methodology for automatically eliciting and
representing the intended meaning of these structures, and for making this
meaning available in domains like information integration and interoperability,
web service discovery and composition, peer-to-peer knowledge management, and
semantic browsers. We also present an implementation (called CtxMatch2) of how
such a method can be used for semantic interoperability. Keywords: meaning elicitation, schema matching, semantic web | |||
| Designing ethical phishing experiments: a study of (ROT13) rOnl query features | | BIBAK | Full-Text | 513-522 | |
| Markus Jakobsson; Jacob Ratkiewicz | |||
| We study how to design experiments to measure the success rates of phishing
attacks that are ethical and accurate, which are two requirements of
contradictory forces. Namely, an ethical experiment must not expose the
participants to any risk; it should be possible to locally verify by the
participants or representatives thereof that this was the case. At the same
time, an experiment is accurate if it is possible to argue why its success rate
is not an upper or lower bound of that of a real attack -- this may be
difficult if the ethics considerations make the user perception of the
experiment different from the user perception of the attack. We introduce
several experimental techniques allowing us to achieve a balance between these
two requirements, and demonstrate how to apply these, using a context aware
phishing experiment on a popular online auction site which we call "rOnl". Our
experiments exhibit a measured average yield of 11% per collection of unique
users. This study was authorized by the Human Subjects Committee at Indiana
University (Study #05-10306). Keywords: accurate, ethical, experiment, phishing, security | |||
| Invasive browser sniffing and countermeasures | | BIBAK | Full-Text | 523-532 | |
| Markus Jakobsson; Sid Stamm | |||
| We describe the detrimental effects of browser cache/history sniffing in the
context of phishing attacks, and detail an approach that neutralizes the threat
by means of URL personalization; we report on an implementation performing such
personalization on the fly, and analyze the costs of and security properties of
our proposed solution. Keywords: browser cache, cascading style sheets, personalization, phishing, sniffing | |||
| A probabilistic approach to spatiotemporal theme pattern mining on weblogs | | BIBAK | Full-Text | 533-542 | |
| Qiaozhu Mei; Chao Liu; Hang Su; ChengXiang Zhai | |||
| Mining subtopics from weblogs and analyzing their spatiotemporal patterns
have applications in multiple domains. In this paper, we define the novel
problem of mining spatiotemporal theme patterns from weblogs and propose a
novel probabilistic approach to model the subtopic themes and spatiotemporal
theme patterns simultaneously. The proposed model discovers spatiotemporal
theme patterns by (1) extracting common themes from weblogs; (2) generating
theme life cycles for each given location; and (3) generating theme snapshots
for each given time period. Evolution of patterns can be discovered by
comparative analysis of theme life cycles and theme snapshots. Experiments on
three different data sets show that the proposed approach can discover
interesting spatiotemporal theme patterns effectively. The proposed
probabilistic model is general and can be used for spatiotemporal text mining
on any domain with time and location information. Keywords: mixture model, spatiotemporal text mining, theme pattern, weblog | |||
| Time-dependent semantic similarity measure of queries using historical click-through data | | BIBAK | Full-Text | 543-552 | |
| Qiankun Zhao; Steven C. H. Hoi; Tie-Yan Liu; Sourav S. Bhowmick; Michael R. Lyu; Wei-Ying Ma | |||
| It has become a promising direction to measure similarity of Web search
queries by mining the increasing amount of click-through data logged by Web
search engines, which record the interactions between users and the search
engines. Most existing approaches employ the click-through data for similarity
measure of queries with little consideration of the temporal factor, while the
click-through data is often dynamic and contains rich temporal information. In
this paper we present a new framework of time-dependent query semantic
similarity model on exploiting the temporal characteristics of historical
click-through data. The intuition is that more accurate semantic similarity
values between queries can be obtained by taking into account the timestamps of
the log data. With a set of user-defined calendar schema and calendar patterns,
our time-dependent query similarity model is constructed using the marginalized
kernel technique, which can exploit both explicit similarity and implicit
semantics from the click-through data effectively. Experimental results on a
large set of click-through data acquired from a commercial search engine show
that our time-dependent query similarity model is more accurate than the
existing approaches. Moreover, we observe that our time-dependent query
similarity model can, to some extent, reflect real-world semantics such as
real-world events that are happening over time. Keywords: click-through data, event detection, evolution pattern, marginalized kernel,
semantic similarity measure | |||
| Interactive wrapper generation with minimal user effort | | BIBAK | Full-Text | 553-563 | |
| Utku Irmak; Torsten Suel | |||
| While much of the data on the web is unstructured in nature, there is also a
significant amount of embedded structured data, such as product information on
e-commerce sites or stock data on financial sites. A large amount of research
has focused on the problem of generating wrappers, i.e., software tools that
allow easy and robust extraction of structured data from text and HTML sources.
In many applications, such as comparison shopping, data has to be extracted
from many different sources, making manual coding of a wrapper for each source
impractical. On the other hand, fully automatic approaches are often not
reliable enough, resulting in low quality of the extracted data.
We describe a complete system for semi-automatic wrapper generation that can be trained on different data sources in a simple interactive manner. Our goal is to minimize the amount of user effort for training reliable wrappers through design of a suitable training interface that is implemented based on a powerful underlying extraction language and a set of training and ranking algorithms. Our experiments show that our system achieves reliable extraction with a very small amount of user effort. Keywords: active learning, data extraction, wrapper generation | |||
| Towards content trust of web resources | | BIBAK | Full-Text | 565-574 | |
| Yolanda Gil; Donovan Artz | |||
| Trust is an integral part of the Semantic Web architecture. While most prior
work focuses on entity-centered issues such as authentication and reputation,
it does not model the content, i.e. the nature and use of the information being
exchanged. This paper discusses content trust as an aggregate of other trust
measures that have been previously studied. The paper introduces several
factors that users consider in deciding whether to trust the content provided
by a Web resource. Many of these factors are hard to capture in practice, since
they would require a large amount of user input. Our goal is to discern which
of these factors could be captured in practice with minimal user interaction in
order to maximize the system's trust estimates. The paper also describes a
simulation environment that we have designed to study alternative models of
content trust. Keywords: semantic web, trust, web of trust | |||
| Supporting online problem-solving communities with the semantic web | | BIBAK | Full-Text | 575-584 | |
| Anupriya Ankolekar; Katia Sycara; James Herbsleb; Robert Kraut; Chris Welty | |||
| The Web plays a critical role in hosting Web communities, their content and
interactions. A prime example is the open source software (OSS) community,
whose members, including software developers and users, interact almost
exclusively over the Web, constantly generating, sharing and refining content
in the form of software code through active interaction over the Web on code
design and bug resolution processes. The Semantic Web is an envisaged extension
of the current Web, in which content is given a well-defined meaning, through
the specification of metadata and ontologies, increasing the utility of the
content and enabling information from heterogeneous sources to be integrated.
We developed a prototype Semantic Web system for OSS communities, Dhruv. Dhruv
provides an enhanced semantic interface to bug resolution messages and
recommends related software objects and artifacts. Dhruv uses an integrated
model of the OpenACS community, the software, and the Web interactions, which
is semi-automatically populated from the existing artifacts of the community. Keywords: computer-supported cooperative work, human-computer interaction, open source
software communities, semantic web applications | |||
| Semantic Wikipedia | | BIBAK | Full-Text | 585-594 | |
| Max Völkel; Markus Krötzsch; Denny Vrandecic; Heiko Haller; Rudi Studer | |||
| Wikipedia is the world's largest collaboratively edited source of
encyclopaedic knowledge. But in spite of its utility, its contents are barely
machine-interpretable. Structural knowledge, e.g., about how concepts are
interrelated, can neither be formally stated nor automatically processed. Also
the wealth of numerical data is only available as plain text and thus can not
be processed by its actual meaning.
We provide an extension to be integrated in Wikipedia, that allows the typing of links between articles and the specification of typed data inside the articles in an easy-to-use manner. Enabling even casual users to participate in the creation of an open semantic knowledge base, Wikipedia has the chance to become a resource of semantic statements, hitherto unknown regarding size, scope, openness, and internationalisation. These semantic enhancements bring to Wikipedia benefits of today's semantic technologies: more specific ways of searching and browsing. Also, the RDF export, that gives direct access to the formalised knowledge, opens Wikipedia up to a wide range of external applications, that will be able to use it as a background knowledge base. In this paper, we present the design, implementation, and possible uses of this extension. Keywords: RDF, Semantic Web, Wiki, Wikipedia | |||
| Dynamic placement for clustered web applications | | BIBAK | Full-Text | 595-604 | |
| A. Karve; T. Kimbrel; G. Pacifici; M. Spreitzer; M. Steinder; M. Sviridenko; A. Tantawi | |||
| We introduce and evaluate a middleware clustering technology capable of
allocating resources to web applications through dynamic application instance
placement. We define application instance placement as the problem of placing
application instances on a given set of server machines to adjust the amount of
resources available to applications in response to varying resource demands of
application clusters. The objective is to maximize the amount of demand that
may be satisfied using a configured placement. To limit the disturbance to the
system caused by starting and stopping application instances, the placement
algorithm attempts to minimize the number of placement changes. It also strives
to keep resource utilization balanced across all server machines. Two types of
resources are managed, one load-dependent and one load-independent. When
putting the chosen placement in effect our controller schedules placement
changes in a manner that limits the disruption to the system. Keywords: dynamic application placement, performance management | |||
| Selective early request termination for busy internet services | | BIBAK | Full-Text | 605-614 | |
| Jingyu Zhou; Tao Yang | |||
| Internet traffic is bursty and network servers are often overloaded with
surprising events or abnormal client request patterns. This paper studies a
load shedding mechanism called selective early request termination (SERT) for
network services that use threads to handle multiple incoming requests
continuously and concurrently. Our investigation with applications from Ask.com
shows that during overloaded situations, a relatively small percentage of long
requests that require excessive computing resource can dramatically affect
other short requests and reduce the overall system throughput. By actively
detecting and aborting overdue long requests, services can perform
significantly better to achieve QoS objectives compared to a purely admission
based approach. We have proposed a termination scheme that monitors running
time of requests, accounts for their resource usage, adaptively adjusts the
selection threshold, and performs a safe termination for a class of requests.
This paper presents the design and implementation of this scheme and describes
experimental results to validate the proposed approach. Keywords: internet services, load shedding, request termination | |||
| SCTP: an innovative transport layer protocol for the web | | BIBAK | Full-Text | 615-624 | |
| Preethi Natarajan; Janardhan R. Iyengar; Paul D. Amer; Randall Stewart | |||
| We propose using the Stream Control Transmission Protocol (SCTP), a recent
IETF transport layer protocol, for reliable web transport. Although TCP has
traditionally been used, we argue that SCTP better matches the needs of
HTTP-based network applications. This position paper discusses SCTP features
that address: (i) head-of-line blocking within a single TCP connection, (ii)
vulnerability to network failures, and (iii) vulnerability to denial-of-service
SYN attacks. We discuss our experience in modifying the Apache server and the
Firefox browser to benefit from SCTP, and demonstrate our HTTP over SCTP design
via simple experiments. We also discuss the benefits of using SCTP in other web
domains through two example scenarios -- multiplexing user requests, and
multiplexing resource access. Finally, we highlight several SCTP features that
will be valuable to the design and implementation of current HTTP-based
client-server applications. Keywords: SCTP, fault-tolerance, head-of-line blocking, stream control transmission
protocol, transport layer service, web applications, web transport | |||
| Improved annotation of the blogosphere via autotagging and hierarchical clustering | | BIBAK | Full-Text | 625-632 | |
| Christopher H. Brooks; Nancy Montanez | |||
| Tags have recently become popular as a means of annotating and organizing
Web pages and blog entries. Advocates of tagging argue that the use of tags
produces a 'folksonomy', a system in which the meaning of a tag is determined
by its use among the community as a whole. We analyze the effectiveness of tags
for classifying blog entries by gathering the top 350 tags from Technorati and
measuring the similarity of all articles that share a tag. We find that tags
are useful for grouping articles into broad categories, but less effective in
indicating the particular content of an article. We then show that
automatically extracting words deemed to be highly relevant can produce a more
focused categorization of articles. We also show that clustering algorithms can
be used to reconstruct a topical hierarchy among tags, and suggest that these
approaches may be used to address some of the weaknesses in current tagging
systems. Keywords: automated annotation, blogs, hierarchical clustering, tagging | |||
| Large-scale text categorization by batch mode active learning | | BIBAK | Full-Text | 633-642 | |
| Steven C. H. Hoi; Rong Jin; Michael R. Lyu | |||
| Large-scale text categorization is an important research topic for Web data
mining. One of the challenges in large-scale text categorization is how to
reduce the human efforts in labeling text documents for building reliable
classification models. In the past, there have been many studies on applying
active learning methods to automatic text categorization, which try to select
the most informative documents for labeling manually. Most of these studies
focused on selecting a single unlabeled document in each iteration. As a
result, the text categorization model has to be retrained after each labeled
document is solicited. In this paper, we present a novel active learning
algorithm that selects a batch of text documents for labeling manually in each
iteration. The key of the batch mode active learning is how to reduce the
redundancy among the selected examples such that each example provides unique
information for model updating. To this end, we use the Fisher information
matrix as the measurement of model uncertainty and choose the set of documents
to effectively maximize the Fisher information of a classification model.
Extensive experiments with three different datasets have shown that our
algorithm is more effective than the state-of-the-art active learning
techniques for text categorization and can be a promising tool toward
large-scale text categorization for World Wide Web documents. Keywords: Fisher information, active learning, convex optimization, logistic
regression, text categorization | |||
| A comparison of implicit and explicit links for web page classification | | BIBAK | Full-Text | 643-650 | |
| Dou Shen; Jian-Tao Sun; Qiang Yang; Zheng Chen | |||
| It is well known that Web-page classification can be enhanced by using
hyperlinks that provide linkages between Web pages. However, in the Web space,
hyperlinks are usually sparse, noisy and thus in many situations can only
provide limited help in classification. In this paper, we extend the concept of
linkages from explicit hyperlinks to implicit links built between Web pages. By
observing that people who search the Web with the same queries often click on
different, but related documents together, we draw implicit links between Web
pages that are clicked after the same queries. Those pages are implicitly
linked. We provide an approach for automatically building the implicit links
between Web pages using Web query logs, together with a thorough comparison
between the uses of implicit and explicit links in Web page classification. Our
experimental results on a large dataset confirm that the use of the implicit
links is better than using explicit links in classification performance, with
an increase of more than 10.5% in terms of the Macro-F1 measurement. Keywords: explicit link, implicit link, query log, virtual document, web page
classification | |||
| An e-market framework for informed trading | | BIBAK | Full-Text | 651-658 | |
| John Debenham; Simeon Simoff | |||
| Fully automated trading, such as e-procurement, using the Internet is
virtually unheard of today. Three core technologies are needed to fully
automate the trading process: data mining, intelligent trading agents and
virtual institutions in which informed trading agents can trade securely both
with each other and with human agents in a natural way. This paper describes a
demonstrable prototype e-trading system that integrates these three
technologies and is available on the World Wide Web. This is part of a larger
project that aims to make informed automated trading a reality. Keywords: data mining, electronic markets, market reliability, trading agents, virtual
institutions | |||
| The impact of online music services on the demand for stars in the music industry | | BIBAK | Full-Text | 659-667 | |
| Ian Pascal Volz | |||
| The music industry's business model is to produce stars. In order to do so,
musicians producing music that fits into well defined clusters of factors
explaining the demand of the majority of music consumers are disproportionately
promoted. This leads to a limitation of available diversity and therefore of a
limitation of the end user's benefit from listening to music. This paper
analyses online music consumer's needs and preferences. These factors are used
in order to explain the demand for stars and the impact of different online
music services on promoting a more diverse music market. Keywords: JEL-classification, MP3, iTunes, music, peer-to-peer, stardom, virtual
community | |||
| The web structure of e-government -- developing a methodology for quantitative evaluation | | BIBAK | Full-Text | 669-678 | |
| Vaclav Petricek; Tobias Escher; Ingemar J. Cox; Helen Margetts | |||
| In this paper we describe preliminary work that examines whether statistical
properties of the structure of websites can be an informative measure of their
quality. We aim to develop a new method for evaluating e-government.
E-government websites are evaluated regularly by consulting companies,
international organizations and academic researchers using a variety of
subjective measures. We aim to improve on these evaluations using a range of
techniques from webmetric and social network analysis. To pilot our
methodology, we examine the structure of government audit office sites in
Canada, the USA, the UK, New Zealand and the Czech Republic.
We report experimental values for a variety of characteristics, including the connected components, the average distance between nodes, the distribution of paths lengths, and the indegree and outdegree. These measures are expected to correlate with (i) the navigability of a website and (ii) with its "nodality" which is a combination of hubness and authority. Comparison of websites based on these characteristics raised a number of issues, related to the proportion of non-hyperlinked content (e.g. pdf and doc files) within a site, and both the very significant differences in the size of the websites and their respective national populations. Methods to account for these issues are proposed and discussed. There appears to be some correlation between the values measured and the league tables reported in the literature. However, this multi dimensional analysis provides a richer source of evaluative techniques than previous work. Our analysis indicates that the US and Canada provide better navigability, much better than the UK; however, the UK site is shown to have the strongest "nodality" on the Web. Keywords: e-government, national audit offices, network, ranking, webmetric | |||
| One document to bind them: combining XML, web services, and the semantic web | | BIBAK | Full-Text | 679-686 | |
| Harry Halpin; Henry S. Thompson | |||
| We present a paradigm for uniting the diverse strands of XML-based Web
technologies by allowing them to be incorporated within a single document. This
overcomes the distinction between programs and data to make XML truly
"self-describing." A proposal for a lightweight yet powerful functional XML
vocabulary called "Semantic fXML" is detailed, based on the well-understood
functional programming paradigm and resembling the embedding of Lisp directly
in XML. Infosets are made "dynamic," since documents can now directly embed
local processes or Web Services into their Infoset. An optional typing regime
for info-sets is provided by Semantic Web ontologies. By regarding Web Services
as functions and the Semantic Web as providing types, and tying it all together
within a single XML vocabulary, the Web can compute. In this light, the real
Web 2.0 can be considered the transformation of the Web from a universal
information space to a universal computation space. Keywords: XML, functional programming, pipelining, semantic web, web services | |||
| ASDL: a wide spectrum language for designing web services | | BIBAK | Full-Text | 687-696 | |
| Monika Solanki; Antonio Cau; Hussein Zedan | |||
| A Service oriented system emerges from composition of services. Dynamically
composed reactive Web services form a special class of service oriented system,
where the delays associated with communication, unreliability and
unavailability of services, and competition for resources from multiple service
requesters are dominant concerns. As complexity of services increase, an
abstract design language for the specification of services and interaction
between them is desired. In this paper, we present ASDL (Abstract Service
Design Language), a wide spectrum language for modelling Web services. We
initially provide an informal description of our computational model for
service oriented systems. We then present ASDL along with its specification
oriented semantics defined in Interval Temporal Logic (ITL): a sound formalism
for specifying and reasoning about temporal properties of systems. The
objective of ASDL is to provide a notation for the design of service
composition and interaction protocols at an abstract level. Keywords: ASDL, computational model, web services, wide spectrum | |||
| Semantic WS-agreement partner selection | | BIBAK | Full-Text | 697-706 | |
| Nicole Oldham; Kunal Verma; Amit Sheth; Farshad Hakimpour | |||
| In a dynamic service oriented environment it is desirable for service
consumers and providers to offer and obtain guarantees regarding their
capabilities and requirements. WS-Agreement defines a language and protocol for
establishing agreements between two parties. The agreements are complex and
expressive to the extent that the manual matching of these agreements would be
expensive both in time and resources. It is essential to develop a method for
matching agreements automatically. This work presents the framework and
implementation of an innovative tool for the matching providers and consumers
based on WS-Agreements. The approach utilizes Semantic Web technologies to
achieve rich and accurate matches. A key feature is the novel and flexible
approach for achieving user personalized matches. Keywords: ARL, OWL, WS-agreement, WSDL-S, agreement matching, dynamic service
selection, multi-ontology service annotation, ontologies, semantic policy
matching, semantic web service, snobase | |||
| Beyond PageRank: machine learning for static ranking | | BIBAK | Full-Text | 707-715 | |
| Matthew Richardson; Amit Prakash; Eric Brill | |||
| Since the publication of Brin and Page's paper on PageRank, many in the Web
community have depended on PageRank for the static (query-independent) ordering
of Web pages. We show that we can significantly outperform PageRank using
features that are independent of the link structure of the Web. We gain a
further boost in accuracy by using data on the frequency at which users visit
Web pages. We use RankNet, a ranking machine learning algorithm, to combine
these and other static features based on anchor text and domain
characteristics. The resulting model achieves a static ranking pairwise
accuracy of 67.3% (vs. 56.7% for PageRank or 50% for random). Keywords: PageRank, RankNet, relevance, search engines, static ranking | |||
| Optimizing scoring functions and indexes for proximity search in type-annotated corpora | | BIBAK | Full-Text | 717-726 | |
| Soumen Chakrabarti; Kriti Puniyani; Sujatha Das | |||
| We introduce a new, powerful class of text proximity queries: find an
instance of a given "answer type" (person, place, distance) near "selector"
tokens matching given literals or satisfying given ground predicates. An
example query is type=distance NEAR Hamburg Munich. Nearness is defined as a
flexible, trainable parameterized aggregation function of the selectors, their
frequency in the corpus, and their distance from the candidate answer. Such
queries provide a key data reduction step for information extraction, data
integration, question answering, and other text-processing applications. We
describe the architecture of a next-generation information retrieval engine for
such applications, and investigate two key technical problems faced in building
it. First, we propose a new algorithm that estimates a scoring function from
past logs of queries and answer spans. Plugging the scoring function into the
query processor gives high accuracy: typically, an answer is found at rank 2-4.
Second, we exploit the skew in the distribution over types seen in query logs
to optimize the space required by the new index structures required by our
system. Extensive performance studies with a 10GB, 2-million document TREC
corpus and several hundred TREC queries show both the accuracy and the
efficiency of our system. From an initial 4.3GB index using 18,000 types from
WordNet, we can discard 88% of the space, while inflating query times by a
factor of only 1.9. Our final index overhead is only 20% of the total index
space needed. Keywords: indexing annotated text | |||
| Automatic identification of user interest for personalized search | | BIBAK | Full-Text | 727-736 | |
| Feng Qiu; Junghoo Cho | |||
| One hundred users, one hundred needs. As more and more topics are being
discussed on the web and our vocabulary remains relatively stable, it is
increasingly difficult to let the search engine know what we want. Coping with
ambiguous queries has long been an important part of the research on
Information Retrieval, but still remains a challenging task. Personalized
search has recently got significant attention in addressing this challenge in
the web search community, based on the premise that a user's general preference
may help the search engine disambiguate the true intention of a query. However,
studies have shown that users are reluctant to provide any explicit input on
their personal preference. In this paper, we study how a search engine can
learn a user's preference automatically based on her past click history and how
it can use the user preference to personalize search results. Our experiments
show that users' preferences can be learned accurately even from little
click-history data and personalized search based on user preference yields
significant improvements over the best existing ranking mechanism in the
literature. Keywords: personalized search, user profile, user search behavior, web search | |||
| Protecting browser state from web privacy attacks | | BIBAK | Full-Text | 737-744 | |
| Collin Jackson; Andrew Bortz; Dan Boneh; John C. Mitchell | |||
| Through a variety of means, including a range of browser cache methods and
inspecting the color of a visited hyperlink, client-side browser state can be
exploited to track users against their wishes. This tracking is possible
because persistent, client-side browser state is not properly partitioned on
per-site basis in current browsers. We address this problem by refining the
general notion of a "same-origin" policy and implementing two browser
extensions that enforce this policy on the browser cache and visited links.
We also analyze various degrees of cooperation between sites to track users, and show that even if long-term browser state is properly partitioned, it is still possible for sites to use modern web features to bounce users between sites and invisibly engage in cross-domain tracking of their visitors. Cooperative privacy attacks are an unavoidable consequence of all persistent browser state that affects the behavior of the browser, and disabling or frequently expiring this state is the only way to achieve true privacy against colluding parties. Keywords: phishing, privacy, web browser design, web spoofing | |||
| Meaning on the web: evolution vs intelligent design? | | BIBA | Full-Text | 745 | |
| Ron Brachman; Dan Connolly; Rohit Khare; Frank Smadja; Frank van Harmelen | |||
| It is a truism that as the Web grows in size and scope, it becomes harder to
find what we want, to identify like-minded people and communities, to find the
best ads to offer, and to have applications work together smoothly. Services
don't interoperate; queries yield long lists of results, most of which seem to
miss the point. If the Web were a person, we would expect richer and more
successful interactions with it -- interactions that were, quite literally,
more meaningful. That's because in human discourse, it is shared meaning that
gives us real communication. Yet with the current Web, meaning cannot be found.
Much recent work has aspired to change this, both for human-machine interchange and machine-machine synchronization. Certainly the "semantic web" looks to add meaning to our current simplistic matching of mere strings of characters against mere "bags" of words. But can we legislate meaning from on high? Isn't meaning organic and determined by use, a moving and context-dependent target? But if meaning is an evolving organic soup, how are humans able to get anything done with one another? Don't we love to "define our terms"? But then again, is real definition even possible? These questions have daunted philosophers for years, and we probably won't solve them here. But we'll try to understand what's at the root of our own current religious debate: should meaning on the Web be evolutionary, driven organically through the bottom-up human assignment of tags? Or does it need to be carefully crafted and managed by a higher authority, using structured representations with defined semantics? Without picket signs or violence (we hope), our panelists will explore the two extreme ends of the spectrum -- and several points in between. | |||
| Identity management on converged networks: a reality check | | BIBA | Full-Text | 747 | |
| Arnaud Sahuguet; Stefan Brands; Kim Cameron; Cahill Conor; Aude Pichelin; Fulup Ar Foll; Mike Neuenschwander | |||
| Since the early days of the Web, identity management has been a big issue.
As the famous cartoon from the New Yorker reminds us, "on the internet, nobody
knows you are a dog". This was true back in July 1993. This is true today. For
the last few years, numerous initiatives have emerged to tackle this issue:
Microsoft Passport, Liberty Alliance, 3GPP GUP, Shibboleth, to name a few.
Major investments are being made in this area and this is foreseen as a
multi-billion dollar market. Yet, as of this writing, there is still no
widespread identity management infrastructure in place ready to be used by the
general public on converged networks.
The goal of this panel is to do a reality check and try to answer the following five questions: * What is identity management? * Who needs identity management and why? * What will the identity management ecosystem look like? * What's agreed upon? * What's next? | |||
| Phoiling phishing | | BIBA | Full-Text | 749 | |
| Rachna Dhamija; Peter Cassidy; Phillip Hallam-Baker; Markus Jacobsson | |||
| In the last few years, Internet users have seen the rapid expansion of "phishing", the use of spoofed e-mails and fraudulent websites designed to trick users into divulging sensitive data. More recently, we have seen the growth of "pharming", the use of malware or DNS-based attacks to misdirect users to rogue websites. In this panel, we will examine the state of the art in anti-phishing solutions and explore promising directions for future research. | |||
| The next wave of the web | | BIBA | Full-Text | 750 | |
| Nigel Shadbolt; Tim Berners-Lee; Jim Hendler; Claire Hart; Richard Benjamins | |||
| The World Wide Web has been revolutionary in terms of impact, scale and
outreach. At every level society has been changed in some way by the Web. This
Panel will consider likely developments in this extraordinary human construct
as we attempt to realise the Next Wave of the Web -- a Semantic Web.
Nigel Shadbolt will Chair a discussion that will focus on the prospects for the Semantic Web, its likely form and the challenges it faces. Can we achieve the necessary agreements on shared meaning for the Semantic Web? Can we achieve a critical mass of semantically annotated data and content? How are we to trust such content? Do the scientific and commercial drivers really demand a Semantic Web? How will the move to a mobile and ubiquitous Web affect the Semantic Web? How does Web 2.0 relate to the Semantic Web? | |||
| Compressing and searching XML data via two zips | | BIBAK | Full-Text | 751-760 | |
| P. Ferragina; F. Luccio; G. Manzini; S. Muthukrishnan | |||
| XML is fast becoming the standard format to store, exchange and publish over
the web, and is getting embedded in applications. Two challenges in handling
XML are its size (the XML representation of a document is significantly larger
than its native state) and the complexity of its search (XML search involves
path and content searches on labeled tree structures). We address the basic
problems of compression, navigation and searching of XML documents. In
particular, we adopt recently proposed theoretical algorithms [11] for succinct
tree representations to design and implement a compressed index for XML, called
XBZIPiNDEX, in which the XML document is maintained in a highly compressed
format, and both navigation and searching can be done uncompressing only a tiny
fraction of the data. This solution relies on compressing and indexing two
arrays derived from the XML data. With detailed experiments we compare this
with other compressed XML indexing and searching engines to show that
XBZIPiNDEX has compression ratio up to 35% better than the ones achievable by
those other tools, and its time performance on some path and content search
operations is order of magnitudes faster: few milliseconds over hundreds of MBs
of XML files versus tens of seconds, on standard XML data sources. Keywords: XML compression and indexing, labeled trees | |||
| Wake-on-WLAN | | BIBAK | Full-Text | 761-769 | |
| Nilesh Mishra; Kameswari Chebrolu; Bhaskaran Raman; Abhinav Pathak | |||
| In bridging the digital divide, two important criteria are
cost-effectiveness, and power optimization. While 802.11 is cost-effective and
is being used in several installations in the developing world, typical system
configurations are not really power efficient. In this paper, we propose a
novel "Wake-on-WLAN" mechanism for coarse-grained, on-demand power on/off of
the networking equipment at a remote site. The novelty also lies in our
implementation of a prototype system using low-power 802.15.4-based sensor
motes. We describe the prototype, as well as its evaluation on field in a WiFi
testbed. Preliminary estimates indicate that the proposed mechanism can save
significant power in typical rural networking settings. Keywords: 802.11 mesh network, 802.15.4, power management, rural networking,
wake-on-WLAN | |||
| Analysis of WWW traffic in Cambodia and Ghana | | BIBAK | Full-Text | 771-780 | |
| Bowei Du; Michael Demmer; Eric Brewer | |||
| In this paper we present an analysis of HTTP traffic captured from Internet
cafés and kiosks from two different developing countries -- Cambodia and
Ghana. This paper has two main contributions. The first contribution is a
analysis of the characteristics of the web trace, including the distribution
and classification of the web objects requested by the users. We outline
notable features of the data set which effect the performance of the web for
users in developing regions. Using the trace data, we also perform several
simulation analyses of cache performance, including both traditional caching
and more novel off-line caching proposals. The second contribution is a set of
suggestions on mechanisms to improve the user experience of the web in these
regions. These mechanisms include both applications of well-known research
techniques as well as offering some less well-studied suggestions based on
intermittent connectivity. Keywords: Cambodia, Ghana, HTTP, WWW, caching, classification, delay tolerant
networking, developing regions, dynamic content, hypertext transfer protocol,
measurement, performance analysis, proxy, redundant transfers, trace, world
wide web | |||
| The case for multi-user design for computer aided learning in developing regions | | BIBAK | Full-Text | 781-789 | |
| Joyojeet Pal; Udai Singh Pawar; Eric A. Brewer; Kentaro Toyama | |||
| Computer-aided learning is fast gaining traction in developing regions as a
means to augment classroom instruction. Reasons for using computer-aided
learning range from supplementing teacher shortages to starting underprivileged
children off in technology, and funding for such initiatives range from state
education funds to international agencies and private groups interested in
child development. The interaction of children with computers is seen at
various levels, from unsupervised self-guided learning at public booths without
specific curriculum to highly regulated in-class computer applications with
modules designed to go with school curriculum. Such learning is used at various
levels from children as young as 5 year-old to high-schoolers. This paper uses
field observations of primary school children in India using computer-aided
learning modules, and finds patterns by which children who perform better in
classroom activities seat themselves in front of computer monitors, and control
the mouse, in cases where children are required to share computer resources. We
find that in such circumstances, there emerges a pattern of learning, unique to
multi-user environments -- wherein certain children tend to learn better
because of their control of the mouse. This research also shows that while
computer aided learning software for children is primarily designed for
single-users, the implementation realities of resource-strapped learning
environments in developing regions presents a strong case for multi-user
design. Keywords: developing regions | |||
| Designing an architecture for delivering mobile information services to the rural developing world | | BIBAK | Full-Text | 791-800 | |
| Tapan S. Parikh; Edward D. Lazowska | |||
| Implementing successful rural computing applications requires addressing a
number of significant challenges. Recent advances in mobile phone computing
capabilities make this device a likely candidate to address the client hardware
constraints. Long battery life, wireless connectivity, solid-state memory, low
price and immediate utility all make it better suited to rural conditions than
a PC. However, current mobile software platforms are not as appropriate.
Web-based mobile applications are hard to use, do not take advantage of the
mobile phone's media capabilities and require an online connection. Custom
mobile applications are difficult to develop and distribute. To address these
limitations we present CAM -- a new framework for developing and deploying
mobile computing applications in the rural developing world. CAM applications
are accessed by capturing barcodes using the mobile phone camera, or entering
numeric strings with the keypad. Supporting minimal navigation, direct linkage
to paper practices and offline multi-media interaction, CAM is uniquely adapted
to rural device, user and infrastructure constraints. To illustrate the breadth
of the framework, we list a number of CAM-based applications that we have
implemented or are planning. These include processing microfinance loans,
facilitating rural supply chains, documenting grassroots innovation and
accessing electronic medical histories. Keywords: ICT, client-server distributed systems, mobile computing, mobile phones,
paper user interface, rural development | |||
| WebKhoj: Indian language IR from multiple character encodings | | BIBAK | Full-Text | 801-809 | |
| Prasad Pingali; Jagadeesh Jagarlamudi; Vasudeva Varma | |||
| Today web search engines provide the easiest way to reach information on the
web. In this scenario, more than 95% of Indian language content on the web is
not searchable due to multiple encodings of web pages.
Most of these encodings are proprietary and hence need some kind of standardization for making the content accessible via a search engine. In this paper we present a search engine called WebKhoj which is capable of searching multi-script and multi-encoded Indian language content on the web. We describe a language focused crawler and the transcoding processes involved to achieve accessibility of Indian langauge content. In the end we report some of the experiments that were conducted along with results on Indian language web content. Keywords: Indian languages, non-standard encodings, web search | |||
| Using annotations in enterprise search | | BIBAK | Full-Text | 811-817 | |
| Pavel A. Dmitriev; Nadav Eiron; Marcus Fontoura; Eugene Shekita | |||
| A major difference between corporate intranets and the Internet is that in
intranets the barrier for users to create web pages is much higher. This limits
the amount and quality of anchor text, one of the major factors used by
Internet search engines, making intranet search more difficult. The social
phenomenon at play also means that spam is relatively rare. Both on the
Internet and in intranets, users are often willing to cooperate with the search
engine in improving the search experience. These characteristics naturally lead
to considering using user feedback to improve search quality in intranets. In
this paper we show how a particular form of feedback, namely user annotations,
can be used to improve the quality of intranet search. An annotation is a short
description of the contents of a web page, which can be considered a substitute
for anchor text. We propose two ways to obtain user annotations, using explicit
and implicit feedback, and show how they can be integrated into a search
engine. Preliminary experiments on the IBM intranet demonstrate that using
annotations improves the search quality. Keywords: anchortext, community ranking, enterprise search | |||
| Detecting semantic cloaking on the web | | BIBAK | Full-Text | 819-828 | |
| Baoning Wu; Brian D. Davison | |||
| By supplying different versions of a web page to search engines and to
browsers, a content provider attempts to cloak the real content from the view
of the search engine. Semantic cloaking refers to differences in meaning
between pages which have the effect of deceiving search engine ranking
algorithms. In this paper, we propose an automated two-step method to detect
semantic cloaking pages based on different copies of the same page downloaded
by a web crawler and a web browser. The first step is a filtering step, which
generates a candidate list of semantic cloaking pages. In the second step, a
classifier is used to detect semantic cloaking pages from the candidates
generated by the filtering step. Experiments on manually labeled data sets show
that we can generate a classifier with a precision of 93% and a recall of 85%.
We apply our approach to links from the dmoz Open Directory Project and
estimate that more than 50,000 of these pages employ semantic cloaking. Keywords: spam, web search engine | |||
| Detecting online commercial intention (OCI) | | BIBAK | Full-Text | 829-837 | |
| Honghua (Kathy) Dai; Lingzhi Zhao; Zaiqing Nie; Ji-Rong Wen; Lee Wang; Ying Li | |||
| Understanding goals and preferences behind a user's online activities can
greatly help information providers, such as search engine and E-Commerce web
sites, to personalize contents and thus improve user satisfaction.
Understanding a user's intention could also provide other business advantages
to information providers. For example, information providers can decide whether
to display commercial content based on user's intent to purchase. Previous work
on Web search defines three major types of user search goals for search
queries: navigational, informational and transactional or resource [1][7]. In
this paper, we focus our attention on capturing commercial intention from
search queries and Web pages, i.e., when a user submits the query or browse a
Web page, whether he/she is about to commit or in the middle of a commercial
activity, such as purchase, auction, selling, paid service, etc. We call the
commercial intentions behind a user's online activities as OCI (Online
Commercial Intention). We also propose the notion of "Commercial Activity
Phase" (CAP), which identifies in which phase a user is in his/her commercial
activities: Research or Commit. We present the framework of building machine
learning models to learn OCI based on any Web page content. Based on that
framework, we build models to detect OCI from search queries and Web pages. We
train machine learning models from two types of data sources for a given search
query: content of algorithmic search result page(s) and contents of top sites
returned by a search engine. Our experiments show that the model based on the
first data source achieved better performance. We also discover that frequent
queries are more likely to have commercial intention. Finally we propose our
future work in learning richer commercial intention behind users' online
activities. Keywords: OCI, SVM, intention, online commercial intention, search intention | |||
| Temporal rules for mobile web personalization | | BIBAK | Full-Text | 839-840 | |
| Martin Halvey; Mark T. Keane; Barry Smyth | |||
| Many systems use past behavior, preferences and environmental factors to
attempt to predict user navigation on the Internet. However we believe that
many of these models have shortcomings, in that they do not take into account
that users may have many different sets of preferences. Here we investigate an
environmental factor, namely time, in making predictions about user navigation.
We present methods for creating temporal rules that describe user navigation
patterns. We also show the benefit of using these rules to predict user
navigation and also show the benefits of these models over traditional methods.
An analysis is carried out on a sample of usage logs for Wireless Application
Protocol (WAP) browsing, and the results of this analysis verify our
hypothesis. Keywords: WAP, WWW, mobile, temporal models, user modeling | |||
| Behavior-based web page evaluation | | BIBAK | Full-Text | 841-842 | |
| Ganesan Velayathan; Seiji Yamada | |||
| This paper describes our efforts to factor in a user's browsing behavior to
automatically evaluate web pages that the user shows interest in, based on user
browsing behaviors while browsing. To evaluate a webpage automatically, we have
developed a client-side logging tool: the GINIS Framework. We do not focus just
on clicking, scrolling, navigation, or duration of visit alone, but we propose
integrating these patterns of interaction to recognize and evaluate a user's
response to a given web page. Keywords: automatic profiling, information extraction, web browsing behavior, web
usage mining, web-human interaction | |||
| Using web browser interactions to predict task | | BIBAK | Full-Text | 843-844 | |
| Melanie Kellar; Carolyn Watters | |||
| The automatic identification of a user's task has the potential to improve
information filtering systems that rely on implicit measures of interest and
whose effectiveness may be dependant upon the task at hand. Knowledge of a
user's current task type would allow information filtering systems to apply the
most useful measures of user interest. We recently conducted a field study in
which we logged all participants' interactions with their web browsers and
asked participants to categorize their web usage according to a high-level task
schema. Using the data collected during this study, we have conducted a
preliminary exploration of the usefulness of logged web browser interactions to
predict users' tasks. The results of this initial analysis suggest that
individual models of users' web browser interactions may be useful in
predicting task type. Keywords: decision tree, field study, information filtering, task, task prediction,
web | |||
| An integrated method for social network extraction | | BIBAK | Full-Text | 845-846 | |
| Tom Hope; Takuichi Nishimura; Hideaki Takeda | |||
| A social network can become bases for information infrastructure in the
future. It is important to extract social networks that are not biased.
Providing a simple means for users to register their social relation is also
important. We propose a method that combines various approaches to extract
social networks. Especially, three kinds of networks are extracted;
user-registered Know link network, Web-mined Web link network, and face-to-face
Touch link network. In this paper, the combination of social network extraction
for communities is described, and the analysis on the extracted social networks
is shown. Keywords: social network, user interaction, web mining | |||
| Integrating semantic web and language technologies to improve the online public administrations services | | BIBAK | Full-Text | 847-848 | |
| Marta Gatius; Meritxell González; Sheyla Militello; Pablo Hernández | |||
| In this paper, we describe how domain ontologies are used in a dialogue
system guiding the user to access web public administration contents. The
current implementation of the system supports speech (through the telephone)
and text mode in different languages (English, Spanish, Catalan and Italian). Keywords: dialogue systems, e-government, ontologies, web usability | |||
| DemIL: an online interaction language between citizen and government | | BIBAK | Full-Text | 849-850 | |
| Cristiano Maciel; Ana Cristina Bicharra Garcia | |||
| Electronic democracy should provide information and service for the citizens
on the Internet, allowing room for debate, participation and electronic voting.
The languages being adopted by mass communication means, especially Reality
Shows, are efficient and encourage public participation in decision-making.
This paper discusses a citizen-government interaction language intended to
facilitate citizen participation in the government's decisions. An e-Democracy
Model for people participation through web-based technologies is conceived.
This model specifies the syntax of an Democracy Interaction Language, a DemIL.
Such language incorporates characteristics of Reality Show Formats, and it is
the back-end of a web-interface project in the domain researched. The study of
case Participative Budget of Brazil represents the language proposed. Keywords: e-democracy, e-government, interaction, interface | |||
| Web annotation sharing using P2P | | BIBAK | Full-Text | 851-852 | |
| Osamu Segawa | |||
| We have developed a system that allows users to add annotations immediately
onto a Web page they are viewing, and share the information via a network. A
novel feature of our method is that P2P nodes in the system determine their
roles autonomously, and share the annotation data. Our method is based on P2P;
however, P2P nodes in the system change their roles and data transfer
procedures, depending on their network topology or the status of other nodes.
Our method is robust to node or network problems, and has flexible scalability. Keywords: P2P, annotation | |||
| Generating summaries for large collections of geo-referenced photographs | | BIBAK | Full-Text | 853-854 | |
| Alexander Jaffe; Mor Naaman; Tamir Tassa; Marc Davis | |||
| We describe a framework for automatically selecting a summary set of
photographs from a large collection of geo-referenced photos. The summary
algorithm is based on spatial patterns in photo sets, but can be expanded to
support social, temporal, as well as textual-topical factors of the photo set.
The summary set can be biased by the user, the content of the user's query, and
the context in which the query is made. An initial evaluation on a set of
geo-referenced photos shows that our algorithm performs well, producing results
that are highly rated by users. Keywords: collection summary, geo-referenced information, geo-referenced photographs,
photo browsing, photo collections, semantic zoom | |||
| Determining user interests about museum collections | | BIBAK | Full-Text | 855-856 | |
| Lloyd Rutledge; Lora Aroyo; Natalia Stash | |||
| Currently, there is an increasing effort to provide various personalized
services on museum web sites. This paper presents an approach for determining
user interests in a museum collection with the help of an interactive dialog.
It uses a semantically annotated collection of the Rijksmuseum Amsterdam to
elicit specific user's interests in artists, periods, genres and themes and
uses these values to recommend relevant artefacts and related concepts from the
museum collection. In the presented prototype, we show how constructing a user
profile and applying recommender strategies in this way enable dynamical
generation personalized museum tours for different users. Keywords: museum collections, personalization, recommender systems, semantic browsing,
user profiling | |||
| GIO: a semantic web application using the information grid framework | | BIBAK | Full-Text | 857-858 | |
| Omar Alonso; Sandeepan Banerjee; Mark Drake | |||
| It is well understood that the key for successful Semantic Web applications
depends on the availability of machine understandable meta-data. We describe
the Information Grid, a practical approach to the Semantic Web, and show a
prototype implementation. Information grid resources span all the data in the
organization and all the metadata required to make it meaningful. The final
goal is to let organizations view their assets in a smooth continuum from the
Internet to the Intranet, with uniform semantically rich access. Keywords: RDF, browsing, clustering, databases, information visualization, meta-data,
search, semantic web, tools, user interface | |||
| Graphical representation of RDF queries | | BIBAK | Full-Text | 859-860 | |
| Andreas Harth; Sebastian Ryszard Kruk; Stefan Decker | |||
| In this poster we discuss a graphical notation for representing queries for
semistructured data. We try to strike a balance between expressiveness of the
query language and simplicity and understandability of the graphical notation.
We present the primitives of the notation by means of examples. Keywords: RDF, metadata, query, semistructured data | |||
| Question answering on top of the BT digital library | | BIBAK | Full-Text | 861-862 | |
| Philipp Cimiano; Peter Haase; York Sure; Johanna Völker; Yimin Wang | |||
| In this poster we present an approach to query answering over knowledge
sources that makes use of different ontology management components within an
application scenario of the BT Digital Library. The novelty of the approach
lies in the combination of different semantic technologies providing a clear
benefit for the application scenario considered. Keywords: natural language processing, ontology learning, question answering, web
ontologies | |||
| XPath filename expansion in a Unix shell | | BIBA | Full-Text | 863-864 | |
| Kaspar Giger; Erik Wilde | |||
| Locating files based on file system structure, file properties, and maybe even file contents is a core task of the user interface of operating systems. By adapting XPath's power to the environment of a Unix shell, it is possible to greatly increase the expressive power of the command line language. We present a concept for integrating an XPath view of the file system into a shell, the emphXPath Shell (XPsh), which can be used to find files based on file attributes and contents in a very flexible way. The syntax of the command line language is backwards compatible with traditional shells, and the new XPath-based expressions can be easily mastered with a little bit of XPath knowledge. | |||
| Microformats: a pragmatic path to the semantic web | | BIBAK | Full-Text | 865-866 | |
| Rohit Khare; Tantek Çelik | |||
| Microformats are a clever adaptation of semantic XHTML that makes it easier
to publish, index, and extract semi-structured information such as tags,
calendar entries, contact information, and reviews on the Web. This makes it a
pragmatic path towards achieving the vision set forth for the Semantic Web.
Even though it sidesteps the existing "technology stack" of RDF, ontologies, and Artificial Intelligence-inspired processing tools, various microformats have emerged that parallel the goals of several well-known Semantic Web projects. This poster compares their prospects to the Semantic Web according to Rogers' Diffusion of Innovation model. Keywords: CSS, HTML, decentralization, microformats, semantic web | |||
| SGSDesigner: a graphical interface for annotating and designing semantic grid services | | BIBAK | Full-Text | 867-868 | |
| Asunción Gómez-Pérez; Rafael González-Cabero | |||
| In this paper, we describe SGSDesigner, the ODESGS Environment user
interface. ODESGS Environment (the realization of the ODESGS Framework [1]) is
an environment for supporting both a) the annotation of pre-existing Grid
Services (GSs) and b) the design of new complex Semantic Grid Services (SGSs)
in a (semi) automatic way. Keywords: problem-solving methods, semantic grid services | |||
| Status of the African Web | | BIBAK | Full-Text | 869-870 | |
| Rizza Camus Caminero; Pavol Zavarsky; Yoshiki Mikami | |||
| As part of the Language Observatory Project [4], we have been crawling all
the web space since 2004. We have collected terabytes of data mostly from Asian
and African ccTLDs. In this paper, we present results of the current status of
the African web and compare it with its status in 2004 and 2002. This paper
focuses on the accessibility of the web pages, the web tree growth, web
technology, privacy protection, and web interconnection. Keywords: Africa, ccTLD, interconnection, internet statistics, privacy protection, web
accessibility, web graph, web tree | |||
| Personalization and accessibility: integration of library and web approaches | | BIBAK | Full-Text | 871-872 | |
| Ann Chapman; Brian Kelly; Liddy Nevile; Andy Heath | |||
| This paper describes personalization metadata standards that can be used to
enable individuals to access and use resources based on a user's particular
requirements. The paper describes two approaches which are being developed in
the library and Web worlds and highlights some of the potential challenges
which will need to be addressed in order to maximise interoperability. The
paper concludes by arguing the need for greater dialogue across these two
communities. Keywords: IMS, MARC, accessibility, metadata | |||
| Testing google interfaces modified for the blind | | BIBAK | Full-Text | 873-874 | |
| Patrizia Andronico; Marina Buzzi; Barbara Leporini; Carlos Castillo | |||
| We present the results of a research project focus on improving the
usability of web search tools for blind users who interact via screen reader
and voice synthesizer. In the first stage of our study, we proposed eight
specific guidelines for simplifying this interaction with search engines. Next,
we evaluated these criteria by applying them to Google UIs, re-implementing the
simple search and the result page. Finally, we prepared the environment for a
remote test with 12 totally blind users. The results highlight how Google
interfaces could be improved in order to simplify interaction for the blind. Keywords: accessibility, blind, search engine, usability, user interface design | |||
| Verifying genre-based clustering approach to content extraction | | BIBAK | Full-Text | 875-876 | |
| Suhit Gupta; Hila Becker; Gail Kaiser; Salvatore Stolfo | |||
| The content of a webpage is usually contained within a small body of text
and images, or perhaps several articles on the same page; however, the content
may be lost in the clutter, particularly hurting users browsing on small cell
phone and PDA screens and visually impaired users relying on speed rendering of
web pages. Using the genre of a web page, we have created a solution, Crunch
that automatically identifies clutter and removes it, thus leaving a clean
content-full page. In order to evaluate the improvement in the applications for
this technology, we identified a number of experiments. In this paper, we have
those experiments, the associated results and their evaluation. Keywords: HTML, accessibility, clustering, content extraction, context, reformatting,
speech rendering, website classification | |||
| A browser for browsing the past web | | BIBAK | Full-Text | 877-878 | |
| Adam Jatowt; Yukiko Kawai; Satoshi Nakamura; Yutaka Kidawara; Katsumi Tanaka | |||
| We describe a browser for the past web. It can retrieve data from multiple
past web resources and features a passive browsing style based on change
detection and presentation. The browser shows past pages one by one along a
time line. The parts that were changed between consecutive page versions are
animated to reflect their deletion or insertion, thereby drawing the user's
attention to them. The browser enables automatic skipping of changeless periods
and filtered browsing based on user specified query. Keywords: past web, web archive browsing, web archives | |||
| Live URLs: breathing life into URLs | | BIBAK | Full-Text | 879-880 | |
| Natarajan Kannan; Toufeeq Hussain | |||
| This paper provides a novel approach to use URI fragment identifiers to
enable HTTP clients to address and process content, independent of its original
representation. Keywords: ACM proceedings, HTML, HTTP, URL, browsers, fragment identifier, web
addressing, web content | |||
| Structuring namespace descriptions | | BIBA | Full-Text | 881-882 | |
| Erik Wilde | |||
| Namespaces are a central building block of XML technologies today, they provide the identification mechanism for many XML-related vocabularies. Despite their ubiquity, there is no established mechanism for describing namespaces, and in particular for describing the dependencies of namespaces. We propose a simple model for describing namespaces and their dependencies. Using these descriptions, it is possible to compile directories of namespaces providing searchable and browsable namespace descriptions. | |||
| CiteSeerx: an architecture and web service design for an academic document search engine | | BIBAK | Full-Text | 883-884 | |
| Huajing Li; Isaac Councill; Wang-Chien Lee; C. Lee Giles | |||
| CiteSeer is a scientific literature digital library and search engine which
automatically crawls and indexes scientific documents in the field of computer
and information science. After serving as a public search engine for nearly ten
years, CiteSeer is starting to have scaling problems for handling of more
documents, adding new feature and more users. Its monolithic architecture
design prevents it from effectively making use of new web technologies and
providing new services. After analyzing the current system problems, we propose
a new architecture and data model, CiteSeerx. CiteSeerx that will overcome the
existing problems as well as provide scalability and better performance plus
new services and system features. Keywords: data model, scalability, system architecture | |||
| Tables and trees don't mix (very well) | | BIBA | Full-Text | 885-886 | |
| Erik Wilde | |||
| There are principal differences between the relational model and XML's tree model. This causes problems in all cases where information from these two worlds has to be brought together. Using a few rules for mapping the incompatible aspects of the two models, it becomes easier to process data in systems which need to work with relational and tree data. The most important requirement for a good mapping is that the conceptual model is available and can thus be used for making mapping decisions. | |||
| Robust web content extraction | | BIBAK | Full-Text | 887-888 | |
| Marek Kowalkiewicz; Maria E. Orlowska; Tomasz Kaczmarek; Witold Abramowicz | |||
| We present an empirical evaluation and comparison of two content extraction
methods in HTML: absolute XPath expressions and relative XPath expressions. We
argue that the relative XPath expressions, although not widely used, should be
used in preference to absolute XPath expressions in extracting content from
human-created Web documents. Evaluation of robustness covers four thousand
queries executed on several hundred webpages. We show that in referencing parts
of real world dynamic HTML documents, relative XPath expressions are on average
significantly more robust than absolute XPath ones. Keywords: content extraction, evaluation, robustness, wrappers | |||
| Rapid prototyping of web applications combining domain specific languages and model driven design | | BIBAK | Full-Text | 889-890 | |
| Demetrius Arraes Nunes; Daniel Schwabe | |||
| There have been several authoring methods proposed in the literature that
are model based, essentially following the Model Driven Design philosophy.
While useful, such methods need an effective way to allow the application
designer to somehow synthesize the actual running application from the
specification. In this paper, we describe HyperDE, an environment that combines
Model Driven Design and Domain Specific Languages to enable rapid prototyping
of Web applications. Keywords: hypermedia authoring, model-based designs | |||
| A pruning-based approach for supporting Top-K join queries | | BIBAK | Full-Text | 891-892 | |
| Jie Liu; Liang Feng; Yunpeng Xing | |||
| An important issue arising from large scale data integration is how to
efficiently select the top-K ranking answers from multiple sources while
minimizing the transmission cost. This paper resolves this issue by proposing
an efficient pruning-based approach to answer top-K join queries. The total
amount of transmitted data can be greatly reduced by pruning tuples that can
not produce the desired join results with a rank value greater than or equal to
the rank value generated so far. Keywords: join query, prune, top-K | |||
| Towards DSL-based web engineering | | BIBAK | Full-Text | 893-894 | |
| Martin Nussbaumer; Patrick Freudenstein; Martin Gaedke | |||
| Strong user involvement and clear business objectives, both relying on
efficient communication between the developers and the business, are key
factors for a project's success. Domain-Specific Languages (DSLs) being simple,
highly-focused and tailored to a clear problem domain are a promising
alternative to heavy-weight modeling approaches in the field of Web
Engineering. Thus, they enable stakeholders to validate, modify and even
develop parts of a distributed Web-based solution. Keywords: DSL, conceptual modeling, web engineering, web services | |||
| Capturing the essentials of federated systems | | BIBAK | Full-Text | 895-896 | |
| Johannes Meinecke; Martin Gaedke; Frederic Majer; Alexander Brändle | |||
| Today, the Web is increasingly used as a platform for distributed services,
which transcend organizational boundaries to form federated applications.
Consequently, there is a growing interest in the architectural aspect of
Web-based systems, i.e. the composition of the overall solution into individual
Web applications and Web services from different parties. The design and
evolution of federated systems calls for models that give an overview of the
structural as well as trust-specific composition and reflect the technical
details of the various accesses. We introduce the WebComposition Architecture
Model (WAM) as an overall modeling approach tailored to aspects of highly
distributed systems with federation as an integral factor. Keywords: architecture, federation, modeling, security, web services | |||
| From adaptation engineering to aspect-oriented context-dependency | | BIBAK | Full-Text | 897-898 | |
| Sven Casteleyn; Zoltán Fiala; Geert-Jan Houben; Kees van der Sluijs | |||
| The evolution of the Web requires to consider an increasing number of
context-dependency issues. Therefore, in our research we focus on how to extend
a Web application with additional adaptation concerns without having to
redesign the entire application. Based on a generic transcoding tool we
illustrate here how we can add adaptation functionality to an existing Web
application. Furthermore, we consider how an aspect-oriented approach can
support the high-level specification of such additional concerns in the design
of the Web application. Keywords: adaptation, aspect-oriented programming, component-based web engineering,
web engineering | |||
| Living the TV revolution: unite MHP to the web or face IDTV irrelevance! | | BIBAK | Full-Text | 899-900 | |
| Stefano Ferretti; Marco Roccetti; Johannes Andrich | |||
| The union of Interactive Digital TV (IDTV) and Web promotes the development
of new interactive multimedia services, enjoyable while watching TV even on the
new handheld digital TV receivers. Yet, several design constraints complicate
the deployment of this new pattern of services. Indeed, for a suitable
presentation on a TV set, Web contents must be structured in such a way that
they can be effectively displayed on TV screens via low-end Set Top Boxes
(STBs). Moreover, usable interfaces for IDTV platforms are needed which ensure
a smooth access to contents. Our claim is that the distribution of Web contents
over the IDTV broadcast channels may bring IDTV to a new life. A failure of
this attempt may put IDTV on a progressive track towards irrelevance. We
propose a system for the distribution of Web contents towards IDTV under the
Digital Video Broadcasting -- Multimedia Home Platform (DVB-MHP) standard. Our
system is able to automatically transcode Web contents and ensure a proper
visualization on IDTV. The system is endowed with a client application which
permits to easily browse contents on the TV via a remote control. Real
assessments have confirmed the effectiveness for such an automatic online
service able to reconfigure Web contents for an appropriate distribution and
presentation on IDTV. Keywords: DVB, IDTV, MHP, web contents transcoding | |||
| Using graph matching techniques to wrap data from PDF documents | | BIBAK | Full-Text | 901-902 | |
| Tamir Hassan; Robert Baumgartner | |||
| Wrapping is the process of navigating a data source, semi-automatically
extracting data and transforming it into a form suitable for data processing
applications. There are currently a number of established products on the
market for wrapping data from web pages. One such approach is Lixto [1], a
product of research performed at our institute.
Our work is concerned with extending the wrapping functionality of Lixto to PDF documents. As the PDF format is relatively unstructured, this is a challenging task. We have developed a method to segment the page into blocks, which are represented as nodes in a relational graph. This paper describes our current research in the use of relational matching techniques on this graph to locate wrapping instances. Keywords: PDF, document understanding, graph matching, logical structure, wrapping | |||
| Requirements for multimedia document enrichment | | BIBAK | Full-Text | 903-904 | |
| Ajay Chakravarthy; Vitaveska Lanfranchi; Fabio Ciravegna | |||
| Nowadays a large and growing percentage of information is stored in various
multimedia formats. In order for multimedia information to be efficiently
utilised by users, it is very important to add suitable metadata. In this paper
we will present AKTiveMedia, a tool for enriching multimedia documents with
semantic information. Keywords: multimedia enrichment, semantic annotation interfaces | |||
| DiTaBBu: automating the production of time-based hypermedia content | | BIBAK | Full-Text | 905-906 | |
| Rui Lopes; Luís Carriço; Carlos Duarte | |||
| We present DiTaBBu, Digital Talking Books Builder, a framework for automatic
production of time-based hypermedia for the Web, focusing on the Digital
Talking Books domain. Delivering Digital Talking Books collections to a wide
range of users is an expensive task, as it must take into account each user
profile's different needs, therefore authoring should be dismissed in favour of
automation. With DiTaBBu, we enable automated content delivery in several
playback platforms, targeted to specific user needs, featuring powerful
navigation capabilities over the content. DiTaBBu can also be used as testbed
for prototyping novel capabilities, through its flexible extension mechanisms. Keywords: accessibility, automatic presentation generation, digital talking books,
ditabbu, hypermedia, multimodality | |||
| Capturing RIA concepts in a web modeling language | | BIBAK | Full-Text | 907-908 | |
| Alessandro Bozzon; Sara Comai; Piero Fraternali; Giovanni Toffetti Carughi | |||
| This work addresses conceptual modeling and automatic code generation for
Rich Internet Applications, a variant of Web-based systems bridging the gap
between desktop and Web interfaces. The approach we propose is a first step
towards a full integration of RIA paradigms into the Web development process,
enabling the specification of complex Web solutions mixing HTTP+HTML and Rich
Internet Applications, using a single modeling language and tool. Keywords: rich internet applications, web engineering, web site design | |||
| Generation of multimedia TV news contents for WWW | | BIBAK | Full-Text | 909-910 | |
| Hsin Chia Fu; Yeong Y. Xu; C. L. Tseng | |||
| In this paper, we present a system we have developed for automatic TV News
video indexing that successfully combines results from the fields of speaker
verification, acoustic analysis, very large vocabulary video OCR, content based
sampling of video, information retrieval, dialogue systems, and ASF media
delivery over IP. The prototype of TV news content processing Web was completed
in July 2003. Since then, the system has been up running continuously. Up to
the date when this message is written (March 27, 2006), the system records and
analyzes the prime time evening news program in Taiwan every day of these
years, except a few power failure shutdown. The TV news web is at
http://140.113.216.64/NewsQuery/main.as Keywords: TV news, content analysis, information retrieval, video OCR | |||
| Proposal of integrated search engine of web and TV contents | | BIBAK | Full-Text | 911-912 | |
| Hisashi Miyamori; Mitsuru Minakuchi; Zoran Stejic; Qiang Ma; Tadashi Araki; Katsumi Tanaka | |||
| A search engine that can handle TV programs and Web content in an integrated
way is proposed. Conventional search engines have been able to handle Web
content and/or data stored in a PC desktop as target information. In the
future, however, the target information is expected to be stored in various
places such as in hard-disk (HD)/DVD recorders, digital cameras, mobile
devices, and even in real space as ubiquitous content, and a search engine that
can search across such heterogeneous resources will become essential.
Therefore, as a first step towards developing such next-generation search
engine, a prototype search system for Web and TV programs is developed that
performs integrated search of those content, and that allows chain search where
related content can be accessed from each search result. The integrated search
is achieved by generating integrated indices for Web and TV content based on
vector space model and by computing similarity between the query and all the
content described by the indices. The chain search of related content is done
by computing similarity between the selected result and all other content based
on the integrated indices. Also, the zoom-based display of the search results
enables to control media transition and level of details of the contents to
acquire information efficiently. In this paper, testing of a prototype of the
integrated search engine validated the approach taken by the proposed method. Keywords: TV programs, chain search, information integration, information retrieval,
integrated search, search engine, web content | |||
| Using semantic rules to determine access control for web services | | BIBAK | Full-Text | 913-914 | |
| Brian Shields; Owen Molloy; Gerard Lyons; Jim Duggan | |||
| Semantic Web technologies are bring increasingly employed to solve knowledge
management issues in traditional Web technologies. This paper follows that
trend and proposes using Semantic rule languages to construct rules for
defining access control rules for Web Services. Using these rules, a system
will be able to manage access to Web Services and also the information accessed
via these services. Keywords: OWL, SWRL, authorisation, web service security | |||
| Strong authentication in web proxies | | BIBAK | Full-Text | 915-916 | |
| Domenico Rotiroti | |||
| In this paper we present a way to integrate web proxies with smart card
based authentication systems. Keywords: HTTP, proxy, smart card | |||
| Safeguard against unicode attacks: generation and applications of UC-simlist | | BIBAK | Full-Text | 917-918 | |
| Anthony Y. Fu; Wan Zhang; Xiaotie Deng; Liu Wenyin | |||
| A severe potential security problem in utilization of Unicode on the Web is
identified, which is resulted from the fact that there are many similar
characters in the Universal Character Set (UCS). The foundation of our solution
relies on evaluating the similarity of characters in UCS. We develop a solution
based on the renowned Kernel Density Estimation (KDE) method to establish such
a Unicode Similarity List (UC-SimList). Keywords: phishing, secure web identity, unicode | |||
| Efficient edge-services for colorblind users | | BIBK | Full-Text | 919-920 | |
| Gennaro Iaccarino; Delfina Malandrino; Marco Del Percio; Vittorio Scarano | |||
Keywords: colorblindness, edge services, vision, web accessibility | |||
| A user profile-based approach for personal information access: shaping your information portfolio | | BIBAK | Full-Text | 921-922 | |
| Lo Ka Kan; Xiang Peng; Irwin King | |||
| In the spread of internet, internet-based information service business has
started to become profitable. One of the key technologies is personalization.
Successful internet information services must realize personalized information
delivery, by which the users can automatically receive highly tuned information
according to their personal needs and preferences. In order to realize such
personalized information services, we have developed an automatic user
preference capture and an automatic information clipping function based on a
Personalized Information Access technique. In this paper, those techniques will
be demonstrated by showing a deployed personalized webpage service application. Keywords: information retrieval, internet behavior, personal information access,
system, user profile | |||
| Finding visual concepts by web image mining | | BIBAK | Full-Text | 923-924 | |
| Keiji Yanai; Kobus Barnard | |||
| We propose measuring "visualness" of concepts with images on the Web, that
is, what extent concepts have visual characteristics. This is a new application
of "Web image mining". To know which concept has visually discriminative power
is important for image recognition, since not all concepts are related to
visual contents. Mining image data on the Web with our method enables it. Our
method performs probabilistic region selection for images and computes an
entropy measure which represents "visualness" of concepts. In the experiments,
we collected about forty thousand images from the Web for 150 concepts. We
examined which concepts are suitable for annotation of image contents. Keywords: image recognition, probabilistic method, web image mining | |||
| Deriving wishlists from blogs show us your blog, and we'll tell you what books to buy | | BIBAK | Full-Text | 925-926 | |
| Gilad Mishne; Maarten de Rijke | |||
| We use a combination of text analysis and external knowledge sources to
estimate the commercial taste of bloggers from their text; our methods are
evaluated using product wishlists found in the blogs. Initial results are
promising, showing that valuable insights can be mined from blogs, not just at
the aggregate but also at the individual blog level. Keywords: amazon, blogs, wishlists | |||
| Relationship between web links and trade | | BIBAK | Full-Text | 927-928 | |
| Ricardo Baeza-Yates; Carlos Castillo | |||
| We report on observations on Web characterization studies that suggest that
the amount of Web links among sites under different country-code top-level
domains is related to the amount of trade between the corresponding countries. Keywords: national web domains, world trade graph | |||
| System for spatio-temporal analysis of online news and blogs | | BIBAK | Full-Text | 929-930 | |
| Angelo Dalli | |||
| Previous work on spatio-temporal analysis of news items and other documents
has largely focused on broad categorization of small text collections by region
or country. A system for large-scale spatio-temporal analysis of online news
media and blogs is presented, together with an analysis of global news media
coverage over a nine year period. We demonstrate the benefits of using a
hierarchical geospatial database to disambiguate between geographical named
entities, and provide results for an extremely fine-grained analysis of news
items. Aggregate maps of media attention for particular places around the world
are compared with geographical and socio-economic data. Our analysis suggests
that GDP per capita is the best indicator for media attention. Keywords: blogs, disambiguation of geographical named entities, geolocation, media
attention, news, social behavior, spatio-temporal | |||
| Extracting news-related queries from web query log | | BIBAK | Full-Text | 931-932 | |
| Michael Maslov; Alexander Golovko; Ilya Segalovich; Pavel Braslavski | |||
| In this poster, we present a method for extracting queries related to
real-life events, or news-related queries, from large web query logs. The
method employs query frequencies and search over a collection of recent news.
News-related queries can be helpful for disambiguating user information needs,
as well as for effective online news processing. The performed evaluation
proves that the method yields good precision. Keywords: query log analysis, web search | |||
| Visually guided bottom-up table detection and segmentation in web documents | | BIBAK | Full-Text | 933-934 | |
| Bernhard Krüpl; Marcus Herzog | |||
| In the AllRight project, we are developing an algorithm for unsupervised
table detection and segmentation that uses the visual rendition of a Web page
rather than the HTML code. Our algorithm works bottom-up by grouping word
bounding boxes into larger groups and uses a set of heuristics. It has already
been implemented and a preliminary evaluation on about 6000 Web documents has
been carried out. Keywords: table detection, web information extraction | |||
| Generating maps of web pages using cellular automata | | BIBAK | Full-Text | 935-936 | |
| Hanene Azzag; David Ratsimba; David Da Costa; Gilles Venturini; Christiane Guinot | |||
| The aim of web pages visualization is to present in a very informative and
interactive way a set of web documents to the user in order to let him or her
navigate through these documents. In the web context, this may correspond to
several user's tasks: displaying the results of a search engine, or visualizing
a graph of pages such as a hypertext or a surf map. In addition to web pages
visualization, web pages clustering also greatly improves the amount of
information presented to the user by highlighting the similarities between the
documents [6]. In this paper we explore the use of a cellular automata (CA) to
generate such maps of web pages. Keywords: cellular automata, unsupervised clustering, visualization, web pages | |||
| BuzzRank ... and the trend is your friend | | BIBAK | Full-Text | 937-938 | |
| Klaus Berberich; Srikanta Bedathur; Michalis Vazirgiannis; Gerhard Weikum | |||
| Ranking methods like PageRank assess the importance of Web pages based on
the current state of the rapidly evolving Web graph. The dynamics of the
resulting importance scores, however, have not been considered yet, although
they provide the key to an understanding of the Zeitgeist on the Web. This
paper proposes the BuzzRank method that quantifies trends in time series of
importance scores and is based on a relevant growth model of importance scores.
We experimentally demonstrate the usefulness of BuzzRank on a bibliographic
dataset. Keywords: pagerank, web dynamics, web graph | |||
| Detecting nepotistic links by language model disagreement | | BIBAK | Full-Text | 939-940 | |
| András A. Benczúr; István Bíró; Károly Csalogány; Máté Uher | |||
| In this short note we demonstrate the applicability of hyperlink
downweighting by means of language model disagreement. The method filters out
hyperlinks with no relevance to the target page without the need of white and
blacklists or human interaction. We fight various forms of nepotism such as
common maintainers, ads, link exchanges or misused affiliate programs. Our
method is tested on a 31 M page crawl of the .de domain with a manually
classified 1000-page random sample. Keywords: anchor text, language modeling, link Spam | |||
| The distribution of pageRank follows a power-law only for particular values of the damping factor | | BIBAK | Full-Text | 941-942 | |
| Luca Becchetti; Carlos Castillo | |||
| We show that the empirical distribution of the PageRank values in a large
set of Web pages does not follow a power-law except for some particular choices
of the damping factor. We argue that for a graph with an in-degree distribution
following a power-law with exponent between 2.1 and 2.2, choosing a damping
factor around 0.85 for PageRank yields a power-law distribution of its values.
We suggest that power-law distributions of PageRank in Web graphs have been
observed because the typical damping factor used in practice is between 0.85
and 0.90. Keywords: pagerank distribution, web graph | |||
| Mining related queries from search engine query logs | | BIBAK | Full-Text | 943-944 | |
| Xiaodong Shi; Christopher C. Yang | |||
| In this work we propose a method that retrieves a list of related queries
given an initial input query. The related queries are based on the query log of
previously issued queries by human users, which can be discovered using our
improved association rule mining model. Users can use the suggested related
queries to tune or redirect the search process. Our method not only discovers
the related queries, but also ranks them according to the degree of their
relatedness. Unlike many other rival techniques, it exploits only limited query
log information and performs relatively better on queries in all frequency
divisions. Keywords: association rule, edit distance, query log, related query, web searching | |||
| Discovering event evolution graphs from newswires | | BIBAK | Full-Text | 945-946 | |
| Christopher C. Yang; Xiaodong Shi | |||
| In this paper, we propose an approach to automatically mine event evolution
graphs from newswires on the Web. Event evolution graph is a directed graph in
which the vertices and edges denote news events and the evolutions between
events respectively, in a news affair. Our model utilizes the content
similarity between events and incorporates temporal proximity and document
distributional proximity as decaying functions. Our approach is effective in
presenting the inside developments of news affairs along the timeline, which
can facilitate users' information browsing tasks. Keywords: event evolution, event evolution graph, knowledge management, web content
mining | |||
| Mining clickthrough data for collaborative web search | | BIBAK | Full-Text | 947-948 | |
| Jian-Tao Sun; Xuanhui Wang; Dou Shen; Hua-Jun Zeng; Zheng Chen | |||
| This paper is to investigate the group behavior patterns of search
activities based on Web search history data, i.e., clickthrough data, to boost
search performance. We propose a Collaborative Web Search (CWS) framework based
on the probabilistic modeling of the co-occurrence relationship among the
heterogeneous web objects: users, queries, and Web pages. The CWS framework
consists of two steps: (1) a cube-clustering approach is put forward to
estimate the semantic cluster structures of the Web objects; (2) Web search
activities are conducted by leveraging the probabilistic relations among the
estimated cluster structures. Experiments on a real-world clickthrough data set
validate the effectiveness of our CWS approach. Keywords: clickthrough data, collaborative web search, cube-clustering | |||
| Background knowledge for ontology construction | | BIBAK | Full-Text | 949-950 | |
| Bla Fortuna; Marko Grobelnik; A Dunja Mladenič | |||
| In this paper we describe a solution for incorporating background knowledge
into the OntoGen system for semi-automatic ontology construction. This makes it
easier for different users to construct different and more personalized
ontologies for the same domain. To achieve this we introduce a word weighting
schema to be used in the document representation. The weighting schema is
learned based on the background knowledge provided by user. It is than used by
OntoGen's machine learning and text mining algorithms. Keywords: background knowledge, semi-automatic ontology construction | |||
| Mining RDF metadata for generalized association rules: knowledge discovery in the semantic web era | | BIBAK | Full-Text | 951-952 | |
| Tao Jiang; Ah-Hwee Tan | |||
| In this paper, we present a novel frequent generalized pattern mining
algorithm, called GP-Close, for mining generalized associations from RDF
metadata. To solve the over-generalization problem encountered by existing
methods, GP-Close employs the notion of emphgeneralization closure for
systematic over-generalization reduction. Keywords: RDF mining, association rule mining | |||
| AutoTag: a collaborative approach to automated tag assignment for weblog posts | | BIBAK | Full-Text | 953-954 | |
| Gilad Mishne | |||
| This paper describes AutoTag, a tool which suggests tags for weblog posts
using collaborative filtering methods. An evaluation of AutoTag on a large
collection of posts shows good accuracy; coupled with the blogger's final
quality control, AutoTag assists both in simplifying the tagging process and in
improving its quality. Keywords: blogs, tags | |||
| Merging trees: file system and content integration | | BIBA | Full-Text | 955-956 | |
| Erik Wilde | |||
| XML is the predominant format for representing structured information inside documents, but it stops at the level of files. This makes it hard to use XML-oriented tools to process information which is scattered over multiple documents within a file system. File System XML (FSX) and its content integration provides a unified view of file system structure and content. FSX's adaptors map file contents to XML, which means that any file format can be integrated with an XML view in the integrated view of the file system. | |||
| A content and structure website mining model | | BIBAK | Full-Text | 957-958 | |
| Barbara Poblete; Ricardo Baeza-Yates | |||
| We present a novel model for validating and improving the content and
structure organization of a website. This model studies the website as a graph
and evaluates its interconnectivity in relation to the similarity of its
documents. The aim of this model is to provide a simple way for improving the
overall structure, contents and interconnectivity of a website. This model has
been implemented as a prototype and applied to several websites, showing very
interesting results. Our model is complementary to other methods of website
personalization and improvement. Keywords: web mining, website improvement | |||
| Online mining of frequent query trees over XML data streams | | BIBAK | Full-Text | 959-960 | |
| Hua-Fu Li; Man-Kwan Shan; Suh-Yin Lee | |||
| In this paper, we proposed an online algorithm, called FQT-Stream (Frequent
Query Trees of Streams), to mine the set of all frequent tree patterns over a
continuous XML data stream. A new numbering method is proposed to represent the
tree structure of a XML query tree. An effective sub-tree numeration approach
is developed to extract the essential information from the XML data stream. The
extracted information is stored in an effective summary data structure.
Frequent query trees are mined from the current summary data structure by a
depth-first-search manner. Keywords: XML, data streams, frequent query trees, online mining, web mining | |||
| Using proportional transportation similarity with learned element semantics for XML document clustering | | BIBAK | Full-Text | 961-962 | |
| Xiaojun Wan; Jianwu Yang | |||
| This paper proposes a novel approach to measuring XML document similarity by
taking into account the semantics between XML elements. The motivation of the
proposed approach is to overcome the problems of "under-contribution" and
"over-contribution" existing in previous work. The element semantics are
learned in an unsupervised way and the Proportional Transportation Similarity
is proposed to evaluate XML document similarity by modeling the similarity
calculation as a transportation problem. Experiments of clustering are
performed on three ACM SIGMOD data sets and results show the favorable
performance of the proposed approach. Keywords: XML document clustering, proportional transportation similarity | |||
| Template guided association rule mining from XML documents | | BIBAK | Full-Text | 963-964 | |
| Rahman AliMohammadzadeh; Sadegh Soltan; Masoud Rahgozar | |||
| Compared with traditional association rule mining in the structured world
(e.g. Relational Databases), mining from XML data is confronted with more
challenges due to the inherent flexibilities of XML in both structure and
semantics. The major challenges include 1) a more complicated hierarchical data
structure; 2) an ordered data context; and 3) a much bigger size for each data
element. In order to make XML-enabled association rule mining truly practical
and computationally tractable, we propose a practical model for mining
association rules from XML documents and demonstrate the usability and
effectiveness of model through a set of experiments on real-life data. Keywords: XML, association rule mining, data mining | |||
| Automatic geotagging of Russian web sites | | BIBAK | Full-Text | 965-966 | |
| Alexei Pyalling; Michael Maslov; Pavel Braslavski | |||
| The poster describes a fast, simple, yet accurate method to associate large
amounts of web resources stored in a search engine database with geographic
locations. The method uses location-by-IP data, domain names, and
content-related features: ZIP and area codes. The novelty of the approach lies
in building location-by-IP database by using continuous IP blocks method.
Another contribution is domain name analysis. The method uses search engine
infrastructure and makes it possible to effectively associate large amounts of
search engine data with geography on a regular basis. Experiments ran on Yandex
search engine index; evaluation has proved the efficacy of the approach. Keywords: geographic information retrieval, geotagging | |||
| Using symbolic objects to cluster web documents | | BIBAK | Full-Text | 967-968 | |
| Esteban Meneses; Oldemar Rodríguez-Rojas | |||
| Web Clustering is useful for several activities in the WWW, from
automatically building web directories to improve retrieval performance.
Nevertheless, due to the huge size of the web, a linear mechanism must be
employed to cluster web documents. The k-means is one classic algorithm used in
this problem. We present a variant of the vector model to be used with the
k-means algorithm. Our representation uses symbolic objects for clustering web
documents. Some experiments were done with positive results and future work is
optimistic. Keywords: symbolic data analysis, web clustering | |||
| Estimating required recall for successful knowledge acquisition from the web | | BIBAK | Full-Text | 969-970 | |
| Wolfgang Gatterbauer | |||
| Information on the Web is not only abundant but also redundant. This
redundancy of information has an important consequence on the relation between
the recall of an information gathering system and its capacity to harvest the
core information of a certain domain of knowledge. This paper provides a new
idea for estimating the necessary Web coverage of a knowledge acquisition
system in order to achieve a certain desired coverage of the contained core
information. Keywords: information extraction, quantitative performance measures, recall,
redundancy, web metrics | |||
| Text-based video blogging | | BIBAK | Full-Text | 971-972 | |
| Narichika Hamaguchi; Mamoru Doke; Masaki Hayashi; Nobuyuki Yagi | |||
| A video blogging system has been developed for easily producing your own
video programs that can be made available to the public in much the same way
that blogs are created. The user merely types a program script on a webpage,
the same as creating a blog, selects a direction style, and pastes in some
additional material content to create a CG-based video program that can be
openly distributed to the general public. The script, direction style, and
material content are automatically combined to create a movie file on the
server side. The movie file can then be accessed by referring to an RSS feed
and viewed on the screens of various devices. Keywords: APE, TVML, blog, vlog, web-casting | |||
| A decentralized CF approach based on cooperative agents | | BIBAK | Full-Text | 973-974 | |
| Byeong Man Kim; Qing Li; Adele E. Howe | |||
| In this paper, we propose a decentralized collaborative filtering (CF)
approach based on P2P overlay network for the autonomous agents' environment.
Experiments show that our approach is more scalable than traditional
centralized CF filtering systems and alleviates the sparsity problem in
distributed CF. Keywords: P2P system, distributed collaborative filtering, friend network | |||
| Adaptive web sites: user studies and simulation | | BIBAK | Full-Text | 975-976 | |
| Doug Warner; Stephen D. Durbin; J. Neal Richter; Zuzana Gedeon | |||
| Adaptive web sites have been proposed to enhance ease of navigation and
information retrieval. A variety of approaches are described in the literature,
but consideration of interface presentation issues and realistic user studies
are generally lacking. We report here a large-scale study of sites with dynamic
information collections and user interests, where adaptation is based on an Ant
Colony Optimization technique. We find that most users were able to locate
information effectively without needing to perform explicit searches. The
behavior of users who did search was similar to that on Internet search
engines. Simulations based on site and user models give insight into the
adaptive behavior and correspond to observations. Keywords: adaptive web site, ant colony optimization | |||
| On a service-oriented approach for an engineering knowledge desktop | | BIBAK | Full-Text | 977-978 | |
| Sylvia C. Wong; Richard M. Crowder; Gary B. Wills | |||
| Increasingly, manufacturing companies are shifting their focus from selling
products to providing services. As a result, when designing new products,
engineers must increasingly consider the life cycle costs in addition to any
design requirements. To identify possible areas of concern, designers are
required to consult existing maintenance information from identical products.
However, in a large engineering company, the amount of information available is
significant and in wide range of formats. This paper presents a prototype
knowledge desktop suitable for the design engineer. The Engineering Knowledge
Desktop analyses and suggests relevant information from ontologically marked-up
heterogeneous web resources. It is designed using a Service-Oriented
Architecture, with an ontology to mediate between Web Services. It has been
delivered to the user community for evaluation. Keywords: semantic web, service-oriented architecture, web services | |||
| Design and development of learning management system at universiti Putra Malaysia: a case study of e-SPRINT | | BIBAK | Full-Text | 979-980 | |
| Sidek H. A. Aziz; Aida Suraya; M. Yunus; Kamariah A. Bakar; Hamidah B. Meseran | |||
| This paper reports the design and development of the e-SPRINT, Learning
Management System, which has been derived from Sistem Pengurusan Rangkaian
Integrasi Notakuliah dalam Talian -- mod Elektronik) and currently being
implemented at Universiti Putra Malaysia (UPM). The e-SPRINT was developed by
utilizing PERL (Practical Extraction and Report Language) and was supported by
standard database in Linux/UNIX environment operating system. The system is
currently being used to supplement and complement part of the classroom-based
teaching. This paper covers the architecture and features of the e-SPRINT
system which consists of five main modules. Some general issues and challenges
of such e-learning initiatives implementation will also be discussed. Keywords: internet, learning management system | |||
| Providing SCORM with adaptivity | | BIBAK | Full-Text | 981-982 | |
| M. Rey-López; A. Fernández-Vilas; R. Díaz-Redondo; J. Pazos-Arias | |||
| Content personalization is a very important aspect in the field of
e-learning, although current standards do not fully support it. In this paper,
we outline an extension to the ADL SCORM (Sharable Content Object Reference
Model) standard in an effort to permit a suitable adaptivity based on user's
characteristics. Applying this extension, we can create adaptable courses,
which should be personalized before shown to the student. Keywords: AH, SCORM, adaptivity, e-learning | |||
| A framework for XML data streams history checking and monitoring | | BIBAK | Full-Text | 983-984 | |
| Alessandro Campi; Paola Spoletini | |||
| The need of formal verification is a problem that involves all the fields in
which sensible data are managed. In this context the verification of data
streams became a fundamental task. The purpose of this paper is to present a
framework, based on the model checker SPIN, for the verification of data
streams.
The proposed method uses a linear temporal logic, called TRIO, to describe data constraints and properties. Constraints are automatically translated into Promela, the input language of the model checker SPIN in order to verify them. Keywords: XML, semi-structured data, verification | |||
| The credibility of the posted information in a recommendation system based on a map | | BIBAK | Full-Text | 985-986 | |
| Koji Yamamoto; Daisuke Katagami; Katsumi Nitta; Akira Aiba; Hitoshi Kuwata | |||
| We propose a method for estimating the credibility of the posted information
from users. The system displays these information on the map. Since posted
information can include subjective information from various perspectives, we
can't trust all of the postings as they are. We propose and integrate factors
of the user's geographic posting tendency and votes by other users. Keywords: GIS, credibility, navigation, posting, recommendation | |||
| Archiving web site resources: a records management view | | BIBAK | Full-Text | 987-988 | |
| Maureen Pennock; Brian Kelly | |||
| In this paper, we propose the use of records management principles to
identify and manage Web site resources with enduring value as records. Current
Web archiving activities, collaborative or organisational, whilst extremely
valuable in their own right, often do not and cannot incorporate requirements
for proper records management. Material collected under such initiatives
therefore may not be reliable or authentic from a legal or archival
perspective, with insufficient metadata collected about the object during its
active life, and valuable materials destroyed whilst ephemeral items are
maintained. Education, training, and collaboration between stakeholders are
integral to avoiding these risks and successfully preserving valuable Web-based
materials. Keywords: archiving web sites, best practices, records management | |||
| Geographic locations of web servers | | BIBAK | Full-Text | 989-990 | |
| Katsuko T. Nakahira; Tetsuya Hoshino; Yoshiki Mikami | |||
| The ccTLD (country code Top Level Domain) in a URL does not necessarily
point to the geographic location of the server concerned. The authors have
surveyed sample servers belonging to 60 ccTLDs in Africa, with regard to the
number of hops required to reach the target site from Japan, the response time,
and the NIC registration information of each domain. The survey has revealed
the geographical distribution of server sites as well as their connection
environments. It has been found that the percentage of offshore (out of home
country) servers is as high as 80% and more than half of these are located in
Europe. Offshore servers not only provide little benefit to the people of the
country to which each ccTLD rightly belongs but their existence also heightens
the risk of a country being unable to control them with its own policies and
regulations. Offshore servers constitute a significant aspect of the digital
divide problem. Keywords: Africa, NIC registration information, ccTLD, digital-divide, geographic
location of servers, number of hops, offshore server, response time, traceroute | |||
| Why is connectivity in developing regions expensive: policy challenges more than technical limitations? | | BIBAK | Full-Text | 991-992 | |
| Rahul Tongia | |||
| I present analysis examining some of the causes of poor connectivity in
developing countries. Based on a techno-economic analysis and design, I show
that technical limitations per se are not the bottleneck for widespread
connectivity; rather, design, policy, and regulatory challenges dominate. Keywords: Africa, broadband, digital divide, internet and telecom access, open access,
optical fibers, techno-economics, wireless | |||
| Bilingual web page and site readability assessment | | BIBAK | Full-Text | 993-994 | |
| Tak Pang Lau; Irwin King | |||
| Readability assessment is a method to measure the difficulty of a piece of
text material, and it is widely used in educational field to assist instructors
to prepare appropriate materials for students. In this paper, we investigate
the applications of readability assessment in Web development, such that users
can retrieve information which is appropriate to their levels. We propose a
bilingual (English and Chinese) assessment scheme for Web page and Web site
readability based on textual features, and conduct a series of experiments with
real Web data to evaluate our scheme. Experimental results show that, apart
from just indicating the readability level, the estimated score acts as a good
heuristic to figure out pages with low textual content. Furthermore, we can
obtain the overall content distribution in a Web site by studying the variation
of its readability. Keywords: Chinese, English, readability, web pages, web sites | |||
| Mobile web publishing and surfing based on environmental sensing data | | BIBK | Full-Text | 995-996 | |
| Daisuke Morikawa; Masaru Honjo; Satoshi Nishiyama; Masayoshi Ohashi | |||
Keywords: GPS, RFID, location, personalization, sensor, web browsing, web publishing | |||
| DoNet: a semantic domotic framework | | BIBAK | Full-Text | 997-998 | |
| Malcolm Attard; Matthew Montebello | |||
| In the very near future complete households will be entirely networked as a
de facto standard. In this poster we briefly describe our work in the area of
domotics, where personalization, semantics and agent technology come together.
We illustrate a home system oriented ontology and an intelligent agent based
framework for the rapid development of home control and automation. The ever
changing nature of the home, places the user in a position were he needs to be
involved and become, through DoNet, a part of an ongoing home system
optimization process. Keywords: agents, domotics, semantic web | |||
| Web based device independent mobile map applications.: the m-CHARTIS system | | BIBAK | Full-Text | 999-1000 | |
| John Garofalakis; Theofanis-Aristofanis Michail; Athanasios Plessas | |||
| A map is one of the most useful media in disseminating spatial information.
As mobile devices are becoming increasingly powerful and ubiquitous, new
possibilities to access map information are created. However, mobile devices
still face severe constraints that limit the possibilities that a mobile map
application may offer. We present the m-CHARTIS system, a device independent
mobile map application that enables mobile users to access map information from
their device. Keywords: handheld devices, mobile cartography, mobile devices, mobile map application | |||
| Context-orientated news filtering for Web 2.0 and beyond | | BIBAK | Full-Text | 1001-1002 | |
| David Webster; Weihong Huang; Darren Mundy; Paul Warren | |||
| How can we solve the problem of information overload in news syndication?
This poster outlines the path from keyword-based body text matching to
distance-measurable taxonomic tag matching, on to context scale and practical
uses. Keywords: RSS, aggregation, context, tags, Web 2.0, word senses | |||
| Efficient search for peer-to-peer information retrieval using semantic small world | | BIBAK | Full-Text | 1003-1004 | |
| Hai Jin; Xiaomin Ning; Hanhua Chen | |||
| This paper proposes a semantic overlay based on the small world phenomenon
that facilitates efficient search for information retrieval in unstructured P2P
systems. In the semantic overlay, each node maintains a number of short-range
links which are semantically similar to each other, together with a small
collection of long-range links that help increasing recall rate of information
retrieval and reduce network traffic as well. Experimental results show that
our model can improve performance by 150% compared to Gnutella and by up to 60%
compared to the Interest-based model -- a similar shortcut-based search
technique. Keywords: information retrieval, peer-to-peer, semantic, small world | |||
| Semantic link based top-K join queries in P2P networks | | BIBAK | Full-Text | 1005-1006 | |
| Jie Liu; Liang Feng; Chao He | |||
| An important issue arising from Peer-to-Peer applications is how to
accurately and efficiently retrieve a set of K best matching data objects from
different sources while minimizing the number of objects that have to be
accessed. This paper resolves this issue by organizing peers in a Semantic Link
Network Overlay, where semantic links are established to denote the semantic
relationship between peers' data schemas. A query request will be routed to
appropriate peers according to the semantic link type and a lower bound of rank
function. Optimization strategies are proposed to reduce the total amount of
data transmitted. Keywords: join query, peer-to-peer, semantic link, top-K | |||
| Ontology-based legal information retrieval to improve the information access in e-government | | BIBAK | Full-Text | 1007-1008 | |
| Asunción Gómez-Pérez; Fernando Ortiz-Rodriguez; Boris Villazón-Terrazas | |||
| In this paper, we present EgoIR, an approach for retrieving legal
information based on ontologies; this approach has been developed with Legal
Ontologies to be deployed within the e-government context. Keywords: information retrieval, ontology | |||
| Oyster: sharing and re-using ontologies in a peer-to-peer community | | BIBAK | Full-Text | 1009-1010 | |
| Raul Palma; Peter Haase; Asunción Gómez-Pérez | |||
| In this paper, we present Oyster, a Peer-to-Peer system for exchanging
ontology metadata among communities in the Semantic Web. Oyster exploits
semantic web techniques in data representation, query formulation and query
result presentation to provide an online solution for sharing ontologies, thus
assisting researchers in re-using existing ontologies. Keywords: metadata, ontology, peer-to-peer, repository | |||
| GoGetIt!: a tool for generating structure-driven web crawlers | | BIBAK | Full-Text | 1011-1012 | |
| Márcio L. A. Vidal; Altigran S. da Silva; Edleno S. de Moura; João M. B. Cavalcanti | |||
| We present GoGetIt!, a tool for generating structure-driven crawlers that
requires a minimum effort from the users. The tool takes as input a sample page
and an entry point to a Web site and generates a structure-driven crawler based
on navigation patterns, sequences of patterns for the links a crawler has to
follow to reach the pages structurally similar to the sample page. In the
experiments we have performed, structure-driven crawlers generated by GoGetIt!
were able to collect all pages that match the samples given, including those
pages added after their generation. Keywords: tree edit distance, web crawlers, web data extraction | |||
| Towards practical genre classification of web documents | | BIBAK | Full-Text | 1013-1014 | |
| George Ferizis; Peter Bailey | |||
| Classification of documents by genre is typically done either using
linguistic analysis or term frequency based techniques. The former provides
better classification accuracy than the latter but at the cost of two orders of
magnitude more computation time. While term frequency analysis requires much
less computational resources than linguistic analysis, it returns poor
classification accuracy when the genres are not sufficiently distinct. A method
that removes or approximates the expensive portions of linguistic analysis is
presented.
The accuracy and computation time of this method then compared with both linguistic analysis and term frequency analysis. The results in this paper show that this method can significantly reduce the computation of both time of linguistic analysis and term frequency analysis, while retaining an accuracy that is higher than that of term frequency analysis. Keywords: genre classification, linguistic, term frequency | |||
| Do not crawl in the DUST: different URLs with similar text | | BIBAK | Full-Text | 1015-1016 | |
| Uri Schonfeld; Ziv Bar-Yossef; Idit Keidar | |||
| We consider the problem of dust: Different URLs with Similar Text. Such
duplicate URLs are prevalent in web sites, as web server software often uses
aliases and redirections, translates URLs to some canonical form, and
dynamically generates the same page from various different URL requests. We
present a novel algorithm, DustBuster, for uncovering dust; that is, for
discovering rules for transforming a given URL to others that are likely to
have similar content. DustBuster is able to detect dust effectively from
previous crawl logs or web server logs, without examining page contents.
Verifying these rules via sampling requires fetching few actual web pages.
Search engines can benefit from this information to increase the effectiveness
of crawling, reduce indexing overhead as well as improve the quality of
popularity statistics such as PageRank. Keywords: duplicates, mining, rules, similarity | |||
| Community discovery and analysis in blogspace | | BIBAK | Full-Text | 1017-1018 | |
| Ying Zhou; Joseph Davis | |||
| Weblog has quickly evolved into a new information and knowledge
dissemination channel. Yet it is not easy to discover weblog communities
through keyword search. The main contribution of this paper is the study of
weblog communities from the perspective of social network analysis. We proposed
a new way of collecting and preparing data for weblog community discovery. The
data collection stage focuses on gaining knowledge of the strength of social
ties between weblogs. The strength of social ties and the clustering feature of
social network guided the discovery of weblog communities. Keywords: community, social network, social tie, weblog | |||
| PageSim: a novel link-based measure of web page similarity | | BIBAK | Full-Text | 1019-1020 | |
| Zhenjiang Lin; Michael R. Lyu; Irwin King | |||
| To find similar web pages to a query page on the Web, this paper introduces
a novel link-based similarity measure, called PageSim. Contrast to SimRank, a
recursive refinement of cocitation, PageSim can measure similarity between any
two web pages, whereas SimRank cannot in some cases. We give some intuitions to
the PageSim model, and outline the model with mathematical definitions.
Finally, we give an example to illustrate its effectiveness. Keywords: link analysis, pagerank, search engine, similarity measure, simrank | |||
| Finding specification pages according to attributes | | BIBAK | Full-Text | 1021-1022 | |
| Naoki Yoshinag; Kentaro Torisaw | |||
| This paper presents a method for finding a specification page on the web for
a given object (e.g."Titanic")and its class label (e.g."film"). A specification
page for an object is a web page which gives concise attribute-value
information about the object (e.g."director"-"James Cameron" for "Titanic"). A
simple unsupervised method using layout and symbolic decoration cues was
applied to a large number of web pages to acquire the class attributes. We used
these acquired attributes to select a representative specification page for a
given object from the web pages retrieved by a normal search engine.
Experimental results revealed that our method greatly outperformed the normal
search engine in terms of specification retrieval. Keywords: attribute acquisition, specification finding, web search | |||
| Selective hypertext induced topic search | | BIBAK | Full-Text | 1023-1024 | |
| Amit C. Awekar; Pabitra Mitra; Jaewoo Kang | |||
| We address the problem of answering broad-topic queries on the World Wide
Web. We present a link based analysis algorithm SelHITS, which is an
improvement over Kleinberg's HITS [2] algorithm. We introduce the concept of
virtual links to exploit the latent information in the hyperlinked environment.
We propose a novel approach to calculate hub and authority values. We also
present a selective expansion method which avoids topic drift and provides
results consistent with only one interpretation of the query, even if the query
is ambiguous. Initial experimental evaluation and user feedback show that our
algorithm indeed distills the most important and relevant pages for broad-topic
queries. We also infer that there exists a uniform notion of quality of search
results within users. Keywords: link analysis, searching, topic distillation | |||
| An audio/video analysis mechanism for web indexing | | BIBAK | Full-Text | 1025-1026 | |
| Marco Furini; Marco Aragone | |||
| The high availability of video streams is making necessary mechanisms for
indexing such contents in the Web world. In this paper we focus on news
programs and we propose a mechanism that integrates low and high level video
features to provide a high level semantic description. A color/luminance
analysis is coupled with audio analysis to provide a better identification of
all the video segments that compose the video stream. Each video segment is
subject to speech detection and is described through MPEG7 so that the
resulting metadata description can be used to index the video stream. An
experimental evaluation shows the benefits of integrating audio and video
analysis. Keywords: MPEG7-DDL, automatic speech recognition, contents indexing, shot boundary
detection, video indexing | |||
| The SOWES approach to P2P web search using semantic overlays | | BIBAK | Full-Text | 1027-1028 | |
| Christos Doulkeridis; Kjetil Nørvåg; Michalis Vazirgiannis | |||
| Peer-to-peer (P2P) Web search has gained a lot of interest lately, due to
the salient characteristics of P2P systems, namely scalability, fault-tolerance
and load-balancing. However, the lack of global knowledge in a vast and
dynamically evolving environment like the Web presents a grand challenge for
organizing content and providing efficient searching. Semantic overlay networks
(SONs) have been proposed as an approach to reduce cost and increase quality of
results, and in this paper we present an unsupervised approach for distributed
and decentralized SON construction, aiming to support efficient search
mechanisms in unstructured P2P systems. Keywords: distributed and peer-to-peer search, semantic overlay networks | |||
| Topic-oriented query expansion for web search | | BIBAK | Full-Text | 1029-1030 | |
| Shao-Chi Wang; Yuzuru Tanaka | |||
| The contribution of this paper includes three folders: (1) To introduce a
topic-oriented query expansion model based on the Information Bottleneck theory
that classify terms into distinct topical clusters in order to find out
candidate terms for the query expansion. (2) To define a term-term similarity
matrix that is available to improve the term ambiguous problem. (3) To propose
two measures, intracluster and intercluster similarities, that are based on
proximity between the topics represented by two clusters in order to evaluate
the retrieval effectiveness. Results of several evaluation experiments in Web
search exhibit the average intracluster similarity was improved for the gain of
79.1% while the average intercluster similarity was decreased for the loss of
36.0%. Keywords: information bottleneck, intercluster similarity, intracluster similarity,
query expansion, term-term similarity matrix, topic-oriented | |||
| Predictive modeling of first-click behavior in web-search | | BIBAK | Full-Text | 1031-1032 | |
| Maeve O'Brien; Mark T. Keane; Barry Smyth | |||
| Search engine results are usually presented in some form of text summary
(e.g., document title, some snippets of the page's content, a URL, etc). Based
on the information contained within these summaries users make relevance
judgments about what links best suit their information needs. Current research
suggests that these relevance judgments are in the service of some search
strategy. In this paper, we model two different search strategies (the
comparison and threshold strategies) and determine how well they fit data
gathered from an experiment on user search within a simulated Google
environment. Keywords: empirical tests, information navigation, information scent, link analysis,
predictive user modeling, search behavior, web evolution | |||
| Proximity within paragraph: a measure to enhance document retrieval performance | | BIBAK | Full-Text | 1033-1034 | |
| Srisupa Palakvangsa-Na-Ayudhya; John A. Keane | |||
| We created a proximity measure, called Proximity Within Paragraph (PWP),
which is based on the concept of value assignment to queried words, grouped by
associated ideas within paragraphs. Based on the WT10G dataset, a test system
comprising three test sets and fifty queries were constructed to evaluate the
effectiveness of PWP by comparing it with the existing method: Minimum Distance
Between Queried Pairs. A further experiment combines the scores obtained from
both methods and the results suggest that the combination can significantly
improve the effectiveness. Keywords: proximity measure, ranking algorithm | |||
| Finding experts and their eetails in e-mail corpora | | BIBAK | Full-Text | 1035-1036 | |
| Krisztian Balog; Maarten de Rijke | |||
| We present methods for finding experts (and their contact details) using
e-mail messages. We locate messages on a topic, and then find the associated
experts. Our approach is unsupervised: both the list of potential experts and
their personal details are obtained automatically from e-mail message headers
and signatures, respectively. Evaluation is done using the e-mail lists in the
W3C corpus. Keywords: e-mail processing, expert finding, expert search | |||
| Efficient query subscription processing for prospective search engines | | BIBAK | Full-Text | 1037-1038 | |
| Utku Irmak; Svilen Mihaylov; Torsten Suel; Samrat Ganguly; Rauf Izmailov | |||
| Current web search engines are retrospective in that they limit users to
searches against already existing pages. Prospective search engines, on the
other hand, allow users to upload queries that will be applied to newly
discovered pages in the future. We study and compare algorithms for efficiently
matching large numbers of simple keyword queries against a stream of newly
discovered pages. Keywords: inverted index, prospective search, query processing | |||
| Mining search engine query logs for query recommendation | | BIBAK | Full-Text | 1039-1040 | |
| Zhiyong Zhang; Olfa Nasraoui | |||
| This paper presents a simple and intuitive method for mining search engine
query logs to get fast query recommendations on a large scale industrial
strength search engine. In order to get a more comprehensive solution, we
combine two methods together. On the one hand, we study and model search engine
users' sequential search behavior, and interpret this consecutive search
behavior as client-side query refinement, that should form the basis for the
search engine's own query refinement process. On the other hand, we combine
this method with a traditional content based similarity method to compensate
for the high sparsity of real query log data, and more specifically, the
shortness of most query sessions. To evaluate our method, we use one hundred
day worth query logs from SINA' search engine to do off-line mining. Then we
analyze three independent editors evaluations on a query test set. Based on
their judgement, our method was found to be effective for finding related
queries, despite its simplicity. In addition to the subjective editors' rating,
we also perform tests based on actual anonymous user search sessions. Keywords: mining, query logs, recommendation, session | |||
| Effective web-scale crawling through website analysis | | BIBAK | Full-Text | 1041-1042 | |
| Iván Gonzlez; Adam Marcus; Daniel N. Meredith; Linda A. Nguyen | |||
| The web crawler space is often delimited into two general areas: full-web
crawling and focused crawling. We present netSifter, a crawler system which
integrates features from these two areas to provide an effective mechanism for
web-scale crawling. netSifter utilizes a combination of page-level analytics
and heuristics which are applied to a sample of web pages from a given website.
These algorithms score individual web pages to determine the general utility of
the overall website. In doing so, netSifter can formulate an in-depth opinion
of a website (and the entirety of its web pages) with a relative minimum of
work. netSifter is then able to bias the future efforts of its crawl towards
higher quality websites, and away from the myriad of low quality websites and
crawler traps that litter the World Wide Web. Keywords: UIMA, crawling, netsifter, sampling, webfountain | |||
| Focused crawling: experiences in a real world project | | BIBK | Full-Text | 1043-1044 | |
| Antonio Badia; Tulay Muezzinoglu; Olfa Nasraoui | |||
Keywords: crawling, information retrieval, thesaurus, topic | |||
| Image annotation using search and mining technologies | | BIBAK | Full-Text | 1045-1046 | |
| Xin-Jing Wang; Lei Zhang; Feng Jing; Wei-Ying Ma | |||
| In this paper, we present a novel solution to the image annotation problem
which annotates images using search and data mining technologies. An accurate
keyword is required to initialize this process, and then leveraging a
large-scale image database, it 1) searches for semantically and visually
similar images, 2) and mines annotations from them. A notable advantage of this
approach is that it enables unlimited vocabulary, while it is not possible for
all existing approaches. Experimental results on real web images show the
effectiveness and efficiency of the proposed algorithm. Keywords: hash indexing, image annotation, search result clustering | |||
| Semantic web integration of cultural heritage sources | | BIBAK | Full-Text | 1047-1048 | |
| P. Sinclair; P. Lewis; K. Martinez; M. Addis; D. Prideaux | |||
| In this paper, we describe research into the use of ontologies to integrate
access to cultural heritage and photographic archives. The use of the CIDOC CRM
and CRM Core ontologies are described together with the metadata mapping
methodology. A system integrating data from four content providers will be
demonstrated. Keywords: interoperability, multimedia, ontologies, semantic web | |||
| The ODESeW 2.0 semantic web application framework | | BIBAK | Full-Text | 1049-1050 | |
| Oscar Corcho; Angel López-Cima; Asunción Gómez-Pérez | |||
| We describe the architecture of the ODESeW 2.0 Semantic Web application
development platform, which has been used to generate the internal and external
Web sites of several R&D projects. Keywords: framework, semantic web, web application | |||
| Visualizing an historical semantic web with Heml | | BIBAK | Full-Text | 1051-1052 | |
| Bruce G. Robertson | |||
| This poster presents ongoing efforts to enrich the RDF-based semantic Web
with the tools of the Historical Event Markup and Linking Project (Heml). An
experimental RDF vocabulary for Heml data is illustrated, as well as its use in
storing and querying encoded historical events. Finally, the practical use of
Heml-RDF is illustrated with a toolkit for the Piggy Bank semantic browser
plugin. Keywords: ACM proceedings, Heml, RDF, chronology, history | |||
| Beyond XML and RDF: the versatile web query language xcerpt | | BIBAK | Full-Text | 1053-1054 | |
| Benedikt Linse; Andreas Schroeder | |||
| Applications and services that access Web data are becoming increasingly
more useful and wide-spread. Current main-stream Web query languages such as
XQuery, XSLT, or SPARQL, however, focus only on one of the different data
formats available on the Web. In contrast, Xcerpt is a emphversatile
semi-structured query language, i.e., a query language able to access all kinds
of Web data such as XML and RDF in the same language reusing common concepts
and language constructs. To integrate heterogeneous data and as a foundation
for Semantic Web reasoning, Xcerpt also provides rules. Xcerpt has a visual
companion language, visXcerpt, that is conceived as a mere rendering of the
(textual) query language Xcerpt using a slightly extended CSS. Both languages
are demonstrated along a realistic use case integrating XML and RDF data
highlighting interesting and unique features. Novel language constructs and
optimization techniques are currently under investigation in the Xcerpt project
(cf. http://xcerpt.org/). Keywords: RDF, XML, query languages, versatility, web, xcerpt | |||
| An ontology for internal and external business processes | | BIBAK | Full-Text | 1055-1056 | |
| Armin Haller; Eyal Oren; Paavo Kotinurmi | |||
| In this paper we introduce our multi metamodel process ontology (m3po),
which is based on various existing reference models and languages from the
workflow and choreography domain. This ontology allows the extraction of
arbitrary choreography interface descriptions from arbitrary internal workflow
models. We also report on an initial validation: we translate an IBM Websphere
MQ Workflow model into the m3po ontology and then extract an Abstract BPEL
model from the ontology. Keywords: choreography, meta model integration, ontology, workflow modelling | |||
| Automatic matchmaking of web services | | BIBK | Full-Text | 1057-1058 | |
| Sudhir Agarwal; Anupriya Ankolekar | |||
Keywords: matchmaking, semantic web services | |||
| Adding semantics to rosettaNet specifications | | BIBAK | Full-Text | 1059-1060 | |
| Paavo Kotinurmi; Tomas Vitvar | |||
| The use of Semantic Web Service (SWS) technologies have been suggested to
enable more dynamic B2B integration of heterogeneous systems and partners. We
present how we add semantics to RosettaNet specifications to enable the WSMX
SWS environment to automate mediation of messages. The benefits of applying SWS
technologies include flexibility in accepting heterogeneity in B2B
integrations. Keywords: B2B integration, XML, ontologysing, rosettaNet | |||
| HTML2RSS: automatic generation of RSS feed based on structure analysis of HTML document | | BIBAK | Full-Text | 1061-1062 | |
| Tomoyuki Nanno; Manabu Okumura | |||
| We present a system to automatically generate RSS feeds from HTML documents
that consist of time-series items with date expressions, e.g., archives of
weblogs, BBSs, chats, mailing lists, site update descriptions, and event
announcements. Our system extracts date expressions, performs structure
analysis of a HTML document, and detects or generates titles from the document. Keywords: RSS, atom, document analysis, feed, syndication | |||
| Logical structure based semantic relationship extraction from semi-structured documents | | BIBAK | Full-Text | 1063-1064 | |
| Zhang Kuo; Wu Gang; Li JuanZi | |||
| Addressed in this paper is the issue of semantic relationship extraction
from semi-structured documents. Many research efforts have been made so far on
the semantic information extraction. However, much of the previous work focuses
on detecting 'isolated' semantic information by making use of linguistic
analysis or linkage information in web pages and limited research has been done
on extracting semantic relationship from the semi-structured documents. In this
paper, we propose a method for semantic relationship extraction by using the
logical information in the semi-structured document (semi-structured document
usually has various types of structure information, e.g. a semi-structured
document may be hierarchical laid out). To the best of our knowledge,
extracting semantic relationships by using logical information has not been
investigated previously. A probabilistic approach has been proposed in the
paper. Features used in the probabilistic model have been defined. Keywords: logical structure, ontology, relationship extraction, semi-structured
document | |||
| OWL FA: a metamodeling extension of OWL D | | BIBAK | Full-Text | 1065-1066 | |
| Jeff Z. Pan; Ian Horrocks | |||
| This paper proposes OWL FA, a decidable extension of OWL DL with the
metamodeling architecture of RDFS(FA). It shows that the knowledge base
satisfiability problem of OWL FA can be reduced to that of OWL DL, and compares
the FA semantics with the recently proposed contextual semantics and Hilog
semantics for OWL. Keywords: metamodeling, ontology, reasoning | |||
| Learning and inferencing in user ontology for personalized semantic web services | | BIBAK | Full-Text | 1067-1068 | |
| Xing Jiang; Ah-Hwee Tan | |||
| Domain ontology has been used in many Semantic Web applications. However,
few applications explore the use of ontology for personalized services. This
paper proposes an ontology based user model consisting of both concepts and
semantic relations to represent users' interests. Specifically, we adopt a
statistical approach to learning a semantic-based user ontology model from
domain ontology and a spreading activation procedure for inferencing in the
user ontology model. We apply the methods of learning and exploiting user
ontology to a semantic search engine for finding academic publications. Our
experimental results support the efficacy of user ontology and spreading
activation theory (SAT) for providing personalized semantic services. Keywords: spreading-activation theory, user ontology | |||
| Upgrading relational legacy data to the semantic web | | BIBAK | Full-Text | 1069-1070 | |
| Jesús Barrasa Rodriguez; Asunción Gómez-Pérez | |||
| In this poster, we describe a framework composed of the R2O mapping language
and the ODEMapster processor to upgrade relational legacy data to the Semantic
Web. The framework is based on the declarative description of mappings between
relational and ontology elements and the exploitation of such mapping
descriptions by a generic processor capable of performing both massive and
query driven data upgrade. Keywords: database-to-ontology mappings, relational databases, semantic web, upgrade | |||
| How semantics make better wikis | | BIBAK | Full-Text | 1071-1072 | |
| Eyal Oren; John G. Breslin; Stefan Decker | |||
| Wikis are popular collaborative hypertext authoring environments, but they
neither support structured access nor information reuse. Adding semantic
annotations helps to address these limitations. We present an architecture for
Semantic Wikis and discuss design decisions including structured access, views,
and annotation language. We present our prototype SemperWiki that implements
this architecture. Keywords: information access, semantic annotation, semantic web, semantic wikis, wikis | |||
| Integrating ecoinformatics resources on the semantic web | | BIBAK | Full-Text | 1073-1074 | |
| Cynthia Sims Parr; Andriy Parafiynyk; Joel Sachs; Li Ding; Sandor Dornbush; Tim Finin; David Wang; Allan Hollander | |||
| We describe ELVIS (the Ecosystem Location Visualization and Information
System), a suite of tools for constructing food webs for a given location. We
express both ELVIS input and output data in OWL, thereby enabling its
integration with other semantic web resources. In particular, we describe using
a Triple Shop application to answer SPARQL queries from a collection of
semantic web documents. This is an end-to-end case study of the semantic web's
utility for ecological and environmental research. Keywords: biodiversity, ecological forecasting, food webs, invasive species,
ontologies, semantic web, service oriented design | |||
| HTML2RSS: automatic generation of RSS feed based on structure analysis of HTML document | | BIBAK | Full-Text | 1075-1076 | |
| Tomoyuki Nanno; Manabu Okumura | |||
| We present a system to automatically generate RSS feeds from HTML documents
that consist of time-series items with date expressions, e.g., archives of
weblogs, BBSs, chats, mailing lists, site update descriptions, and event
announcements. Our system extracts date expressions, performs structure
analysis of a HTML document, and detects or generates titles from the document. Keywords: RSS, atom, document analysis, feed, syndication | |||
| Path summaries and path partitioning in modern XML databases | | BIBK | Full-Text | 1077-1078 | |
| Andrei Arion; Angela Bonifati; Ioana Manolescu; Andrea Pugliese | |||
Keywords: XML, XQuery processing, path partition, path summaries | |||
| Evaluating structural summaries as access methods for XML | | BIBAK | Full-Text | 1079-1080 | |
| Mirella M. Moro; Zografoula Vagena; Vassilis J. Tsotras | |||
| Structural summaries are data structures that preserve all structural
features of XML documents in a compact form. We investigate the applicability
of the most popular summaries as textitaccess methods within XML query
processing. In this context, issues like space and false positives introduced
by the summaries need to be examined. Our evaluation reveals that the
additional space required by the more precise structures is usually small and
justified by the considerable performance gains that they achieve. Keywords: precision, query processing, structural summaries | |||
| FLUX: fuzzy content and structure matching of XML range queries | | BIBAK | Full-Text | 1081-1082 | |
| Hua-Gang Li; S. Alireza Aghili; Divyakant Agrawal; Amr El Abbadi | |||
| An XML range query may impose predicates on the numerical or textual
contents of the elements and/or their respective path structures. In order to
handle content and structure range queries efficiently, an XML query processing
engine needs to incorporate effective indexing and summarization techniques to
efficiently partition the XML document and locate the results. In this paper,
we propose a dynamic summarization and indexing method, FLUX, based on Bloom
filters and B{sup:+}-trees to tackle these problems. The results of our
extensive experimental evaluations indicated the efficiency of the proposed
system. Keywords: XML database, range query, xpath | |||