HCI Bibliography Home | HCI Conferences | WWW Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
WWW Tables of Contents: 01020304-104-205-105-2060708091011-111-212-112-213-1

Proceedings of the 2005 International Conference on the World Wide Web

Fullname:Proceedings of the 14th International Conference on World Wide Web: Special Interest Tracks and Posters
Editors:Allan Ellis; Tatsuya Hagino; Fred Douglis; Prabhakar Raghavan
Location:Chiba, Japan
Dates:2005-May-10 to 2005-May-14
Standard No:ISBN: -59593-051-5; ACM DL: Table of Contents hcibib: WWW05-2
Links:Conference Home Page
  1. WWW 2005-05-10 Volume 2
    1. Embedded web papers
    2. Panels
    3. Industrial and practical experience track paper session 1
    4. Industrial and practical experience track paper session 2
    5. Industrial and practical experience track invited talks
    6. Industrial and practical experience track panel
    7. Posters

WWW 2005-05-10 Volume 2

Embedded web papers

Need for non-visual feedback with long response times in mobile HCI BIBAKFull-Text 775-781
  Virpi Roto; Antti Oulasvirta
When browsing Web pages with a mobile device, the system response times are variable and much longer than on a PC. Users must repeatedly glance at the display to see when the page finally arrives, although mobility demands a Minimal Attention User Interface. We conducted a user study with 27 participants to discover the point at which visual feedback stops reaching the user in mobile context. In the study, we examined the deployment of attention during page loading to the phone vs. the environment in several different everyday mobility contexts, and compared these to the laboratory context. The first part of the page appeared on the screen typically in 11 seconds, but we found that the user's visual attention shifted away from the mobile browser usually between 4 and 8 seconds in the mobile context. In contrast, the continuous span of attention to the browser was more than 14 seconds in the laboratory condition. Based on our study results, we recommend mobile applications provide multimodal feedback for delays of more than four seconds.
Keywords: attention, mobile web, mobility, multimodal feedback, usability
An environment for collaborative content acquisition and editing by coordinated ubiquitous devices BIBAKFull-Text 782-791
  Yutaka Kidawara; Tomoyuki Uchiyama; Katsumi Tanaka
Digital content is not only stored by servers on the Internet, but also on various embedded devices belonging to ubiquitous networks. In this paper, we propose a content processing mechanism for use in an environment enabling collaborative acquisition of embedded digital content in real-world situations. We have developed a network management device that makes it possible to acquire embedded content using coordinated ubiquitous devices. The management device actively configures a network that includes content-providing devices and browsing devices to permit sharing of various items with digital content. We also developed a Functional web mechanism for processing embedded web content in the real-world without a keyboard. This mechanism adds various functions to conventional web content. These functions are activated by messages from a Field in a content processing device. We construct a practical prototype system, which is simple enough for children to use, that we called the "Virtual Insect Catcher". Through a test with 48 children, we demonstrated that this system can be used to acquire embedded web content, retrieve related content from the Internet, and then create new web content. We will also describe the proposed mechanism and the system testing.
Keywords: RFID, embedded content, functional web, multiple device operating, ubiquitous computing, ubiquitous network


Can semantic web be made to flourish? BIBAFull-Text 792
  David Wood; Zavisa Bjelogrlic; Bernadette Hyland; Jim Hendler; Kanzaki Masahide
This panel's objective will be to discuss whether the Semantic Web can be made to grow in a "viral" manner, like the World Wide Web did in the early 1990s. The scope of the discussion will include efforts by the World Wide Web Consortium's Semantic Web Best Practices & Deployment Working Group to identify and publish best practices of Semantic Web practitioners, and the barriers to adoption of those practices by a wider community. The concept of "best practices" as it applies to a distributed, diverse and partially-defined Semantic Web will be discussed and its relevance debated. Specifically, panelists will discuss the capability of standards bodies, commercial companies and early adopters to create a viral technology.
Current trends in the integration of searching and browsing BIBAFull-Text 793
  Andrei Z. Broder; Yoelle S. Maarek; Krishna Bharat; Susan Dumais; Steve Papa; Jan Pedersen; Prabhakar Raghavan
Searching and browsing are the two basic information discovery paradigms, since the early days of the Web. After more than ten years down the road, three schools seem to have emerged: (1) The search-centric school argues that guided navigation is superfluous since free form search has become so good and the search UI so common, that users can satisfy all their needs via simple queries (2) The taxonomy navigation school claims that users have difficulties expressing informational needs and (3) The meta-data centric school advocates the use of meta-data for narrowing large sets of results, and is successful in e-commerce where it is known as "multi faceted search". This panel brings together experts and advocates for all three schools, who will discuss these approaches and share their experiences in the field. We will ask the audience to challenge our experts with real information architecture problems.
Do we need more web performance research? BIBAFull-Text 794
  Michael Rabinovich; Giovanni Pacifici; Michele Colajanni; Krithi Ramamritham; Bruce Maggs
This panel will discuss the future and purpose of Web performance research, concentrating on the reasons for modest success in the adoption of research results in practice. The panel will in particular examine factors that hinder technology transfer in the Web performance area, consider examples of past successes and failures in this arena, and stimulate the discussion on how to make Web performance research more relevant.
Mobile multimedia services BIBAFull-Text 795
  Behzad Shahraray; Wei-Ying Ma; Avideh Zakhor; Noboru Babaguchi
This panel will mainly focus on the role that media processing can play in creating mobile communications, information, and entertainment services. A major premise of our discussion is that media processing techniques go beyond compression and can be employed to monitor, filter, convert, and repurpose information. Such automated techniques can serve to create personalized information and entertainment services in a cost-effective way, adapt existing content for consumption on mobile devices, and circumvent the inherent limitations of mobile devices. Some examples of the applications of media processing techniques for mobile service generation will be given.
On culture in a world-wide information society: toward the knowledge society -- the challenge BIBAFull-Text 796
  Alfredo M. Ronchi; Lynn Thiesmeyer; Antonella Quacchia; Georges Mihajes; Katsuhiro Onoda; Ranjit Makkuni
Starting from more then ten years of experience and achievements in online cultural content, the panel aims to provide a comprehensive view on controversial issues, or unsolved problems, both in the WWW and Cultural community to stimulate lively, thoughtful, and sometimes provocative discussions. Panelists will outline the relevance of digital collections of intangible heritage and endangered archives and discuss the following topics: the "global" Web vs. the preservation of "local" cultural identities, cultural diversities and their relevance in delivering web based services, preservation & future of digital memories, Web-based development and sustainability models. We expect the panelists to actively engage the audience and help them broaden their understanding of the issues. URL: http://www.medicif.org/Events/MEDICI_events/WWW2005/default.htm.
Exploiting the dynamic networking effects of the web BIBAFull-Text 797
  Ramesh Sarukkai; Soumen Chakrabarthi; Gary William Flake; Narayanan Shivakumar; Asim M. Ansari
This panel aims to explore the dynamic networking effects of the Web. Today, linkages on the Web are augmented with dynamic connectivities based on various monetization strategies: e.g. ads and sponsored links. Such linkages change the dynamics of user click/flow on the Web. The key focus of this panel is to debate whether/how such dynamic effects on the Web can be modeled and best exploited. How can we derive cooperative placement strategies that are optimal from a customer perspective? As the World Wide Web becomes more dynamic with fluid link placements guided by different factors, optimizing link placement in a cooperative fashion across the Web will be an integral and crucial component. URL: http://research.yahoo.com/workshops/www2005/NetworkingEffectsWeb/.
Querying the past, present and future: where we are and where we will be BIBAFull-Text 798
  Ling Liu; Andrei Z. Broder; Dieter Fensel; Carole Goble; Calton Pu
This panel will focus on exploring future enhancements of Web technology for active Internet-scale information delivery and dissemination. It will ask the questions of whether the current Web technology is sufficient, what can be leveraged in this endeavor, and how a combination of ideas from a variety of existing disciplines can help in meeting the new challenges of large scale information dissemination. Relevant existing technologies and research areas include: active databases, agent systems, continual queries, event Web, publish/subscribe technology, sensor and stream data management. We expect that some suggestions may be in conflict with current, well-accepted approaches.
Web engineering: technical discipline or social process? BIBAFull-Text 799
  Bebo White; David Lowe; Martin Gaedke; Daniel Schwabe; Yogesh Deshpande
This panel aims to explore the nature of the emerging Web engineering discipline. It will attempt to strongly engage with the issue of whether Web Engineering is currently, and (more saliently) should be in the future, viewed primarily as a technical design discipline with its attention firmly on the way in which Web technologies can be leveraged in the design process, or whether it should be viewed primarily as a socio-positioned discipline which focuses on the nature of the way in which projects are managed, needs are understood and users interact.
Web services considered harmful? BIBAFull-Text 800
  Rohit Khare; Jeff Barr; Mark Baker; Adam Bosworth; Tim Bray; Jeffery McManus
It has been estimated that all of the Web Services specifications and proposals ("WS-*") weigh in at several thousand pages by now. At the same time, their predecessor technologies such as XML-RPC have developed alongside other "grassroots" technologies like RSS. This debate has arguably even risen to the architectural level, contrasting "service-oriented architectures" with REST-based architectural styles. Unfortunately, the multiple overlapping specifications, standards bodies, and vendor strategies tend to obscure the very real successes of providing machine-automatable services over the Web today. This panel asks: Are current community processes for developing, debating, and adopting Web Services are helping or hindering the adoption of Web Services technology? URL: http://labs.commerce.net/wiki/images/1/19/CN-TR-04-05.pdf.

Industrial and practical experience track paper session 1

A personalized search engine based on web-snippet hierarchical clustering BIBAKFull-Text 801-810
  Paolo Ferragina; Antonio Gulli
In this paper we propose a hierarchical clustering engine, called snaket, that is able to organize on-the-fly the search results drawn from 16 commodity search engines into a hierarchy of labeled folders. The hierarchy offers a complementary view to the flat-ranked list of results returned by current search engines. Users can navigate through the hierarchy driven by their search needs. This is especially useful for informative, polysemous and poor queries.
   SnakeT is the first complete and open-source system in the literature that offers both hierarchical clustering and folder labeling with variable-length sentences. We extensively test SnakeT against all available web-snippet clustering engines, and show that it achieves efficiency and efficacy performance close to the best known engine Vivisimo.com.
   Recently, personalized search engines have been introduced with the aim of improving search results by focusing on the users, rather than on their submitted queries. We show how to plug SnakeT on top of any (un-personalized) search engine in order to obtain a form of personalization that is fully adaptive, privacy preserving, scalable, and non intrusive for underlying search engines.
Keywords: information extraction, new search applications and interfaces, personalized web ranking, search engines, web snippets clustering
Ranking definitions with supervised learning methods BIBAKFull-Text 811-819
  Jun Xu; Yunbo Cao; Hang Li; Min Zhao
This paper is concerned with the problem of definition search. Specifically, given a term, we are to retrieve definitional excerpts of the term and rank the extracted excerpts according to their likelihood of being good definitions. This is in contrast to the traditional approaches of either generating a single combined definition or simply outputting all retrieved definitions. Definition ranking is essential for the task. Methods for performing definition ranking are proposed in this paper, which formalize the problem as either classification or ordinal regression. A specification for judging the goodness of a definition is given. We employ SVM as the classification model and Ranking SVM as the ordinal regression model respectively, such that they rank definition candidates according to their likelihood of being good definitions. Features for constructing the SVM and Ranking SVM models are defined. An enterprise search system based on this method has been developed and has been put into practical use. Experimental results indicate that the use of SVM and Ranking SVM can significantly outperform the baseline methods of using heuristic rules or employing the conventional information retrieval method of Okapi. This is true both when the answers are paragraphs and when they are sentences. Experimental results also show that SVM or Ranking SVM models trained in one domain can be adapted to another domain, indicating that generic models for definition ranking can be constructed.
Keywords: classification, ordinal regression, search of definitions, text mining, web mining, web search
Identifying link farm spam pages BIBAKFull-Text 820-829
  Baoning Wu; Brian D. Davison
With the increasing importance of search in guiding today's web traffic, more and more effort has been spent to create search engine spam. Since link analysis is one of the most important factors in current commercial search engines' ranking systems, new kinds of spam aiming at links have appeared. Building link farms is one technique that can deteriorate link-based ranking algorithms. In this paper, we present algorithms for detecting these link farms automatically by first generating a seed set based on the common link set between incoming and outgoing links of Web pages and then expanding it. Links between identified pages are re-weighted, providing a modified web graph to use in ranking page importance. Experimental results show that we can identify most link farm spam pages and the final ranking results are improved for almost all tested queries.
Keywords: HITS, PageRank, link analysis, spam, web search engine
The volume and evolution of web page templates BIBAKFull-Text 830-839
  David Gibson; Kunal Punera; Andrew Tomkins
Web pages contain a combination of unique content and template material, which is present across multiple pages and used primarily for formatting, navigation, and branding. We study the nature, evolution, and prevalence of these templates on the web. As part of this work, we develop new randomized algorithms for template extraction that perform approximately twenty times faster than existing approaches with similar quality. Our results show that 40-50% of the content on the web is template content. Over the last eight years, the fraction of template content has doubled, and the growth shows no sign of abating. Text, links, and total HTML bytes within templates are all growing as a fraction of total content at a rate of between 6 and 8% per year. We discuss the deleterious implications of this growth for information retrieval and ranking, classification, and link analysis.
Keywords: algorithms, boilerplate, data cleaning, data mining, templates, web mining

Industrial and practical experience track paper session 2

The infocious web search engine: improving web searching through linguistic analysis BIBAKFull-Text 840-849
  Alexandros Ntoulas; Gerald Chao; Junghoo Cho
In this paper we present the Infocious Web search engine [23]. Our goal in creating Infocious is to improve the way people find information on the Web by resolving ambiguities present in natural language text. This is achieved by performing linguistic analysis on the content of the Web pages we index, which is a departure from existing Web search engines that return results mainly based on keyword matching. This additional step of linguistic processing gives Infocious two main advantages. First, Infocious gains a deeper understanding of the content of Web pages so it can better match users' queries with indexed documents and therefore can improve relevancy of the returned results. Second, based on its linguistic processing, Infocious can organize and present the results to the user in more intuitive ways. In this paper we present the linguistic processing technologies that we incorporated in Infocious and how they are applied in helping users find information on the Web more efficiently. We discuss the various components in the architecture of Infocious and how each of them benefits from the added linguistic processing. Finally, we experimentally evaluate the performance of a component which leverages linguistic information in order to categorize Web pages.
Keywords: concept extraction, crawling, indexing, information retrieval, language analysis, linguistic analysis of web text, natural language processing, part-of-speech tagging, phrase identification, web search engine, web searching, word sense disambiguation
How to make web sites talk together: web service solution BIBAKFull-Text 850-855
  Hoang Pham Huy; Takahiro Kawamura; Tetsuo Hasegawa
Integrating web sites to provide more efficient services is a very promising way in the Internet. For example searching house for rent based on train system or preparing a holiday with several constrains such as hotel, air ticket, etc... From resource view point, current web sites in the Internet already provide quite enough information. However, the challenge is these web sites just provide information but do not support any mechanism to exchange them. As a consequence, it is very often that a human user has to take the role to "link" several web sites by browsing each one and get the concrete information. The reason comes from a historical objective. Web sites were developed for human users browsing and so, they do not support any machine-understandable mechanism.
   Current researches in WWW environment already propose several solutions to make newly web sites become understandable to other web sites so that they can be integrated. However, the question is how to integrate existing web sites to these new one. Evidently, redeveloping all of them is an unacceptable solution. In this paper, we propose a solution of Web Service Gateway to "wrap" existing web sites in Web services. Thus, without any efforts to duplicate the Web sites code, these services inherit all features from the sites while can be enriched with other Web service features like UDDI publishing, semantic describing, etc. This proposal was developed in Toshiba with Web Service Gateway and Wrapper Generator System. By using these systems, several integrated-applications were built and they are also presented and evaluated in this paper.
Keywords: WSDL, service development, web service, web site, wrapper
Diversified SCM standard for the Japanese retail industry BIBAKFull-Text 856-863
  Koichi Hayashi; Naoki Koguro; Reki Murakami
In this paper, we present the concept of a diversified SCM (supply chain management) standard and distributed hub architecture which were used in B2B experiments for the Japanese retail industry. The conventional concept of B2B standards develops a single ideal set of business transactions to be supported. In contrast, our concept allows a wide range of diverse business transaction patterns necessary for industry supply chains. An industry develops a standard SCM model that partitions the whole supply chain into several transaction segments, each of which provides alternative business transaction patterns. For B2B collaboration, companies must agree on a collaboration configuration, which chooses the transaction alternatives from each segment. To support the development of a B2B system that executes an agreed collaboration, we introduce an SOA (service oriented architecture) based pattern called a distributed hub architecture. As a hub of B2B collaboration, it includes a complete set of services that can process every possible business transaction included in a standard SCM model. However, it does not function as a centralized service that coordinates participants. Instead, it is deployed on every participant and executes the assigned part of the supply chain collaboratively with other distributed hubs. Based on this concept, we analyzed actual business transactions in the Japanese retail industry and developed a standard SCM model, which represents more than a thousand possible transaction patterns. Based on the model, we developed an experimental system for the Japanese retail industry. The demonstration experiment involved major players in the industry including one of the largest general merchandise stores, one of the largest wholesalers, and major manufacturers in Japan.
Keywords: B2B collaboration, SOA (service oriented architecture), business process management, ebXML, retail industry, standardization, supply chain management, web services
Crawling a country: better strategies than breadth-first for web page ordering BIBAKFull-Text 864-872
  Ricardo Baeza-Yates; Carlos Castillo; Mauricio Marin; Andrea Rodriguez
This article compares several page ordering strategies for Web crawling under several metrics. The objective of these strategies is to download the most "important" pages "early" during the crawl. As the coverage of modern search engines is small compared to the size of the Web, and it is impossible to index all of the Web for both theoretical and practical reasons, it is relevant to index at least the most important pages.
   We use data from actual Web pages to build Web graphs and execute a crawler simulator on those graphs. As the Web is very dynamic, crawling simulation is the only way to ensure that all the strategies considered are compared under the same conditions. We propose several page ordering strategies that are more efficient than breadth- first search and strategies based on partial Pagerank calculations.
Keywords: scheduling policy, web crawler, web page importance

Industrial and practical experience track invited talks

Internet search engines: past and future BIBAFull-Text 873
  Jan O. Pedersen
I will review the short history of Internet Search Engines from early first generation systems to the current crop of stock market darlings. Many of the underlying technology problems remain the same, but the business has become significantly more sophisticated and high-powered. I will touch on some of the economics driving the remarkable success of these services and make some predictions about future trends.
News in the age of the web BIBAFull-Text 874
  Krishna Bharat
One of the most exciting and successful examples of the Web impacting society is online news. The history of the news industry from print to the online medium is an interesting journey. Broadcast news transformed society by making news available instantly rather than once a day. While more channels became available, barriers to entry remained high and mainstream opinions continued to dominate. News on the net has brought in a number of valuable transformations, allowing news to be made (potentially) more accessible, diverse, democratic, personalized and interactive than before. Blogging has now made "citizen reporting" possible. As with any disruptive technology online news has both positive and negative implications, such as the threat of disinformation. Computer assisted news is a fun area of research that draws upon prior work in information retrieval, data mining and user interfaces. Given the volume of online news being generated today, the ability to find news and related facts quickly and with high relevance affects both readers and journalists. The talk will address the social implications as well as the technical challenges in the dissemination of online news, with a focus on Google News. Google News is an automated service that makes over 4,500 online, English sources searchable and browseable in real time, with an emphasis on breadth of coverage.
Technical challenges in exploiting the web as a business resource BIBAFull-Text 875
  Andrew Tomkins
In this talk, I'll describe some recent indicators suggesting that businesses are on the cusp of operational exploitation of the web as a decision support resource. From consumer research and purchasing behavior to enterprise brand tracking, intelligence gathering, and advertising, the web is suddenly on everybody's mind -- not as an exciting future possibility, but as an exploitable resource. I'll describe some technological approaches to employing this resource, talk about what's possible today, and describe some challenges for the future. As a running example, I'll cover IBM's WebFountain system: its architecture, analytical model, and applications.
DoCoMo's challenge towards new mobile services BIBAFull-Text 876
  Kiyoyuki Tsujimura
NTT DoCoMo, the provider of "i-mode" mobile Internet service, which accommodates over 40 million subscribers in Japan, is now working to create new types of mobile communications services featuring visual content and contactless IC technology.
Automatic text processing to enhance product search for on-line shopping BIBAFull-Text 877
  Gilles Vandelle
The growing eCommerce business requires an advanced way of searching for products. Buyers today are not only using the web to accomplish transactions but also to search for and select products that fit their needs. The products are now global but the users want a site that uses their language when shopping. This talk will describe how Kelkoo built a solution used across. The multiple European languages have been addressed with a simple linguistic approach combined with machine learning technologies. In this talk we will put the emphasis on the use of machine learning to address local diversity.

Industrial and practical experience track panel

How search engines shape the web BIBAFull-Text 879
  Byron Dom; Krishna Bharat; Andrei Broder; Marc Najork; Jan Pedersen; Yoshinobu Tonomura
The state of the web today has been and continues to be greatly influenced by the existence of web-search engines. This panel will discuss the ways in which search engines have affected the web in the past and ways in which they may affect it in the future. Both positive and negative effects will be discussed as will potential measures to combat the latter. Besides the obvious ways in which search engines help people find content, other effects to be discussed include: the whole phenomenon of web-page spam, based on both text and link (e.g. link farms), the business of "Search Engine Optimization" (optimizing pages to rank highly in web-search results), the bided-terms business and the associated problem of click fraud, to name a few.


The anatomy of a news search engine BIBAKFull-Text 880-881
  A. Gulli
Today, news browsing and searching is one of the most important Internet activity. This paper introduces a general framework to build a News search engine by describing Velthune, an academic News search engine available on line.
Keywords: extraction, information, news search engines, syndication
Preferential walk: towards efficient and scalable search in unstructured peer-to-peer networks BIBAKFull-Text 882-883
  Hai Zhuge; Xue Chen; Xiaoping Sun
To improve search efficiency and reduce unnecessary traffic in Peer-to-Peer (P2P) networks, this paper proposes a trust-based probabilistic search algorithm, called preferential walk (P-Walk). Every peer ranks its neighbors according to searching experience. The highly ranked neighbors have higher probabilities to be queried. Simulation results show that P-Walk is not only efficient, but also robust against malicious behaviors. Furthermore, we measure peers' rank distribution and draw implications.
Keywords: P2P, power-law, probability, search, trust
Car racing through the streets of the web: a high-speed 3D game over a fast synchronization service BIBAKFull-Text 884-885
  Stefano Cacciaguerra; Stefano Ferretti; Marco Roccetti; Matteo Roffilli
The growth of the Internet brought a new age for game developers. New exciting, highly interactive Massively Multiplayer Online Games (MMOGs) may be now deployed on the Web, thanks to new scalable distributed solutions and amazing 3D graphics systems plugged directly into standard browsers. Along this line, taking advantage of a mirrored game server architecture, we developed a 3D car racing multiplayer game for use over the Web, freely inspired to Armagetron. Game servers are kept synchronized through the use of a fast synchronization scheme which is able to drop obsolete game events to uphold the playability degree while preserving the game state consistency. Preliminary results confirm that smart 3D spaces may be created over the Web where the magic of gaming is reproduced for the pleasure of a huge number of players. This result may be obtained only by converging highly accurate event synchronization technologies with 3D scene graph based rendering software.
Keywords: MMOG, scene graph, synchronization
A fast XPATH evaluation technique with the facility of updates BIBAKFull-Text 886-887
  Ashish Virmani; Suchit Agarwal; Rahul Thathoo; Shekhar Suman; Sudip Sanyal
This paper addresses the problem of fast retrieval of data from XML documents by providing a labeling schema that can easily handle simple as well as complex XPATH queries and also provide for updates without the need for the entire document being re-indexed in the RDBMS. We introduce a new labeling schema called the "Z-Label" for efficiently processing XPATH queries involving child and descendant axes.
   The use of "Z-Label" coupled with the indexing schema provides for smooth updates in the XML document.
Keywords: Dewey indexing, XML, XPath query optimization, biaxes path expression, updates
Mapping XML instances BIBAFull-Text 888-889
  Sai Anand; Erik Wilde
For XML-based applications in general and B2B applications in particular, mapping between differently structured XML documents, to enable exchange of data, is a basic problem. A generic solution to the problem is of interest and desirable both in an academic and practical sense. We present a case study of the problem that arises in an XML based project, which involves mapping of different XML schemas to each other. We describe our approach to solving the problem, its advantages and limitations. We also compare and contrast our approach with previously known approaches and commercially available software solutions.
How much is a keyword worth? BIBAKFull-Text 890-891
  Ramesh R. Sarukkai
How much is a keyword worth? At the crux of every search is a query that is composed of search keywords. Sponsors bid for placement on such keywords using a variety of factors, the key being the relative demand for the keyword, and its ability to drive customers to their site. In this paper, we explore the notion of "worth of a keyword". We determine the keyword's worth by tying it to the end criteria that needs to be maximized. As an illustrative example, keyword searches that drive e-commerce transactions are modeled and methods for estimating the Return On Investment/value of a keyword from the association data is discussed.
Keywords: ROI, e-commerce, optimization, search keyword valuation, sponsored keyword recommendation, sponsored listing
Predicting outcomes of web navigation BIBAKFull-Text 892-893
  Jacek Gwizdka; Ian Spence
Two exploratory studies examined the relationships among web navigation metrics, measures of lostness, and success on web navigation tasks. The web metrics were based on counts of visits to web pages, properties of the web usage graph, and similarity to an optimal path. Metrics based on similarity to an optimal path were good predictors of lostness and task success.
Keywords: compactness, lostness, path similarity, stratum, web navigation
XAR-miner: efficient association rules mining for XML data BIBAKFull-Text 894-895
  Sheng Zhang; Ji Zhang; Han Liu; Wei Wang
In this paper, we propose a framework, called XAR-Miner, for mining ARs from XML documents efficiently. In XAR-Miner, raw data in the XML document are first preprocessed to transform to either an Indexed Content Tree (IX-tree) or Multi-relational databases (Multi-DB), depending on the size of XML document and memory constraint of the system, for efficient data selection and AR mining. Task-relevant concepts are generalized to produce generalized meta-patterns, based on which the large ARs that meet the support and confidence levels are generated.
Keywords: XML data, association rule mining, meta-patterns
X-warehouse: building query pattern-driven data BIBAKFull-Text 896-897
  Ji Zhang; Wei Wang; Han Liu; Sheng Zhang
In this paper, we propose an approach to materialize XML data warehouses based on the frequent query patterns discovered from historical queries issued by users. The schemas of integrated XML documents in the warehouse are built using these frequent query patterns represented as Frequent Query Pattern Trees (FreqQPTs). Using hierarchical clustering technique, FreqQPTs are clustered and merged to produce a specified number of integrated XML documents for actual data feeding. Maintenance issue of the data warehouse is also treated in this paper.
Keywords: XML data, data integration, data warehouse, query patterns
TotalRank: ranking without damping BIBAKFull-Text 898-899
  Paolo Boldi
PageRank is defined as the stationary state of a Markov chain obtained by perturbing the transition matrix of a web graph with a damping factor α that spreads part of the rank. The choice of α is eminently empirical, but most applications use α = 0.85; nonetheless, the selection of α is critical, and some believe that link farms may use this choice adversarially. Recent results [1] prove that the PageRank of a page is a rational function of α, and that this function can be approximated quite efficiently: this fact can be used to define a new form of ranking, TotalRank, that averages PageRanks over all possible α's. We show how this rank can be computed efficiently, and provide some preliminary experimental results on its quality and comparisons with PageRank.
Keywords: Kendall's τ, link farms, pageRank, ranking
MemoSpace: a visualization tool for web navigation BIBAKFull-Text 900-901
  Jacqueline Waniek; Holger Langner; Falk Schmidsberger
A central aspect of reducing orientation problems in web navigation concerns the design of adequate navigation aids. Visualization of users' navigation path in form of a temporal-spatial template can function as external memory of users' search history, thereby supporting the user to find previously visited sites, getting an overview of the search process and moreover, provide structure for the complex WorldWideWeb (WWW) environment. This paper presents an application for dynamic 2 and 3 dimensional visualizations of users' navigation paths, called MemoSpace. In an explorative study, users behavior and subjective evaluation of a MemoSpace application was examined.
Keywords: MemoSpace, navigation
The indexable web is more than 11.5 billion pages BIBAKFull-Text 902-903
  A. Gulli; A. Signorini
In this short paper we estimate the size of the public indexable web at 11.5 billion pages. We also estimate the overlap and the index size of Google, MSN, Ask/Teoma and Yahoo!
Keywords: index sizes, search engines, size of the web
A language for expressing user-context preferences in the web BIBAKFull-Text 904-905
  Juan Ignacio Vázquez; A Diego López de Ipina
In this paper, we introduce WPML (WebProfiles Markup Language) for expressing user-context preferences information in the Web. Using WPML a service provider can negotiate and obtain user-related information to personalise service experience without explicit manual configuration by the user, while preserving his privacy using P3P.
Keywords: HTTP, ambient intelligence, context-aware, cookies, profiles, state management, web
Retrieving multimedia web objects based on PageRank algorithm BIBAKFull-Text 906-907
  Christopher C. Yang; K. Y. Chan
Hyperlink analysis has been widely investigated to support the retrieval of Web documents in Internet search engines. It has been proven that the hyperlink analysis significantly improves the relevance of the search results and these techniques have been adopted in many commercial search engines, e.g. Google. However, hyperlink analysis is mostly utilized in the ranking mechanism of Web pages only but not including other multimedia objects, such as images and video. In this project, we propose a modified Multimedia PageRank algorithm to support the searching of multimedia objects in the Web.
Keywords: HITS, PageRank, content based retrieval, hyperlink analysis, multimedia retrieval, web search engines
Automatic generation of web portals using artificial ants BIBAKFull-Text 908-909
  Hanene Azzag; Gilles Venturini; Christiane Guinot
We present in this work a new model (named AntTree) based on artificial ants for document hierarchical clustering. This model is inspired from the self-assembly behavior of real ants. We have simulated this behavior to build a hierarchical tree-structured partitioning of a set of documents, according to the similarities between these documents. We have successfully compared our results to those obtained by ascending hierarchical clustering.
Keywords: artificial ants, hierarchical clustering, portals sites, web
Persistence in web based collaborations BIBAKFull-Text 910-911
  N. Bryan-Kinns; P. G. T. Healey; J. Lee
We outline work on web based support for group creativity. We focus on a study of the effect persistence of participants' musical contributions has on their mutual engagement.
Keywords: HCI, collaboration, creativity, music, user interfaces
Popular web hot spots identification and visualization BIBAKFull-Text 912-913
  D. Avramouli; J. Garofalakis; D. J. Kavvadias; C. Makris; Y. Panagis; E. Sakkopoulos
This work aims a two-fold contribution: it presents a software to analyse logfiles and visualize popular web hot spots and, additionally, presents an algorithm to use this information in order to identify subsets of the website that display large access patterns. Such information is extremely valuable to the site maintainer, since it indicates points that may need content intervention or/and site graph restructuring. Experimental validation verified that the visualization tool, when coupled with algorithms that infer frequent traversal patterns, is both effective in indicating popular hot spots and efficient in doing so by using graph-based representations of popular traversals.
Keywords: access visualization, maximal forward path, usage mining
Information flow using edge stress factor BIBAKFull-Text 914-915
  Franco Salvetti; Savitha Srinivasan
This paper shows how a corpus of instant messages can be employed to detect de facto communities of practice automatically. A novel algorithm based on the concept of Edge Stress Factor is proposed and validated. Results show that this approach is fast and effective in studying collaborative behavior.
Keywords: graph clustering, social network analysis
Adaptive filtering of advertisements on web pages BIBAKFull-Text 916-917
  Babak Esfandiari; Richard Nock
We present a browser extension to dynamically learn to filter unwanted images (such as advertisements or flashy graphics) based on minimal user feedback. To do so, we apply the weighted majority algorithm using pieces of the Uniform Resource Locators of such images as predictors. Experimental results tend to confirm that the accuracy of the predictions converges quickly to very high levels.
Keywords: advertisement filtering, interface agents, weighted majority
WEBCAP: a capacity planning tool for web resource management BIBAKFull-Text 918-919
  Sami Habib; Maytham Safar
A staggering number of multimedia applications are being introduced every day. Yet, the inordinate delays encountered in retrieving multimedia documents make it difficult to use the Web for real-time applications such as educational broadcasting, video conferencing, and multimedia streaming. The problem of delivering multimedia documents in time while placing the least demands on the client, network and server resources is a challenging optimization problem. The WEBCAP is ongoing project that explores applying capacity planning techniques to manage or tune the Web resources (client, network, server) for optimal or near optimal performance, subject to minimizing the retrieval cost while satisfying the real-time constraints and available resources. The WEBCAP project consists of four software modules: object extractor, object representer, object scheduler, and system tuner. The four modules are connected serially with 3 feedback-loops. In this paper, we focus on how to extract objects from multimedia document and how to represent them as object and operation flow graphs while maintaining precedence relations among the objects.
Keywords: capacity-planning, multimedia, scheduling
Finding the search engine that works for you BIBAKFull-Text 920-921
  Kin F. Li; Wei Yu; Shojiro Nishio; Yali Wang
A search engine evaluation model that considers over seventy performance and feature parameters is presented. The design of a web-based system that allows the user to tailor the model to his/her own preference, and to evaluate search engines of interest, is introduced. The results presented to the user identify the most suitable search engine that suits his/her needs.
Keywords: performance evaluation, personalization, search engines
Information retrieval in P2P networks using genetic algorithm BIBAKFull-Text 922-923
  Wan Yeung Wong; Tak Pang Lau; Irwin King
Hybrid Peer-to-Peer (P2P) networks based on the direct connection model have two shortcomings which are high bandwidth consumption and poor semi-parallel search. However, they can further be improved by the query propagation model. In this paper, we propose a novel query routing strategy called GAroute based on the query propagation model. By giving the current P2P network topology and relevance level of each peer, GAroute returns a list of query routing paths that cover as many relevant peers as possible. We model this as the Longest Path Problem in a directed graph which is NP-complete and we obtain high quality (0.95 in 100 peers) approximate solutions in polynomial time by using Genetic Algorithm (GA). We describe the problem modeling and proposed GA for finding long paths. Finally, we summarize the experimental results which measure the scalability and quality of different searching algorithms. According to these results, GAroute works well in some large scaled P2P networks.
Keywords: P2P, genetic algorithm, longest path problem, query routing
An investigation of cloning in web applications BIBAKFull-Text 924-925
  Damith C. Rajapakse; Stan Jarzabek
Cloning (ad hoc reuse by duplication of design or code) speeds up development, but also hinders future maintenance. Cloning also hints at reuse opportunities that, if exploited systematically, might have positive impact on development and maintenance productivity. Unstable requirements and tight schedules pose unique challenges for Web Application engineering that encourage cloning. We are conducting a systematic study of cloning in Web Applications of different sizes, developed using a range of Web technologies, and serving diverse purposes. Our initial results show cloning rates up to 63% in both newly developed and already maintained Web Applications. Expected contribution of this work is two-fold: (1) to confirm potential benefits of reuse-based methods in addressing clone related problems of Web engineering, and (2) to create a framework of metrics and presentation views to be used in other similar studies.
Keywords: clone analysis, clone metrics, clones, software maintenance, software reuse, web applications, web engineering
A more precise model for web retrieval BIBAKFull-Text 926-927
  Junli Yuan; Hung Chi; Qibin Sun
Most research works on web retrieval latency are object-level based, which we think is insufficient and sometimes inaccurate. In this paper, we propose a fine grained operation-level Web Retrieval Dependency Model (WRDM) to provide more precise capture of web retrieval process. Our model reveals some new factors in web retrieval which cannot be seen at object level but are very important to studies in the web retrieval area.
Keywords: dependency, latency, model, performance, web retrieval
Extracting semantic structure of web documents using content and visual information BIBAKFull-Text 928-929
  Rupesh R. Mehta; Pabitra Mitra; Harish Karnick
This work aims to provide a page segmentation algorithm which uses both visual and content information to extract the semantic structure of a web page. The visual information is utilized using the VIPS algorithm and the content information using a pre-trained Naive Bayes classifier. The output of the algorithm is a semantic structure tree whose leaves represent segments having unique topic. However contents of the leaf segments may possibly be physically distributed in the web page. This structure can be useful in many web applications like information retrieval, information extraction and automatic web page adaptation. This algorithm is expected to outperform other existing page segmentation algorithms since it utilizes both content and visual information.
Keywords: DOM, VIPS, naive Bayes classifier, page segmentation, topic hierarchy
A quality framework for web site quality: user satisfaction and quality assurance BIBAKFull-Text 930-931
  Brian Kelly; Richard Vidgen
Web site developers need to use of standards and best practices to ensure that Web sites are functional, accessible and interoperable. However many Web sites fail to achieve such goals. This short paper describes how a Web site quality assessment method (E-Qual) might be used in conjunction with a quality assurance framework (QA Focus) to provide a rounded view of Web site quality that takes account of end user and developer perspectives.
Keywords: best practices, quality assurance, standards, web site quality
WebRogue: virtual presence in web sites BIBAKFull-Text 932-933
  Alessandro Soro; Ivan Marcialis; Davide Carboni; Gavino Paddeu
WebRogue is an application for virtual presence over the Web. It provides the Web Browser with a chat subwindow that allows users connected to the same Web site to meet, share opinions and cooperate in a totally free, non moderated and uncensored environment. Each time the user loads a Web page in the Web Browser, WebRogue opens a discussion channel in a centralized server application, that is completely decoupled from the Web server, using the URL of the Web site as a key. Thus whenever a new page is loaded the user can see who is connected, as if entering a physical site. Interactivity is supported by means of two type of commands: comunication commands allow synchronous interaction as with chat or instant messaging software; Social commands allow cooperation: group surfing, exchange of visit-cards and wait in line.
Keywords: chat, virtual presence, web, web communities
An economic model of the worldwide web BIBAKFull-Text 934-935
  George Kouroupas; Elias Koutsoupias; Christos H. Papadimitriou; Martha Sideri
We believe that much novel insight into the worldwide web can be obtained from taking into account the important fact that it is created, used, and run by selfish optimizing agents: users, document authors, and search engines. On-going theoretical and experimental analysis of a simple abstract model of www creation and search based on user utilities illustrates this point: We find that efficiency is higher when the utilities are more clustered, and that power-law statistics of document degrees emerge very naturally in this context. More importantly, our work sets up many more elaborate questions, related, e.g., to www search algorithms seen as author incentives, to search engine spam, and to search engine quality and competition.
Keywords: economic model, game theory, market, power laws, price of anarchy, utility function, web search
Adaptive page ranking with neural networks BIBAKFull-Text 936-937
  Franco Scarselli; Sweah Liang Yong; Markus Hagenbuchner; Ah Chung Tsoi
Recent developments in the area of neural networks provided new models which are capable of processing general types of graph structures. Neural networks are well-known for their generalization capabilities. This paper explores the idea of applying a novel neural network model to a web graph to compute an adaptive ranking of pages. Some early experimental results indicate that the new neural network models generalize exceptionally well when trained on a relatively small number of pages.
Keywords: adaptive page rank, graph processing, neural networks
The WT10G dataset and the evolution of the web BIBAKFull-Text 938-939
  Wei-Tsen Milly Chiang; Markus Hagenbuchner; Ah Chung Tsoi
The purpose of this paper is threefold. First, we study the evolution of the web based on data available from an earlier snapshot of the web and compare the results with those predicted in [2]. Secondly, we establish whether the WT10G dataset, a popular benchmark for the development and evaluation of internet based applications is appropriate for the tasks. Finally, is there a need for a collection of a new dataset for such purposes. The findings are that the appropriateness of using the popular WT10G dataset in recent Internet-based experiments is questionable and that there is a need for a new collection of dataset for development and evaluation purposes of algorithms related to Internet search engine developments.
Keywords: rate of change, standard datasets, web evolution
A semantic-link-based infrastructure for web service discovery in P2P networks BIBAKFull-Text 940-941
  Jie Liu; Hai Zhuge
An important issue arising from P2P applications is how to accurately and efficiently retrieve the required Web services from large-scale repositories. This paper resolves this issue by organizing services in the overlay combining the Semantic Service Link Network and the Chord P2P network. A service request will first be routed in the Chord according to the given service operation names and keywords. Then, the same request will be routed in the Semantic Link Network according to the service link type and semantic matching. Compared with previous P2P service discovery approaches, the proposed approach has two advantages: (1) produce more accurate and meaning results when searching for particular services in a P2P network; and (2) enable users and peers to discover services in a more flexible way.
Keywords: peer-to-peer, semantic link, web service
Automatic generation of link collections and their visualization BIBAKFull-Text 942-943
  Osamu Segawa; Jun Kawai; Kazuyuki Sakauchi
In this paper, we describe a method of generating link collections in a user-specified category by comprehensively collecting existing link collections and analyzing their hyperlink references. Moreover, we propose a visualization method for a bird's-eye view of the generated link collections. Our methods are effective in grasping intuitively the trend of significant sites and keywords in a category.
Keywords: hyperlink analysis, link collection, visualization
Predictive ranking: a novel page ranking approach by estimating the web structure BIBAKFull-Text 944-945
  Haixuan Yang; Irwin King; Michael R. Lyu
PageRank (PR) is one of the most popular ways to rank web pages. However, as the Web continues to grow in volume, it is becoming more and more difficult to crawl all the available pages. As a result, the page ranks computed by PR are only based on a subset of the whole Web. This produces inaccurate outcome because of the inherent incomplete information (dangling pages) that exist in the calculation. To overcome this incompleteness, we propose a new variant of the PageRank algorithm called, Predictive Ranking (PreR), in which different classes of dangling pages are analyzed individually so that the link structure can be predicted more accurately. We detail our proposed steps. Furthermore, experimental results show that this algorithm achieves encouraging results when compared with previous methods.
Keywords: PageRank, link analysis, predictive ranking
Webified video: media conversion from TV program to web content and their integrated viewing method BIBAKFull-Text 946-947
  Hisashi Miyamori; Katsumi Tanaka
A method is proposed for viewing broadcast content that converts TV programs into Web content and integrates the results with related information retrieved using local and/or Internet content.
Keywords: fusion of broadcast and web content, media conversion, metadata generation, next-generation storage TV, scene search, topic segmentation
Personal TV viewing by using live chat as metadata BIBAKFull-Text 948-949
  Hisashi Miyamori; Satoshi Nakamura; Katsumi Tanaka
We propose a new TV viewing method by personalizing TV programs with live chat information on the Web. It enables a new way of viewing TV content from different perspectives reflecting viewers' viewpoints.
Keywords: digest, fusion of broadcast and web content, live chat, metadata generation, semantic analysis, viewer, viewpoint
Accuracy enhancement of function-oriented web image classification BIBAKFull-Text 950-951
  Koji Nakahira; Toshihiko Yamasaki; Kiyoharu Aizawa
We propose a function-oriented classification of web images and show new applications using this categorization. We defined nine categories of images taking into account of their functions used in web pages, and classified web images by using Support Vector Machine (SVM) in tree structured way. In order to achieve high accuracy of classification, we employed two kinds of features, image-based features and text-based features, and the two kinds can be used together or separately for the stages of the classification. We also utilized DCT coefficients to classify photo images and illustrations. As a result, accurate classification has been achieved. Finally, we show the page summarization as a new application that is made feasible for the first time by our new categories of WWW images.
Keywords: classification, support vector machine, web images
Hera presentation generator BIBAKFull-Text 952-953
  Flavius Frasincar; Geert-Jan Houben; Peter Barna
Semantic Web Information Systems (SWIS) are Web Information Systems that use Semantic Web technologies. Hera is a model-driven design methodology for SWIS. In Hera, models are represented in RDFS and model instances in RDF. The Hera Presentation Generator (HPG) is an integrated development environment that supports the presentation generation layer of the Hera methodology. The HPG is based on a pipeline of data transformations driven by different Hera models.
Keywords: RDF(S), SWIS, WIS, design environment, semantic web
Can link analysis tell us about web traffic? BIBAKFull-Text 954-955
  Marcin Sydow
In this paper we measure correlation between link analysis characteristics for Web pages such as in- and out-degree, PageRank and RBS with those obtained from real Web traffic analysis. Measurements made on real data from the Polish Web show that PageRank is observably but not strongly correlated with actual visits made by Web users to Web pages and that our RBS algorithm[2] is more correlated with traffic data than PageRank in some cases.
Keywords: PageRank, RBS, link analysis, web traffic analysis
Analyzing web page headings considering various presentation BIBAKFull-Text 956-957
  Yushin Tatsumi; Toshiyuki Asahi
Exploiting document structure can solve the usability problem when browsing web pages designed for PCs with non-PC terminals. For example, by exploiting headings among document structure and showing them selectively within a display, users can easily grasp a page's overview. In this paper, as a basic part of document structure analysis, we propose a heading analysis method for web pages considering various presentation. Results of evaluation experiments confirmed that our proposed method could extract many headings that could not be extracted by using HTML element names.
Keywords: content adaptation, heading analysis, web document analysis
Predicting navigation patterns on the mobile-internet using time of the week BIBAKFull-Text 958-959
  Martin Halvey; Mark T. Keane; Barry Smyth
A predictive analysis of user navigation in the Internet is presented that exploits time-of-the-week data. Specifically, we investigate time as an environmental factor in making predictions about user navigation. An analysis is carried out of a large sample of user, navigation data (over 3.7 million sessions from 0.5 million users) in a mobile-Internet context to determine whether user surfing patterns vary depending on the time of the week on which they occur. We find that the use of time improves the predictive accuracy of navigation models.
Keywords: WAP, WWW, browsing, log file analysis, mobile, mobile-web, navigation, prediction, user modeling
Finding group shilling in recommendation system BIBAKFull-Text 960-961
  Xue-Feng Su; Hua-Jun Zeng; Zheng Chen
In the age of information explosion, recommendation system has been proved effective to cope with information overload in e-commerce area. However, unscrupulous producers shill the systems in many ways to make profit, and it makes the system imprecise and unreliable in a long term. Among many shilling behaviors, a new form of attack, called group shilling, appears and does great harm to the system. Because group shilling users are now well organized and become more hidden among various normal users, it is hard to find them by traditional methods. However, these group shilling users are similar to some extent, for they both shill the target items. We bring out a similarity spreading algorithm to find these group shilling users and protect recommendation system from unfair ratings. In our algorithm, we try to find these cunning group shilling users through propagating similarities from items to users iteratively. The experiment shows our similarity spreading algorithm improves the precision of the system and provides the system a reliable protection.
Keywords: collaborative filtering, group shilling, recommendation system
SLL: running my web services on your WS platforms BIBAKFull-Text 962-963
  Donald Kossmann; Christian Reichel
Today, the choice for a particular programming language limits the alternative products that can be used to deploy the program. For instance, a Java program must be executed using a Java VM. This limitation is particularly harmful for the emergence of a new programming paradigm like SOA and Web Services because platforms for new innovative programming languages are typically not as stable and mature as the established platforms for traditional programming paradigms. The purpose of this work is to break the strong ties between programming languages and runtime environments and thus make it possible to innovate at both ends independently. Thereby, the specific focus is on Web Services and Service-Oriented Architectures; focusing on this domain makes it possible to achieve this goal with affordable efforts. The key idea is to introduce a Service Language Layer (SLL) which gives a high-level abstraction of a service-oriented program and which can easily and efficiently be executed on alternative Web Services platforms.
Keywords: XML, XML-based service language, decoupling, service language layer, transformation, web services
An agent system for ontology sharing on WWW BIBAKFull-Text 964-965
  Kotaro Nakayama; Takahiro Hara; Shojiro Nishio
Semantic Web Services (SWS), a new generation WWW technology, will facilitate the automation of Web service tasks, including automated Web service discovery, execution, composition and mediation by using XML based metadata and ontology. There have been several efforts to build knowledge representation languages for Web Services. However, only few attempts have so far been made to develop applications based on SWS. Especially, front-end agent systems for users are one of the urgent research areas. The purpose of this paper is to introduce our new integrated front-end agent system for ontology management and SWS management.
Keywords: agent technologies, ontology, semantic web, web services
Introducing multimodal character agents into existing web applications BIBAKFull-Text 966-967
  Kimihito Ito
This paper proposes a framework in which end-users can instantaneously modify existing Web applications by introducing multimodal user-interface. The authors use the IntelligentPad architecture and MPML as the basis of the framework. Example applications include character agents that read the latest news on a news Web site. The framework does not require users to write any program codes or scripts to introduce multimodal user-interface to existing Web applications.
Keywords: IntelligentPad, MPML, multimodal user interface, web application
Interactive web-wrapper construction for extracting relational information from web documents BIBAKFull-Text 968-969
  Tsuyoshi Sugibuchi; Yuzuru Tanaka
In this paper, we propose a new user interface to interactively specify Web wrappers to extract relational information from Web documents. In this study, we focused on improving user's trial-and-error repetitions for constructing a wrapper. Our approach is a combination of a light-weight wrapper construction method and the dynamic previewing interface which quickly previews how generated wrapper works. We adopted a simple algorithm which can construct a Web wrapper from given extraction examples in less than 100 milliseconds. By using the algorithm, our system dynamically generates a new wrapper from a stream of user's mouse events for specifying extraction examples, and immediately updates a preview result that shows how the generated wrapper extracts HTML nodes from a source Web document. Through this animated display, a user can make a lot of wrapper construction trials with various different combinations of extraction examples by only moving a mouse on the Web document, and reach a good set of examples to obtain an intended wrapper in a short time.
Keywords: information extraction, user interfaces, web wrappers
Multispace information visualization framework for the intercomparison of data sets retrieved from web services BIBAKFull-Text 970-971
  Masahiko Itoh; Yuzuru Tanaka
We introduce a new visualization framework for the intercomparison of more than one data set retrieved from Web services. In our framework, we use more than one visualization space simultaneously, each of which visualizes a single data set retrieved from the Web service. For this purpose, we provide a new 3D component for accessing Web services, and provide a 3D space component, in which data set retrieved from the Web service is visualized. Moreover, our framework provides users with various operations applicable to these space components, i.e., union, intersection, set-difference, cross-product, selection, projection, and joins.
Keywords: IntelligentBox, WorldBottle, visualization, web service
On the feasibility of low-rank approximation for personalized PageRank BIBAKFull-Text 972-973
  András A. Benczúr; Károly Csalogány; Tamás Sarlós
Personalized PageRank expresses backlink-based page quality around user-selected pages in a similar way to PageRank over the entire Web. Algorithms for computing personalized PageRank on the fly are either limited to a restricted choice of page selection or believed to behave well only on sparser regions of the Web. In this paper we show the feasibility of computing personalized PageRank by a k < 1000 low-rank approximation of the Page-Rank transition matrix; by our algorithm we may compute an approximate personalized Page-Rank by multiplying an n x k, a k x n matrix and the n-dimensional personalization vector. Since low-rank approximations are accurate on dense regions, we hope that our technique will combine well with known algorithms.
Keywords: link analysis, low-rank approximation, personalized PageRank, singular value decomposition, web information retrieval
An architecture for personal semantic web information retrieval system BIBAKFull-Text 974-975
  Haibo Yu; Tsunenori Mine; Makoto Amamiya
The semantic Web and Web service technologies have provided both new possibilities and challenges to automatic information processing. There are a lot of researches on applying these new technologies into current personal Web information retrieval systems, but no research addresses the semantic issues from the whole life cycle and architecture point of view. Web services provide a new way for accessing Web resources, but until now, they have been managed separately from traditional Web contents resources. In this poster, we propose a conceptual architecture for a personal semantic Web information retrieval system. It incorporates semantic Web, Web services and multi-agent technologies to enable not only precise location of Web resources but also the automatic or semi-automatic integration of hybrid Web contents and Web services.
Keywords: information retrieval system, semantic web, web portal, web services
TruRank: taking PageRank to the limit BIBAKFull-Text 976-977
  Sebastiano Vigna
PageRank is defined as the stationary state of a Markov chain depending on a damping factor α that spreads uniformly part of the rank. The choice of α is eminently empirical, and in most cases the original suggestion α=0.85 by Brin and Page is still used. It is common belief that values of α closer to 1 give a "truer to the web" PageRank, but a small α accelerates convergence. Recently, however, it has been shown that when α=1 all pages in the core component are very likely to have rank 0 [1]. This behaviour makes it difficult to understand PageRank when α≈1, as it converges to a meaningless value for most pages. We propose a simple and natural modification to the standard preprocessing performed on the adjacency matrix of the graph, resulting in a ranking scheme we call TruRank. TruRank ranks the web with principles almost identical to PageRank, but it gives meaningful values also when α≈1.
Keywords: PageRank, web graph
An information extraction engine for web discussion forums BIBAKFull-Text 978-979
  Hanny Yulius Limanto; Nguyen Ngoc Giang; Vo Tan Trung; Jun Zhang; Qi He; Nguyen Quang Huy
In this poster, we present an information extraction engine for web-based forums. The engine analyzes the HTML files crawled from web forums, deduces the wrapper (template) of the pages and extracts the information about posts (e.g., author, title, content, number of replies and views, etc.). Extraction is an important module for forum search engine, since it helps to understand the content of a forum HTML page and facilitates ranking during retrieval. We discuss the system architecture of the extraction engine in the context of a forum search engine and present various components in the extraction engine. We also introduce briefly the extraction process and discuss some implementation issues.
Keywords: discussion board, forums, information extraction, information retrieval, search engine
Mining web site's topic hierarchy BIBAKFull-Text 980-981
  Nan Liu; C. Yang
Searching and navigating a Web site is a tedious task and the hierarchical models, such as site maps, are frequently used for organizing the Web site's content. In this work, we propose to model a Web site's content structure using the topic hierarchy, a directed tree rooted at a Web site's homepage in which the vertices and edges correspond to Web pages and hyperlinks. Our algorithm for mining a Web site's topic hierarchy utilizes three types of information associated with a Web site: link structure, directory structure and Web pages' content.
Keywords: content structure, topic hierarchy, web mining
Consistency checking of UML model diagrams using the XML semantics approach BIBAKFull-Text 982-983
  Yasser Kotb; Takuya Katayama
A software design is often modeled as a collection of unified Modeling Language (UML) diagrams. There are different aspects of the software system that are covered by many different UML diagrams. This leads for big risk that the overall specification of the system becomes inconsistent and incompleteness. This inherits the necessary to check the consistency between these related UML diagrams. In addition, as the software system gets evolution, those diagrams get modified that leads again to possible inconsistency and incompleteness between the different versions of these diagrams. In this paper, we plan to employ our previous novel XML semantics approach, which proposed for checking the semantic consistency of XML documents using attribute grammar techniques, to check the consistency of UML diagrams. The key idea here is translating the UML diagrams to its equivalent XMI documents. Then checking the consistency of these XMI documents, they are special forms of XML, by employing them to our previous XML semantics approach.
Keywords: UML, XMI, XML, attribute grammars, model checking
Delivering new web content reusing remote and heterogeneous sites. A DOM-based approach BIBAKFull-Text 984-985
  Luis Álvarez Sabucedo; Luis Anido Rifón
This contribution addresses the development of new web sites reusing already existing contents from external sources. Unlike common links to other resources, which retrieves the whole resource, we propose an approach where partial retrieval is possible: the unit for data reuse is a node in a DOM tree. This solution permits the partial reuse of external and heterogeneous web contents with no need for client (browser) modifications and just minor changes for web servers.
Keywords: DOM, HTTP, URL, content reuse, hypertext, interoperability, reusability, web server
Multi-step media adaptation: implementation of a knowledge-based engine BIBAKFull-Text 986-987
  Peter Soetens; Matthias De Geyter
Continuing changes in the domains of consumer devices and multimedia formats demand for a new approach to media adaptation. The publication of customized content on a device requires an automatic adaptation engine that takes into account the specifications of both the device and the material to be published. These specifications can be expressed using a single domain ontology that describes the concepts of the media adaptation domain. In this document, we provide insight into the implementation of an adaptation engine that exploits this domain knowledge. We explain how this engine, through the use of description matching and Semantic Web Services, composes a chain of adaptation services which will alter the original content to the needs of the target device.
Keywords: OWL, content adaptation, device independence, multimedia, semantic web, services, standards
A clustering method for news articles retrieval system BIBAKFull-Text 988-989
  Hiroyuki Toda; Ryoji Kataoka
Organizing the results of a search facilitates the user in overviewing the information returned. We regard the clustering task as the tasks of making labels for a list of items and we focus on news articles and propose a clustering method that uses named entity extraction.
Keywords: document clustering, named entity, search result organization
The language observatory project (LOP) BIBAKFull-Text 990-991
  Yoshiki Mikami; Pavol Zavarsky; Mohd Zaidi Abd Rozan; Izumi Suzuki; Masayuki Takahashi; Tomohide Maki; Irwan Nizan Ayob; Paolo Boldi; Massimo Santini; Sebastiano Vigna
The first part of the paper provides a brief description of the Language Observatory Project (LOP) and highlights the major technical difficulties to be challenged. The latter part gives how we responded to these difficulties by adopting UbiCrawler as a data collecting engine for the project. An interactive collaboration between the two groups is producing quite satisfactory results.
Keywords: character sets, language, language digital divide, language identification, scripts, web crawler
Association search in semantic web: search + inference BIBAKFull-Text 992-993
  Liang Bangyong; Tang Jie; Li Juanzi
Association search is to search for certain instances in semantic web and then make inferences from and about the instances we have found. In this paper, we propose the problem of association search and our preliminary solution for it using Bayesian network. We first minutely define the association search and its categorization. We then define tasks in association search. In terms of Bayesian network, we take ontology taxonomy as network structure in Bayesian network. We use the query log of instances to estimate the network parameters. After the Bayesian network is constructed, we give the solution for association search in the network.
Keywords: Bayesian network, inference, knowledge management, ontology
XHTML meta data profiles BIBAKFull-Text 994-995
  Tantek Çelik; Eric A. Meyer; Matthew Mullenweg
In this paper, we describe XHTML Meta Data Profiles (XMDP) which use XHTML to define a simple profile format which is both human and machine readable. XMDP can be used to extend XHTML by defining new link relationships, meta data properties/values, and class name semantics. XMDP has already been used to extend semantic XHTML to represent social networks, document licensing, voting, and tagging.
Keywords: HTML, WWW, XFN, XHTML, XMDP, class names, link relationships, lowercase semantic web, meta data, microformats, profiles, reuse, schema, world wide web
An adaptive middleware infrastructure for mobile computing BIBAKFull-Text 996-997
  Ronnie Cheung
In a mobile environment where mobile applications suffer from the limitation and variation of system resources availability, it is desirable for the applications to adapt their behaviors to resource limitations and variations. It is also necessary to exploit optimal application performance. However, adaptation mechanisms by mobile applications usually suffers from the problem of unfairness to other applications, in contrast, adaptation by the operation system focuses more on the overall system performance, while neglecting the needs of individual applications. Hence, the adaptation task is best coordinated by a middleware that is able to cater for individual application's need on a fair ground, while maintaining optimal system performance. This is achieved by a context-aware mobile middleware that sits in between the mobile application and the operating environment.
Keywords: adaptation, middleware infrastructure, mobile environments
Data versioning techniques for internet transaction management BIBAKFull-Text 998-999
  Ramkrishna Chatterjee; Gopalan Arun
An Internet transaction is a transaction that involves communication over the Internet using standard Internet protocols such as HTTPS. Such transactions are widely used in Internet-based applications such as e-commerce. With the growth of the Internet, the volume and complexity of Internet transactions are rapidly increasing. We present data versioning techniques that can reduce the complexity of managing Internet transactions and improve their scalability and reliability. These techniques have been implemented using standard database technology, without any change in database kernel. Our initial empirical results argue for the effectiveness of these techniques in practice.
Keywords: internet transaction, scalability, versioning
Using visual cues for extraction of tabular data from arbitrary HTML documents BIBAKFull-Text 1000-1001
  Bernhard Krüpl; Marcus Herzog; Wolfgang Gatterbauer
We describe a method to extract tabular data from web pages. Rather than just analyzing the DOM tree, we also exploit visual cues in the rendered version of the document to extract data from tables which are not explicitly marked with an HTML table element. To detect tables, we rely on a variant of the well-known X-Y cut algorithm as used in the OCR community. We implemented the system by directly accessing Mozilla's box model that contains the positional data for all HTML elements of a given web page.
Keywords: table detection, visual analysis, web information extraction
Describing namespaces with GRDDL BIBAKFull-Text 1002-1003
  Erik Wilde
Describing XML Namespaces is an open issue for many users of XML technologies, and even though namespaces are one of the foundations of XML, there is no generally accepted and widely used format for namespace descriptions. We present a framework for describing namespaces based on GRDDL using a controlled vocabulary. Using this frame-work, namespace descriptions can be easily generated, harvested and published in human- or machine-readable form.
Keywords: languages, management
Building an open source meta-search engine BIBAKFull-Text 1004-1005
  A. Gulli; A. Signorini
In this short paper we introduce Helios, a flexible and efficient open source meta-search engine. Helios currently runs on the top of 18 search engines (in Web, Books, News, and Academic publication domains), but additional search engines can be easily plugged in. We also report some performance mesured during its development.
Keywords: meta search engines, open source
Design and implementation of a feedback controller for slowdown differentiation on internet servers BIBAKFull-Text 1006-1007
  Jianbin Wei; Cheng-Zhong Xu
Proportional slowdown differentiation (PSD) aims to maintain slowdown ratios between different classes of clients according to their pre-specified differentiation parameters. In this paper, we design a feedback controller to allocate processing rate on Internet servers for PSD. In this approach, the processing rate of a class is adjusted by an integral feedback controller according to the difference between the target slowdown ratio and the achieved one. The initial rate class is estimated based on predicted workload using queueing theory. We implement the feedback controller in an Apache Web server. The experimental results under various environments demonstrate the controller's effectiveness and robustness.
Keywords: feedback control, quality of service, slowdown
MiSpider: a continuous agent on web pages BIBAKFull-Text 1008-1009
  Yujiro Fukagaya; Tadachika Ozono; Takayuki Ito; Toramatsu Shintani
In this paper, we propose a Web based agent system called MiSpider, which provides intelligent web services on web browsers. MiSpider enables users to use agents on existing browsers. Users can use MiSpider all over the world only to access the Internet. MiSpider Agent has persistency, and agents condition doesn't change if users change a browsing page. Moreover, agents have a message passing skill to communicate among the agents.
Keywords: browsing support, information system, multiagent system
Automatically learning document taxonomies for hierarchical classification BIBAKFull-Text 1010-1011
  Kunal Punera; Suju Rajan; Joydeep Ghosh
While several hierarchical classification methods have been applied to web content, such techniques invariably rely on a pre-defined taxonomy of documents. We propose a new technique that extracts a suitable hierarchical structure automatically from a corpus of labeled documents. We show that our technique groups similar classes closer together in the tree and discovers relationships among documents that are not encoded in the class labels. The learned taxonomy is then used along with binary SVMs for multi-class classification. We demonstrate the efficacy of our approach by testing it on the 20-Newsgroup dataset.
Keywords: automatic taxonomy learning, hierarchical classification
Web page marker: a web browsing support system based on marking and anchoring BIBAKFull-Text 1012-1013
  Takahiro Koga; Noriharu Tashiro; Tadachika Ozono; Takayuki Ito; Toramatsu Shintani
In this paper, we propose a web browsing support system, called WPM, which provides marking and anchoring functions on ordinary web browsers. WPM users can mark words and phrases on web pages by using their browsers without any extra plug-ins like similar systems, and can anchor words to refer them later. WPM makes it possible to carry out marking to the existing Web page so that marking carried out to paper. By changing character decoration partially, the text is indicated by emphasis and improve readability. WPM is implemented using proxy agent. This system can be used in everyday browsing, without a user being conscious of a system by using a proxy.
Keywords: browsing support, marking, proxy agent
An approach for realizing privacy-preserving web-based services BIBKFull-Text 1014-1015
  Wei Xu; R. Sekar; I. V. Ramakrishnan; V. N. Venkatakrishnan
Keywords: information flow, privacy, web service
Exploiting the web for point-in-time file sharing BIBAKFull-Text 1016-1017
  Roberto J. Bayardo; Sebastian Thomschke
We describe a simple approach to "point-in-time" file sharing based on time expiring web links and personal webservers. This approach to file sharing is useful in environments where instant messaging clients are varied and don't necessarily support (compatible) file transfer protocols. We discuss the features of such an approach along with a successfully deployed implementation now in wide use throughout the IBM corporation.
Keywords: file sharing, instant messaging, personal web server
Using OWL for querying an XML/RDF syntax BIBAKFull-Text 1018-1019
  Rubén Tous; Jaime Delgado
Some recent initiatives try to take profit from RDF to make XML documents interoperate at the semantic level. Ontologies are used to establish semantic connections among XML languages, and some mechanisms have been defined to query them with natural XML query languages like XPath and XML Query. Generally structure-mapping approaches define a simple translation between trivial XPath expressions and some RDF query language like RDQL; however some XPath constructs cannot be covered in a structure-mapping strategy. In contrast, our work takes the model-mapping approach, respectful with node order, that allows mapping all XPath axis. The obtained XPath implementation has the properties of schema-awareness and IDREF-awareness, so it can be used to exploit inheritance hierarchies defined in one or more XML schemas.
Keywords: RDF, XML, XPath, idref-awareness, interoperability, ontologies, schema-awareness, semantic integration
Signing individual fragments of an RDF graph BIBAKFull-Text 1020-1021
  Giovanni Tummarello; Christian Morbidoni; Paolo Puliti; Francesco Piazza
Being able to determine the provenience of statements is a fundamental step in any SW trust modeling. We propose a methodology that allows signing of small groups of RDF statements. Groups of statements signed with this methodology can be safely inserted into any existing triple store without the loss of provenance information since only standard RDF semantics and constructs are used. This methodology has been implemented and is both available as open source library and deployed in a SW P2P project.
Keywords: RDF, digital signature, semantic web, trust
Hybrid semantic tagging for information extraction BIBAKFull-Text 1022-1023
  Ronen Feldman; Benjamin Rosenfeld; Moshe Fresko; Brian D. Davison
The semantic web is expected to have an impact at least as big as that of the existing HTML based web, if not greater. However, the challenge lays in creating this semantic web and in converting existing web information into the semantic paradigm. One of the core technologies that can help in migration process is automatic markup, the semantic markup of content, providing the semantic tags to describe the raw content. This paper describes a hybrid statistical and knowledge-based information extraction model, able to extract entities and relations at the sentence level. The model attempts to retain and improve the high accuracy levels of knowledge-based systems while drastically reducing the amount of manual labor by relying on statistics drawn from a training corpus. The implementation of the model, called TEG (Trainable Extraction Grammar), can be adapted to any IE domain by writing a suitable set of rules in a SCFG (Stochastic Context Free Grammar) based extraction language, and training them using an annotated corpus. The experiments show that our hybrid approach outperforms both purely statistical and purely knowledge-based systems, while requiring orders of magnitude less manual rule writing and smaller amount of training data. We also demonstrate the robustness of our system under conditions of poor training data quality. This makes the system very suitable for converting legacy web pages to semantic web pages.
Keywords: HMM, information extraction, rules based systems, semantic web, text mining
GalaTex: a conformant implementation of the XQuery full-text language BIBAKFull-Text 1024-1025
  Emiran Curtmola; Sihem Amer-Yahia; Philip Brown; Mary Fernández
We describe GalaTex, the first complete implementation of XQuery Full-Text, a W3C specification that extends XPath 2.0 and XQuery 1.0 with full-text search. XQuery Full-Text provides composable full-text search primitives such as keyword search, Boolean queries, and keyword-distance predicates. GalaTex is intended to serve as a reference implementation for XQuery Full-Text and as a platform for addressing new research problems such as scoring full-text query results, optimizing XML queries over both structure and text, and evaluating top-k queries on scored results. GalaTex is an all-XQuery implementation initially focused on completeness and conformance rather than on efficiency. We describe its implementation on top of Galax, a complete XQuery implementation.
Keywords: XQuery, conformant prototype, full-text
Guidelines for developing trust in health websites BIBAKFull-Text 1026-1027
  E. Sillence; P. Briggs; L. Fishwick; P. Harris
How do people decide which health websites to trust and which to reject? Thirteen participants all diagnosed with hypertension were invited to search for information and advice relating to hypertension. Participants took part in a four-week study engaging in both free and directed web searches. A content analysis of the group discussions revealed support for a staged model of trust in which mistrust or rejection of websites is based on design factors and trust or selection of websites is based on content factors such as source credibility and personalization. A number of guidelines for developing trust in health websites are proposed.
Keywords: computer mediated communication, credibility, health, internet, social identity, trust
Efficient structural joins with on-the-fly indexing BIBAKFull-Text 1028-1029
  Kun-Lung Wu; Shyh-Kwei Chen; Philip S. Yu
Previous work on structural joins mostly focuses on maintaining offline indexes on disks. Most of them also require the elements in both sets to be sorted. In this paper, we study an on-the-fly, in-memory indexing approach to structural joins. There is no need to sort the elements or maintain indexes on disks. We identify the similarity between the structural join problem and the stabbing query problem, and extend a main memory-based indexing technique for stabbing queries to structural joins.
Keywords: XML, containment queries, structural joins
Processing link structures and linkbases on the web BIBAKFull-Text 1030-1031
  François Bry; Michael Eckert
Hyperlinks are an essential feature of the World Wide Web, highly responsible for its success. XLink improves on HTML's linking capabilities in several ways. In particular, links after XLink can be "out-of-line" (i.e., not defined at a link source) and collected in (possibly several) linkbases, which considerably ease building complex link structures.
   Modeling of link structures and processing of linkbases under the Web's "open world linking" are aspects neglected by XLink. Adding a notion of "interface" to XLink, as suggested in this work, considerably improves modeling of link structures. When a link structure is traversed, the relevant linkbase(s) might become ambiguous. We suggest three linkbase management modes governing the binding of a linkbase to a document to resolve this ambiguity.
Keywords: XLink, hyperlink, link modeling and processing, linkbase
A comprehensive comparative study on term weighting schemes for text categorization with support vector machines BIBAKFull-Text 1032-1033
  Man Lan; Chew-Lim Tan; Hwee-Boon Low; Sam-Yuan Sung
Term weighting scheme, which has been used to convert the documents as vectors in the term space, is a vital step in automatic text categorization. In this paper, we conducted comprehensive experiments to compare various term weighting schemes with SVM on two widely-used benchmark data sets. We also presented a new term weighting scheme tf-rf to improve the term's discriminating power. The controlled experimental results showed that this newly proposed tf-rf scheme is significantly better than other widely-used term weighting schemes. Compared with schemes related with tf factor alone, the idf factor does not improve or even decrease the term's discriminating power for text categorization.
Keywords: SVM, categorization, term weighting schemes, text
A model for short-term content adaptation BIBAKFull-Text 1034-1035
  Marco Benini; Alberto Trombetta; Michela Acquaviva
This paper proposes a model for short-term content adaptation whose aim is to satisfy the contingent needs of users by adjusting the information a web-application provides on the basis of a short-term user profile. The mathematical model results in the design of an adaptive filter that profiles users by observing their queries to the application and that adjusts the answers of the application according to the inferred user needs. Also, the mathematical model ensures the correctness of the filter, that is, the filter is guaranteed to exhibit a coherent short-term adaptive behaviour.
Keywords: information filtering, user modelling
Semantic virtual environments BIBAKFull-Text 1036-1037
  Karsten A. Otto
Today's Virtual Environment (VE) systems share a number of issues with the HTML-based World Wide Web. Their content is usually designed for presentation to humans, and thus is not suitable for machine access. This is complicated by the large number of different data models and network protocols in use. Accordingly, it is difficult to develop VE software, such as agents, services, and tools.
   In this paper we adopt the Semantic Web idea to the field of virtual environments. Using the Resource Description Framework (RDF) we establish a machine-understandable abstraction of existing VE systems -- the Semantic Virtual Environments (SVE). On this basis it is possible to develop system-independent software, which could even operate across VE system boundaries.
Keywords: components, framework, integration, semantic web, virtual environments
Verify Feature Models using protegeowl BIBAKFull-Text 1038-1039
  Hai Wang; Yuan Fang Li; Jing Sun; Hongyu Zhang
Feature models are widely used in domain engineering to capture common and variant features among systems in a particular domain. However, the lack of a widely-adopted means of precisely representing and formally verifying feature models has hindered the development of this area. This paper presents an approach to modeling and verifying feature diagrams using Semantic Web ontologies.
Keywords: OWL, feature modeling, ontologies, semantic web
Multiple strategies detection in ontology mapping BIBAKFull-Text 1040-1041
  Jie Tang; Yong Liang; Zi Li
Ontology mapping is the task of finding semantic relationships between entities (i.e. concept, attribute and relation) of two ontologies. In the existing literatures, many (semi-)automatic approaches have found considerable interest by combining several mapping strategies (namely multi-strategy mapping). However, experiments show that multi-strategy based mapping does not always outperform its single-strategy counterpart. We here mainly consider the following questions: For a new, unseen mapping task, should one use a multi-strategy or a single-strategy? And if the task is suitable for multi-strategy, then which strategies should be selected in the combined scenario? This paper proposes an approach of multiple strategies detection for ontology mapping. The results obtained so far show that multi-strategy detection improves both on precision and recall significantly.
Keywords: multi-strategy detection, ontology mapping, semantic web
A study on combination of block importance and relevance to estimate page relevance BIBAKFull-Text 1042-1043
  Shen Huang; Yong Yu; Shengping Li; Gui-Rong Xue; Lei Zhang
Some work showed that segmenting web pages into "semantic independent" blocks could help to improve the whole page retrieval. One key and unexplored issue is how to combine the block importance and relevance to a given query. In this poster, we first propose an automatic way to measure block importance to improve retrieval. After that, user information need is also concerned to refine block importance for different users.
Keywords: block importance, block relevance, information need, iterative combination
Towards autonomic web-sites based on learning automata BIBAKFull-Text 1044-1045
  Pradeep S; Chitra Ramachandran; Srinath Srinivasa
Autonomics or self-reorganization becomes pertinent for web-sites serving a large number of users with highly varying workloads. An important component of self-adaptation is to model the behaviour of users and adapt accordingly. This paper proposes a learning-automata based technique for model discovery. User access patterns are used to construct an FSM model of user behaviour that in turn is used for prediction and prefetching. The proposed technique uses a generalization algorithm to classify behaviour patterns into a small number of generalized classes. It has been tested on both synthetic and live data-sets and has shown a prediction hit-rate of up to 89% on a real web-site.
Keywords: autonomic website, generalization, learning automata
On business activity modeling using grammars BIBAKFull-Text 1046-1047
  Savitha Srinivasan; Arnon Amir; Prasad Deshpande; Vladimir Zbarsky
Web based applications offer a mainstream channel for businesses to manage their activities. We model such business activity in a grammar-based framework. The Backus Naur form notation is used to represent the syntax of a regular grammar corresponding to Web log patterns of interest. Then, a deterministic finite state machine is used to parse Web logs against the grammar. Detected tasks are associated with metadata such as time taken to perform the activity, and aggregated along relevant corporate dimensions.
Keywords: data mining, web log analysis
Soundness proof of Z semantics of OWL using institutions BIBAKFull-Text 1048-1049
  Dorel Lucanu; Yuan Fang Li; Jin Song Dong
The correctness of the Z semantics of OWL is the theoretical foundation of using software engineering techniques to verify Web ontologies. As OWL and Z are based on different logical systems, we use institutions to represent their underlying logical systems and use institution morphisms to prove the correctness of the Z semantics for OWL DL.
Keywords: OWL, Z, comorphism of institutions, institution
An analysis of search engine switching behavior using click streams BIBAKFull-Text 1050-1051
  Yun-Fang Juan; Chi-Chao Chang
In this paper, we propose a simple framework to characterize the switching behavior between search engines based on click streams. We segment users into a number of categories based on their search engine usage during two adjacent time periods and construct the transition probability matrix across these usage categories. The principal eigenvector of the transposed transition probability matrix represents the limiting probabilities, which are proportions of users in each usage category at steady state. We experiment with this framework using click streams focusing on two search engines: one with a large market share and the other with a small market share. The results offer interesting insights into search engine switching. The limiting probabilities provide empirical evidence that small engines can still retain its fair share of users over time.
Keywords: Markov chain, clustering, limiting probabilities, principal eigenvectors, probability matrix, search engines, sequence, session, switching behavior, transition
Comparing relevance feedback algorithms for web search BIBAKFull-Text 1052-1053
  Vishwa Vinay; Ken Wood; Natasa Milic-Frayling; Ingemar J. Cox
We evaluate three different relevance feedback (RF)algorithms, Rocchio, Robertson/Sparck-Jones (RSJ) and Bayesian, in the context of Web search. We use a target-testing experimental procedure whereby a user must locate a specific document. For user relevance feedback, we consider all possible user choices of indicating zero or more relevant documents from a set of 10 displayed documents. Examination of the effects of each user choice permits us to compute an upper-bound on the performance of each RF algorithm.
   We ind that there is a significant variation in the upper-bound performance o the three RF algorithms and that the Bayesian algorithm approaches the best possible.
Keywords: evaluation, relevance feedback, web search
SAT-MOD: moderate itemset fittest for text classification BIBAKFull-Text 1054-1055
  Jianlin Feng; Huijun Liu; Jing Zou
In this paper, we present a novel association-based method called SAT-MOD for text classification. SAT-MOD views a sentence rather than a document as a transaction, and uses a novel heuristic called MODFIT to select the most significant itemsets for constructing a category classifier. The effectiveness of SAT-MOD has been demonstrated comparable to well-known alternatives such as LinearSVM and much better than current document-level words association based methods on the Reuters corpus.
Keywords: MODFIT (moderate itemset fittest) heuristic, text classification
Applying NavOptim to minimise navigational effort BIBAKFull-Text 1056-1057
  David Lowe; Xiaoying Kong
A major factor in the effectiveness of the interaction which users have with Web applications is the ease with which they can locate information and functionality which they are seeking. Effective design is however complicated by the multiple design purposes and diverse users which Web applications typically support. In this paper we describe a navigational design method aimed at optimising designs through minimizing navigational entropy. The approach uses a theoretical navigational depth for the various information and service components to moderate a nested hierarchical clustering of the content.
Keywords: design, efforts metrics, navigation architecture
Building reactive web applications BIBAKFull-Text 1058-1059
  Federico M. Facca; Stefano Ceri; Jacopo Armani; Vera Demaldé
The Adaptive Web is a new research area addressing the personalization of the Web experience for each user. In this paper we propose a new high-level model for the specification of Web applications that take into account the manner users interact with the application for supplying appropriate contents or gathering profile data. We therefore consider entire processes (rather than single properties) as smallest information units, allowing for automatic restructuring of application components. For this purpose, a high-level Event-Condition-Action (ECA) paradigm is proposed, which enables capturing arbitrary (and timed) clicking behaviors.
Keywords: adaptive web, design method, eca rule, user modeling
Detection of phishing webpages based on visual similarity BIBAKFull-Text 1060-1061
  Liu Wenyin; Guanglin Huang; Liu Xiaoyue; Zhang Min; Xiaotie Deng
An approach to detection of phishing webpages based on visual similarity is proposed, which can be utilized as a part of an enterprise solution for anti-phishing. A legitimate webpage owner can use this approach to search the Web for suspicious webpages which are visually similar to the true webpage. A webpage is reported as a phishing suspect if the visual similarity is higher than its corresponding preset threshold. Preliminary experiments show that the approach can successfully detect those phishing webpages for online use.
Keywords: anti-phishing, information filtering, visual similarity, web document analysis
Modeling the author bias between two on-line computer science citation databases BIBAKFull-Text 1062-1063
  Vaclav Petricek; Ingemar J. Cox; Hui Han; Isaac G. Councill; C. Lee Giles
We examine the difference and similarities between two on-line computer science citation databases DBLP and CiteSeer. The database entries in DBLP are inserted manually while the CiteSeer entries are obtained autonomously. We show that the CiteSeer database contains considerably fewer single author papers. This bias can be modeled by an exponential process with intuitive explanation. The model permits us to predict that the DBLP database covers approximately 30% of the entire literature of Computer Science.
Keywords: DBLP, acquisition bias, bibliometrics, citeSeer
Hubble: an advanced dynamic folder system for XML BIBAKFull-Text 1064-1065
  Ning Li; Joshua Hui; Hui-I Hsiao; Kevin Beyer
Organizing large document collections for finding information easily and quickly has always been an important user requirement. This paper describes a flexible and powerful dynamic folder technology, called Hubble, which exploits XML semantics to precisely categorize XML documents into categories or folders.
Keywords: XML, categorization, content navigation, dynamic folder
Support for arbitrary regions in XSL-FO: a proposal for extending XSL-FO semantics and processing model BIBAKFull-Text 1066-1067
  Ana Cristina B. da Silva; Joao B. S. de Oliveira; Fernando T. M. Mano; Thiago B. Silva; Leonardo L. Meirelles; Felipe R. Meneguzzi; Fabio Giannetti
This paper proposes an extension of the XSL-FO standard which allows the specification of an unlimited number of arbitrarily shaped page regions. These extensions are built on top of XSL-FO 1.1 to enable flow content to be laid out into arbitrary shapes and allowing for page layouts currently available only to desktop publishing software. Such a proposal is expected to leverage XSL-FO towards usage as an enabling technology in the generation of content intended for personalized printing.
Keywords: LaTeX, SVG, XML, XSL-FO, digital printing
Improved timing control for web server systems using internal state information BIBAKFull-Text 1068-1069
  Xue Liu; Rong Zheng; Jin Heo; Lui Sha
How to effectively allocate system resource to meet the Service Level Agreement (SLA) of Web servers is a challenging problem. In this paper, we propose an improved scheme for autonomous timing performance control in Web servers under highly dynamic traffic loads. We devise a novel delay regulation technique called Queue Length Model Based Feedback Control utilizing server internal state information to reduce response time variance in presence of bursty traffic. Both simulation and experimental studies using synthesized workloads and real-world Web traces demonstrate the effectiveness of the proposed approach.
Keywords: SLA, control theory, feedback, queueing model, web server
Service discovery and measurement based on DAML-QoS ontology BIBAKFull-Text 1070-1071
  Chen Zhou; Liang-Tien Chia; Bu-Sung Lee
As more and more Web services are deployed, Web service's discovery mechanisms become essential. Similar services can have quite different QoS behaviors. For service selection and management purpose, it is necessary to clearly specify QoS constraints and metrics definitions for Web services. We investigate on the semantic QoS specification and introduce our design principles on it. Based on the specification refinement and conformance, we introduce the QoS matchmaking algorithm with multiple matching degrees. The matchmaking prototype is designed to prove the feasibility. Well-defined Metrics can be further utilized by measurement organizations to monitor and evaluate the promised service level objectives.
Keywords: QoS, matchmaking, semantic web, web service discovery
Boosting SVM classifiers by ensemble BIBAKFull-Text 1072-1073
  Yan-Shi Dong; Ke-Song Han
By far, the support vector machines (SVM) achieve the state-of-the-art performance for the text classification (TC) tasks. Due to the complexity of the TC problems, it becomes a challenge to systematically develop classifiers with better performance. We try to attack this problem by ensemble methods, which are often used for boosting weak classifiers, such as decision tree, neural networks, etc., and whether they are effective for strong classifiers is not clear.
Keywords: classifier design and evaluation, information filtering, machine learning, neural nets, text processing
Adaptive query routing in peer web search BIBAKFull-Text 1074-1075
  Le-Shin Wu; Ruj Akavipat; Filippo Menczer
An unstructured peer network application was proposed to address the query forwarding problem of distributed search engines and scalability limitations of centralized search engines. Here we present novel techniques to improve local adaptive routing, showing they perform significantly better than a simple learning scheme driven by query response interactions among neighbors. We validate prototypes of our peer network application via simulations with 500 model users based on actual Web crawls. We finally compare the quality of the results with those obtained by centralized search engines, suggesting that our application can draw advantages from the context and coverage of the peer collective.
Keywords: adaptive query routing, peer collaborative search, topical crawlers
Transforming web contents into a storybook with dialogues and animations BIBAKFull-Text 1076-1077
  Kaoru Sumi; Katsumi Tanaka
This paper describes a medium, called Interactive e-Hon, for helping children to understand contents from the Web. It works by transforming electronic contents into an easily understandable "storybook world." In this world, easy-to-understand contents are generated by creating 3D animations that include contents and metaphors, and by using a child-parent model with dialogue expression and a question-answering style comprehensible to children.
Keywords: agent, animation, dialogue, information presentation, media conversion
AVATAR: an approach based on semantic reasoning to recommend personalized TV programs BIBAKFull-Text 1078-1079
  Yolanda Blanco; José J. Pazos; Alberto Gil; Manuel Ramos; Ana Fernández; Rebeca P. Díaz; Martín López; Belén Barragáns
In this paper a TV recommender system called AVATAR (AdVAnce Telematic search of Audiovisual contents by semantic Reasoning) is presented. This tool uses the experience gained in the field of the Semantic Web to personalize the TV programs shown to the end users. The main contribution of our system is a process of semantic reasoning carried out on the descriptions of the TV contents -- provided by means of metainformation -- and on the viewer preferences -- contained in personal profiles. Such process allows to diversify the offered suggestions maintaining the personalization, given that the aim is to find contents appealing for the users, which are related semantically to their programs of interest.
   Here the framework proposed for this reasoning is introduced, by including (i) the OWL ontology chosen to represent the knowledge of our application domain, (ii) the organization of the user profiles, (iii) the query language LIKO, which is intended to browse the ontology and (iv) the semantic relations inferred from the system knowledge base.
Keywords: TV recommender system, inference of semantic relations, ontologies, semantic web
WAND: a meta-data maintenance system over the internet BIBAKFull-Text 1080-1081
  Anubhav Bhatia; Saikat Mukherjee; Saugat Mitra; Srinath Srinivasa
WAND is a meta-data management system that provides a file-system tree for users of an internet based P2P network. The tree is robust and retains its structure even when nodes (peers) enter and leave the network. The robustness is based on a concept of virtual folders that are automatically created to retain paths to lower level folders whenever a node hosting a higher-level folder moves away. Other contributions of the WAND system include its novel approach towards managing root directory information and handling network partitions.
Keywords: maintenance, meta-data, peer-to-peer, wide-area distributed file system
Composite event queries for reactivity on the web BIBAKFull-Text 1082-1083
  James Bailey; François Bry; Paula-Lavinia Pätrânjan
Reactivity on the Web is an emerging issue. The capability to automatically react to events (such as updates to Web resources) is essential for both Web services and Semantic Web systems. Such systems need to have the capability to detect and react to complex, real life situations. This presentation gives flavours of the high-level language XChange, for programming reactive behaviour on the Web.
Keywords: composite events, event-condition-action rules, reactive languages, web
Learning how to learn with web contents BIBAKFull-Text 1084-1085
  Akihiro Kashihara; Shinobu Hasegawa
Learning Web contents requires learners not only to navigate the Web pages to construct their own knowledge from the contents learned at and between the pages, but also to control their own navigation and knowledge construction processes. However, it is not so easy to control the learning processes. The main issue addressed is how to help learners learn how to learn with Web contents. This paper discusses how to design a meta-learning tool.
Keywords: hyperspace, learning affordance, meta-learning, navigational learning, web contents
From user-centric web traffic data to usage data BIBAKFull-Text 1086-1087
  Thomas Beauvisage; Houssem Assadi
In this paper, we describe a user-centric Internet usage data processing platform. Raw usage data is collected using a software probe installed on a panel of Internet users' workstations. It is then processed by our platform. The transformation of raw usage data into qualified and usable information by Internet usage sociology researchers means setting up a series of relatively complex processes using quite a wide variety of resources. We use a combination of ad hoc rule-based systems and external resources to qualify the visited Web pages. We also implemented topological and temporal indicators in order to describe the dynamics of Web sessions.
Keywords: internet uses, traffic analysis, usage data, user-centric traffic data, web usage mining
Multichannel publication of interactive media documents in a news environment BIBAKFull-Text 1088-1089
  Tom Beckers; Nico Oorts; Filip Hendrickx; Rik Van De Walle
Multichannel publication of multimedia presentations poses a significant challenge on the generic description of the presentation content and the system necessary to convert these descriptions into final-form presentations. We present a solution based on the XiMPF document model and a component based system architecture.
Keywords: XML, device independence, framework, interactivity, multichannel publication, multimedia, standards
Advanced fault analysis in web service composition BIBAKFull-Text 1090-1091
  L. Ardissono; L. Console; A. Goy; G. Petrone; C. Picardi; M. Segnan; D. Theseider Dupré
Currently, fault management in Web Services orchestrating multiple suppliers relies on a local analysis, that does not span across individual services, thus limiting the effectiveness of recovery strategies. We propose to address this limitation by employing Model-Based Diagnosis to enhance fault analysis. In our approach, a Diagnostic Web Service is added to the set of Web Services providing the overall service, and acts as a supervisor of their execution, by identifying anomalies and explaining them in terms of faults to be repaired.
Keywords: diagnosis, fault management, web service composition
Mining directed social network from message board BIBAKFull-Text 1092-1093
  Naohiro Matsumura; David E. Goldberg; Xavier Llorà
In the paper, we present an approach to mining a directed social network from a message board on the Internet where vertices denote individuals and directed links denote the flow of influence. The influence is measured based on propagating terms among individuals via messages. The distance with respect to contextual similarity between individuals is acquired since the influence indicates the degree of their shared interest represented as terms.
Keywords: directed social network, internet message board
Incremental page rank computation on evolving graphs BIBFull-Text 1094-1095
  Prasanna Desikan; Nishith Pathak; Jaideep Srivastava; Vipin Kumar
Enhancing the privacy of web-based communication BIBAKFull-Text 1096-1097
  Aleksandra Korolova; Ayman Farahat; Philippe Golle
A profiling adversary is an adversary whose goal is to classify a population of users into categories according to messages they exchange. This adversary models the most common privacy threat against web based communication.
   We propose a new encryption scheme, called stealth encryption, that protects users from profiling attacks by concealing the semantic content of plaintext while preserving its grammatical structure and other non-semantic linguistic features, such as word frequency distribution. Given English plaintext, stealth encryption produces ciphertext that cannot efficiently be distinguished from normal English text (our techniques apply to other languages as well).
Keywords: privacy, profiling, protection
Generating XSLT scripts for the fast transformation of XML documents BIBAKFull-Text 1098-1099
  Dong-Hoon Shin; Kyong-Ho Lee
This paper proposes a method of generating XSLT scripts, which support the fast transformation of XML documents, given one-to-one matching relationships between leaf nodes of XML schemas. The proposed method enhances the transformation speed of generated XSLT scripts through reducing template calls. Experimental results show that the proposed method has generated XSLT scripts that support the faster transformation of XML documents, compared with previous works.
Keywords: XML, XSLT, document transformation
ALVIN: a system for visualizing large networks BIBKFull-Text 1100-1101
  Davood Rafiei; Stephen Curial
Keywords: network visualization, sampling, visualizing the web
Analysis of topic dynamics in web search BIBAKFull-Text 1102-1103
  Xuehua Shen; Susan Dumais; Eric Horvitz
We report on a study of topic dynamics for pages visited by a sample of people using MSN Search. We examine the predictive accuracies of probabilistic models of topic transitions for individuals and groups of users. We explore temporal dynamics by comparing the accuracy of the models for predicting topic transitions at increasingly distant times in the future. Finally, we discuss directions for applying models of search topic dynamics.
Keywords: topic analysis, topic transition, user modeling, web search
Clustering for probabilistic model estimation for CF BIBAKFull-Text 1104-1105
  Qing Li; Byeong Man Kim; Sung Hyon Myaeng
Based on the type of collaborative objects, a collaborative filtering (CF) system falls into one of two categories: item-based CF and user-based CF. Clustering is the basic idea in both cases, where users or items are classified into user groups where users share similar preference or item groups where items have similar attributes or characteristics. Observing the fact that in user-based CF each user community is characterized by a Gaussian distribution on the ratings for each item and the fact that in item-based CF the ratings of each user in item community satisfy a Gaussian distribution, we propose a method of probabilistic model estimation for CF, where objects (user or items) are classified into groups based on the content information and ratings at the same time and predictions are made considering the Gaussian distribution of ratings. Experiments on a real-world data set illustrate that our approach is favorable.
Keywords: collaborative filtering, information filtering, probabilistic model
An experimental study on large-scale web categorization BIBAKFull-Text 1106-1107
  Tie-Yan Liu; Yiming Yang; Hao Wan; Qian Zhou; Bin Gao; Hua-Jun Zeng; Zheng Chen; Wei-Ying Ma
Taxonomies of the Web typically have hundreds of thousands of categories and skewed category distribution over documents. It is not clear whether existing text classification technologies can perform well on and scale up to such large-scale applications. To understand this, we conducted the evaluation of several representative methods (Support Vector Machines, k-Nearest Neighbor and Naive Bayes) with Yahoo! taxonomies. In particular, we evaluated the effectiveness/efficiency tradeoff in classifiers with hierarchical setting compared to conventional (flat) setting, and tested popular threshold tuning strategies for their scalability and accuracy in large-scale classification problems.
Keywords: algorithm complexity, parameter tuning strategies, text categorization, very large web taxonomies
Site abstraction for rare category classification in large-scale web directory BIBAKFull-Text 1108-1109
  Tie-Yan Liu; Hao Wan; Tao Qin; Zheng Chen; Yong Ren; Wei-Ying Ma
Automatically classifying the Web directories is an effective way to manage Web information. However, our experiments showed that the state-of-the-art text classification technologies could not lead to acceptable performance in this task. Due to our analysis, the main problem is the lack of effective training data in rare categories of Web directories. To tackle this problem, we proposed a novel technology named Site Abstraction to synthesize new training examples from the website of the existing training document. The main idea is to propagate features through parent-child relationship in the sitemap tree. Experiments showed that our method significantly improved the classification performance.
Keywords: hierarchical classification, site abstraction, support vector machines (SVM), text classification, web directory
Designing learning services: from content-based to activity-based learning systems BIBAKFull-Text 1110-1111
  Pythagoras Karampiperis; Demetrios Sampson
The need for e-learning systems that support a diverse set of pedagogical requirements has been identified as an important issue in web-based education. Until now, significant R&D effort has been devoted aiming towards web-based educational systems tailored to specific pedagogical approaches. The most advanced of them are based on the IEEE Learning Technology Systems Architecture and use standardized content structuring based on the ADL Sharable Content Object Reference Model in order to enable sharing and reusability of the learning content. However, sharing of learning activities among different web-based educational systems still remains an open issue. The open question is how web-based educational systems should be designed in order to enable reusing and repurposing of learning activities. In this paper we propose an authoring system, refered to as ASK-LDT that utilizes the Learning Design principles to provide the means for designing activity-based learning services and systems.
Keywords: architectures, authoring tools, learning activities, learning design, reusability
Topological spaces of the web BIBKFull-Text 1112-1113
  Gabriel Ciobanu; Danut Rusu
Keywords: separation, topology density, web metrics
Extracting context to improve accuracy for HTML content extraction BIBAKFull-Text 1114-1115
  Suhit Gupta; Gail Kaiser; Salvatore Stolfo
Previous work on content extraction utilized various heuristics such as link to text ratio, prominence of tables, and identification of advertising. Many of these heuristics were associated with "settings", whereby some heuristics could be turned on or off and others parameterized by minimum or maximum threshold values. A given collection of settings -- such as removing table cells with high linked to non-linked text ratios and removing all apparent advertising -- might work very well for a news website, but leave little or no content left for the reader of a shopping site or a web portal We present a new technique, based on incrementally clustering websites using search engine snippets, to associate a newly requested website with a particular "genre", and then employ settings previously determined to be appropriate for that genre, with dramatically improved content extraction results overall.
Keywords: DOM trees, HTML, accessibility, content extraction, context, reformatting, speech rendering
Constructing extensible XQuery mappings BIBAKFull-Text 1116-1117
  Gang Qian; Yisheng Dong
Constructing and maintaining semantic mappings are necessary but troublesome in data sharing systems. While most current work focuses on seeking automated techniques to solve this problem, this paper proposes a combination model for constructing extensible mappings between XML schemas. In our model, complex global mappings are constructed by first defining simple atomic mappings for each target schema element, and then combining them using a few basic operators. At the same time, we provide automated support for constructing such combined mappings.
Keywords: XQuery, automated support, extensibility, mapping
TJFast: effective processing of XML twig pattern matching BIBAKFull-Text 1118-1119
  Jiaheng Lu; Ting Chen; Tok Wang Ling
Finding all the occurrences of a twig pattern in an XML database is a core operation for efficient evaluation of XML queries. A number of algorithms have been proposed to process a twig query based on region encoding. In this paper, based on a novel labeling scheme: extended Dewey, we propose a novel and efficient holistic twig join algorithm, namely TJFast. Compared to previous work, our algorithm only needs to access the labels of leaf query nodes. We report our experimental results to show that our algorithms are superior to previous approaches in terms of the number of elements scanned and query performance.
Keywords: holistic twig join, labeling scheme
Web services security configuration in a service-oriented architecture BIBAKFull-Text 1120-1121
  Takeshi Imamura; Michiaki Tatsubori; Yuichi Nakamura; Christopher Giblin
Security is one of the major concerns when developing mission-critical business applications, and this concern motivated the Web Services Security specifications. However, the existing tools to configure the security properties of Web Services give a technology-oriented view; only assisting in choosing data to encrypt and the encryption algorithms to use. A user must manually bridge the gap between the security requirements and the configuration, which could cause extra configuration costs and lead to potential misconfiguration hazards. To ease this situation, we came up with refining security requirements from business to technology, leveraging the concepts of Service-Oriented Architecture (SOA) and Model-Driven Architecture (MDA). Security requirements are gradually transformed to more detailed ones or countermeasures by bridging the gap between them by using best practice patterns.
Keywords: best practice pattern, model-driven architecture, security configuration, service-oriented architecture, web services security
BackRank: an alternative for PageRank? BIBAKFull-Text 1122-1123
  Mohamed Bouklit; Fabien Mathieu
This paper proposes to extend a previous work, The Effect of the Back Button in a Random Walk: Application for PageRank [5]. We introduce an enhanced version of the PageRank algorithm using a realistic model for the Back button, thus improving the random surfer model. We show that in the special case where the history is bound to an unique page (you cannot use the Back button twice in a row), we can produce an algorithm that does not need much more resources than a standard PageRank. This algorithm, BackRank, can converge up to 30% faster than a standard PageRank and suppress most of the drawbacks induced by the existence of pages without links.
Keywords: PageRank, back button, flow, random walk, web analysis
Finding the boundaries of information resources on the web BIBAKFull-Text 1124-1125
  Pavel Dmitriev; Carl Lagoze; Boris Suchkov
In recent years, many algorithms for the Web have been developed that work with information units distinct from individual web pages. These include segments of web pages or aggregation of web pages into web communities. Using these logical information units has been shown to improve the performance of many web algorithms. In this paper, we focus on a type of logical information units called compound documents. We argue that the ability to identify compound documents can improve information retrieval, automatic metadata generation, and navigation on the Web. We propose a unified framework for identifying the boundaries of compound documents, which combines both structural and content features of constituent web pages. The framework is based on a combination of machine learning and clustering algorithms, with the former algorithm supervising the latter one. Experiments on a collection of educational web sites show that our approach can reliably identify most of the compound documents on these sites.
Keywords: WWW, clustering, compound documents
Semantic search of schema repositories BIBFull-Text 1126-1127
  Tanveer Syeda-Mahmood; Gauri Shah; Lingling Yan; Willi Urban
Improving text collection selection with coverage and overlap statistics BIBAKFull-Text 1128-1129
  Thomas Hernandez; Subbarao Kambhampati
In an environment of distributed text collections, the first step in the information retrieval process is to identify which of all available collections are more relevant to a given query and which should thus be accessed to answer the query. We address the challenge of collection selection when there is full or partial overlap between the available text collections, a scenario which has not been examined previously despite its real-world applications. To that end, we present COSCO, a collection selection approach which uses collection-specific coverage and overlap statistics. We describe our experimental results which show that the presented approach displays the desired behavior of retrieving more new results early on in the collection order, and performs consistently and significantly better than CORI, previously considered to be one of the best collection selection systems.
Keywords: collection overlap, collection selection, statistics gathering
A framework for handling dependencies among web services transactions BIBAKFull-Text 1130-1131
  Seunglak Choi; Jungsook Kim; Hyukjae Jang; Su Myeon Kim; Junehwa Song; Hangkyu Kim; Yunjoon Lee
This paper proposes an effective Web services (WS) transaction management framework to automatically manage inconsistencies occurred by relaxing isolation of WS transactions.
Keywords: isolation relaxation, transaction management protocol, transaction model, web services
Middleware services for web service compositions BIBAKFull-Text 1132-1133
  Anis Charfi; Mira Mezini
WS-* specifications cover a variety of issues ranging from security and reliability to transaction support in web services. However, these specifications do not address web service compositions. On the other hand, BPEL as the future standard web service composition language allows the specification of the functional part of the composition as a business process but it fails short in expressing non-functional properties such as security, reliability and persistence. In this paper, we propose an approach for the transparent integration of technical concerns in web service compositions. Our approach is driven by the analogy between web services and software components and is inspired from server-side component models such as Enterprise Java Beans. The main components of our framework are the process container, the middleware services and the deployment descriptor.
Keywords: BPEL, middleware, web service composition
Application networking on peer-to-peer networks BIBAKFull-Text 1134-1135
  Mu Su; Chi-Hung Chi
This paper proposes the AN.P2P architecture to facilitate efficient peer-to-peer content delivery with heterogeneous presentation requirements. In general, the AN.P2P enables a peer to deliver the original content objects and an associated workflow to other peers. The workflow is composed of content adaptation tasks. Hence, the recipient can reuse the original object to generate appropriate presentations for other peers.
Keywords: application networking, peer-to-peer content distribution
Web data cleansing for information retrieval using key resource page selection BIBAKFull-Text 1136-1137
  Yiqun Liu; Canhui Wang; Min Zhang; Shaoping Ma
With the page explosion of WWW, how to cover more useful information with limited storage and computation resources becomes more and more important in web IR research. Using web page non-content feature analysis, we proposed a clustering-based method to select high quality pages from the whole page set. Although the result page set contains only 44.3% of the whole collection, it is related with more than 98% of links and covers about 90% of key information. Link property and retrieval affects are also observed and experiment results show that key resource selection method is more suitable for the job of data cleansing and the result page set outperforms the whole collection by smaller size and better retrieval performance.
Keywords: non-content feature, web IR, web data cleansing
Web resource geographic location classification and detection BIBAKFull-Text 1138-1139
  Chuang Wang; Xing Xie; Lee Wang; Yansheng Lu; Wei-Ying Ma
Rapid pervasion of the web into users' daily lives has put much importance on capturing location-specific information on the web, due to the fact that most human activities occur locally around where a user is located. This is especially true in the increasingly popular mobile and local search environments. Thus, how to correctly and effectively detect locations from web resources has become a key challenge to location-based web applications. In this paper, we first explicitly distinguish the locations of web resources into three types to cater to different application needs: 1) provider location; 2) content location; and 3) serving location. Then we describe a unified system that computes each of the three locations, employing a set of algorithms and different geographic sources.
Keywords: content location, geographic location, location-based web application, provider location, serving location, web location
Ontology-based learning content repurposing BIBAKFull-Text 1140-1141
  Katrien Verbert; A Dragan Gasevic; A Jelena Jovanovic; Erik Duval
This paper investigates basic research issues that need to be addressed for developing an architecture that enables repurposing of learning objects in a flexible way. Currently, there are a number of Learning Object Content Models (e.g. the SCORM Content Aggregation Model) that define learning objects and their components in a more or less precise way. However, these models do not allow repurposing of fine-grained components (sentences, images). We developed an ontology-based solution for content repurposing. The ontology is a solid basis for an architecture that will enable on-the-fly access to learning object components and that will facilitate repurposing these components.
Keywords: content models, learning objects, metadata, ontologies, repurposing
Representing personal web information using a topic-oriented interface BIBAKFull-Text 1142-1143
  Zhigang Hua; Hao Liu; Xing Xie; Hanqing Lu; Wei-Ying Ma
Nowadays, Web activities have become daily practice for people. It is therefore essential to organize and present this continuously increasing Web information in a more usable manner. In this paper, we developed a novel approach to reorganize personal Web information as a topic-oriented interface. In our approach, we proposed to utilize anchor, title and URL information to represent content information for the browsed Web pages rather than the content body. Furthermore, we explored three methods to organize personal Web information: 1) top-down statistical clustering; 2) salience phrase based clustering; and 3) support vector machine (SVM) based classification. Finally, we conducted a usability study to verify the effectiveness of our proposed solution. The experimental results demonstrated that users could visit the pages that have been browsed previously more easily with our approach than existing solutions.
Keywords: clustering, personal web information, topic classfication, user information mining, user interface
Web2Talkshow: transforming web content into TV-program-like content based on the creation of dialogue BIBAKFull-Text 1144-1145
  Akiyo Nadamoto; Masaki Hayashi; Katsumi Tanaka
We propose a new browsing system called "Web2Talkshow". It transforms declarative-based web content into humorous dialog-based TV-program-like content that is presented through cartoon animation and synthesized speech. The system does this based on keywords in the original web content. Web2Talkshow enable users to get desired web content easily, pleasantly, and in a user-friendly way while being able to continue working on other tasks. Thus, using it will be much like watching TV.
Keywords: TV-program-like content, dialogue, humor, web content
WCAG formalization with W3C standards BIBAKFull-Text 1146-1147
  Vicente Luque Centeno; Carlos Delgado Kloos; Martin Gaedke; Martin Nussbaumer
Web accessibility consists on a set of checkpoints which are rather expensive to evaluate or to spot. However, using W3C technologies, this cost can be clearly minimized. This article presents a W3C formalized rule-set version for automatable checkpoints from WCAG 1.0.
Keywords: WAI, WCAG, XPath, XPointer, XQuery
Bootstrapping ontology alignment methods with APFEL BIBKFull-Text 1148-1149
  Marc Ehrig; Steffen Staab; York Sure
Keywords: alignment, machine learning, mapping, matching, ontology
Understanding the function of web elements for mobile content delivery using random walk models BIBAKFull-Text 1150-1151
  Xinyi Yin; Wee Sun Lee
In this paper, we describe a method for understanding the function of web elements. It classifies web elements into five functional categories: Content (C), Related Links (R), Navigation and Support (N), Advertisement (A) and Form (F). We construct five graphs for a web page, and each graph is designed such that most of the probability mass of the stationary distribution is concentrated in nodes belong to its corresponding category. We perform random walks on these graphs until convergence and classify based on its rank value in different graphs. Our experiment shows that the new method performed very well comparing to basic machine learning methods.
Keywords: HTML, WWW (world wide web), classification
Does learning how to read Japanese have to be so difficult: and can the web help? BIBKFull-Text 1152-1153
  Julien Quint; Ulrich Apel
Keywords: kanji, Japanese, SVG, graphetic dictionary, reading help
The semantic webscape: a view of the semantic web BIBAKFull-Text 1154-1155
  Juhnyoung Lee; Richard Goodwin
It has been a few years since the semantic Web was initiated by W3C, but its status has not been quantitatively measured. It is crucial to understand the status at this early stage, for researchers, developers and administrators to gain insight into what will come in this field. The objective of our work is to quantitatively measure and present the status of the semantic Web. We conduct a longitudinal study on the semantic Web pages to track trends in the use of semantic markup languages. This paper presents early results of this study with two historical data sets from October 2003 and October 2004. Our results show that while it is very early stage of semantic Web adoption, its growth outpaces that of the entire Web for the period. Also, RDF (Resource Description Framework) has dominated among semantic markup languages, taking about 98% of all semantic pages on the Web. It has been used in a variety of metadata annotation applications. This study shows that the most popular application is RSS (RDF Site Summary) for syndicating news and blogs, which takes more than 60% of all semantic Web pages. It also shows that the use of OWL (Web Ontology Language) which was recommended by W3C in early 2004 has been increased 900% for the period.
Keywords: RSS, markup languages, ontology, semantic web
A modeling approach to federated identity and access management BIBAKFull-Text 1156-1157
  Martin Gaedke; Johannes Meinecke; Martin Nussbaumer
As the Web is increasingly used as a platform for heterogeneous applications, we are faced with new requirements to authentication, authorization and identity management. Modern architectures have to control access not only to single, isolated systems, but to whole business-spanning federations of applications and services. This task is complicated by the diversity of today's specifications concerning e.g. privacy, system integrity and distribution in the web. As an approach to such problems, in this paper, we introduce a solution catalogue of reusable building blocks for Identity and Access Management (IAM). The concepts of these blocks have been realized in a configurable system that supports IAM solutions for Web-based applications.
Keywords: federation, identity and access management, reuse, security
XSLT by example BIBAKFull-Text 1158-1159
  Daniele Braga; Alessandro Campi; Roberto Cappa; Damiano Salvi
XQBE (XQuery By Example, [1]), a visual dialect of XQuery, uses hierarchical structures to express transformations between XML documents. XSLT, the standard transformation language for XML, is increasingly popular among programmers and Web developers for separating the application and presentation layers of Web applications. However, its syntax and its rule-based execution paradigm are rather intricate, and the number of XSLT experts is limited; the availability of easier "dialects" could be extremely valuable and may contribute to the adoption of XML for developing data-centered Web applications and services. With this motivation in mind, we adapted XQBE to serve as a visual interface for expressing XML-to-XML transformations and generate the XSLT code that performs such transformations.
Keywords: XML, XQuery, semi-structured data, visual query languages
Automated semantic web services orchestration via concept covering BIBAKFull-Text 1160-1161
  T. Di Noia; E. Di Sciascio; F. M. Donini; A. Ragone; S. Colucci
We exploit the recently proposed Concept Abduction inference service in Description Logics to solve Concept Covering problems. We propose a framework and polynomial greedy algorithm for semantic based automated Web service orchestration, fully compliant with Semantic Web technologies. We show the proposed approach is able to deal with not exact solutions, computing an approximate orchestration with respect to an agent request modeled a subset of OWL-DL.
Keywords: description logics, orchestration, semantic web, semantic web services
Answering order-based queries over XML data BIBAKFull-Text 1162-1163
  Zografoula Vagena; Nick Koudas; Divesh Srivastava; Vassilis J. Tsotras
Order-based queries over XML data include XPath navigation axes such as following-sibling and following. In this paper, we present holistic algorithms that evaluate such order-based queries. An experimental comparison with previous approaches shows the performance benefits of our algorithms.
Keywords: XML, holistic algorithms, order-based queries
A publish and subscribe collaboration architecture for web-based information BIBAKFull-Text 1164-1165
  M. Brian Blake; David H. Fado; Gregory A. Mack
Markup languages, representations, schemas, and tools have significantly increased the ability for organizations to share their information. Languages, such as the Extensible Markup Language (XML), provide a vehicle for organizations to represent information in a common, machine-interpretable format. Although these approaches facilitate the collaboration and integration of inter-organizational information, the reality is that the schema representations behind these languages are reasonably difficult to learn, and automated schema integration (without semantics or ontology mappings) is currently an open problem. In this paper, we introduce an architecture and service-oriented infrastructure to facilitate organizational collaboration that combines the push features of the publish/subscribe protocol with storage of distributed registry capabilities.
Keywords: distributed and heterogeneous information management, management of semi-structured data
Migrating web application sessions in mobile computing BIBAKFull-Text 1166-1167
  G. Canfora; G. Di Santo; G. Venturi; E. Zimeo; M. V. Zito
The capability to change user agent while working is starting to appear in state of the art mobile computing due to the proliferation of different kinds of devices, ranging from personal wireless devices to desktop computers, and to the consequent necessity of migrating working sessions from a device to a more apt one. Research results related to the hand-off at low level are not sufficient to solve the problem at application level. The paper presents a scheme for session hand-off in Web applications which, by exploiting a proxy-based architecture, is able to work without interventions on existing code.
Keywords: mobile computing, session hand-off, web applications
Video quality estimation for internet streaming BIBKFull-Text 1168-1169
  Amy Reibman; Subhabrata Sen; Jacobus Van der Merwe
Keywords: network measurement, performance, streaming, video quality
An approach for ontology-based elicitation of user models to enable personalization on the semantic web BIBAKFull-Text 1170-1171
  Ronald Denaux; Lora Aroyo; Vania Dimitrova
A novel framework for eliciting a user's conceptualization based on an ontology-driven dialog is presented here. It has been integrated in an RDF/OWL-based architecture of an adaptive learning content management system. The implemented framework is illustrated with an application scenario to deal with the cold start problem and to enable tailoring the system's behavior to the needs of each individual user.
Keywords: adaptive content management, application of semantic web technologies, personalization on the semantic web, user modeling
Analyzing online discussion for marketing intelligence BIBAKFull-Text 1172-1173
  Natalie Glance; Matthew Hurst; Kamal Nigam; Matthew Siegler; Robert Stockton; Takashi Tomokiyo
We present a system that gathers and analyzes online discussion as it relates to consumer products. Weblogs and online message boards provide forums that record the voice of the public. Woven into this discussion is a wide range of opinion and commentary about consumer products. Given its volume, format and content, the appropriate approach to understanding this data is large-scale web and text data mining. By using a wide variety of state-of-the-art techniques including crawling, wrapping, text classification and computational linguistics, online discussion is gathered and annotated within a framework that provides for interactive analysis that yields marketing intelligence for our customers.
Keywords: computational linguistics, content systems, information retrieval, machine learning, text mining
Exploiting the deep web with DynaBot: matching, probing, and ranking BIBAKFull-Text 1174-1175
  Daniel Rocco; James Caverlee; Ling Liu; Terence Critchlow
We present the design of Dynabot, a guided Deep Web discovery system. Dynabot's modular architecture supports focused crawling of the Deep Web with an emphasis on matching, probing, and ranking discovered sources using two key components: service class descriptions and source-biased analysis. We describe the overall architecture of Dynabot and discuss how these components support effective exploitation of the massive Deep Web data available.
Keywords: crawling, deep web, probing, service class
A framework for determining necessary query set sizes to evaluate web search effectiveness BIBAFull-Text 1176-1177
  Eric C. Jensen; Steven M. Beitzel; Ophir Frieder; Abdur Chowdhury
We describe a framework of bootstrapped hypothesis testing for estimating the confidence in one web search engine outperforming another over any randomly sampled query set of a given size. To validate this framework, we have constructed and made available a precision-oriented test collection consisting of manual binary relevance judgments for each of the top ten results of ten web search engines across 896 queries and the single best result for each of those queries. Results from this bootstrapping approach over typical query set sizes indicate that examining repeated statistical tests is imperative, as a single test is quite likely to find significant differences that do not necessarily generalize. We also find that the number of queries needed for a repeatable evaluation in a dynamic environment such as the web is much higher than previously studied.
Wireless SOAP: optimizations for mobile wireless web services BIBAKFull-Text 1178-1179
  Naresh Apte; Keith Deutsch; Ravi Jain
We propose a set of optimization techniques, collectively called Wireless SOAP (WSOAP), to compress SOAP messages transmitted across a wireless link. The Name Space Equivalency technique rests on the observation that exact recovery of compressed messages is not required at the receiver; an equivalent form suffices. The WSDL Aware Encoding technique obtains further savings by utilizing knowledge of the underlying WSDL by means of an offline protocol we define. We summarize the design, implementation and performance of our Wireless SOAP prototype, and show that Wireless SOAP can reduce message sizes by 3x-12x compared to SOAP.
Keywords: SOAP, WSDL, applications, compression, networks, services, web services, wireless
METEOR: metadata and instance extraction from object referral lists on the web BIBAKFull-Text 1180-1181
  Hasan Davulcu; Srinivas Vadrevu; Saravanakumar Nagarajan; Fatih Gelgi
The Web has established itself as the largest public data repository ever available. Even though the vast majority of information on the Web is formatted to be easily readable by the human eye, "meaningful information" is still largely inaccessible for the computer applications. In this paper we present the METEOR system which utilizes various presentation and linkage regularities from referral lists of various sorts to automatically separate and extract metadata and instance information. Experimental results for the university domain with 12 computer science department Web sites, comprising 361 individual faculty and course home pages indicate that the performance of the metadata and instance extraction averages 85%, 88% F-measure respectively. METEOR achieves this performance without any domain specific engineering requirement.
Keywords: extraction, instance, metadata, object, semantic, web
Merkle tree authentication of HTTP responses BIBAKFull-Text 1182-1183
  Roberto J. Bayardo; Jeffrey Sorensen
We propose extensions to existing web protocols that allow proofs of authenticity of HTTP server responses, whether or not the HTTP server is under the control of the publisher. These extensions protect users from content that may be substituted by malicious servers, and therefore have immediate applications in improving the security of web caching, mirroring, and relaying systems that rely on untrusted machines [2,4]. Our proposal relies on Merkle trees to support 200 and 404 response authentication while requiring only a single cryptographic hash of trusted data per repository. While existing web protocols such as HTTPS can provide authenticity guarantees (in addition to confidentiality), HTTPS consumes significantly more computational resources, and requires that the hosting server act without malice in generating responses and in protecting the publisher's private key.
Keywords: authenticity, merkle hash tree, web content distribution
Cyclone: an encyclopedic web search site BIBAKFull-Text 1184-1185
  Atsushi Fujii; Katunobu Itou; Tetsuya Ishikawa
We propose a Web search site called "Cyclone", in which a user can retrieve encyclopedic term descriptions on the Web. Cyclone searches the Web for headwords and page fragments describing the headwords. High-quality page fragments are selected as term descriptions and are classified into domains. The number of current headwords is over 700,000.
Keywords: encyclopedias, extraction, organization, web search
Automated synthesis of executable web service compositions from BPEL4WS processes BIBAKFull-Text 1186-1187
  M. Pistore; P. Traverso; P. Bertoli; A. Marconi
We propose a technique for the automated synthesis of new composite web services. Given a set of abstract bpel4ws descriptions of component services, and a composition requirement, we automatically generate a concrete bpel4ws process that, when executed, interacts with the components and satisfies the requirement.
   We implement the proposed approach exploiting efficient representation techniques, and we show its scalability over case studies taken from a real world application and over a parameterized domain.
Keywords: automated synthesis, business processes, web service composition
Web log mining with adaptive support thresholds BIBAKFull-Text 1188-1189
  Jian-Chih Ou; Chang-Hung Lee; Ming-Syan Chen
With the fast increase in Web activities, Web data mining has recently become an important research topic. However, most previous studies of mining path traversal patterns are based on the model of a uniform support threshold without taking into consideration such important factors as the length of a pattern, the positions of Web pages, and the importance of a particular pattern, etc. In view of this, we study and apply the Markov chain model to provide the determination of support threshold of Web documents. Furthermore, by properly employing some techniques devised for joining reference sequences, a new mining procedure of Web traversal patterns is proposed in this paper.
Keywords: Markov model, path traversal pattern, web mining
Focused crawling by exploiting anchor text using decision tree BIBAKFull-Text 1190-1191
  Jun Li; Kazutaka Furuse; Kazunori Yamaguchi
Focused crawlers are considered as a promising way to tackle the scalability problem of topic-oriented or personalized search engines. To design a focused crawler, the choice of strategy for prioritizing unvisited URLs is crucial. In this paper, we propose a method using a decision tree on anchor texts of hyperlinks. We conducted experiments on the real data sets of four Japanese universities and verified our approach.
Keywords: anchor text, decision tree learning, focused crawling, shortest path
One project, four schema languages: medley or melee? BIBAFull-Text 1192
  Makoto Murata
This talk first gives an overview of an XML project for e-Local Governments, which is under the auspices of MIAC (Ministry of Internal Affairs and Communications) of Japan. This talk then focuses on schema authoring and user interfaces. In particular, the use of four schema languages, namely RELAX NG, W3C XML Schema, DTD, and Schematron, is highlighted.