HCI Bibliography Home | HCI Conferences | WWW Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
WWW Tables of Contents: 05-105-2060708091011-111-212-112-213-113-214-114-215-115-2

Proceedings of the 2012 International Conference on the World Wide Web

Fullname:Companion Proceedings of the 21st international conference on World Wide Web
Editors:Alain Mille; Fabien Gandon; Jacques Misselis; Michael Rabinovich; Steffen Staab
Location:Lyon, France
Dates:2012-Apr-16 to 2012-Apr-20
Standard No:ISBN: 978-1-4503-1230-1; ACM DL: Table of Contents hcibib: WWW12-2
Papers:166; 273
Pages:524; 1246
Links:Conference Website
  1. WWW 2012-04-16 Volume 2
    1. Industry track presentations
    2. PhD Symposium
    3. European track presentations
    4. Demonstrations
    5. Poster presentations
    6. SWDM'12 workshop 1
    7. XperienceWeb'12 Workshop 2
    8. CQA'12 workshop 3
    9. EMAIL'12 workshop 4
    10. AdMIRe'12 workshop 6
    11. MultiAPro'12 workshop 7
    12. LSNA'12 workshop 8
    13. SWCS'12 workshop 9
    14. MSND'12 workshop 10

WWW 2012-04-16 Volume 2

Industry track presentations

Web-scale user modeling for targeting BIBAFull-Text 3-12
  Mohamed Aly; Andrew Hatch; Vanja Josifovski; Vijay K. Narayanan
We present the experiences from building a web-scale user modeling platform for optimizing display advertising targeting at Yahoo!. The platform described in this paper allows for per-campaign maximization of conversions representing purchase activities or transactions. Conversions directly translate to advertiser's revenue, and thus provide the most relevant metrics of return on advertising investment. We focus on two major challenges: how to efficiently process histories of billions of users on a daily basis, and how to build per-campaign conversion models given the extremely low conversion rates (compared to click rates in a traditional setting). We first present mechanisms for building web-scale user profiles in a daily incremental fashion. Second, we show how to reduce the latency through in-memory processing of billions of user records. Finally, we discuss a technique for scaling the number of handled campaigns/models by introducing an efficient labeling technique that allows for sharing negative training examples across multiple campaigns.
Outage detection via real-time social stream analysis: leveraging the power of online complaints BIBAFull-Text 13-22
  Eriq Augustine; Cailin Cushing; Alex Dekhtyar; Kevin McEntee; Kimberly Paterson; Matt Tognetti
Over the past couple of years, Netflix has significantly expanded its online streaming offerings, which now encompass multiple delivery platforms and thousands of titles available for instant view. This paper documents the design and development of an outage detection system for the online services provided by Netflix. Unlike other internal quality-control measures used at Netflix, this system uses only publicly available information: the tweets, or Twitter posts, that mention the word "Netflix," and has been developed and deployed externally, on servers independent of the Netflix infrastructure. This paper discussed the system and provides assessment of the accuracy of its real-time detection and alert mechanisms.
Optimizing user exploring experience in emerging e-commerce products BIBAFull-Text 23-32
  Xiubo Geng; Xin Fan; Jiang Bian; Xin Li; Zhaohui Zheng
E-commerce has emerged as a popular channel for Web users to conduct transaction over Internet. In e-commerce services, users usually prefer to discover information via querying over category browsing, since the hierarchical structure supported by category browsing can provide them a more effective and efficient way to find their interested properties. However, in many emerging e-commerce services, well-defined hierarchical structures are not always available; moreover, in some other e-commerce services, the pre-defined hierarchical structures are too coarse and less intuitive to distinguish properties according to users interests. This will lead to very bad user experience. In this paper, to address these problems, we propose a hierarchical clustering method to build the query taxonomy based on users' exploration behavior automatically, and further propose an intuitive and light-weight approach to construct browsing list for each cluster to help users discover interested items. The advantage of our approach is four folded. First, we build a hierarchical taxonomy automatically, which saves tedious human effort. Second, we provide a fine-grained structure, which can help user reach their interested items efficiently. Third, our hierarchical structure is derived from users' interaction logs, and thus is intuitive to users. Fourth, given the hierarchical structures, for each cluster, we present both frequently clicked items and retrieved results of queries in the category, which provides more intuitive items to users. We evaluate our work by applying it to the exploration task of a real-world e-commerce service, i.e. online shop for smart mobile phone's apps. Experimental results show that our clustering algorithm is efficient and effective to assist users to discover their interested properties, and further comparisons illustrate that the hierarchical topic browsing performs much better than existing category browsing approach (i.e. Android Market mobile apps category) in terms of information exploration.
FoCUS: learning to crawl web forums BIBAFull-Text 33-42
  Jingtian Jiang; Nenghai Yu; Chin-Yew Lin
In this paper, we present FoCUS (Forum Crawler Under Supervision), a supervised web-scale forum crawler. The goal of FoCUS is to only trawl relevant forum content from the web with minimal overhead. Forum threads contain information content that is the target of forum crawlers. Although forums have different layouts or styles and are powered by different forum software packages, they always have similar implicit navigation paths connected by specific URL types to lead users from entry pages to thread pages. Based on this observation, we reduce the web forum crawling problem to a URL type recognition problem and show how to learn accurate and effective regular expression patterns of implicit navigation paths from an automatically created training set using aggregated results from weak page type classifiers. Robust page type classifiers can be trained from as few as 5 annotated forums and applied to a large set of unseen forums. Our test results show that FoCUS achieved over 98% effectiveness and 97% coverage on a large set of test forums powered by over 150 different forum software packages.
Answering math queries with search engines BIBAFull-Text 43-52
  Shahab Kamali; Johnson Apacible; Yasaman Hosseinkashi
Conventional search engines such as Bing and Google provide a user with a short answer to some queries as well as a ranked list of documents, in order to better meet her information needs. In this paper we study a class of such queries that we call math. Calculations (e.g. "12% of 24$", "square root of 120"), unit conversions (e.g. "convert 10 meter to feet"), and symbolic computations (e.g. "plot x^2+x+1") are examples of math queries. Among the queries that should be answered, math queries are special because of the infinite combinations of numbers and symbols, and rather few keywords that form them. Answering math queries must be done through real time computations rather than keyword searches or database look ups. The lack of a formal definition for the entire range of math queries makes it hard to automatically identify them all. We propose a novel approach for recognizing and classifying math queries using large scale search logs, and investigate its accuracy through empirical experiments and statistical analysis. It allows us to discover classes of math queries even if we do not know their structures in advance. It also helps to identify queries that are not math even though they might look like math queries.
   We also evaluate the usefulness of math answers based on the implicit feedback from users. Traditional approaches for evaluating the quality of search results mostly rely on the click information and interpret a click on a link as a sign of satisfaction. Answers to math queries do not contain links, therefore such metrics are not applicable to them. In this paper we describe two evaluation metrics that can be applied for math queries, and present the results on a large collection of math queries taken from Bing's search logs.
Hierarchical composable optimization of web pages BIBAFull-Text 53-62
  Ronny Lempel; Ronen Barenboim; Edward Bortnikov; Nadav Golbandi; Amit Kagian; Liran Katzir; Hayim Makabee; Scott Roy; Oren Somekh
The process of creating modern Web media experiences is challenged by the need to adapt the content and presentation choices to dynamic real-time fluctuations of user interest across multiple audiences. We introduce FAME -- a Framework for Agile Media Experiences -- which addresses this scalability problem. FAME allows media creators to define abstract page models that are subsequently transformed into real experiences through algorithmic experimentation. FAME's page models are hierarchically composed of simple building blocks, mirroring the structure of most Web pages. They are resolved into concrete page instances by pluggable algorithms which optimize the pages for specific business goals. Our framework allows retrieving dynamic content from multiple sources, defining the experimentation's degrees of freedom, and constraining the algorithmic choices. It offers an effective separation of concerns in the media creation process, enabling multiple stakeholders with profoundly different skills to apply their crafts and perform their duties independently, composing and reusing each other's work in modular ways.
Delta-reasoner: a semantic web reasoner for an intelligent mobile platform BIBAFull-Text 63-72
  Boris Motik; Ian Horrocks; Su Myeon Kim
To make mobile device applications more intelligent, one can combine the information obtained via device sensors with background knowledge in order to deduce the user's current context, and then use this context to adapt the application's behaviour to the user's needs. In this paper we describe Delta-Reasoner, a key component of the Intelligent Mobile Platform (IMP), which was designed to support context-aware applications running on mobile devices. Context-aware applications and the mobile platform impose unusual requirements on the reasoner, which we have met by incorporating advanced features such as incremental reasoning and continuous query evaluation into our reasoner. Although we have so far been able to conduct only a very preliminary performance evaluation, our results are very encouraging: our reasoner exhibits sub-second response time on ontologies whose size significantly exceeds the size of the ontologies used in the IMP.
Rewriting null e-commerce queries to recommend products BIBAFull-Text 73-82
  Gyanit Singh; Nish Parikh; Neel Sundaresan
In e-commerce applications product descriptions are often concise. E-Commerce search engines often have to deal with queries that cannot be easily matched to product inventory resulting in zero recall or null query situations. Null queries arise from differences in buyer and seller vocabulary or from the transient nature of products. In this paper, we describe a system that rewrites null e-commerce queries to find matching products as close to the original query as possible. The system uses query relaxation to rewrite null queries in order to match products. Using eBay as an example of a dynamic marketplace, we show how using temporal feedback that respects product category structure using the repository of expired products, we improve the quality of recommended results. The system is scalable and can be run in a high volume setting. We show through our experiments that high quality product recommendations for more than 25% of null queries are achievable.
Towards expressive exploratory search over entity-relationship data BIBAFull-Text 83-92
  Sivan Yogev; Haggai Roitman; David Carmel; Naama Zwerdling
In this paper we describe a novel approach for exploratory search over rich entity-relationship data that utilizes a unique combination of expressive, yet intuitive, query language, faceted search, and graph navigation. We describe an extended faceted search solution which allows to index, search, and browse rich entity-relationship data. We report experimental results of an evaluation study, using a benchmark of several of entity-relationship datasets, demonstrating that our exploratory approach is both effective and efficient compared to other existing approaches.
Data extraction from web pages based on structural-semantic entropy BIBAFull-Text 93-102
  Xiaoqing Zheng; Yiling Gu; Yinsheng Li
Most of today's web content is designed for human consumption, which makes it difficult for software tools to access them readily. Even web content that is automatically generated from back-end databases is usually presented without the original structural information. In this paper, we present an automated information extraction algorithm that can extract the relevant attribute-value pairs from product descriptions across different sites. A notion, called structural-semantic entropy, is used to locate the data of interest on web pages, which measures the density of occurrence of relevant information on the DOM tree representation of web pages. Our approach is less labor-intensive and insensitive to changes in web-page format. Experimental results on a large number of real-life web page collections are encouraging and confirm the feasibility of the approach, which has been successfully applied to detect false drug advertisements on the web due to its capacity in associating the attributes of records with their respective values.
Clustering and load balancing optimization for redundant content removal BIBAFull-Text 103-112
  Shanzhong Zhu; Alexandra Potapova; Maha Alabduljalil; Xin Liu; Tao Yang
Removing redundant content is an important data processing operation in search engines and other web applications. An offline approach can be important for reducing the engine's cost, but it is challenging to scale such an approach for a large data set which is updated continuously. This paper discusses our experience in developing a scalable approach with parallel clustering that detects and removes near duplicates incrementally when processing billions of web pages. It presents a multidimensional mapping to balance the load among multiple machines. It further describes several approximation techniques to efficiently manage distributed duplicate groups with transitive relationship. The experimental results evaluate the efficiency and accuracy of the incremental clustering, assess the effectiveness of the multidimensional mapping, and demonstrate the impact on online cost reduction and search quality.

PhD Symposium

From linked data to linked entities: a migration path BIBAFull-Text 115-120
  Giovanni Bartolomeo; Stefano Salsano
Entities have been deserved special attention in the latest years, however their identification is still troublesome. Existing approaches exploit ad hoc services or centralized architectures. In this paper we present a novel approach to recognize naturally emerging entity identifiers built on top of Linked Data concepts and protocols.
Cyberbullying detection: a step toward a safer internet yard BIBAFull-Text 121-126
  Maral Dadvar; Franciska de Jong
As a result of the invention of social networks friendships, relationships and social communications have all gone to a new level with new definitions. One may have hundreds of friends without even seeing their faces. Meanwhile, alongside this transition there is increasing evidence that online social applications have been used by children and adolescents for bullying. State-of-the-art studies in cyberbullying detection have mainly focused on the content of the conversations while largely ignoring the users involved in cyberbullying. We propose that incorporation of the users' information, their characteristics, and post-harassing behaviour, for instance, posting a new status in another social network as a reaction to their bullying experience, will improve the accuracy of cyberbullying detection. Cross-system analyses of the users' behaviour -- monitoring their reactions in different online environments -- can facilitate this process and provide information that could lead to more accurate detection of cyberbullying.
Intelligent crawling of web applications for web archiving BIBAFull-Text 127-132
  Muhammad Faheem
The steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently accessed (which leads to suboptimal crawling strategies) and whatever structured content is contained in Web pages (which results in page-level archives whose content is hard to exploit). We focus in this PhD work on the crawling and archiving of publicly accessible Web applications, especially those of the social Web. A Web application is any application that uses Web standards such as HTML and HTTP to publish information on the Web, accessible by Web browsers. Examples include Web forums, social networks, geolocation services, etc. We claim that the best strategy to crawl these applications is to make the Web crawler aware of the kind of application currently processed, allowing it to refine the list of URLs to process, and to annotate the archive with information about the structure of crawled content. We add adaptive characteristics to an archival Web crawler: being able to identify when a Web page belongs to a given Web application and applying the appropriate crawling and content extraction methodology.
Binary RDF for scalable publishing, exchanging and consumption in the web of data BIBAFull-Text 133-138
  Javier D. Fernández
The Web of Data is increasingly producing large RDF datasets from diverse fields of knowledge, pushing the Web to a data-to-data cloud. However, traditional RDF representations were inspired by a document-centric view, which results in verbose/redundant data, costly to exchange and post-process. This article discusses an ongoing doctoral thesis addressing efficient formats for publication, exchange and consumption of RDF on a large scale. First, a binary serialization format for RDF, called HDT, is proposed. Then, we focus on compressed rich-functional structures which take part of efficient HDT representation as well as most applications performing on huge RDF datasets.
User-generated metadata in audio-visual collections BIBAFull-Text 139-144
  Riste Gligorov
In recent years, crowdsourcing has gained attention as an alternative method for collecting video annotations. An example is the internet video labeling game Waisda? launched by the Netherlands Institute for Sound and Vision. The goal of this PhD research is to investigate the value of the user tags collected with this video labeling game. To this end, we address the following four issues. First, we perform a comparative analysis between user-generated tags and professional annotations in terms of what aspects of videos they describe. Second, we measure how well user tags are suited for fragment retrieval and compare it with fragment search based on other sources like transcripts and professional annotations. Third, as previous research suggested that user tags predominately refer to objects and rarely describe scenes, we will study whether user tags can be successfully exploited to generate scene-level descriptions. Finally, we investigate how tag quality can be characterized and potential methods to improve it.
Scalable search platform: improving pipelined query processing for distributed full-text retrieval BIBAFull-Text 145-150
  Simon Jonassen
In theory, term-wise partitioned indexes may provide higher throughput than document-wise partitioned. In practice, term-wise partitioning shows lacking scalability with increasing collection size and intra-query parallelism, which leads to long query latency and poor performance at low query loads. In our work, we have developed several techniques to deal with these problems. Our current results show a significant improvement over the state-of-the-art approach on a small distributed IR system, and our next objective is to evaluate the scalability of the improved approach on a large system. In this paper, we describe the relation between our work and the problem of scalability, summarize the results, limitations and challenges of our current work, and outline directions for further research.
Building reputation and trust using federated search and opinion mining BIBAFull-Text 151-154
  Somayeh Khatiban
The term online reputation addresses trust relationships amongst agents in dynamic open systems. These can appear as ratings, recommendations, referrals and feedback. Several reputation models and rating aggregation algorithms have been proposed. However, finding a trusted entity on the web is still an issue as all reputation systems work individually. The aim of this project is to introduce a global reputation system that aggregates people's opinions from different resources (e.g. e-commerce websites, and review) with the help federated search techniques. A sentiment analysis approach is subsequently used to extract high quality opinions and inform how to increase trust in the search result.
A semantic policy sharing and adaptation infrastructure for pervasive communities BIBAFull-Text 155-160
  Vikash Kumar
Rule based information processing has traditionally been vital in many aspects of business, process manufacturing and information science. The need for rules gets even more magnified when limitations of ontology development in OWL are taken into account. In conjunction, the potent combination of ontology and rule based applications could be the future of information processing and knowledge representation on the web. However, semantic rules tend to be very dependent on multitudes of parameters and context data making it less flexible for use in applications where users could benefit from each other by socially sharing intelligence in the form of policies. This work aims to address this issue arising in rule based semantic applications in the use cases of smart home communities and privacy aware m-commerce setting for mobile users. In this paper, we propose a semantic policy sharing and adaptation infrastructure that enables a semantic rule created in one set of environmental, physical and contextual settings to be adapted for use in a situation when those settings/parameters/context variables change. The focus will mainly be on behavioural policies in the smart home use case and privacy enforcing and data filtering policies in the m-commerce scenario. Finally, we look into the possibility of making this solution application independent so that the benefits of such a policy adaptation infrastructure could be exploited in other application settings as well.
A generic graph-based multidimensional recommendation framework and its implementations BIBAFull-Text 161-166
  Sangkeun Lee
As the volume of information on the Web is explosively growing, recommender systems have become essential tools for helping users to find what they need or prefer. Most existing systems are two-dimensional in that they only exploit User and Item dimensions and perform a typical form of recommendation 'Recommending Item to User'. Yet, in many applications, the capabilities of dealing with multidimensional information and of adapting to various forms of recommendation requests are very important. In this paper, we take a graph-based approach to accomplishing such requirements in recommender systems and present a generic graph-based multidimensional recommendation framework. Based on the framework, we propose two homogeneous graph-based and one heterogeneous graph-based multidimensional recommendation methods. We expect our approach will be useful for increasing recommendation performance and enabling flexibility of recommender systems so that they can incorporate various user intentions into their recommendation process. We present our research result that we have reached and discuss remaining challenges and future work.
Semi-automatic semantic moderation of web annotations BIBAFull-Text 167-172
  Elaheh Momeni
Many social media portals are featuring annotation functionality in order to integrate the end users' knowledge with existing digital curation processes. This facilitates extending existing metadata about digital resources. However, due to various levels of annotators' expertise, the quality of annotations can vary from excellent to vague. The evaluation and moderation of annotations (be they troll, vague, or helpful) have not been sufficiently analyzed automatically. Available approaches mostly attempt to solve the problem by using distributed moderation systems, which are influenced by factors affecting accuracy (such as imbalance voting). Despite this, we hypothesize that analyzing and exploiting both content and context dimensions of annotations may assist the automatic moderation process. In this research, we focus on leveraging the context and content features of social web annotations for semi-automatic semantic moderation. This paper describes the vision of our research, proposes an approach for semi-automatic semantic moderation, introduces an ongoing effort from which we collect data that can serve as a basis for evaluating our assumption, and report on lessons learned so far.
Modeling the flow and change of information on the web BIBAFull-Text 173-178
  Nataliia Pobiedina
The proposed PhD work approaches the problem of information flow and change on the Web. To model temporal dynamics both of the Web structure and its content, the author proposes to apply the framework of stochastic graph transformation systems. This framework is currently widely used in software engineering and model checking. A quantitative and qualitative evaluation of the framework will be performed during a case study of the short-term temporal behavior of economics news on selected English news websites and blogs over selected time period.
Context-aware image semantic extraction in the social web BIBAFull-Text 179-184
  Massimiliano Ruocco
Media sharing applications such as Panoramio and Flickr contain a huge amount of pictures that need to be organized to facilitate browsing and retrieval. Such pictures are often surrounded by a set of metadata or image tags, constituting the image context. With the advent of the paradigm of Web 2.0 especially the past five years, the concept of image context has further evolved, allowing users to tag their own and other people's pictures. Focusing on tagging, we distinguish between static and dynamic features. The set of static features include textual and visual features, as well as the contextual information. Further, we may identify other features belonging to the social context as a result of the usage within the media sharing applications. Due to their dynamic nature, we call these the dynamic set of features. In this work, we assume that every media uploaded contains both static and dynamic features. In addition, a user may be linked with other users with whom he/she shares common interests. This has resulted in a new series of challenges within the research field of semantic understanding. One of the main goals of this work is to address these challenges.
Augmenting the web with accountability BIBAFull-Text 185-190
  Oshani Wasana Seneviratne
Given the ubiquity of data on the web, and the lack of usage restriction enforcement mechanisms, stories of personal, creative and other kinds of data misuses are on the rise. There should be both sociological and technological mechanisms that facilitate accountability on the web that would prevent such data misuses. Sociological mechanisms appeal to the data consumer's self-interest in adhering to the data provider's desires. This involves a system of rewards such as recognition and financial incentives, and deterrents such as prohibitions by laws for any violations and social pressure. Bur there is no well-defined technological mechanism for the discovery of accountability or the lack of it on the web. As part of my PhD thesis I propose a solution to this problem by designing a web protocol called HTTPA (Accountable HTTP). This protocol will enable data consumers and data producers to agree to specific usage restrictions, preserve the provenance of data transferred from a web server to a client and back to another web server, and more importantly provide a mechanism to derive an 'audit trail' for the data reuse with the help of a trusted intermediary called a 'Provenance Tracker Network'.
AMBER: turning annotations into knowledge BIBAFull-Text 191-196
  Cheng Wang
Web extraction is the task of turning unstructured HTML into knowledge. Computers are able to generate annotations of unstructured HTML, but it is more important to turn those annotations into structured knowledge. Unfortunately, the current systems extracting knowledge from result pages lack accuracy.
   In this proposal, we present AMBER, a system fully automated turning annotations to structured knowledge from any result page of a given domain. AMBER observes basic domain attributes on a page and leverages repeated occurrences of similar attributes to group related attributes into records. This contrasts to previous approaches that analyze the repeated structure only of the HTML, as no domain knowledge is available. Our multi-domain experimental evaluation on hundreds of sites demonstrates that AMBER achieves accuracy (>98%) comparable to skilled human annotator.
Chinese news event 5W1H semantic elements extraction for event ontology population BIBAFull-Text 197-202
  Wei Wang
To relieve "News Information Overload", in this paper, we propose a novel approach of 5W1H (who, what, whom, when, where, how) event semantic elements extraction for Chinese news event knowledge base construction. The approach comprises a key event identification step, an event semantic elements extraction step and an event ontology population step. We first use a machine learning method to identify the key events from Chinese news stories. Then we extract event 5W1H elements by employing the combination of SRL, NER technique and rule-based method. At last we populate the extracted facts of news events to NOEM, an event ontology designed specifically for modeling semantic elements and relations of events. Our experiments on real online news data sets show the reasonability and feasibility of our approach.
Semi-structured semantic overlay for information retrieval in self-organizing networks BIBAFull-Text 203-208
  Yulian Yang
As scalability and flexibility have become the critical concerns in information management systems, self-organizing networks attract attentions from both research and industrial communities. This work proposes a semi-structured semantic overlay for information retrieval in large-scale self-organizing networks. With the autonomy to their own resources, the nodes are organized into a semantic overlay hosting topically discriminative communities. For information retrieval within a community, unstructured routing approach is employed for the sake of flexibility; While for joining new nodes and routing queries to a distant community, a structured mechanism is designed to save the traffic and time cost. Different from the semantic overlay in the literature, our proposal has three contributions: 1. we design topic-based indexing to form and maintain the semantic overlay, to guarantee both scalability and efficiency; 2. We introduce unstructured routing approach within the community, to allow flexible node joining and leaving; 3. We take advantage of the interaction among nodes to capture the overlay changes and make corresponding adaption in topic-based indexing.

European track presentations

The ERC webdam on foundations of web data management BIBAFull-Text 211-214
  Serge Abiteboul; Pierre Senellart; Victor Vianu
The Webdam ERC grant is a five-year project that started in December 2008. The goal is to develop a formal model for Web data management that would open new horizons for the development of the Web in a well-principled way, enhancing its functionality, performance, and reliability. Specifically, the goal is to develop a universally accepted formal framework for describing complex and flexible interacting Web applications featuring notably data exchange, sharing, integration, querying, and updating. We also propose to develop formal foundations that will enable peers to concurrently reason about global data management activities, cooperate in solving specific tasks, and support services with desired quality of service. Although the proposal addresses fundamental issues, its goal is to serve as the basis for future software development for Web data management.
WAI-ACT: web accessibility now BIBAFull-Text 215-218
  Shadi Abou-Zahra
The W3C web accessibility standards have now existed for over a decade yet implementation of accessible websites, software, and web technologies is lagging behind this development. This fact is largely due to lack of knowledge and expertise among developers and due to fragmentation of web accessibility approaches. It is an opportune time to develop authoritative practical guidance and harmonized approaches, and to research potential challenges and opportunities in future technologies in a collaborative setting. The EC-funded WAI-ACT project addresses these needs through use of an open cooperation framework that builds on and extends the existing mechanisms of the W3C Web Accessibility Initiative (WAI). This paper presents the WAI-ACT project and how it will drive accessibility implementation in advanced web technologies.
GLOCAL: event-based retrieval of networked media BIBAFull-Text 219-222
  Pierre Andrews; Francesco De Natale; Sven Buschbeck; Anthony Jameson; Kerstin Bischoff; Claudiu S. Firan; Claudia Niederée; Vasileios Mezaris; Spiros Nikolopoulos; Vanessa Murdock; Adam Rae
The idea of the European project GLOCAL is to use events as the central concept for search, organization and combination of multimedia content from various sources. For this purpose methods for event detection and event matching as well as media analysis are developed. Considered events range from private, over local, to global events.
Combining social web and BPM for improving enterprise performances: the BPM4People approach to social BPM BIBAFull-Text 223-226
  Marco Brambilla; Piero Fraternali; Carmen Karina Vaca Ruiz
Social BPM fuses business process management practices with social networking applications, with the aim of enhancing the enterprise performance by means of a controlled participation of external stakeholders to process design and enactment. This project-centered demonstration paper proposes a model-driven approach to participatory and social enactment of business processes. The approach consists of defining a specific notation for describing Social BPM behaviors (defined as a BPMN 2.0 extension), a methodology, and a technical framework that allows enterprises to implement of social processes as Web applications integrated with public or private Web social networks. The presented work is performed within the BPM4People SME Capacities project.
Entity oriented search and exploration for cultural heritage collections: the EU cultura project BIBAFull-Text 227-230
  David Carmel; Naama Zwerdling; Sivan Yogev
In this paper we describe an entity oriented search and exploration system that we are developing for the EU Cultura project.
The patents retrieval prototype in the MOLTO project BIBAFull-Text 231-234
  Milen Chechev; Meritxell Gonzàlez; Lluís Màrquez; Cristina España-Bonet
This paper describes the patents retrieval prototype developed within the MOLTO project. The prototype aims to provide a multilingual natural language interface for querying the content of patent documents. The developed system is focused on the biomedical and pharmaceutical domain and includes the translation of the patent claims and abstracts into English, French and German. Aiming at the best retrieval results of the patent information and text content, patent documents are preprocessed and semantically annotated. Then, the annotations are stored and indexed in an OWLIM semantic repository, which contains a patent specific ontology and others from different domains. The prototype, accessible online at http://molto-patents.ontotext.com, presents a multilingual natural language interface to query the retrieval system. In MOLTO, the multilingualism of the queries is addressed by means of the GF Tool, which provides an easy way to build and maintain controlled language grammars for interlingual translation in limited domains. The abstract representation obtained from the GF is used to retrieve both the matched RDF instances and the list of patents semantically related to the user's search criteria. The online interface allows to browse the retrieved patents and shows on the text the semantic annotations that explain the reason why any particular patent has matched the user's criteria.
End-user-oriented telco mashups: the OMELETTE approach BIBAFull-Text 235-238
  Olexiy Chudnovskyy; Tobias Nestler; Martin Gaedke; Florian Daniel; José Ignacio Fernández-Villamor; Vadim Chepegin; José Angel Fornas; Scott Wilson; Christoph Kögler; Heng Chang
With the success of Web 2.0 we are witnessing a growing number of services and APIs exposed by Telecom, IT and content providers. Targeting the Web community and, in particular, Web application developers, service providers expose capabilities of their infrastructures and applications in order to open new markets and to reach new customer groups. However, due to the complexity of the underlying technologies, the last step, i.e., the consumption and integration of the offered services, is a non-trivial and time-consuming task that is still a prerogative of expert developers. Although many approaches to lower the entry barriers for end users exist, little success has been achieved so far. In this paper, we introduce the OMELETTE project and show how it addresses end-user-oriented telco mashup development. We present the goals of the project, describe its contributions, summarize current results, and describe current and future work.
Multilingual online generation from semantic web ontologies BIBAFull-Text 239-242
  Dana Dannélls; Mariana Damova; Ramona Enache; Milen Chechev
In this paper we report on our ongoing work in the EU project Multilingual Online Translation (MOLTO), supported by the European Union Seventh Framework Programme under grant agreement FP7-ICT-247914. More specifically, we present work workpackage 8 (WP8): Case Study: Cultural Heritage. The objective of the work is to build an ontology-based multilingual application for museum information on the Web. Our approach relies on the innovative idea of Reason-able View of the Web of linked data applied to the domain of cultural heritage. We have been developing a Web application that uses Semantic Web ontologies for generating coherent multilingual natural language descriptions about museum objects. We have been experimenting with museum data to test our approach and find that it performs well for the examined languages.
Making use of social media data in public health BIBAFull-Text 243-246
  Kerstin Denecke; Peter Dolog; Pavel Smrz
Disease surveillance systems exist to offer an easily accessible "epidemiological snapshot" on up-to-date summary statistics for numerous infectious diseases. However, these indicator-based systems represent only part of the solution. Experiences show that they fail when confronted with agents that are new emerging like the agents causing the lung disease SARS in 2002. Further, due to slow reporting mechanisms, the time until health threats become visible to public health officials can be long. The M-Eco project provides an event-based approach to the early detection of emerging health threats. The developed technologies exploit content from social media and multimedia data as input and analyze it by sophisticated event-detection techniques to identify potential threats. Alerts for public health threats are provided to the user in a personalized way.
SocialSensor: sensing user generated input for improved media discovery and experience BIBAFull-Text 247-250
  Sotiris Diplaris; Symeon Papadopoulos; Ioannis Kompatsiaris; Ayse Goker; Andrew Macfarlane; Jochen Spangenberg; Hakim Hacid; Linas Maknavicius; Matthias Klusch
SocialSensor will develop a new framework for enabling real-time multimedia indexing and search in the Social Web. The project moves beyond conventional text-based indexing and retrieval models by mining and aggregating user inputs and content over multiple social networking sites. Social Indexing will incorporate information about the structure and activity of the users social network directly into the multimedia analysis and search process. Furthermore, it will enhance the multimedia consumption experience by developing novel user-centric media visualization and browsing paradigms. For example, SocialSensor will analyse the dynamic and massive user contributions in order to extract unbiased trending topics and events and will use social connections for improved recommendations. To achieve its objectives, SocialSensor introduces the concept of Dynamic Social COntainers (DySCOs), a new layer of online multimedia content organisation with particular emphasis on the real-time, social and contextual nature of content and information consumption. Through the proposed DySCOs-centered media search, SocialSensor will integrate social content mining, search and intelligent presentation in a personalized, context and network-aware way, based on aggregation and indexing of both UGC and multimedia Web content.
The multilingual web: report on multilingualweb initiative BIBAFull-Text 251-254
  David Filip; Dave Lewis; Felix Sasaki
We report on the MultilingualWeb initiative, a collaboration between the W3C Internationalization Activity and the European Commission, realized as a series of EC-funded projects. We review the outcomes of "MultilingualWeb", which conducted 4 workshops analyzing "gaps" within Web standardization that currently hinder multilinguality. Gap analysis led to formation of "MultilingualWeb-LT" -- project and W3C Working Group with cross industry representation that will address priority issues via standardization of interoperability metadata.
Mobile web applications: bringing mobile apps and web together BIBAFull-Text 255-258
  Marie-Claire Forgue; Dominique Hazaël-Massieux
The popularity of mobile applications is very high and still growing rapidly. These applications allow their users to stay connected with a large number of service providers in seamless fashion, both for leisure and productivity. But service prThe popularity of mobile applications is very high and still growing rapidly. These applications allow their users to stay connected with a large number of service providers in seamless fashion, both for leisure and productivity. But service providers suffer from the high fragmentation of mobile development platforms that force them to develop, maintain and deploy their applications in a large number of versions and formats. The Mobile Web Applications (MobiWebApp [1]) EU project aims to build on Europe's strength in mobile technologies to enable European research and industry to strengthen its position in Web technologies to be active and visible on the mobile applications market.
The CUBRIK project: human-enhanced time-aware multimedia search BIBAFull-Text 259-262
  Piero Fraternali; Marco Tagliasacchi; Davide Martinenghi; Alessandro Bozzon; Ilio Catallo; Eleonora Ciceri; Francesco Nucci; Vincenzo Croce; Ismail Sengor Altingovde; Wolf Siberski; Fausto Giunchiglia; Wolfgang Nejdl; Martha Larson; Ebroul Izquierdo; Petros Daras; Otto Chrons; Ralph Traphoener; Bjoern Decker; John Lomas; Patrick Aichroth; Jasminko Novak; Ghislain Sillaume; F. Sanchez Figueroa; Carolina Salas-Parra
The Cubrik Project is an Integrated Project of the 7th Framework Programme that aims at contributing to the multimedia search domain by opening the architecture of multimedia search engines to the integration of open source and third party content annotation and query processing components, and by exploiting the contribution of humans and communities in all the phases of multimedia search, from content processing to query processing and relevance feedback processing. The CUBRIK presentation will showcase the architectural concept and scientific background of the project and demonstrate an initial scenario of human-enhanced content and query processing pipeline.
The webinos project BIBAFull-Text 263-266
  Christian Fuhrhop; John Lyle; Shamal Faily
This poster paper describes the webinos project and presents the architecture and security features developed in webinos. It highlights the main objectives and concepts of the project and describes the architecture derived to achieve the objectives.
DIADEM: domain-centric, intelligent, automated data extraction methodology BIBAFull-Text 267-270
  Tim Furche; Georg Gottlob; Giovanni Grasso; Omer Gunes; Xiaoanan Guo; Andrey Kravchenko; Giorgio Orsi; Christian Schallhart; Andrew Sellers; Cheng Wang
Search engines are the sinews of the web. These sinews have become strained, however: Where the web's function once was a mix of library and yellow pages, it has become the central marketplace for information of almost any kind. We search more and more for objects with specific characteristics, a car with a certain mileage, an affordable apartment close to a good school, or the latest accessory for our phones. Search engines all too often fail to provide reasonable answers, making us sift through dozens of websites with thousands of offers -- never to be sure a better offer isn't just around the corner. What search engines are missing is understanding of the objects and their attributes published on websites.
   Automatically identifying and extracting these objects is akin to alchemy: transforming unstructured web information into highly structured data with near perfect accuracy. With DIADEM we present a formula for this transformation, but at a price: DIADEM identifies and extracts data from a website with high accuracy. The price is that for this task we need to provide DIADEM with extensive knowledge about the ontology and phenomenology of the domain, i.e., about entities (and relations) and about the representation of these entities in the textual, structural, and visual language of a website of this domain. In this demonstration, we demonstrate with a first prototype of DIADEM that, in contrast to alchemists, DIADEM has developed a viable formula.
Social media meta-API: leveraging the content of social networks BIBAFull-Text 271-274
  George Papadakis; Konstantinos Tserpes; Emmanuel Sardis; Magdalini Kardara; Athanasios Papaoikonomou; Fotis Aisopos
Social Network (SN) environments are the ideal future service marketplaces. It is well known and documented that SN users are increasing at a tremendous pace. Taking advantage of these social dynamics as well as the vast volumes, of amateur content generated every second, is a major step towards creating a potentially huge market of services. In this paper, we describe the external web services that SocIoS project is researching and developing, and will support with the Social Media community. Aiming to support the end users of SNs, to enhance their transactions with more automated ways, and with the advantage for better production and performance in their workflows over SNs inputs and content, this work presents the main architecture, functionality, and benefits per external service. Finally, introduces the end user, into the new era of SNs with business applicability and better social transactions over SNs content.
ARCOMEM: from collect-all ARchives to COmmunity MEMories BIBAFull-Text 275-278
  Thomas Risse; Wim Peters
The ARCOMEM project is about memory institutions like archives, museums and libraries in the age of the Social Web. Social media are becoming more and more pervasive in all areas of life. ARCOMEM's aim is to help to transform archives into collective memories that are more tightly integrated with their community of users and to exploit Web 2.0 and the wisdom of crowds to make Web archiving a more selective and meaning-based process. ARCOMEM (FP7-IST-270239) is an Integrating Project in the FP7 program of the European Commission, which involves twelve partners from academia, industry and public sector. The project will run from January 1, 2011 to December 31, 2013.
Plan4All GeoPortal: web of spatial data BIBAFull-Text 279-282
  Evangelos Sakkopoulos; Tomas Mildorf; Karel Charvat; Inga Berzina; Kai-Uwe Krause
Plan4All project contributes on the harmonization of spatial data and related metadata in order to make them available through Web across a linked data platform. A prototype of a Web search European spatial data portal is already available at http://www.plan4all.eu. The key aim is to provide a methodology and present best practices towards the standardization of spatial data according to the INSPIRE principles and provide results that would be a reference material for linking data and data specification from the spatial planning point of view. The results include methodology and implementation of multilingual search for data and common portrayal rules for content providers. These are critical services for sharing and understanding spatial data across Europe. Plan4All paradigm shows that a clear applicable methodology for harmonization of spatial data on all different topics of interest can be achieve efficiently. Plan4All shows that it is possible to build Pan European Web access, to link spatial data and to utilize multilingual metadata providing a roadmap for linked spatial data across and hopefully beyond Europe. The proposed demonstration based on Plan4All experience aims to show experience, best practices and methods to achieve data harmonization and provision of linked spatial data on the Web.
Multimedia search over integrated social and sensor networks BIBAFull-Text 283-286
  John Soldatos; Moez Draief; Craig Macdonald; Iadh Ounis
This paper presents work in progress within the FP7 EU-funded project SMART to develop a multimedia search engine over content and information stemming from the physical world, as derived through visual, acoustic and other sensors. Among the unique features of the search engine is its ability to respond to social queries, through integrating social networks with sensor networks. Motivated by this innovation, the paper presents and discusses the state-of-the-art in participatory sensing and other technologies blending social and sensor networks.
Tracking entities in web archives: the LAWA project BIBAFull-Text 287-290
  Marc Spaniol; Gerhard Weikum
Web-preservation organization like the Internet Archive not only capture the history of born-digital content but also reflect the zeitgeist of different time periods over more than a decade. This longitudinal data is a potential gold mine for researchers like sociologists, politologists, media and market analysts, or experts on intellectual property. The LAWA project (Longitudinal Analytics of Web Archive data) is developing an Internet-based experimental testbed for large-scale data analytics on Web archive collections. Its emphasis is on scalable methods for this specific kind of big-data analytics, and software tools for aggregating, querying, mining, and analyzing Web contents over long epochs. In this paper, we highlight our research on {\em entity-level analytics} in Web archive data, which lifts Web analytics from plain text to the entity-level by detecting named entities, resolving ambiguous names, extracting temporal facts and visualizing entities over extended time periods. Our results provide key assets for tracking named entities in the evolving Web, news, and social media.
I-SEARCH: a multimodal search engine based on rich unified content description (RUCoD) BIBAFull-Text 291-294
  Thomas Steiner; Lorenzo Sutton; Sabine Spiller; Marilena Lazzaro; Francesco Nucci; Vincenzo Croce; Alberto Massari; Antonio Camurri; Anne Verroust-Blondet; Laurent Joyeux; Jonas Etzold; Paul Grimm; Athanasios Mademlis; Sotiris Malassiotis; Petros Daras; Apostolos Axenopoulos; Dimitrios Tzovaras
In this paper, we report on work around the I-SEARCH EU (FP7 ICT STREP) project whose objective is the development of a multimodal search engine. We present the project's objectives, and detail the achieved results, amongst which a Rich Unified Content Description format.
Enabling users to create their own web-based machine translation engine BIBAFull-Text 295-298
  Andrejs Vasiljevs; Raivis Skadins; Indra Samite
This paper presents European Union co-funded projects to advance the development and use of machine translation (MT) that will benefit from the possibilities provided by the Web. Current mass-market and online MT systems are of a general nature and perform poorly for smaller languages and domain specific texts. The ICT-PSP Programme project LetsMT! develops a user-driven machine translation "factory in the cloud" enabling web users to get customized MT that better fits their needs. Harnessing the huge potential of the web together with open statistical machine translation (SMT) technologies LetsMT! has created an innovative online collaborative platform for data sharing and building MT. Users can upload their parallel corpora to an online repository and generate user-tailored SMT systems based on user selected data. FP7 Programme project ACCURAT researches new methods for accumulating more data from the Web to improve the quality of data-driven machine translation systems. ACCURAT has created techniques and tools to use comparable corpora such as news feeds and multinational web pages. Although the majority of these texts are not direct translations, they share a lot of common paragraphs, sentences, phrases, terms and named entities in different languages which are useful for machine translation.
Semantic evaluation at large scale (SEALS) BIBAFull-Text 299-302
  Stuart N. Wrigley; Raúl García-Castro; Lyndon Nixon
This paper describes the main goals and outcomes of the EU-funded Framework 7 project entitled Semantic Evaluation at Large Scale (SEALS). The growth and success of the Semantic Web is built upon a wide range of Semantic technologies from ontology engineering tools through to semantic web service discovery and semantic search. The evaluation of such technologies -- and, indeed, assessments of their mutual compatibility -- is critical for their sustained improvement and adoption. The SEALS project is creating an open and sustainable platform on which all aspects of an evaluation can be hosted and executed and has been designed to accommodate most technology types. It is envisaged that the platform will become the de facto repository of test datasets and will allow anyone to organise, execute and store the results of technology evaluations free of charge and without corporate bias. The demonstration will show how individual tools can be prepared for evaluation, uploaded to the platform, evaluated according to some criteria and the subsequent results viewed. In addition, the demonstration will show the flexibility and power of the SEALS Platform for evaluation organisers by highlighting some of the key technologies used.


Twitcident: fighting fire with information from social web streams BIBAFull-Text 305-308
  Fabian Abel; Claudia Hauff; Geert-Jan Houben; Richard Stronkman; Ke Tao
In this paper, we present Twitcident, a framework and Web-based system for filtering, searching and analyzing information about real-world incidents or crises. Twitcident connects to emergency broadcasting services and automatically starts tracking and filtering information from Social Web streams (Twitter) when a new incident occurs. It enriches the semantics of streamed Twitter messages to profile incidents and to continuously improve and adapt the information filtering to the current temporal context. Faceted search and analytical tools allow users to retrieve particular information fragments and overview and analyze the current situation as reported on the Social Web. Demo: http://wis.ewi.tudelft.nl/twitcident/
SWiPE: searching wikipedia by example BIBAFull-Text 309-312
  Maurizio Atzori; Carlo Zaniolo
A novel method is demonstrated that allows semantic and well-structured knowledge bases (such as DBpedia) to be easily queried directly from Wikipedia's pages. Using Swipe, naive users with no knowledge of RDF triples and SPARQL can easily query DBpedia with powerful questions such as: "Who are the U.S. presidents who took office when they were 55-year old or younger, during the last 60 years", or "Find the town in California with less than 10 thousand people". This is accomplished by a novel Search by Example (SBE) approach where a user can enter the query conditions directly on the Infobox of a Wikipedia page. In fact, Swipe activates various fields of Wikipedia to allow users to enter query conditions, and then uses these conditions to generate equivalent SPARQL queries and execute them on DBpedia. Finally, Swipe returns the query results in a form that is conducive to query refinements and further explorations. Swipe's SBE approach makes semi-structured documents queryable in an intuitive and user-friendly way and, through Wikipedia, delivers the benefits of querying and exploring large knowledge bases to all Web users.
ProFoUnd: program-analysis-based form understanding BIBAFull-Text 313-316
  Michael Benedikt; Tim Furche; Andreas Savvides; Pierre Senellart
An important feature of web search interfaces are the restrictions enforced on input values -- those reflecting either the semantics of the data or requirements specific to the interface. Both integrity constraints and "access restrictions" can be of great use to web exploration tools. We demonstrate here a novel technique for discovering constraints that requires no form submissions whatsoever. We work via statically analyzing the JavaScript client-side code used to enforce the constraints, when such code is available. We combine custom recognizers for JavaScript functions relevant to constraint checking with a generic program analysis layer. Integrated with a web browser, our system shows the constraints detected on accessed web forms, and allows a user to see the corresponding JavaScript code fragment.
A social network for video annotation and discovery based on semantic profiling BIBAFull-Text 317-320
  Marco Bertini; Alberto Del Bimbo; Andrea Ferracani; Daniele Pezzatini
This paper presents a system for the social annotation and discovery of videos based on social networks and social knowledge. The system, developed as a web application, allows users to comment and annotate, manually and automatically, video frames and scenes enriching their content with tags, references to Facebook users and pages and Wikipedia resources. These annotations are used to semantically model the interests and the folksonomy of each user and resource in the network, and to suggest to users new resources, Facebook friends and videos whose content is related to their interests. A screencast showing an example of these functionalities is publicly available at: http://vimeo.com/miccunifi/facetube
GovWILD: integrating open government data for transparency BIBAFull-Text 321-324
  Christoph Böhm; Markus Freitag; Arvid Heise; Claudia Lehmann; Andrina Mascher; Felix Naumann; Vuk Ercegovac; Mauricio Hernandez; Peter Haase; Michael Schmidt
Many government organizations publish a variety of data on the web to enable transparency, foster applications, and to satisfy legal obligations. Data content, format, structure, and quality vary widely, even in cases where the data is published using the wide-spread linked data principles. Yet within this data and their integration lies much value: We demonstrate GovWILD, a web-based prototype that integrates and cleanses Open Government Data at a large scale. Apart from the web-based interface that presents a use case of the created dataset at govwild.org, we provide all integrated data as a download. This data can be used to answer questions about politicians, companies, and government funding.
FreeQ: an interactive query interface for freebase BIBAFull-Text 325-328
  Elena Demidova; Xuan Zhou; Wolfgang Nejdl
Freebase is a large-scale open-world database where users collaboratively create and structure content over an open platform. Keyword queries over Freebase are notoriously ambiguous due to the size and the complexity of the dataset. To this end, novel techniques are required to enable naive users to express their informational needs and retrieve the desired data. FreeQ offers users an interactive interface for incremental query construction over a large-scale dataset, so that the users can find desired information quickly and accurately.
Querying socio-spatial networks on the world-wide web BIBAFull-Text 329-332
  Yerach Doytsher; Ben Galon; Yaron Kanza
navigation systems, allow users to record their location history. The location history data can be analyzed to generate life patterns|patterns that associate people to places they frequently visit. Accordingly, an SSN is a graph that consists of (1) a social network, (2) a spatial network, and (3) life patterns that connect the users of the social network to locations, i.e., to geographical entities in the spatial network. In this paper we present a system that stores SNN in a graph-based database management system and provides a novel query language, namely SSNQL, for querying the integrated data. The system includes a Web-based graphical user interface that allows presenting the social network, presenting the spatial network and posing SSNQL queries over the integrated data. The user interface also depicts the structure of queries for the purpose of debugging and optimization. Our demonstration presents the management of the integrated data as an SSN and it illustrates the query evaluation process in SSNQL.
Scalable, flexible and generic instant overview search BIBAFull-Text 333-336
  Pavlos Fafalios; Ioannis Kitsos; Yannis Tzitzikas
The last years there is an increasing interest on providing the top search results while the user types a query letter by letter. In this paper we present and demonstrate a family of instant search applications which apart from showing instantly only the top search results, they can show various other kinds of precomputed aggregated information. This paradigm is more helpful for the end user (in comparison to the classic search-as-you-type), since it can combine autocompletion, search-as-you-type, results clustering, faceted search, entity mining, etc. Furthermore, apart from being helpful for the end user, it is also beneficial for the server's side. However, the instant provision of such services for large number of queries, big amounts of precomputed information, and large number of concurrent users is challenging. We demonstrate how this can be achieved using very modest hardware. Our approach relies on (a) a partitioned trie-based index that exploits the available main memory and disk, and (b) dedicated caching techniques. We report performance results over a server running on a modest personal computer (with 3 GB main memory) that provides instant services for millions of distinct queries and terabytes of precomputed information. Furthermore these services are tolerant to user typos and the word order.
WISER: a web-based interactive route search system for smartphones BIBAFull-Text 337-340
  Roi Friedman; Itsik Hefez; Yaron Kanza; Roy Levin; Eliyahu Safra; Yehoshua Sagiv
Many smartphones, nowadays, use GPS to detect the location of the user, and can use the Internet to interact with remote location-based services. These two capabilities support online navigation that incorporates search. In this demo we presents WISER -- a system for Web-based Interactive Search en Route. In the system, users perform route search by providing (1) a target location, and (2) search terms that specify types of geographic entities to be visited.
   The task is to find a route that minimizes the travel distance from the initial location of the user to the target, via entities of the specified types. However, planning a route under conditions of uncertainty requires the system to take into account the possibility that some visited entities will not satisfy the search requirements, so that the route may need to go via several entities of the same type. In an interactive search, the user provides feedback regarding her satisfaction with entities she visits during the travel, and the system changes the route, in real time, accordingly. The goal is to use the interaction for computing a route that is more effective than a route that is computed in a non-interactive fashion.
Automatically learning gazetteers from the deep web BIBAFull-Text 341-344
  Tim Furche; Giovanni Grasso; Giorgio Orsi; Christian Schallhart; Cheng Wang
Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples, but to produce accurate results, these examples must have the quality of human annotations. We resolve this conflict with AMBER, a system for fully automated data extraction from result pages. In contrast to previous approaches, AMBER employs domain specific gazetteers to discern basic domain attributes on a page, and leverages repeated occurrences of similar attributes to group related attributes into records rather than relying on the noisy structure of the DOM. With this approach AMBER is able to identify records and their attributes with almost perfect accuracy (>98%) on a large sample of websites. To make such an approach feasible at scale, AMBER automatically learns domain gazetteers from a small seed set. In this demonstration, we show how AMBER uses the repeated structure of records on deep web result pages to learn such gazetteers. This is only possible with a highly accurate extraction system. Depending on its parametrization, this learning process runs either fully automatically or with human interaction. We show how AMBER bootstraps a gazetteer for UK locations in 4 iterations: From a small seed sample we achieve 94.4% accuracy in recognizing UK locations in the $4th$ iteration.
FindiLike: preference driven entity search BIBAFull-Text 345-348
  Kavita Ganesan; ChengXiang Zhai
Traditional web search engines enable users to find documents based on topics. However, in finding entities such as restaurants, hotels and products, traditional search engines fail to suffice as users are often interested in finding entities based on structured attributes such as price and brand and unstructured information such as opinions of other web users. In this paper, we showcase a preference driven search system, that enables users to find entities of interest based on a set of structured preferences as well as unstructured opinion preferences. We demonstrate our system in the context of hotel search.
Partisan scale BIBAFull-Text 349-352
  Sedat Gokalp; Hasan Davulcu
US Senate is the venue of political debates where the federal bills are formed and voted. Senators show their support/opposition along the bills with their votes. This information makes it possible to extract the polarity of the senators. We use signed bipartite graphs for modeling debates, and we propose an algorithm for partitioning both the senators, and the bills comprising the debate into binary opposing camps. Simultaneously, our algorithm scales both the senators and the bills on a univariate scale. Using this scale, a researcher can identify moderate and partisan senators within each camp, and polarizing vs. unifying bills. We applied our algorithm on all the terms of the US Senate to the date for longitudinal analysis and developed a web based interactive user interface www.PartisanScale.com to visualize the analysis.
OPAL: a passe-partout for web forms BIBAFull-Text 353-356
  Xiaonan Guo; Jochen Kranzdorf; Tim Furche; Giovanni Grasso; Giorgio Orsi; Christian Schallhart
Web forms are the interfaces of the deep web. Though modern web browsers provide facilities to assist in form filling, this assistance is limited to prior form fillings or keyword matching. Automatic form understanding enables a broad range of applications, including crawlers, meta-search engines, and usability and accessibility support for enhanced web browsing. In this demonstration, we use a novel form understanding approach, OPAL, to assist in form filling even for complex, previously unknown forms. OPAL associates form labels to fields by analyzing structural properties in the HTML encoding and visual features of the page rendering. OPAL interprets this labeling and classifies the fields according to a given domain ontology. The combination of these two properties, allows OPAL to deal effectively with many forms outside of the grasp of existing form filling techniques. In the UK real estate domain, OPAL achieves >99% accuracy in form understanding.
Round-trip semantics with Sztakipedia and DBpedia spotlight BIBAFull-Text 357-360
  Mihály Héder; Pablo N. Mendes
We describe a tool kit to support a knowledge-enhancement cycle on the Web. In the first step, structured data which is extracted from Wikipedia is used to construct automatic content enhancement engines. Those engines can be used to interconnect knowledge in structured and unstructured information sources on the Web, including Wikipedia itself. Sztakipedia-toolbar is a MediaWiki user script which brings DBpedia Spotlight and other kinds of machine intelligence into the Wiki editor interface to provide enhancement suggestions to the user. The suggestions offered by the tool focus on complementing knowledge and increasing the availability of structured data on Wikipedia. This will, in turn, increase the available information for the content enhancement engines themselves, completing a virtuous cycle of knowledge enhancement.
ResEval mash: a mashup tool for advanced research evaluation BIBAFull-Text 361-364
  Muhammad Imran; Felix Kling; Stefano Soi; Florian Daniel; Fabio Casati; Maurizio Marchese
In this demonstration, we present ResEval Mash, a mashup platform for research evaluation, i.e., for the assessment of the productivity or quality of researchers, teams, institutions, journals, and the like -- a topic most of us are acquainted with. The platform is specifically tailored to the need of sourcing data about scientific publications and researchers from the Web, aggregating them, computing metrics (also complex and ad-hoc ones), and visualizing them. ResEval Mash is a hosted mashup platform with a client-side editor and runtime engine, both running inside a common web browser. It supports the processing of also large amounts of data, a feature that is achieved via the sensible distribution of the respective computation steps over client and server. Our preliminary user study shows that ResEval Mash indeed has the power to enable domain experts to develop own mashups (research evaluation metrics); other mashup platforms rather support skilled developers. The reason for this success is ResEval Mash's domain-specificity.
T@gz: intuitive and effortless categorization and sharing of email conversations BIBAFull-Text 365-368
  Parag Mulendra Joshi; Claudio Bartolini; Sven Graupner
In this paper, we describe T@gz, a system designed for effortless and instantaneous sharing of enterprise knowledge through routine email communications and powerful harvesting of such enterprise information using text analytics techniques. T@gz is a system that enables dynamic, non-intrusive and effortless sharing of information within an enterprise and automatically harvests knowledge from such daily interactions. It also allows enterprise knowledge workers to easily subscribe to new information. It enables self organization of information in conversations while it carefully avoids requiring users to substantially change their usual work-flow of exchanging emails.
   Incorporating this system in an enterprise improves productivity by: "discovery of connections between employees with converging interests and expertise, similar to social networks naturally leading to formation of interest groups, avoiding the problem of information lost in mountains of emails," creating expert profiles by mapping areas of expertise or interests to organizational map. Harvested information includes folksonomy appropriate to an organization, tagged and organized conversations and expertise map.
Visual oXPath: robust wrapping by example BIBAFull-Text 369-372
  Jochen Kranzdorf; Andrew Sellers; Giovanni Grasso; Christian Schallhart; Tim Furche
Good examples are hard to find, particularly in wrapper induction: Picking even one wrong example can spell disaster by yielding overgeneralized or overspecialized wrappers. Such wrappers extract data with low precision or recall, unless adjusted by human experts at significant cost.
   Visual OXPath is an open-source, visual wrapper induction system that requires minimal examples and eases wrapper refinement: Often it derives the intended wrapper from a single example through sophisticated heuristics that determine the best set of similar examples. To ease wrapper refinement, it offers a list of wrappers ranked by example similarity and robustness. Visual OXPath offers extensive visual feedback for this refinement which can be performed without any knowledge of the underlying wrapper language. Where further refinement by a human wrapper is needed, Visual OXPath profits from being based on OXPath, a declarative wrapper language that extends XPath with a thin layer of features necessary for extraction and page navigation.
Adding wings to red bull media: search and display semantically enhanced video fragments BIBAFull-Text 373-376
  Thomas Kurz; Sebastian Schaffert; Georg Güntner; Manuel Fernandez
The Linked Data movement with the aims of publishing and interconnecting machine readable data has originated in the last decade. Although the set of (open) data sources is rapidly growing, the integration of multimedia in this Web of Data is still at a very early stage. This paper describes, how arbitrary video content and metadata can be processed to identify meaningful linking partners for video fragments -- and thus create a web of linked media. The video test-set for our demonstrator is part of the Red Bull Content Pool and confined to the Cliff Diving domain. The candidate set of possible link targets is a combination of a Red Bull thesaurus, information about divers from www.redbull.com and concepts from DBPedia. The demo includes both a semantic search on videos and video fragments and a player for videos with semantic enhancements.
Kjing: (mix the knowledge) BIBAFull-Text 377-380
  Daniel Lacroix; Yves-Armel Martin
Kjing is a web app that allow to rapidly set a multiscreen multi-device environment and to interact and distribute content in realtime. It can be used for museographic, educational or conferencing purpose.
Interactive hypervideo visualization for browsing behavior analysis BIBAFull-Text 381-384
  Luis A. Leiva; Roberto Vivó
Processing web interaction data is known to be cumbersome and time-consuming. State-of-the-art web tracking systems usually allow replaying user interactions in the form of mouse tracks, a video-like visualization scheme, to engage practitioners in the analysis process. However, traditional online video inspection has not explored the full capabilities of hypermedia and interactive techniques. In this paper, we introduce a web-based tracking tool that generates interactive visualizations from users' activity. The system unobtrusively collects browser events derived from normal usage, offering a unified framework to inspect interaction data in several ways. We compare our approach to related work in the research community as well as in commercial systems, and describe how ours fits in a real-world scenario. This research shows that there is a wide range of applications where the proposed tool can assist the WWW community.
Simplifying friendlist management BIBAFull-Text 385-388
  Yabing Liu; Bimal Viswanath; Mainack Mondal; Krishna P. Gummadi; Alan Mislove
Online social networks like Facebook allow users to connect, communicate, and share content. The popularity of these services has lead to an information overload for their users; the task of simply keeping track of different interactions has become daunting. To reduce this burden, sites like Facebook allows the user to group friends into specific lists, known as friendlists, aggregating the interactions and content from all friends in each friendlist. While this approach greatly reduces the burden on the user, it still forces the user to create and populate the friendlists themselves and, worse, makes the user responsible for maintaining the membership of their friendlists over time. We show that friendlists often have a strong correspondence to the structure of the social network, implying that friendlists may be automatically inferred by leveraging the social network structure. We present a demonstration of Friendlist Manager, a Facebook application that proposes friendlists to the user based on the structure of their local social network, allows the user to tweak the proposed friendlists, and then automatically creates the friendlists for the user.
The RaiNewsbook: browsing worldwide multimodal news stories by facts, entities and dates BIBAFull-Text 389-392
  Maurizio Montagnuolo; Alberto Messina
This paper presents a novel framework for multimodal news data aggregation, retrieval and browsing. News aggregations are contextualised within automatically extracted information such as entities (i.e. persons, places and organisations), temporal span, categorical topics, social networks popularity and audience scores. Further resources coming from professional repositories, and related to the aggregation topics, can be accessed as well. The system is accessible through a Web interface supporting interactive navigation and exploration of large-scale collections of news stories at the topic and context levels. Users can select news topics and sub-topics interactively, building their personal paths towards worldwide events, main characters, dates and contents.
BabelNetXplorer: a platform for multilingual lexical knowledge base access and exploration BIBAFull-Text 393-396
  Roberto Navigli; Simone Paolo Ponzetto
Knowledge on word meanings and their relations across languages is vital for enabling semantic information technologies: in fact, the ever increasingly multilingual nature of the Web now calls for the development of methods that are both robust and widely applicable for processing textual information in a multitude of languages. In our research, we approach this ambitious task by means of BabelNet, a wide-coverage multilingual lexical knowledge base. In this paper we present an Application Programming Interface and a Graphical User Interface which, respectively, allow programmatic access and visual exploration of BabelNet. Our contribution is to provide the research community with easy-to-use tools for performing multilingual lexical semantic analysis, thereby fostering further research in this direction.
H2RDF: adaptive query processing on RDF data in the cloud BIBAFull-Text 397-400
  Nikolaos Papailiou; Ioannis Konstantinou; Dimitrios Tsoumakos; Nectarios Koziris
In this work we present H2RDF, a fully distributed RDF store that combines the MapReduce processing framework with a NoSQL distributed data store. Our system features two unique characteristics that enable efficient processing of both simple and multi-join SPARQL queries on virtually unlimited number of triples: Join algorithms that execute joins according to query selectivity to reduce processing; and adaptive choice among centralized and distributed (MapReduce-based) join execution for fast query responses. Our system efficiently answers both simple joins and complex multivariate queries and easily scales to 3 billion triples using a small cluster of 9 worker nodes. H2RDF outperforms state-of-the-art distributed solutions in multi-join and nonselective queries while achieving comparable performance to centralized solutions in selective queries. In this demonstration we showcase the system's functionality through an interactive GUI. Users will be able to execute predefined or custom-made SPARQL queries on datasets of different sizes, using different join algorithms. Moreover, they can repeat all queries utilizing a different number of cluster resources. Using real-time cluster monitoring and detailed statistics, participants will be able to understand the advantages of different execution schemes versus the input data as well as the scalability properties of H2RDF over both the data size and the available worker resources.
Paraimpu: a platform for a social web of things BIBAFull-Text 401-404
  Antonio Pintus; Davide Carboni; Andrea Piras
The Web of Things is a scenario where potentially billions of connected smart objects communicate using the Web protocols, HTTP in primis. A Web of Things envisioning and design has raised several research issues, from protocols adoption and communication models to architectural styles and social aspects facing. In this demo we present the prototype of a scalable architecture for a large scale social Web of Things for smart objects and services, named Paraimpu. It is a Web-based platform which allows to add, use, share and inter-connect real HTTP-enabled smart objects and "virtual" things like services on the Web and social networks. Paraimpu defines and uses few strong abstractions, in order to allow mash-ups of heterogeneous things introducing powerful rules for data adaptation. Adding and inter-connecting objects is supported through user friendly models and features.
Automated semantic tagging of speech audio BIBAFull-Text 405-408
  Yves Raimond; Chris Lowis; Roderick Hodgson; Jonathan Tweed
The BBC is currently tagging programmes manually, using DBpedia as a source of tag identifiers, and a list of suggested tags extracted from the programme synopsis. These tags are then used to help navigation and topic-based search of programmes on the BBC website. However, given the very large number of programmes available in the archive, most of them having very little metadata attached to them, we need a way to automatically assign tags to programmes. We describe a framework to do so, using speech recognition, text processing and concept tagging techniques. We describe how this framework was successfully applied to a very large BBC radio archive. We demonstrate an application using automatically extracted tags to aid discovery of archive content.
Baya: assisted mashup development as a service BIBAFull-Text 409-412
  Soudip Roy Chowdhury; Carlos Rodríguez; Florian Daniel; Fabio Casati
In this demonstration, we describe Baya, an extension of Yahoo! Pipes that guides and speeds up development by interactively recommending composition knowledge harvested from a repository of existing pipes. Composition knowledge is delivered in the form of reusable mashup patterns, which are retrieved and ranked on the fly while the developer models his own pipe (the mashup) and that are automatically weaved into his pipe model upon selection. Baya mines candidate patterns from pipe models available online and thereby leverages on the knowledge of the crowd, i.e., of other developers. Baya is an extension for the Firefox browser that seamlessly integrates with Pipes. It enhances Pipes with a powerful new feature for both expert developers and beginners, speeding up the former and enabling the latter. The discovery of composition knowledge is provided as a service and can easily be extended toward other modeling environments.
S2S architecture and faceted browsing applications BIBAFull-Text 413-416
  Eric Rozell; Peter Fox; Jin Zheng; Jim Hendler
This demo paper will discuss a search interface framework designed as part of the Semantic eScience Framework project at the Tetherless World Constellation. The search interface framework, S2S, was designed to facilitate the construction of interactive user interfaces for data catalogs. We use Semantic Web technologies, including an OWL ontology for describing the semantics of data services, as well as the semantics of user interface components. We have applied S2S in three different scenarios: (1) the development of a faceted browse interface integrated with an interactive mapping and visualization tool for biological and chemical oceanographic data, (2) the development of a faceted browser for more than 700,000 open government datasets in over 100 catalogs worldwide, and (3) the development of a user interface for a virtual observatory in the field of solar-terrestrial physics. Throughout this paper, we discuss the architecture of the S2S framework, focusing on its extensibility and reusability, and also review the application scenarios.
Turning a Web 2.0 social network into a Web 3.0, distributed, and secured social web application BIBAFull-Text 417-420
  Henry Story; Romain Blin; Julien Subercaze; Christophe Gravier; Pierre Maret
This demonstration presents the process of transforming a Web 2.0 centralized social network into a Web 3.0, distributed, and secured Social application, and what was learnt in this process. The initial Web 2.0 Social Network application was written by a group of students over a period of 4 months in the spring of 2011. It had all the bells and whistles of the well known Social Networks: walls to post on, circles of friends, etc. The students were very enthusiastic in building their social network, but the chances of it growing into a large community were close to non-existent unless a way could be found to tie it into a bigger social network. This is where linked data protected by the Web Access Control Ontology and WebID authentication could come to the rescue. The paper describes this transformation process, and we will demonstrate the full software version at the conference.
Adding fake facts to ontologies BIBAFull-Text 421-424
  Fabian M. Suchanek; David Gross-Amblard
In this paper, we study how artificial facts can be added to an RDFS ontology. Artificial facts are an easy way of proving the ownership of an ontology: If another ontology contains the artificial fact, it has probably been taken from the original ontology. We show how the ownership of an ontology can be established with provably tight probability bounds, even if only parts of the ontology are being re-used. We explain how artificial facts can be generated in an inconspicuous and minimally disruptive way. Our demo allows users to generate artificial facts and to guess which facts were generated.
CASIS: a system for concept-aware social image search BIBAFull-Text 425-428
  Ba Quan Truong; Aixin Sun; Sourav S. Bhowmick
Tag-based social image search enables users to formulate queries using keywords. However, as queries are usually very short and users have very different interpretations of a particular tag in annotating and searching images, the returned images to a tag query usually contain a collection of images related to multiple concepts. We demonstrate Casis, a system for concept-aware social image search. Casis detects tag concepts based on the collective knowledge embedded in social tagging from the initial results to a query. A tag concept is a set of tags highly associated with each other and collectively conveys a semantic meaning. Images to a query are then organized by tag concepts. Casis provides intuitive and interactive browsing of search results through a tag concept graph, which visualizes the tags defining each tag concept and their relationships within and across concepts. Supporting multiple retrieval methods and multiple concept detection algorithms, Casis offers superior social image search experiences by choosing the most suitable retrieval methods and concept-aware image organizations.
In the mood for affective search with web stereotypes BIBAFull-Text 429-432
  Tony Veale; Yanfen Hao
Models of sentiment analysis in text require an understanding of what kinds of sentiment-bearing language are generally used to describe specific topics. Thus, fine-grained sentiment analysis requires both a topic lexicon and a sentiment lexicon, and an affective mapping between both. For instance, when one speaks disparagingly about a city (like London, say), what aspects of city does one generally focus on, and what words are used to disparage those aspects? As when we talk about the weather, our language obeys certain familiar patterns -- what we might call clichés and stereotypes -- when we talk about familiar topics. In this paper we describe the construction of an affective stereotype lexicon, that is, a lexicon of stereotypes and their most salient affective qualities. We show, via a demonstration system called MOODfinger, how this lexicon can be used to underpin the processes of affective query expansion and summarization in a system for retrieving and organizing news content from the Web. Though we adopt a simple bipolar +/- view of sentiment, we show how this stereotype lexicon allows users to coin their own nuanced moods on demand.
Personalized newscasts and social networks: a prototype built over a flexible integration model BIBAFull-Text 433-436
  Luca Vignaroli; Roberto Del Pero; Fulvio Negro
The way we watch television is changing with the introduction of attractive Web activities that move users away from TV to other media. The integration of the cultures of TV and Web is still an open issue. How can we make TV more open? How can we enable a possible collaboration of these two different worlds? TV-Web convergence is much more than placing a Web browser into a TV set or putting TV content into a Web media player. The NoTube project, funded by the European Community, is demonstrating how an open and general set of tools adaptable to a number of possible scenarios and allowing a designer to implement the targeted final service with ease can be introduced. A prototype based on the NoTube model in which the Smartphone is used as secondary screen is presented. The video demonstration [11] is available at http://youtu.be/dMM7MH9CZY8.
An early warning system for unrecognized drug side effects discovery BIBAFull-Text 437-440
  Hao Wu; Hui Fang; Steven J. Stanhope
Drugs can treat human diseases through chemical interactions between the ingredients and intended targets in the human body. However, the ingredients could unexpectedly interact with off-targets, which may cause adverse drug side effects. Notifying patients and physicians of potential drug effects is an important step in improving healthcare quality and delivery. With the increasing popularity of Web 2.0 applications, more and more patients start discussing drug side effects in many online sources. In this paper, we describe our efforts on building UDWarning, a novel early warning system for unrecognized drug side effects discovery based on the text information gathered from the Internet. The system can automatically build a knowledge base for drug side effects by integrating the information related to drug side effects from different sources. It can also monitor the online information about drugs and discover possible unrecognized drug side effects. Our demonstration will show that the system has the potentials to expedite the discovery process of unrecognized drug side effects and to improve the quality of healthcare.
Titan: a system for effective web service discovery BIBAFull-Text 441-444
  Jian Wu; Liang Chen; Yanan Xie; Zibin Zheng
With the increase of web services and user demand's diversity, effective web service discovery is becoming a big challenge. Clustering web services would greatly boost the ability of web service search engine to retrieve relevant ones. In this paper, we propose a web service search engine Titan which contains 15,969 web services crawled from the Internet. In Titan, two main technologies, i.e., web service clustering and tag recommendation, are employed to improve the effectiveness of web service discovery. Specifically, both WSDL (Web Service Description Language) documents and tags of web services are utilized for clustering, while tag recommendation is adopted to handle some inherent problems of tagging data, e.g., uneven tag distribution and noise tags.
Deep answers for naturally asked questions on the web of data BIBAFull-Text 445-449
  Mohamed Yahya; Klaus Berberich; Shady Elbassuoni; Maya Ramanath; Volker Tresp; Gerhard Weikum
We present DEANNA, a framework for natural language question answering over structured knowledge bases. Given a natural language question, DEANNA translates questions into a structured SPARQL query that can be evaluated over knowledge bases such as Yago, Dbpedia, Freebase, or other Linked Data sources. DEANNA analyzes questions and maps verbal phrases to relations and noun phrases to either individual entities or semantic classes. Importantly, it judiciously generates variables for target entities or classes to express joins between multiple triple patterns. We leverage the semantic type system for entities and use constraints in jointly mapping the constituents of the question to relations, classes, and entities. We demonstrate the capabilities and interface of DEANNA, which allows advanced users to influence the translation process and to see how the different components interact to produce the final result.

Poster presentations

Associating structured records to text documents BIBAFull-Text 451-452
  Rakesh Agrawal; Ariel Fuxman; Anitha Kannan; John Shafer; Partha Pratim Talukdar
Postulate two independently created data sources. The first contains text documents, each discussing one or a small number of objects. The second is a collection of structured records, each containing information about the characteristics of some objects. We present techniques for associating structured records to corresponding text documents and empirical results supporting the proposed techniques.
Textual and contextual patterns for sentiment analysis over microblogs BIBAFull-Text 453-454
  Fotis Aisopos; George Papadakis; Konstantinos Tserpes; Theodora Varvarigou
Microblog content poses serious challenges to the applicability of sentiment analysis, due to its inherent characteristics. We introduce a novel method relying on content-based and context-based features, guaranteeing high effectiveness and robustness in the settings we are considering. The evaluation of our methods over a large Twitter data set indicates significant improvements over the traditional techniques.
PAC'nPost: a framework for a micro-blogging social network in an unstructured P2P network BIBAFull-Text 455-456
  H. Asthana; Ingemar J. Cox
We describe a framework for a micro-blogging social network implemented in an unstructured peer-to-peer network. A micro-blogging social network must provide capabilities for users to (i) publish, (ii) follow and (iii) search. Our retrieval mechanism is based on a probably approximately correct (PAC) search architecture in which a query is sent to a fixed number of nodes in the network. In PAC, the probability of attaining a particular accuracy is a function of the number of nodes queried (fixed) and the replication rate of documents (micro-blog). Publishing a micro-blog then becomes a matter of replicating the micro-blog to the required number of random nodes without any central coordination. To solve this, we use techniques from the field of rumour spreading (gossip protocols) to propagate new documents. Our document spreading algorithm is designed such that a document has a very high probability of being copied to only the required number of nodes. Results from simulations performed on networks of 10,000, 100,000 and 500,000 nodes verify our mathematical models. The framework is also applicable for indexing dynamic web pages in a distributed search engine or for a system which indexes newly created BitTorrents in a decentralized environment.
The impact of visual appearance on user response in online display advertising BIBAFull-Text 457-458
  Javad Azimi; Ruofei Zhang; Yang Zhou; Vidhya Navalpakkam; Jianchang Mao; Xiaoli Fern
Display advertising has been a significant source of revenue for publishers and ad networks in the online advertising ecosystem. One of the main goals in display advertising is to maximize user response rate for advertising campaigns, such as click through rates (CTR) or conversion rates. Although the visual appearance of ads (creatives) matters for propensity of user response, there is no published work so far to address this topic via a systematic data-driven approach. In this paper we quantitatively study the relationship between the visual appearance and performance of creatives using large scale data in the world's largest display ads exchange system, RightMedia. We designed a set of 43 visual features, some of which are novel and some are inspired by related work. We extracted these features from real creatives served on RightMedia. Then, we present recommendations of visual features that have the most important impact on CTR to the professional designers in order to optimize their creative design. We believe that the findings presented in this paper will be very useful for the online advertising industry in designing high-performance creatives. We have also designed and conducted an experiment to evaluate the effectiveness of visual features by themselves for CTR prediction.
Impact of ad impressions on dynamic commercial actions: value attribution in marketing campaigns BIBAFull-Text 459-460
  Joel Barajas; Ram Akella; Marius Holtan; Jaimie Kwon; Aaron Flores; Victor Andrei
We develop a descriptive method to estimate the impact of ad impressions on commercial actions dynamically without tracking cookies. We analyze 2,885 campaigns for 1,251 products from the Advertising.com ad network. We compare our method with A/B testing for 2 campaigns, and with a public synthetic dataset.
Audience dynamics of online catch up TV BIBAFull-Text 461-462
  Thomas Beauvisage; Jean-Samuel Beuscart
This paper studies the demand for TV contents on online catch up platforms, in order to assess how catch up TV offers transform TV consumption. We build upon empirical data on French TV consumption in June 2011: a daily monitoring of online audience on web catch up platforms, and live audience ratings of traditional broadcast TV. We provide three main results: 1) online consumption is more concentrated than off-line audience, contradicting the hypothesis of a long tail effect of catch up TV; 2) the temporality of replay TV consumption on the web is very close to the live broadcasting of the programs, thus softening rather than breaking the synchrony of traditional TV; 3) detailed data on online consumption of news reveals two patterns of consumption ("alternative TV ritual" vs. "à la carte").
Group recommendations via multi-armed bandits BIBAFull-Text 463-464
  José Bento; Stratis Ioannidis; S. Muthukrishnan; Jinyun Yan
We study recommendations for persistent groups that repeatedly engage in a joint activity. We approach this as a multi-arm bandit problem. We design a recommendation policy and show it has logarithmic regret. Our analysis also shows that regret depends linearly on d, the size of the underlying persistent group. We evaluate our policy on movie recommendations over the MovieLens and MoviePilot datasets.
A revenue sharing mechanism for federated search and advertising BIBAFull-Text 465-466
  Marco Brambilla; Sofia Ceppi; Nicola Gatti; Enrico H. Gerding
Federated search engines combine search results from two or more (general -- purpose or domain -- specific) content providers. They enable complex searches (e.g., complete vacation planning) or more reliable results by allowing users to receive high quality results from a variety of sources. We propose a new revenue sharing mechanism for federated search engines, considering different actors involved in the search results generation (i.e., content providers, advertising providers, hybrid content+advertising providers, and content integrators). We extend the existing sponsored search auctions by supporting heterogeneous participants and redistribution of monetary values to the different actors, while maintaining flexibility in the payment scheme.
Efficient multi-view maintenance in the social semantic web BIBAFull-Text 467-468
  Matthias Broecheler; Andrea Pugliese; V. S. Subrahmanian
The Social Semantic Web (SSW) refers to the mix of RDF data in web content, and social network data associated with those who posted that content. Applications to monitor the SSW are becoming increasingly popular. For instance, marketers want to look for semantic patterns relating to the content of tweets and Facebook posts relating to their products. Such applications allow multiple users to specify patterns of interest, and monitor them in real-time as new data gets added to the web or to a social network. In this paper, we develop the concept of SSW view servers in which all of these types of applications can be simultaneously monitored from such servers. The patterns of interest are views. We show that a given set of views can be compiled in multiple possible ways to take advantage of common substructures, and define the concept of an optimal merge. We develop a very fast MultiView algorithm that scalably and efficiently maintains multiple subgraph views. We show that our algorithm is correct, study its complexity, and experimentally demonstrate that our algorithm can scalably handle updates to hundreds of views on real-world SSW databases with up to 540M edges.
BlueFinder: estimate where a beach photo was taken BIBAFull-Text 469-470
  Liangliang Cao; John R. Smith; Zhen Wen; Zhijun Yin; Xin Jin; Jiawei Han
This paper describes a system to estimate geographical locations for beach photos. We develop an iterative method that not only trains visual classifiers but also discovers geographical clusters for beach regions. The results show that it is possible to recognize different beaches using visual information with reasonable accuracy, and our system works 27 times better than random guess for the geographical localization task.
News comments generation via mining microblogs BIBAFull-Text 471-472
  Xuezhi Cao; Kailong Chen; Rui Long; Guoqing Zheng; Yong Yu
Microblogging websites such as Twitter and Chinese Sina Weibo contain large amounts of microblogs posted by users. Many of these microblogs are highly sensitive to the important real-world events and correlated to the news events. Thus, microblogs from these websites can be collected as comments for the news to reveal the opinions and attitude towards the news event among large number of users. In this paper, we present a framework to automatically collect relevant microblogs from microblogging websites to generate comments for popular news on news websites.
MobiMash: end user development for mobile mashups BIBAFull-Text 473-474
  Cinzia Cappiello; Maristella Matera; Matteo Picozzi; Alessandro Caio; Mariano Tomas Guevara
The adoption of adequate tools, oriented towards the End User Development (EUD), can promote mobile mashups as "democratic" tools, able to accommodate the long tail of users' specific needs. We introduce MobiMash, a novel approach and a platform for the construction of mobile mashups, characterized by a lightweight composition paradigm, mainly guided by the notion of visual templates. The composition paradigm generates an application schema that is based on a domain specific language addressing dimensions for data integration and service orchestration, and that guides at run-time the dynamic instantiation of the final mobile app.
Privacy management for online social networks BIBAFull-Text 475-476
  Gorrell P. Cheek; Mohamed Shehab
We introduce a privacy management approach that leverages users' memory and opinion of their friends to set policies for other similar friends. We refer to this new approach as Same-As Privacy Management. To demonstrate the effectiveness of our privacy management improvements, we implemented a prototype Facebook application and conducted an extensive user study. We demonstrated considerable reductions in policy authoring time using Same-As Privacy Management over traditional group based privacy management approaches. Finally, we presented user perceptions, which were very encouraging.
Fast and cost-efficient bid estimation for contextual ads BIBAFull-Text 477-478
  Ye Chen; Pavel Berkhin; Jie Li; Sharon Wan; Tak W. Yan
We study the problem of estimating the value of a contextual ad impression, and based upon which an ad network bids on an exchange. The ad impression opportunity would materialize into revenue only if the ad network wins the impression and a user clicks on the ads, both as a rare event especially in an open exchange for contextual ads. Given a low revenue expectation and the elusive nature of predicting weak-signal click-through rates, the computational cost incurred by bid estimation shall be cautiously justified. We developed and deployed a novel impression valuation model, which is expected to reduce the computational cost by 95% and hence more than double the profit. Our approach is highly economized through a fast implementation of kNN regression that primarily leverages low-dimensional sell-side data (user and publisher). We also address the cold-start problem or the exploration vs. exploitation requirement by Bayesian smoothing using a beta prior, and adapt to the temporal dynamics using an autoregressive model.
Fast query evaluation for ad retrieval BIBAFull-Text 479-480
  Ye Chen; Mitali Gupta; Tak W. Yan
We describe a fast query evaluation method for ad document retrieval in online advertising, based upon the classic WAND algorithm. The key idea is to localize per-topic term upper bounds into homogeneous ad groups. Our approach is not only theoretically motivated by a topical mixture model; but empirically justified by the characteristics of the ad domain, that is, short and semantically focused documents with natural hierarchy. We report experimental results using artificial and real-world query-ad retrieval data, and show that the tighter-bound WAND outperforms the traditional approach by 35.4% reduction in number of full evaluations.
CONSENTO: a consensus search engine for answering subjective queries BIBAFull-Text 481-482
  Jaehoon Choi; Donghyeon Kim; Seongsoon Kim; Junkyu Lee; Sangrak Lim; Sunwon Lee; Jaewoo Kang
Search engines have become an important decision making tool today. Decision making queries are often subjective, such as 'best sedan for family use,' 'best action movies in 2010,' to name a few. Unfortunately, such queries cannot be answered properly by conventional search systems. In order to address this problem, we introduce Consento, a consensus search engine designed to answer subjective queries. Consento performs subdocument-level indexing to more precisely capture semantics from user opinions. We also introduce a new ranking method, or ConsensusRank that counts in online comments referring to an entity as a weighted vote to the entity. We validated the framework with an empirical study using the data on movie reviews.
Good abandonments in factoid queries BIBAFull-Text 483-484
  Aleksandr Chuklin; Pavel Serdyukov
It is often considered that high abandonment rate corresponds to poor IR system performance. However several studies suggested that there are so called good abandonments, i.e. situations when search engine result page (SERP) contains enough details to satisfy the user information need without necessity to click on search results. In those papers only editorial metrics of SERP were used, and one cannot be sure that situations marked as good abandonments by assessors actually imply user satisfaction. In present work we propose some real-world evidences for good abandonments by calculating correlation between editorial and click metrics.
Potential good abandonment prediction BIBAFull-Text 485-486
  Aleksandr Chuklin; Pavel Serdyukov
Abandonment rate is one of the most broadly used online user satisfaction metrics. In this paper we discuss the notion of potential good abandonment, i.e. queries that may potentially result in user satisfaction without the need to click on search results (if search engine result page contains enough details to satisfy the user information need). We show, that we can train a classifier which is able to distinguish between potential good and bad abandonments with rather good results compared to our baseline. As a case study we show how to apply these ideas to IR evaluation and introduce a new metric for A/B-testing -- Bad Abandonment Rate.
Ubiquitous access control for SPARQL endpoints: lessons learned and future challenges BIBAFull-Text 487-488
  Luca Costabello; Serena Villata; Nicolas Delaforge; Fabien Gandon
We present and evaluate a context-aware access control framework for SPARQL endpoints queried from mobile.
Mining for insights in the search engine query stream BIBAFull-Text 489-490
  Ovidiu Dan; Pavel Dmitriev; Ryen W. White
Search engines record a large amount of metadata each time a user issues a query. While efficiently mining this data can be challenging, the results can be useful in multiple ways, including monitoring search engine performance, improving search relevance, prioritizing research, and optimizing day-to-day operations. In this poster, we describe an approach for mining query log data for actionable insights -- specific query segments (sets of queries) that require attention, and actions that need to be taken to improve the segments. Starting with a set of important metrics, we identify query segments that are "interesting" with respect to these metrics using a distributed frequent itemset mining algorithm.
Developing domain-specific mashup tools for end users BIBAFull-Text 491-492
  Florian Daniel; Muhammad Imran; Felix Kling; Stefano Soi; Fabio Casati; Maurizio Marchese
The recent emergence of mashup tools has refueled research on end user development, i.e., on enabling end users without programming skills to compose own applications. Yet, similar to what happened with analogous promises in web service composition and business process management, research has mostly focused on technology and, as a consequence, has failed its objective. Plain technology (e.g., SOAP/WSDL web services) or simple modeling languages (e.g., Yahoo! Pipes) don't convey enough meaning to non-programmers. We propose a domain-specific approach to mashups that "speaks the language of the user", i.e., that is aware of the terminology, concepts, rules, and conventions (the domain) the user is comfortable with. We show what developing a domain-specific mashup tool means, which role the mashup meta-model and the domain model play and how these can be merged into a domain-specific mashup meta-model. We apply the approach implementing a mashup tool for the research evaluation domain. Our user study confirms that domain-specific mashup tools indeed lower the entry barrier to mashup development.
Discovery and reuse of composition knowledge for assisted mashup development BIBAFull-Text 493-494
  Florian Daniel; Carlos Rodriguez; Soudip Roy Chowdhury; Hamid R. Motahari Nezhad; Fabio Casati
Despite the emergence of mashup tools like Yahoo! Pipes or JackBe Presto Wires, developing mashups is still non-trivial and requires intimate knowledge about the functionality of web APIs and services, their interfaces, parameter settings, data mappings, and so on. We aim to assist the mashup process and to turn it into an interactive co-creation process, in which one part of the solution comes from the developer and the other part from reusable composition knowledge that has proven successful in the past. We harvest composition knowledge from a repository of existing mashup models by mining a set of reusable composition patterns, which we then use to interactively provide composition recommendations to developers while they model their own mashup. Upon acceptance of a recommendation, the purposeful design of the respective pattern types allows us to automatically weave the chosen pattern into a partial mashup model, in practice performing a set of modeling actions on behalf of the developer. The experimental evaluation of our prototype implementation demonstrates that it is indeed possible to harvest meaningful, reusable knowledge from existing mashups, and that even complex recommendations can be efficiently queried and weaved also inside the client browser.
Towards personalized learning to rank for epidemic intelligence based on social media streams BIBAFull-Text 495-496
  Ernesto Diaz-Aviles; Avaré Stewart; Edward Velasco; Kerstin Denecke; Wolfgang Nejdl
In the presence of sudden outbreaks, how can social media streams be used to strengthen surveillance capabilities? In May 2011, Germany reported one of the largest described outbreaks of Enterohemorrhagic Escherichia coli (EHEC). By end of June, 47 persons had died. After the detection of the outbreak, authorities investigating the cause and the impact in the population were interested in the analysis of micro-blog data related to the event. Since Thousands of tweets related to this outbreak were produced every day, this task was overwhelming for experts participating in the investigation. In this work, we propose a Personalized Tweet Ranking algorithm for Epidemic Intelligence (PTR4EI), that provides users a personalized, short list of tweets based on the user's context. PTR4EI is based on a learning to rank framework and exploits as features, complementary context information extracted from the social hash-tagging behavior in Twitter. Our experimental evaluation on a dataset, collected in real-time during the EHEC outbreak, shows the superior ranking performance of PTR4EI. We believe our work can serve as a building block for an open early warning system based on Twitter, helping to realize the vision of Epidemic Intelligence for the Crowd, by the Crowd.
D2RQ/update: updating relational data via virtual RDF BIBAFull-Text 497-498
  Vadim Eisenberg; Yaron Kanza
D2RQ is a popular RDB-to-RDF mapping platform that supports mapping relational databases to RDF and posing SPARQL queries to these relational databases. However, D2RQ merely provides a read-only RDF view on relational databases. Thus, we introduce D2RQ/Update -- an extension of D2RQ to enable executing SPARQL/Update statements on the mapped data, and to facilitate the creation of a read-write Semantic Web.
HeterRank: addressing information heterogeneity for personalized recommendation in social tagging systems BIBAFull-Text 499-500
  Wei Feng; Jianyong Wang
A social tagging system provides users an effective way to collaboratively annotate and organize items with their own tags. A social tagging system contains heterogenous information like users' tagging behaviors, social networks, tag semantics and item profiles. All the heterogenous information helps alleviate the cold start problem due to data sparsity. In this paper, we model a social tagging system as a multi-type graph and propose a graph-based ranking algorithm called HeterRank for tag recommendation. Experimental results on three publicly available datasets, i.e., CiteULike, Last.fm and Delicious prove the effectiveness of HeterRank for tag recommendation with heterogenous information.
Domain adaptive answer extraction for discussion boards BIBAFull-Text 501-502
  Ankur Gandhe; Dinesh Raghu; Rose Catherine
Answer extraction from discussion boards is an extensively studied problem. Most of the existing work is focused on supervised methods for extracting answers using similarity features and forum-specific features. Although this works well for the domain or forum data that it has been trained on, it is difficult to use the same models for a domain where the vocabulary is different and some forum specific features may not be available. In this poster, we report initial results of a domain adaptive answer extractor that performs the extraction in two steps: a) an answer recognizer identifies the sentences in a post which are likely to be answers, and b) a domain relevance module determines the domain significance of the identified answer. We use domain independent methodology that can be easily adapted to any given domain with minimum effort.
Towards multiple identity detection in social networks BIBAFull-Text 503-504
  Kahina Gani; Hakim Hacid; Ryan Skraba
In this paper we discuss a piece of work which intends to provide some insights regarding the resolution of the hard problem of multiple identities detection. Based on hypothesis that each person is unique and identifiable whether in its writing style or social behavior, we propose a Framework relying on machine learning models and a deep analysis of social interactions, towards such detection.
How shall we catch people's concerns in micro-blogging? BIBAFull-Text 505-506
  Heng Gao; Qiudan Li; Hongyun Bao; Shuangyong Song
In micro-blogging, people talk about their daily life and change minds freely, thus by mining people's interest in micro-blogging, we will easily perceive the pulse of society. In this paper, we catch what people are caring about in their daily life by discovering meaningful communities based on probabilistic factor model (PFM). The proposed solution identifies people's interest from their friendship and content information. Therefore, it reveals the behaviors of people in micro-blogging naturally. Experimental results verify the effectiveness of the proposed model and show people's social life vividly.
Link prediction via latent factor BlockModel BIBAFull-Text 507-508
  Sheng Gao; Ludovic Denoyer; Patrick Gallinari
In this paper we address the problem of link prediction in networked data, which appears in many applications such as social network analysis or recommender systems. Previous studies either consider latent feature based models but disregarding local structure in the network, or focus exclusively on capturing local structure of objects based on latent blockmodels without coupling with latent characteristics of objects. To combine the benefits of previous work, we propose a novel model that can incorporate the effects of latent features of objects and local structure in the network simultaneously. To achieve this, we model the relation graph as a function of both latent feature factors and latent cluster memberships of objects to collectively discover globally predictive intrinsic properties of objects and capture latent block structure in the network to improve prediction performance. Extensive experiments on several real world datasets suggest that our proposed model outperforms the other state of the art approaches for link prediction.
Using toolbar data to understand Yahoo!: answers usage BIBAFull-Text 509-510
  Giovanni Gardelli; Ingmar Weber
We use Yahoo! Toolbar data to gain insights into why people use Q&A sites. We look at questions asked on Yahoo! Answers and analyze both the pre-question behavior of users as well as their general online behavior. Our results indicate that there is a one-dimensional spectrum of users ranging from "social users" to "informational users" and that web search and Q&A sites complement each other, rather than compete. Concerning the pre-question behavior, users who first issue a question-related query are more likely to issue informational questions, rather than conversational ones, and such questions are less likely to attract an answer. Finally, we only find weak evidence for topical congruence between a user's questions and his web queries.
SnoopyTagging: recommending contextualized tags to increase the quality and quantity of meta-information BIBAFull-Text 511-512
  Wolfgang Gassler; Eva Zangerle; Martin Bürgler; Günther Specht
Current mass-collaboration platforms use tags to annotate and categorize resources enabling effective search capabilities. However, as tags are freely chosen keywords, the resulting tag vocabulary is very heterogeneous. Another shortcoming of simple tags is that they do not allow for a specification of context to create meaningful metadata. In this paper we present the SnoopyTagging approach which supports the user in the process of creating contextualized tags while at the same time decreasing the heterogeneity of the tag vocabulary by facilitating intelligent self-learning recommendation algorithms.
Comparative evaluation of javascript frameworks BIBAFull-Text 513-514
  Andreas Gizas; Sotiris Christodoulou; Theodore Papatheodorou
For web programmers, it is important to choose the proper JavaScript framework that not only serves their current web project needs, but also provides code of high quality and good performance. The scope of this work is to provide a thorough quality and performance evaluation of the most popular JavaScript frameworks, taking into account well established software quality factors and performance tests. The major outcome is that we highlight the pros and cons of JavaScript frameworks in various areas of interest and signify which and where are the problematical points of their code, that probably need to be improved in the next versions.
Getting more RDF support from relational databases BIBAFull-Text 515-516
  François Goasdoué; Ioana Manolescu; Alexandra Roatis
We introduce the database fragment of RDF, which extends the popular Description Logic fragment, in particular with support for incomplete information. We then provide novel sound and complete saturation- and reformulation-based techniques for answering the Basic Graph Pattern queries of SPARQL in this fragment. Notably, we extend the state of the art on pushing RDF query processing within robust / efficient relational database management systems. Finally, we experimentally compare our query answering techniques using well-established datasets.
S2ORM: exploiting syntactic and semantic information for opinion retrieval BIBAFull-Text 517-518
  Liqiang Guo; Xiaojun Wan
Opinion retrieval is the task of finding documents that express an opinion about a given query. A key challenge in opinion retrieval is to capture the query-related opinion score of a document. Existing methods rely mainly on the proximity information between the opinion terms and the query terms to address the key challenge. In this study, we propose to incorporate the syntactic and semantic information of terms into a probabilistic language model in order to capture the query-related opinion score more accurately.
All our messages are belong to us: usable confidentiality in social networks BIBAFull-Text 519-520
  Marian Harbach; Sascha Fahl; Thomas Muders; Matthew Smith
Current online social networking (OSN) sites pose severe risks to their users' privacy. Facebook in particular is capturing more and more of a user's past activities, sometimes starting from the day of birth. Instead of transiently passing on information between friends, a user's data is stored persistently and therefore subject to the risk of undesired disclosure. Traditionally, a regular user of a social network has little awareness of her privacy needs in the Web or is not ready to invest a considerable effort in securing her online activities. Furthermore, the centralised nature of proprietary social networking platforms simply does not cater for end-to-end privacy protection mechanisms. In this paper, we present a non-disruptive and lightweight integration of a confidentiality mechanism into OSNs. Additionally, direct integration of visual security indicators into the OSN UI raise the awareness for (un)protected content and thus their own privacy. We present a fully-working prototype for Facebook and an initial usability study, showing that, on average, untrained users can be ready to use the service in three minutes.
Populating personal linked data caches using context models BIBAFull-Text 521-522
  Olaf Hartig; Tom Heath
The emergence of a Web of Data enables new forms of application that require expressive query access, for which mature, Web-scale information retrieval techniques may not be suited. Rather than attempting to deliver expressive query capabilities at Web-scale, this paper proposes the use of smaller, pre-populated data caches whose contents are personalized to the needs of an individual user. We present an approach to a priori population of such caches with Linked Data harvested from the Web, seeded by a simple context model for each user, which is progressively enriched by executing a series of enrichment rules over Linked Data from the Web. Such caches can act as personal data stores supporting a range of different applications. A comprehensive user evaluation demonstrates that our approach can accurately predict the relevance of attributes added to the context model and the execution probability of queries based on these attributes, thereby optimizing the cache population process.
Probabilistic critical path identification for cost-effective monitoring of service-based web applications BIBAFull-Text 523-524
  Qiang He; Jun Han; Yun Yang; Jean-Guy Schneider; Hai Jin; Steve Versteeg
The critical path of a composite Web application operating in volatile environments, i.e., the execution path in the service composition with the maximum execution time, should be prioritised in cost-effective monitoring as it determines the response time of the Web application. In volatile operating environments, the critical path of a Web application is probabilistic. As such, it is important to estimate the criticalities of the execution paths, i.e., the probabilities that they are critical, to decide which parts of the system to monitor. We propose a novel approach to the identification of Probabilistic Critical Path for Service-based Web Applications (PCP-SWA), which calculates the criticalities of different execution paths in the context of service composition. We evaluate PCP-SWA experimentally using an example Web application. Compared to random monitoring, PCP-SWA based monitoring is 55.67% more cost-effective on average.
A statistical approach to URL-based web page clustering BIBAFull-Text 525-526
  Inma Hernández; Carlos R. Rivero; David Ruiz; Rafael Corchuelo
Most web page classifiers use features from the page content, which means that it has to be downloaded to be classified. We propose a technique to cluster web pages by means of their URL exclusively. In contrast to other proposals, we analyze features that are outside the page, hence, we do not need to download a page to classify it. Also, it is non-supervised, requiring little intervention from the user. Furthermore, we do not need to crawl extensively a site to build a classifier for that site, but only a small subset of pages. We have performed an experiment over 21 highly visited websites to evaluate the performance of our classifier, obtaining good precision and recall results.
Frequent temporal social behavior search in information networks BIBAFull-Text 527-528
  Hsun-Ping Hsieh; Cheng-Te Li; Shou-De Lin
In current social networking service (SNS) such as Facebook, there are diverse kinds of interactions between entity types. One commonly-used activity of SNS users is to track and observe the representative social and temporal behaviors of other individuals. This inspires us to propose a new problem of Temporal Social Behavior Search (TSBS) from social interactions in an information network: given a structural query with associated temporal labels, how to find the subgraph instances satisfying the query structure and temporal requirements? In TSBS, a query can be (a) a topological structure, (b) the partially-assigned individuals on nodes, and/or (c) the temporal sequential labels on edges. The TSBS method consists of two parts: offline mining and online matching. to the former mines the temporal subgraph patterns for retrieving representative structures that match the query. Then based on the given query, we perform the online structural matching on the mined patterns and return the top-k resulting subgraphs. Experiments on academic datasets demonstrate the effectiveness of TSBS.
TripRec: recommending trip routes from large scale check-in data BIBAFull-Text 529-530
  Hsun-Ping Hsieh; Cheng-Te Li; Shou-De Lin
With location-based services, such as Foursquare and Gowalla, users can easily perform check-in actions anywhere and anytime. Such check-in data not only enables personal geospatial journeys but also serves as a fine-grained source for trip planning. In this work, we aim to collectively recommend trip routes by leveraging a large-scaled check-in data through mining the moving behaviors of users. A novel recommendation system, TripRec, is proposed to allow users to specify starting/end and must-go locations. It further provides the flexibility to satisfy certain time constraint (i.e., the expected duration of the trip). By considering a sequence of check-in points as a route, we mine the frequent sequences with some ranking mechanism to achieve the goal. Our TripRec targets at travelers who are unfamiliar to the objective area/city and have time constraints in the trip.
Social status and role analysis of Palin's email network BIBAFull-Text 531-532
  Xia Hu; Huan Liu
Email usage is pervasive among people from different backgrounds, and email corpus can be an important data source to study intricate social structures. Social status and role analysis on a personal email network can help reveal hidden information. The availability of Sarah Palin's email corpus presents a great opportunity to study social statuses and social roles in an email network. However, the email corpus does not readily lend itself to social network analysis due to problems such as noisy email data, scale in size, and temporal constraints. In this paper, we report an initial investigation of social status and role analysis on Sarah Palin's email corpus. In particular, we conduct a preliminary study on Palin's social statuses and roles. To the best of our knowledge, this work is the first exploration of Sarah Palin's email corpus recently released by the state of Alaska.
The affects of task difficulty on medical searches BIBAFull-Text 533-534
  Anushia Inthiran; Saadat M. Alhashmi; Pervaiz K. Ahmed
In this paper, we analyze medical searching behavior performed by a typical medical searcher. We broadly classify a typical medical searcher as: non-medical professionals or medical professionals. We use behavioral signals to study how task difficulty affects medical searching behavior. Using simulated scenarios, we gathered data from an exploratory survey of 180 search sessions performed by 60 participants. Our research study provides a deep understanding of how task difficulty affects medical search behavior. Non-medical professionals and medical professionals demonstrate similar search behavior when searching on an easy task. Longer queries, more time and more incomplete search sessions are observed for an easy task. However, they demonstrate different results evaluation behavior based on task difficulty.
Leveraging interlingual classification to improve web search BIBAFull-Text 535-536
  Jagadeesh Jagarlamudi; Paul N. Bennett; Krysta M. Svore
In this paper we address the problem of improving accuracy of web search in a smaller, data-limited search market (search language) using behavioral data from a larger, data-rich market (assist language). Specifically, we use interlingual classification to infer the search language query's intent using the assist language click-through data. We use these improved estimates of query intent, along with the query intent based on the search language data, to compute features that encode the similarity between a search result (URL) and the query. These features are subsequently fed into the ranking model to improve the relevance ranking of the documents. Our experimental results on German and French languages show the effectiveness of using assist language behavioral data especially, when the search language queries have small click-through data.
Modeling click-through based word-pairs for web search BIBAFull-Text 537-538
  Jagadeesh Jagarlamudi; Jianfeng Gao
Statistical translation models and latent semantic analysis (LSA) are two effective approaches to exploit click-through data for web search ranking. This paper presents two document ranking models that combine both approaches by explicitly modeling word-pairs. The first model, called PairModel, is a monolingual ranking model based on word pairs that are derived from click-through data. It maps queries and documents into a concept space spanned by these word pairs. The second model, called Bilingual Paired Topic Model (BPTM), uses bilingual word pairs and jointly models a bilingual query-document collection. This model maps queries and documents in multiple languages into a lower dimensional semantic subspace. Experimental results on web search task show that they significantly outperform the state-of-the-art baseline models, and the best result is obtained by interpolating PairModel and BPTM.
Google image swirl: a large-scale content-based image visualization system BIBAFull-Text 539-540
  Yushi Jing; Henry Rowley; Jingbin Wang; David Tsai; Chuck Rosenberg; Michele Covell
Web image retrieval systems, such as Google or Bing image search, present search results as a relevance-ordered list. Although alternative browsing models (e.g. results as clusters or hierarchies) have been proposed in the past, it remains to be seen whether such models can be applied to large-scale image search. This work presents Google Image Swirl, a large-scale, publicly available, hierarchical image browsing system by automatically group the search results based on visual and semantic similarity. This paper describes methods used to build such system and shares the findings from 2-years worth of user feedback and usage statistics.
Identifying sentiments over N-gram BIBAFull-Text 541-542
  Noriaki Kawamae
Our proposal, identifying sentiment over N-gram (ISN) focuses on both word order and phrases, and the interdependency between specific rating and corresponding sentiment in a text to detect subjective information.
StormRider: harnessing "storm" for social networks BIBAFull-Text 543-544
  Vaibhav V. Khadilkar; Murat Kantarcioglu; Bhavani Thuraisingham
The focus of online social media providers today has shifted from "content generation" towards finding effective methodologies for "content storage, retrieval and analysis" in the presence of evolving networks. Towards this end, in this paper we present StormRider, a framework that uses existing cloud computing and semantic web technologies to provide application programmers with automated support for these tasks, thereby allowing a richer assortment of use cases to be implemented on the underlying evolving social networks.
Learning from positive and unlabeled amazon reviews: towards identifying trustworthy reviewers BIBAFull-Text 545-546
  Marios Kokkodis
On-line marketplaces have been growing in importance over the last few years. In such environments, reviews consist the main reputation mechanism for the available products. Hence, presenting high quality reviews is crucial in achieving a high level of customer satisfaction. Towards this direction, in this work, we introduce a new dimension of review quality, the reviewer's "trustfulness". We assume that voluntary information provided by Amazon reviewers, regarding whether they are the actual buyers of the product, signals the reliability of a review. Based on this information, we characterize a reviewer as trustworthy (positive instance) or of unknown "trustfulness" (unlabeled instance). Then, we build models that exploit reviewers' profile information and on-line behavior to rank them according to the probability of being trustworthy. Our results are very promising, since they provide evidence that our predictive models separate positive from unlabeled instances with very high accuracies.
Treehugger or petrolhead?: identifying bias by comparing online news articles with political speeches BIBAFull-Text 547-548
  Ralf Krestel; Alex Wall; Wolfgang Nejdl
The Web is a very democratic medium of communication allowing everyone to express his or her opinion about any type of topic. This multitude of voices makes it more and more important to detect bias and help Internet users understand the background of information sources. Political bias of Web sites, articles, or blog posts is hard to identify straightaway. Manual content analysis conducted by experts is the standard way in political and social science to detect this bias. In this paper we present an automated approach relying on methods from information retrieval and corpus statistics to identify biased vocabulary use. As an example, we analyzed 15 years of parliamentary speeches of the German Bundestag and we investigated whether there is bias towards a political party in major national online newspapers and magazines. The results show that bias exists with respect to vocabulary use and it coincides with human judgement.
Towards optimizing the non-functional service matchmaking time BIBAFull-Text 549-550
  Kyriakos Kritikos; Dimitris Plexousakis
The Internet is moving fast to a new era where million of services and things will be available. In this way, as there will be many functionally-equivalent services for a specific user task, the service non-functional aspect should be considered for filtering and choosing the appropriate services. The related approaches in service discovery mainly concentrate on exploiting constraint solving techniques for inferring if the user non-functional requirements are satisfied by the service nonfunctional capabilities. However, as the matchmaking time is proportional to the number of non-functional service descriptions, these approaches fail to fulfill the user request in a timely manner. To this end, two alternative techniques for improving the non-functional service matchmaking time have been developed. The first one is generic as it can handle non-functional service specifications containing n-ary constraints, while the second is only applicable to unary-constrained specifications. Both techniques were experimentally evaluated. The preliminary evaluation results show that the service matchmaking time is significantly improved without compromising matchmaking accuracy.
Measuring usefulness of context for context-aware ranking BIBAFull-Text 551-552
  Andrey Kustarev; Yury Ustinovsky; Pavel Serduykov
Most of major search engines develop different types of personalisation of search results. Personalisation includes deriving user's long-term preferences, query disambiguation etc. User sessions provide very powerful tool commonly used for these problems. In this paper we focus on personalisation based on context-aware reranking. We implement a machine learning framework to approach this problem and study importance of different types of features. We stress that features concerning temporal and context relatedness of queries along with features relied on user's actions are most important and play crucial role for this type of personalisation.
TEM: a novel perspective to modeling content onmicroblogs BIBAFull-Text 553-554
  Himabindu Lakkaraju; Hyung-Il Ahn
In recent times, microblogging sites like Facebook and Twitter have gained a lot of popularity. Millions of users world wide have been using these sites to post content that interests them and also to voice their opinions on several current events. In this paper, we present a novel non-parametric probabilistic model -- Temporally driven Theme Event Model (TEM) for analyzing the content on microblogs. We also describe an online inference procedure for this model that enables its usage on large scale data. Experimentation carried out on real world data extracted from Facebook and Twitter demonstrates the efficacy of the proposed approach.
Using proximity to predict activity in social networks BIBAFull-Text 555-556
  Kristina Lerman; Suradej Intagorn; Jeon-Hyung Kang; Rumi Ghosh
The structure of a social network contains information useful for predicting its evolution. We show that structural information also helps predict activity. People who are "close" in some sense in a social network are more likely to perform similar actions than more distant people. We use network proximity to capture the degree to which people are "close" to each other. In addition to standard proximity metrics used in the link prediction task, such as neighborhood overlap, we introduce new metrics that model different types of interactions that take place between people. We study this claim empirically using data about URL forwarding activity on the social media sites Digg and Twitter. We show that structural proximity of two users in the follower graph is related to similarity of their activity, i.e., how many URLs they both forward. We also show that given friends' activity, knowing their proximity to the user can help better predict which URLs the user will forward. We compare the performance of different proximity metrics on the activity prediction task and find that metrics that take into account the attention-limited nature of interactions in social media lead to substantially better predictions.
Finding influential seed successors in social networks BIBAFull-Text 557-558
  Cheng-Te Li; Hsun-Ping Hsieh; Shou-De Lin; Man-Kwan Shan
In a dynamic social network, nodes can be removed from the network for some reasons, and consequently affect the behaviors of the network. In this paper, we tackle the challenge of finding a successor node for each removed seed node to maintain the influence spread in the network. Given a social network and a set of seed nodes for influence maximization, who are the best successors to be transferred the jobs of initial influence propagation if some seeds are removed from the network. To tackle this problem, we present and discuss five neighborhood-based selection heuristics, including degree, degree discount, overlapping, community bridge, and community degree. Experiments on DBLP co-authorship network show the effectiveness of devised heuristics.
Influence propagation and maximization for heterogeneous social networks BIBAFull-Text 559-560
  Cheng-Te Li; Shou-De Lin; Man-Kwan Shan
Influence propagation and maximization is a well-studied problem in social network mining. However, most of the previous works focus only on homogeneous social networks where nodes and links are of single type. This work aims at defining information propagation for heterogeneous social networks (containing multiple types of nodes and links). We propose to consider the individual behaviors of persons to model the influence propagation. Person nodes possess different influence probabilities to activate their friends according to their interaction behaviors. The proposed model consists of two stages. First, based on the heterogeneous social network, we create a human-based influence graph where nodes are of human-type and links carry weights that represent how special the target node is to the source node. Second, we propose two entropy-based heuristics to identify the disseminators in the influence graph to maximize the influence spread. Experimental results show promising results for the proposed method.
Dynamic selection of activation targets to boost the influence spread in social networks BIBAFull-Text 561-562
  Cheng-Te Li; Man-Kwan Shan; Shou-De Lin
This paper aims to combine the viral marketing with the idea of direct selling to for influence maximization in a social network. In direct selling, producers can sell the products directly to the consumers without having to go through a cascade of wholesalers. Through direct selling, it is possible to sell the products in a more efficient and economic manner. Motivated by this idea, we propose a target-selecting independent cascade (TIC) model, in which during influence propagation each active node can give up to attempt to influence some neighboring nodes, named victims, who could be hard to affect, and try to activate some of its friends of friends, termed destinations, who could have higher potential to increase the influence spread. Thus, the next question to ask is that given a social network and a set of seeds for influence propagation under TIC model, how to select targets (i.e., victims and destinations) for the attempts of activation during the propagation to boost of influence spread. We propose and evaluate three heuristics for the target selection. Experiments show that selecting targets based on influence probability between nodes have the highest boost of influence spread.
Regional subgraph discovery in social networks BIBAFull-Text 563-564
  Cheng-Te Li; Man-Kwan Shan; Shou-De Lin
This paper solves a region-based subgraph discovery problem. We are given a social network and some sample nodes which is supposed to belong to a specific region, and the goal is to obtain a subgraph that contains the sampled nodes with other nodes in the same region. Such regional subgraph discovery can benefit region-based applications, including scholar search, friend suggestion, and viral marketing. To deal with this problem, we assume there is a hidden backbone connecting the query nodes directly or indirectly in their region. The idea is that individuals belonging to the same region tend to share similar interests and cultures. By modeling such fact on edge weights, we search the graph to extract the regional backbone with respect to the query nodes. Then we can expand the backbone to derive the regional network. Experiments on a DBLP co-authorship network show the proposed method can effectively discover the regional subgraph with high precision scores.
GPU-based minwise hashing: GPU-based minwise hashing BIBAFull-Text 565-566
  Ping Li; Anshumali Shrivastava; Christian A. Konig
Minwise hashing is a standard technique for efficient set similarity estimation in the context of search. The recent work of b-bit minwise hashing provided a substantial improvement by storing only the lowest b bits of each hashed value. Both minwise hashing and b-bit minwise hashing require an expensive preprocessing step for applying k (e.g., k=500) permutations on the entire data in order to compute k minimal values as the hashed data. In this paper, we developed a parallelization scheme using GPUs, which reduced the processing time by a factor of 20-80. Reducing the preprocessing time is highly beneficial in practice, for example, for duplicate web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers (when the test data are not preprocessed).
CloudSpeller: query spelling correction by using a unified hidden Markov model with web-scale resources BIBAFull-Text 567-568
  Yanen Li; Huizhong Duan; ChengXiang Zhai
Query spelling correction is an important component of modern search engines that can help users to express an information need more accurately and thus improve search quality. In this work we proposed and implemented an end-to-end speller correction system, namely CloudSpeller. The CloudSpeller system uses a Hidden Markov Model to effectively model major types of spelling errors in a unified framework, in which we integrate a large-scale lexicon constructed using Wikipedia, an error model trained from high confidence correction pairs, and the Microsoft Web N-gram service. Our system achieves excellent performance on two search query spelling correction datasets, reaching 0.960 and 0.937 F1 scores on the TREC dataset and the MSN dataset respectively.
Sentiment classification via integrating multiple feature presentations BIBAFull-Text 569-570
  Yuming Lin; Jingwei Zhang; Xiaoling Wang; Aoying Zhou
In the bag of words framework, documents are often converted into vectors according to predefined features together with weighting mechanisms. Since each feature presentation has its character, it is difficult to determine which one should be chosen for a specific domain, especially for the users who are not familiar with the domain. This paper explores the integration of various feature presentations to improve the classification accuracy. A general two phases framework is proposed. In the first phase, we train multiple base classifiers with various vector spaces and use these classifiers to predict the class of testing samples respectively. In the second phase, the previous predicted results are integrated into the ultimate class via stacking with SVM. The experimental results demonstrate the effectiveness of our method.
Tuning parameters of the expected reciprocal rank BIBAFull-Text 571-572
  Yury Logachev; Lidia Grauer; Pavel Serdyukov
There are several popular IR metrics based on an underlying user model. Most of them are parameterized. Usually parameters of these metrics are chosen on the basis of general considerations and not validated by experiments with real users. Particularly, the parameters of the Expected Reciprocal Rank measure are the normalized parameters of the DCG metric, and the latter are chosen in an ad-hoc manner. We suggest two approaches for adjusting parameters of the ERR model by analyzing real users behaviour: one based on a controlled experiment and another relying on search log analysis. We show that our approaches generate parameters that are largely different from the commonly used parameters of the ERR model.
Conversations reconstruction in the social web BIBAFull-Text 573-574
  Juan Antonio Lossio Ventura; Hakim Hacid; Arnaud Ansiaux; Maria Laura Maag
We propose a socio-semantic approach for building conversations from social interactions following three steps: (i) content linkage, (ii) participants (users) linkage, and (iii) temporal linkage. Preliminary evaluations on a Twitter dataset show promising and interesting results.
Secure querying of recursive XML views: a standard xpath-based technique BIBAFull-Text 575-576
  Houari Mahfoud; Abdessamad Imine
Most state-of-the art approaches for securing XML documents allow users to access data only through authorized views defined by annotating an XML grammar (e.g. DTD) with a collection of XPath expressions. To prevent improper disclosure of confidential information, user queries posed on these views need to be rewritten into equivalent queries on the underlying documents, which enables us to avoid the overhead of view materialization and maintenance. A major concern here is that XPath query rewriting for recursive XML views is still an open problem. To overcome this problem, some authors have proposed rewriting approaches based on the non-standard language, "Regular XPath", which is more expressive than XPath and makes rewriting possible under recursion. However, query rewriting under Regular XPath can be of exponential size as it relies on automaton model. Most importantly, Regular XPath remains a theoretical achievement. Indeed, it is not commonly used in practice as translation and evaluation tools are not available. In this work, we show that query rewriting is always possible for recursive XML views using only the expressive power of the standard XPath. We propose a general approach for securely querying of XML data under arbitrary security views (recursive or not) and for a significant fragment of XPath. We provide a linear rewriting algorithm that is efficient and scales well.
GoThere: travel suggestions using geotagged photos BIBAFull-Text 577-578
  Abdul Majid; Ling Chen; Gencai Chen; Hamid Turab Mirza; Ibrar Hussain
We propose a context and preference aware travel guide that suggests significant tourist destinations to users based on their preferences and current surrounding context using contextualized user-generated contents from the social media repository, i.e., Flickr.
Ad-hoc ride sharing application using continuous SPARQL queries BIBAFull-Text 579-580
  Debnath Mukherjee; Snehasis Banerjee; Prateep Misra
In the existing ride sharing scenario, the ride taker has to cope with uncertainties since the ride giver may be delayed or may not show up due to some exigencies. A solution to this problem is discussed in this paper. The solution framework is based on gathering information from multiple streams such as traffic status on the ride giver's routes and the ride giver's GPS coordinates. Also, it maintains a list of alternative ride givers so as to almost guarantee a ride for the ride taker. This solution uses a SPARQL-based continuous query framework that is capable of sensing fast-changing real-time situation. It also has reasoning capabilities for handling ride taker's preferences. The paper introduces the concept of user-managed windows that is shown to be required for this solution. Finally we show that the performance of the application is enhanced by designing the application with short incremental queries.
Sparse linear methods with side information for Top-N recommendations BIBAFull-Text 581-582
  Xia Ning; George Karypis
This paper focuses on developing effective algorithms that utilize side information for top-N recommender systems. A set of Sparse Linear Methods with Side information (SSLIM) is proposed, that utilize a regularized optimization process to learn a sparse item-to-item coefficient matrix based on historical user-item purchase profiles and side information associated with the items. This coefficient matrix is used within an item-based recommendation framework to generate a size-N ranked list of items for a user. Our experimental results demonstrate that SSLIM outperforms other methods in effectively utilizing side information and achieving performance improvement.
Sentiment analysis amidst ambiguities in YouTube comments on Yoruba language (Nollywood) movies BIBAFull-Text 583-584
  Sylvester Olubolu Orimaye; Saadat M. Alhashmi; Siew Eu-gene
Nollywood is the second largest movie industry in the world in terms of annual movie production. A dominant number of the movies are in Yoruba language spoken by over 20 million people across the globe. The number of Yoruba language movies uploaded to YouTube and their corresponding comments is growing exponentially. However, YouTube comments made by native speakers on Yoruba movies combine English language, Yoruba language, and other commonly used "pidgin" Yoruba language words. Since Yoruba is still a resource constrained language, existing sentiment or subjectivity analysis algorithms have poor performances on YouTube comments made on Yoruba language movies. This is because of the constrained language ambiguities. In this work, we present an automatic sentiment analysis algorithm for YouTube comments on Yoruba language movies. The algorithm uses SentiWordNet thesaurus and a lexicon of commonly used Yoruba language sentiment words and phrases. In terms of precision-recall, the algorithm performs more than a state-of-the-art sentiment analysis technique by up to 20%.
C4PS: colors for privacy settings BIBAFull-Text 585-586
  Thomas Paul; Martin Stopczynski; Daniel Puscher; Melanie Volkamer; Thorsten Strufe
The ever increasing popularity of Facebook and other Online Social Networks has left a wealth of personal and private data on the web, aggregated and readily accessible for broad and automatic retrieval. Protection from both undesired recipients and harvesting by crawlers is implemented by access control, manually configured by the user and owner of the data. Several studies demonstrate that default settings cause an unnoticed over-sharing and that users have trouble understanding and configuring adequate privacy settings. We developed an improved interface for privacy settings in Facebook by mainly applying color coding for different groups, providing easy access to the privacy settings, and applying the principle of common practices. Using a lab study, we show that the new approach increases the usability significantly.
Extracting advertising keywords from URL strings BIBAFull-Text 587-588
  Santosh Raju; Raghavendra Udupa
Extracting advertising keywords from web-pages is important in keyword-based online advertising. Previous works have attempted to extract advertising keywords from the whole content of a web-page. However, in some scenarios, it is necessary to extract keywords from just the URL string itself. In this work, we propose an algorithm for extracting advertising keywords from the URL string alone. Our algorithm has applications in contextual and paid search advertising. We evaluate the effectiveness of our algorithm on publisher URLs and show that it produces very good quality keywords that are comparable with keywords produced by page based extractors.
Instrumenting a logic programming language to gather provenance from an information extraction application BIBAFull-Text 589-590
  Christine F. Reilly; Yueh-Hsuan Chiang; Jeffrey F. Naughton
Information extraction (IE) programs for the web consume and produce a lot of data. In order to better understand the program output, the developer and user often desire to know the details of how the output was created. Provenance can be used to learn about the creation of the output. We collect fine-grained provenance by leveraging ongoing work in the IE community to write IE programs in a logic programming language. The logic programming language exposes the semantics of the program, allowing us to gather fine-grained provenance during program execution. We discuss a case study using a web-based community information management system, then present results regarding the performance of queries over the provenance data gathered by our logic program interpreter. Our findings show that it is possible to gather useful fine-grained provenance during the execution of a logic based web information extraction program. Additionally, queries over this provenance information can be performed in a reasonable amount of time.
Lexical quality as a proxy for web text understandability BIBAFull-Text 591-592
  Luz Rello; Ricardo Baeza-Yates
We show that a recently introduced lexical quality measure is also valid to measure textual Web accessibility. Our measure estimates the lexical quality of a site based in the occurrence in English Web pages of a large set of words with errors. We first compute the correlation of our measure with Web popularity measures to show that gives independent information. Second, we carry out a user study using eye tracking to prove that the degree of lexical quality of a text is related to the degree of understandability of a text, one of the factors behind Web accessibility.
Latent contextual indexing of annotated documents BIBAFull-Text 593-594
  Christian Sengstock; Michael Gertz
In this paper we propose a simple and flexible framework to index context-annotated documents, e.g., documents with timestamps or georeferences, by contextual topics. A contextual topic is a distribution over document features with a particular meaning in the context domain, such as a repetitive event or a geographic phenomenon. Such a framework supports document clustering, labeling, and search, with respect to contextual knowledge contained in the document collection. To realize the framework, we introduce an approach to project documents into a context-feature space. Then, dimensionality reduction is used to extract contextual topics in this context-feature space. The topics can then be projected back onto the documents. We demonstrate the utility of our approach with a case study on georeferenced Wikipedia articles.
APOLLO: a general framework for populating ontology with named entities via random walks on graphs BIBAFull-Text 595-596
  Wei Shen; Jianyong Wang; Ping Luo; Min Wang
Automatically populating ontology with named entities extracted from the unstructured text has become a key issue for Semantic Web. This issue naturally consists of two subtasks: (1) for the entity mention whose mapping entity does not exist in the ontology, attach it to the right category in the ontology (i.e., fine-grained named entity classification), and (2) for the entity mention whose mapping entity is contained in the ontology, link it with its mapping real world entity in the ontology (i.e., entity linking). Previous studies only focus on one of the two subtasks. This paper proposes APOLLO, a general weakly supervised frAmework for POpuLating ontoLOgy with named entities. APOLLO leverages the rich semantic knowledge embedded in the Wikipedia to resolve this task via random walks on graphs. An experimental study has been conducted to show the effectiveness of APOLLO.
Multiple spreaders affect the indirect influence on Twitter BIBAFull-Text 597-598
  Xin Shuai; Ying Ding; Jerome Busemeyer
Most studies on social influence have focused on direct influence, while another interesting question can be raised as whether indirect influence exists between two users who're not directly connected in the network and what affects such influence. In addition, the theory of complex contagion tells us that more spreaders will enhance the indirect influence between two users. Our observation of intensity of indirect influence, propagated by n parallel spreaders and quantified by retweeting probability on Twitter, shows that complex contagion is validated globally but is violated locally. In other words, the retweeting probability increases non-monotonically with some local drops.
Entity based translation language model BIBAFull-Text 599-600
  Amit Singh
Bridging the lexical gap between the user's question and the question-answer pairs in Q&A archives has been a major challenge for Q&A retrieval. State-of-the-art approaches address this issue by implicitly expanding the queries with additional words using statistical translation models. In this work we extend the lexical word based translation model to incorporate semantic concepts. We explore strategies to learn the translation probabilities between words and the concepts using the Q&A archives and Wikipedia. Experiments conducted on a large scale real data from Yahoo Answers! show that the proposed techniques are promising and need further investigation.
Enabling accent resilient speech based information retrieval BIBAFull-Text 601-602
  Koushik Sinha; Geetha Manjunath; Raveesh R. Sharma; Viswanath Gangavaram; Pooja A; Deepak R. Murugaian
Voice interfaces to browsers and mobile applications are becoming popular as typing with touch screens is cumbersome. The main issue of practical speech based interfaces is how to overcome speech recognition errors. This problem is more severe when the users are non-native speakers of English due to differences in pronunciations. In this paper, we describe a novel, intelligent speech interface design approach for IR tasks that is significantly robust to accent variations. Our solution uses phonemic similarity based word spreading and semantic information based filtering to boost the accuracy of any ASR. We evaluated our solution with Google Voice as the ASR for a web question-answering system developed in-house and the results are very encouraging.
Dynamical information retrieval modelling: a portfolio-armed bandit machine approach BIBAFull-Text 603-604
  Marc Sloan; Jun Wang
The dynamic nature of document relevance is largely ignored by traditional Information Retrieval (IR) models, which assume that scores (relevance) for documents given an information need are static. In this paper, we formulate a general Dynamical Information Retrieval problem, where we consider retrieval as a stochastic, controllable process. The ranking action continuously controls the retrieval system's dynamics and an optimal ranking policy is found that maximizes the overall users' satisfaction during each period. Through deriving the posterior probability of the documents evolving relevancy from user clicks, we can provide a plug-in framework for incorporating a number of click models, which can be combined with Multi-Armed Bandit theory and Portfolio Theory of IR to create a dynamic ranking rule that takes rank bias and click dependency into account. We verify the versatility of our algorithms in a number of experiments and demonstrate improved performance over strong baselines and as a result significant performance gains have been achieved.
Detecting dynamic association among Twitter topics BIBAFull-Text 605-606
  Shuangyong Song; Qiudan Li; Hongyun Bao
Over the last few years, Twitter is increasingly becoming an important source of up-to-date topics about what is happening in the world. In this paper, we propose a dynamic topic association detection model to discover relations between Twitter topics, by which users can gain insights into richer information about topics of interest. The proposed model utilizes a time constrained method to extract event-based spatio-temporal topic association, and constructs a dynamic temporal map to represent the obtained result. Experimental results show the improvement of the proposed model compared to static spatio-temporal method and co-occurrence method.
Using community information to improve the precision of link prediction methods BIBAFull-Text 607-608
  Sucheta Soundarajan; John Hopcroft
Because network data is often incomplete, researchers consider the link prediction problem, which asks which non-existent edges in an incomplete network are most likely to exist in the complete network. Classical approaches compute the 'similarity' of two nodes, and conclude that highly similar nodes are most likely to be connected in the complete network. Here, we consider several such similarity-based measures, but supplement the similarity calculations with community information. We show that for many networks, the inclusion of community information improves the accuracy of similarity-based link prediction methods.
Open and decentralized platform for visualizing web mash-ups in augmented and mirror worlds BIBAFull-Text 609-610
  Vlad Stirbu; David Murphy; Yu You
Augmented reality applications are gaining popularity due to increased capabilities of modern mobile devices. However, existing applications are tightly integrated with backend services that expose content using proprietary interfaces. We demonstrate an architecture that allows visualization of web content in augmented and mirror world applications, based on open web protocols and formats. We describe two clients, one for creating virtual artifacts, web resources that bind together web content with location and a 3D model, and one that visualizes the virtual artifacts in the mirror world.
Actualization of query suggestions using query logs BIBAFull-Text 611-612
  Alisa Strizhevskaya; Alexey Baytin; Irina Galinskaya; Pavel Serdyukov
In this work we are studying actualization techniques for building an up-to-date query suggestions model using query logs. The performance of the proposed actualization algorithms was estimated by real query flow of the Yandex search engine.
Query spelling correction using multi-task learning BIBAFull-Text 613-614
  Xu Sun; Anshumali Shrivastava; Ping Li
This paper explores the use of online multi-task learning for search query spelling correction, by effectively transferring information from different and biased training datasets for improving spelling correction across datasets. Experiments were conducted on three query spelling correction datasets, including the well-known TREC benchmark data. Our experimental results demonstrate that the proposed method considerably outperforms existing baseline systems in terms of accuracy. Importantly, the proposed method is about one-order of magnitude faster than baseline systems in terms of training speed. In contrast to existing methods which typically require more than (e.g.,) 50 training passes, our algorithm can very closely approach the empirical optimum in around five passes.
Incorporating seasonal time series analysis with search behavior information in sales forecasting BIBAFull-Text 615-616
  Yuchen Tian; Yiqun Liu; Danqing Xu; Ting Yao; Min Zhang; Shaoping Ma
We consider the problem of predicting monthly auto sales in mainland China. First, we design an algorithm using click-through and query reformulation information to cluster related queries and count their frequencies on monthly-basis. By introducing Exponentially Weighted Moving Averages (EWMA) model, we measure the seasonal impact on the sales trend. Two features are combined using linear regression. The experiment shows that our model is effective with high accuracy and outperforms conventional forecasting models.
Photo-TaPE: user privacy preferences in photo tagging BIBAFull-Text 617-618
  Vincent Toubiana; Vincent Verdot; Benoit Christophe; Mathieu Boussard
Although they are used to expose pictures on the Web, users may not want to have a link between their identity and pictures without being able to modify them or control who accesses them. Photo tagging -- and more broadly face-recognition algorithms -- often escapes to the users' control and creates links between private situations and their public profile. To address this issue, we designed a geo-location aided system to let users declare their tagging preferences directly when the picture is taken. We present Photo-Tagging Preference Enforcement (Photo-TaPE) a system enforcing users tagging preferences without revealing their identity. By improving face-recognition efficiency, Photo-TaPE can guarantee the user tagging preferences in 67% of the cases and significantly reduces the processing time of face-recognition algorithms.
Understanding human movement semantics: a point of interest based approach BIBAFull-Text 619-620
  Ionut Trestian; Kévin Huguenin; Ling Su; Aleksandar Kuzmanovic
The recent availability of human mobility traces has driven a new wave of research on human movement with straightforward applications in wireless/cellular network. In this paper we revisit the human mobility problem with new assumptions. We believe that human movement is not independent of the surrounding locations, i.e. the points of interest that they visit; most of the time people travel with specific goals in mind, visit specific points of interest, and frequently revisit favorite places. Using GPS mobility traces of a large number of users located across two distinct geographical locations we study the correlation between people's trajectories and the differently spread points of interest nearby.
Scalable multi stage clustering of tagged micro-messages BIBAFull-Text 621-622
  Oren Tsur; Adi Littman; Ari Rappoport
The growing popularity of microblogging backed by services like Twitter, Facebook, Google+ and LinkedIn, raises the challenge of clustering short and extremely sparse documents. In this work we propose SMSC -- a scalable, accurate and efficient multi stage clustering algorithm. Our algorithm leverages users practice of adding tags to some messages by bootstrapping over virtual non sparse documents. We experiment on a large corpus of tweets from Twitter, and evaluate results against a gold-standard classification validated by seven clustering evaluation measures (information theoretic, paired and greedy). Results show that the algorithm presented is both accurate and efficient, significantly outperforming other algorithms. Under reasonable practical assumptions, our algorithm scales up sublinearly in time.
Seeing the best and worst of everything on the web with a two-level, feature-rich affect lexicon BIBAFull-Text 623-624
  Tony Veale
Affect lexica are useful for sentiment analysis because they map words (or senses) onto sentiment ratings. However, few lexica explain their ratings, or provide sufficient feature richness to allow a selective "spin"" to be placed on a word in context. Since an affect lexicon aims to capture the affect of a word or sense in its most stereotypical usage, it should be grounded in explicit stereotype representations of each word's most salient properties and behaviors. We show here how to acquire a large stereotype lexicon from Web content, and further show how to determine sentiment ratings for each entry in the lexicon, both at the level of properties and behaviors and at the level of stereotypes. Finally, we show how the properties of a stereotype can be segregated on demand, to place a positive or negative spin on a word in context.
Unified classification model for geotagging websites BIBAFull-Text 625-626
  Alexey Volkov; Pavel Serdyukov
The paper presents a novel approach to finding regional scopes (geotagging) of websites. It relies on a single binary classification model per region type to perform the multi-class classification and uses a variety of features of different nature that have not been yet used together for machine-learning based regional classification of websites. The evaluation demonstrates the advantage of our "one model per region type" method versus the traditional "one model per region" approach.
Selling futures online advertising slots via option contracts BIBAFull-Text 627-628
  Jun Wang; Bowei Chen
Many online advertising slots are sold through bidding mechanisms by publishers and search engines. Highly affected by the dual force of supply and demand, the prices of advertising slots vary significantly over time. This then influences the businesses whose major revenues are driven by online advertising, particularly for publishers and search engines. To address the problem, we propose to sell the future advertising slots via option contracts (also called ad options). The ad option can give its buyer the right to buy the future advertising slots at a prefixed price. The pricing model of ad options is developed in order to reduce the volatility of the income of publishers or search engines. Our experimental results confirm the validity of ad options and the embedded risk management mechanisms.
Model news relatedness through user comments BIBAFull-Text 629-630
  Xuanhui Wang; Jian Bian; Yi Chang; Belle Tseng
Most of previous work on news relatedness focuses on news article texts. In this paper, we study the benefit of user-generated comments on modeling news relatedness. Comments contain rich text information which is provided by commenters and rated by readers with thumb-up or thumb-down, but the quality of individual comments varies widely. We compare different ways of capturing relatedness by leveraging both text and user interaction information in comments. Our evaluation based on an editorial data set demonstrates that the text information in comments is very effective to model relatedness while community rating is quite predictive of the comment quality.
A data-driven sketch of Wikipedia editors BIBAFull-Text 631-632
  Robert West; Ingmar Weber; Carlos Castillo
Who edits Wikipedia? We attempt to shed light on this question by using aggregated log data from Yahoo!'s browser toolbar in order to analyze Wikipedians' editing behavior in the context of their online lives beyond Wikipedia. We broadly characterize editors by investigating how their online behavior differs from that of other users; e.g., we find that Wikipedia editors search more, read more news, play more games, and, perhaps surprisingly, are more immersed in pop culture. Then we inspect how editors' general interests relate to the articles to which they contribute; e.g., we confirm the intuition that editors show more expertise in their active domains than average users. Our results are relevant as they illuminate novel aspects of what has become many Web users' prevalent source of information and can help in recruiting new editors.
A framework to represent and mine knowledge evolution from Wikipedia revisions BIBAFull-Text 633-634
  Xian Wu; Wei Fan; Meilun Sheng; Li Zhang; Xiaoxiao Shi; Zhong Su; Yong Yu
State-of-the-art knowledge representation in semantic web employs a triple format (subject-relation-object). The limitation is that it can only represent static information, but cannot easily encode revisions of semantic web and knowledge evolution. In reality, knowledge does not stay still but evolves over time. In this paper, we first introduce the concept of "quintuple representation" by adding two new fields, state and time, where state has two values, either in or out, to denote that the referred knowledge takes effective or becomes expired at the given time. We then discuss a two-step statistical framework to mine knowledge evolution into the proposed quintuple representation. Utilizing extracted quintuple properly, it not only can reveal knowledge changing history but also detect expired information. We evaluate the proposed framework on Wikipedia revisions, as well as, common web pages currently not in semantic web format.
Review spam detection via time series pattern discovery BIBAFull-Text 635-636
  Sihong Xie; Guan Wang; Shuyang Lin; Philip S. Yu
Online reviews play a crucial role in today's electronic commerce. Due to the pervasive spam reviews, customers can be misled to buy low-quality products, while decent stores can be defamed by malicious reviews. We observe that, in reality, a great portion (> 90% in the data we study) of the reviewers write only one review (singleton review). These reviews are so enormous in number that they can almost determine a store's rating and impression. However, existing methods ignore these reviewers. To address this problem, we observe that the normal reviewers' arrival pattern is stable and uncorrelated to their rating pattern temporally. In contrast, spam attacks are usually bursty and either positively or negatively correlated to the rating. Thus, we propose to detect such attacks via unusually correlated temporal patterns. We identify and construct multidimensional time series based on aggregate statistics, in order to depict and mine such correlation. Experimental results show that the proposed method is effective in detecting singleton review attacks. We discover that singleton review is a significant source of spam reviews and largely affects the ratings of online stores.
Combining classification with clustering for web person disambiguation BIBAFull-Text 637-638
  Jian Xu; Qin Lu; Zhengzhong Liu
Web Person Disambiguation is often conducted through clustering web documents to identify different namesakes for a given name. This paper presents a new key-phrased clustering method combined with a second step re-classification to identify outliers to improve cluster performance. For document clustering, the hierarchical agglomerative approach is conducted based on the vector space model which uses key phrases as the main feature. Outliers of cluster results are then identified through a centroids-based method. The outliers are then reclassified by the SVM classifier into the more appropriate clusters using a key phrase-based string kernel model as its feature space. The re-classification uses the clustering result in the first step as its training data so as to avoid the use of separate training data required by most classification algorithms. Experiments conducted on the WePS-2 dataset show that the algorithm based on key phrases is effective in improving the WPD performance.
Exploiting various implicit feedback for collaborative filtering BIBAFull-Text 639-640
  Byoungju Yang; Sangkeun Lee; Sungchan Park; Sang-goo Lee
So far, many researchers have worked on recommender systems using users' implicit feedback, since it is difficult to collect explicit item preferences in most applications. Existing researches generally use a pseudo-rating matrix by adding up the number of item consumption; however, this naive approach may not capture user preferences correctly in that many other important user activities are ignored. In this paper, we show that users' diverse implicit feedbacks can be significantly used to improve recommendation accuracy. We classify various users' behaviors (e.g., search item, skip, add to playlist, etc.) into positive or negative feedback groups and construct more accurate pseudo-rating matrix. Our preliminary experimental result shows significant potential of our approach. Also, we bring out a question to the previous approaches, aggregating item usage count into ratings.
The effect of links on networked user engagement BIBAFull-Text 641-642
  Elad Yom-Tov; Mounia Lalmas; Georges Dupret; Ricardo Baeza-Yates; Pinard Donmez; Janette Lehmann
In the online world, user engagement refers to the phenomena associated with being captivated by a web application and wanting to use it longer and frequently. Nowadays, many providers operate multiple content sites, very different from each other. Due to their extremely varied content, these are usually studied and optimized separately. However, user engagement should be examined not only within individual sites, but also across sites, that is the entire content provider network. In previous work, we investigated networked user engagement, by defining a global measure of engagement that captures the effect that sites have on the engagement on other sites within the same browsing session. Here, we look at the effect of links on networked user engagement, as these are commonly used by online content providers to increase user engagement.
Investigating bias in traditional media through social media BIBAFull-Text 643-644
  Arjumand Younus; Muhammad Atif Qureshi; Suneel Kumar Kingrani; Muhammad Saeed; Nasir Touheed; Colm O'Riordan; Pasi Gabriella
It is often the case that traditional media provide coverage of a news event on the basis of journalists' viewpoints -- a problem termed in the literature as media bias. On the other hand social media have given birth to an alternative paradigm of journalism known as "citizen journalism". We take advantage of citizen journalism to detect the bias in traditional media and propose a simple model for empirical measurement of media bias.
Enhancing naive Bayes with various smoothing methods for short text classification BIBAFull-Text 645-646
  Quan Yuan; Gao Cong; Nadia Magnenat Thalmann
Partly due to the proliferance of microblog, short texts are becoming prominent. A huge number of short texts are generated every day, which calls for a method that can efficiently accommodate new data to incrementally adjust classification models. Naive Bayes meets such a need. We apply several smoothing models to Naive Bayes for question topic classification, as an example of short text classification, and study their performance. The experimental results on a large real question data show that the smoothing methods are able to significantly improve the question classification performance of Naive Bayes. We also study the effect of training data size, and question length on performance.
Filtering and ranking schemes for finding inclusion dependencies on the web BIBAFull-Text 647-648
  Erika Yumiya; Atsuyuki Morishima; Masami Takahashi; Shigeo Sugimoto; Hiroyuki Kitagawa
This paper addresses the problem of finding inclusion dependencies on the Web. In our approach, we enumerate pairs of HTML/XML elements that possibly represent inclusion dependencies and then rank the results for verification. This paper focuses on the challenges in the finding and ranking processes.
Exploiting shopping and reviewing behavior to re-score online evaluations BIBAFull-Text 649-650
  Rong Zhang; ChaoFeng Sha; Minqi Zhou; Aoying Zhou
Analysis to product reviews has attracted great attention from both academia and industry. Generally the evaluation scores of reviews are used to generate the average scores of products and shops for future potential users. However, in the real world, there is the inconsistency problem between the evaluation scores and review content, and some customers do not give out fair reviews. In this work, we focus on detecting the credibility of customers by analyzing online shopping and review behaviors, and then we re-score the reviews for products and shops. In the end, we evaluate our algorithm based on the real data set from Taobao, the biggest E-commerce site in China.

SWDM'12 workshop 1

Information cascades in social media in response to a crisis: a preliminary model and a case study BIBAFull-Text 653-656
  Cindy Hui; Yulia Tyshchuk; William A. Wallace; Malik Magdon-Ismail; Mark Goldberg
The focus of this paper is on demonstrating how a model of the diffusion of actionable information can be used to study information cascades on Twitter that are in response to an actual crisis event, and its concomitant alerts and warning messages from emergency managers. We will: identify the types of information requested or shared during a crisis situation; show how messages spread among the users on Twitter including what kinds of information cascades or patterns are observed; and note what these patterns tell us about information flow and the users. We conclude by noting that emergency managers can use this information to either facilitate the spreading of accurate information or impede the flow of inaccurate or improper messages.
User community reconstruction using sampled microblogging data BIBAFull-Text 657-660
  Miki Enoki; Yohei Ikawa; Raymond Rudy
User community recognition in social media services is important to identify hot topics or users' interests and concerns in a timely way when a disaster has occurred. In microblogging services, many short messages are posted every day and some of them represent replies or forwarded messages between users. We extract such conversational messages to link the users as a user network and regard the strongly-connected components in the network as indicators of user communities. However, using all of the microblog data for user community extraction is too costly and requires too much storage space when decomposing strongly-connected components. In contrast, using sampled data may miss some user connections and thus divide one user community into pieces. In this paper, we propose a method for user community reconstruction using the lexical similarity of the messages and the user's link information between separate communities.
Towards situational pattern mining from microblogging activity BIBAFull-Text 661-666
  Nathan Gnanasambandam; Keith Thompson; Ion Florie Ho; Sarah Lam; Sang Won Yoon
Many useful patterns can be derived from analyzing microblogging behavior at different scales (individual and social group). In this paper, we derive patterns relating to spatio-temporal traffic flow, visit regularity, content and social ties as they relate to an individual's activities in an urban environment (e.g., New York City). We also demonstrate, through an example, methods for reasoning about the activities, locations and group structures that may underlie the microblogging messages in the aforementioned context of mining situation patterns. These individual and group situational patterns may be very crucial when planning for disruptions and organized response.
Mining conversations of geographically changing users BIBAFull-Text 667-670
  Liam McNamara; Christian Rohner
In recent disaster events, social media has proven to be an effective communication tool for affected people. The corpus of generated messages contains valuable information about the situation, needs, and locations of victims. We propose an approach to extract significant aspects of user discussions to better inform responders and enable an appropriate response. The methodology combines location based division of users together with standard text mining (term frequency inverse document frequency) to identify important topics of conversation in a dynamic geographic network. We further suggest that both topics and movement patterns change during a disaster, which requires identification of new trends. When applied to an area that has suffered a disaster, this approach can provide 'sensemaking' through insights into where people are located, where they are going and what they communicate when moving.
Characterization of social media response to natural disasters BIBAFull-Text 671-674
  Seema Nagar; Aaditeshwar Seth; Anupam Joshi
Online social networking websites such as Twitter and Facebook often serve a breaking-news role for natural disasters: these websites are among the first ones to mention the news, and because they are visited by millions of users regularly the websites also help communicate the news to a large mass of people. In this paper, we examine how news about these disasters spreads on the social network. In addition to this, we also examine the countries of the Tweeting users. We examine Twitter logs from the 2010 Philippines typhoon, the 2011 Brazil flood and the 2011 Japan earthquake. We find that although news about the disaster may be initiated in multiple places in the social network, it quickly finds a core community that is interested in the disaster, and has little chance to escape the community via social network links alone. We also find evidence that the world at large expresses concern about such largescale disasters, and not just countries geographically proximate to the epicenter of the disaster. Our analysis has implications for the design of fund raising campaigns through social networking websites.
Rumor spreading and inoculation of nodes in complex networks BIBAFull-Text 675-678
  Anurag Singh; Yatindra Nath Singh
Over the Internet or on social networks rumors can spread and can affect the society in disaster. The question one asks about this phenomenon is that whether these rumors can be suppressed using suitable mechanisms. One of the possible solutions is to inoculate a certain fraction of nodes against rumors. The inoculation can be done randomly or in targeted fashion. In this paper, small world network model has been used to investigate the efficiency of inoculation. It has been found that if average degree of small world network is small than both inoculation methods are successful. When average degree is large, neither of these methods are able to stop rumor spreading. But if acceptability of rumor is reduced along with inoculation, the rumor spreading can be stopped even in this case.
   The proposed hypothesis has been verified using simulation experiments.
Bursty event detection from text streams for disaster management BIBAFull-Text 679-682
  Sungjun Lee; Sangjin Lee; Kwanho Kim; Jonghun Park
In this paper, an approach to automatically identifying bursty events from multiple text streams is presented. We investigate the characteristics of bursty terms that appear in the documents generated from text streams, and incorporate those characteristics into a term weighting scheme that distinguishes bursty terms from other non-bursty terms. Experimental results based on the news corpus show that our approach outperforms the existing alternatives in extracting bursty terms from multiple text streams. The proposed research is expected to contribute to increasing the situational awareness of ongoing events particularly when a natural or economic disaster occurs.
Automatic sub-event detection in emergency management using social media BIBAFull-Text 683-686
  Daniela Pohl; Abdelhamid Bouchachia; Hermann Hellwagner
Emergency management is about assessing critical situations, followed by decision making as a key step. Clearly, information is crucial in this two-step process. The technology of social (multi)media turns out to be an interesting source for collecting information about an emergency situation. In particular, situational information can be captured in form of pictures, videos, or text messages. The present paper investigates the application of multimedia metadata to identify the set of sub-events related to an emergency situation. The used metadata is compiled from Flickr and YouTube during an emergency situation, where the identification of the events relies on clustering. Initial results presented in this paper show how social media data can be used to detect different sub-events in a critical situation.
Location inference using microblog messages BIBAFull-Text 687-690
  Yohei Ikawa; Miki Enoki; Michiaki Tatsubori
In order to sense and analyze disaster information from social media, microblogs as sources of social data have recently attracted attention. In this paper, we attempt to discover geolocation information from microblog messages to assess disasters. Since microblog services are more timely compared to other social media, understanding the geolocation information of each microblog message is useful for quickly responding to a sudden disasters. Some microblog services provide a function for adding geolocation information to messages from mobile device equipped with GPS detectors. However, few users use this function, so most messages do not have geolocation information. Therefore, we attempt to discover the location where a message was generated by using its textual content. The proposed method learns associations between a location and its relevant keywords from past messages, and guesses where a new message came from.
SocialEMIS: improving emergency preparedness through collaboration BIBAFull-Text 691-694
  Ouejdane Mejri; Pierluigi Plebani
The definition of the contingency plan during the preparedness phase holds a crucial role in emergency management. A proper emergency response, indeed, requires the implementation of a contingency plan that can be accurate only if different people with different skills are involved. The goal of this paper is to introduce SocialEMIS, a first prototype of a tool that supports the collaborative definition of contingency plans. Although the current implementation is now focused on the role of the emergency operators, the accuracy of the plan will also take advantage of information coming from the citizens in future releases. Moreover, the contingency plans defined with SocialEMIS represent a knowledge base for defining other contingency plans.
Emergency situation awareness from Twitter for crisis management BIBAFull-Text 695-698
  Mark A. Cameron; Robert Power; Bella Robinson; Jie Yin
This paper describes ongoing work with the Australian Government to detect, assess, summarise, and report messages of interest for crisis coordination published by Twitter. The developed platform and client tools, collectively termed the Emergency Situation Awareness -- Automated Web Text Mining (ESA-AWTM) system, demonstrate how relevant Twitter messages can be identified and utilised to inform the situation awareness of an emergency incident as it unfolds.
   A description of the ESA-AWTM platform is presented detailing how it may be used for real life emergency management scenarios. These scenarios are focused on general use cases to provide: evidence of pre-incident activity; near-real-time notification of an incident occurring; first-hand reports of incident impacts; and gauging the community response to an emergency warning. Our tools have recently been deployed in a trial for use by crisis coordinators.
MECA: mobile edge capture and analysis middleware for social sensing applications BIBAFull-Text 699-702
  Fan Ye; Raghu Ganti; Raheleh Dimaghani; Keith Grueneberg; Seraphin Calo
In this paper, we propose and develop MECA, a common middleware infrastructure for data collection from mobile devices in an efficient, flexible, and scalable manner. It provides a high level abstraction of phenomenon such that applications can express diverse data needs in a declarative fashion. MECA coordinates the data collection and primitive processing activities, so that data can be shared among applications. It addresses the inefficiency issues in the current vertical integration approach. We showcase the benefits of MECA by means of a disaster management application.
The use of social media within the global disaster alert and coordination system (GDACS) BIBAFull-Text 703-706
  Beate Stollberg; Tom de Groeve
The Global Disaster Alert and Coordination System (GDACS) collects near real-time hazard information to provide global multi-hazard disaster alerting for earthquakes, tsunamis, tropical cyclones, floods and volcanoes. GDACS alerts are based on calculations from physical disaster parameters and used by emergency responders. In 2011, the Joint Research Centre (JRC) of the European Commission started exploring if and how social media could be an additional valuable data source for international disaster response. The question is if awareness of the situation after a disaster could be improved by the use of social media tools and data. In order to explore this, JRC developed a Twitter account and Facebook page for the dissemination of GDACS alerts, a Twitter parser for the monitoring of information and a mobile application for information exchange. This paper presents the Twitter parser and the intermediate results of the data analysis which shows that the parsing of Twitter feeds (so-called tweets) can provide important information about side effects of disasters, on the perceived impact of a hazard and on the reaction of the affected population. The most important result is that impact information on collapsed buildings were detected through tweets within the first half an hour after an earthquake occurred and before any mass media reported the collapse.
Evaluating the impact of incorporating information from social media streams in disaster relief routing BIBAFull-Text 707-708
  Ashlea Bennett Milburn; Clarence L. Wardell
In this paper, we describe a model that can be used to evaluate the impact of using imperfect information when routing supplies for disaster relief. Using two objectives, maximizing the population supported, and minimizing response time, we explore the potential tradeoffs (e.g. more information, but possibly less accurate) of using information from social media streams to inform routing and resource allocation decisions immediately after a disaster.
Tweeting about the tsunami?: mining Twitter for information on the Tohoku earthquake and tsunami BIBAFull-Text 709-710
  Akiko Murakami; Tetsuya Nasukawa
On 11th March 2011, a 9.0-magnitude megathrust earthquake occurred in the ocean near Japan. This was the first large-scale natural disaster in Japan since the broad adoption of social media tools (such as Facebook and Twitter). In particular, Twitter is suitable for broadcasting information, naturally making it the most frequently used social medias when disasters strike. This paper presents a topical analysis using text mining tools and shows the tools' effectiveness for the analysis of social media data analysis after a disaster. Though an ad hoc system without prepared resources was useful, an improved system with some syntactic pattern dictionaries showed better results.
Mass and social media corpus analysis after the 2011 great east Japan earthquake BIBAFull-Text 711-712
  Shosuke Sato; Michiaki Tatsubori; Fumihiko Imamura
In this paper, we outline our analysis of mass media and social media as used for disaster management. We looked at the differences among multiple sub-corpuses to find relatively unique keywords based on chronologies, geographic locations, or media types. We are currently analyzing a massive corpus collected from Internet news sources and Twitter after the Great East Japan Earthquake.
Social media and SMS in the Haiti earthquake BIBAFull-Text 713-714
  Julie Dugdale; Bartel Van de Walle; Corinna Koeppinghoff
We describe some first results of an empirical study describing how social media and SMS were used in coordinating humanitarian relief after the Haiti Earthquake in January 2010. Current information systems for crisis management are increasingly incorporating information obtained from citizens transmitted via social media and SMS. This information proves particularly useful at the aggregate level. However it has led to some problems: information overload and processing difficulties, variable speed of information delivery, managing volunteer communities, and the high risk of receiving inaccurate or incorrect information.
Social web in disaster archives BIBAFull-Text 715-716
  Michiaki Tatsubori; Hideo Watanabe; Akihiro Shibayama; Shosuke Sato; Fumihiko Imamura
Preserving social Web datasets is a crucial part of research work for disaster management based on information from social media. This paper describes the Michinoku Shinrokuden disaster archive project, mainly dedicated to archiving data from the 2011 Great East Japan Earthquake and its aftermath. Social websites should of course be part of this archive. We discuss issues in archiving social websites for the disaster management research communities and introduce our vision for Michinoku Shinrokuden.

XperienceWeb'12 Workshop 2

Extraction of onomatopoeia used for foods from food reviews and its application to restaurant search BIBAFull-Text 719-728
  Ayumi Kato; Yusuke Fukazawa; Tomomasa Sato; Taketoshi Mori
Onomatopoeia is widely used in food reviews about food or restaurants. In this paper, we propose and evaluate a method to extract onomatopoeia including unknown ones automatically from food reviews sites. From the evaluation result, we found that we can extract onomatopoeia for specific foods with more than 46% precision; we find 18 unknown onomatopoeia, i.e. not registered in an existing onomatopoeia dictionary, in 62 extracted onomatopoeia. In addition, we propose a system that can present the user with a list of onomatopoeia specific to a restaurant she is interested in. The evaluation results indicate that an intuitive restaurant search can be done via a list of onomatopoeia, and that they are helpful for selecting food or restaurants.
Solution mining for specific contextualised problems: towards an approach for experience mining BIBAFull-Text 729-738
  Christian Severin Sauer; Thomas Roth-Berghofer
In this paper we describe the task of automated mining for solutions to highly specific problems. We do so under the premise of mapping the split view on context, introduced by Brézillon and Pomerol, onto three different levels of abstraction of a problem domain. This is done to integrate the notion of activity or focus and its influence on the context into the mining for a solution. We assume that a problem's context describes key characteristics to be decisive criteria in the mining process to mine successful solutions for it. We further detail on the process of a chain of sub problems and their foci adding up to a meta problem solution and how this can used to mine for such solutions. Through a guiding example we introduce basic steps of the solution mining process and common aspects we deem interesting to be analysed closer in upcoming research on solution mining. We further examine the possible integration of these newly established outlines for automatic solution mining for highly specific problems into a SEASALTexp, a currently developed architecture for explanation-aware extraction and case-based processing of experiences from Internet communities. We thereby gained first insights in issues occurring while trying to integrate automatic solution mining.
Extraction of procedural knowledge from the web: a comparison of two workflow extraction approaches BIBAFull-Text 739-745
  Pol Schumacher; Mirjam Minor; Kirstin Walter; Ralph Bergmann
User generated Web content includes large amounts of procedural knowledge (also called how to knowledge). This paper is on a comparison of two extraction methods for procedural knowledge from the Web. Both methods create workflow representations automatically from text with the aim to reuse the Web experience by reasoning methods. Two variants of the workflow extraction process are introduced and evaluated by experiments with cooking recipes as a sample domain. The first variant is a term-based approach that integrates standard information extraction methods from the GATE system. The second variant is a frame-based approach that is implemented by means of the SUNDANCE system. The expert assessment of the extraction results clearly shows that the more sophisticated frame-based approach outperforms the term-based approach of automated workflow extraction.
Collecting, reusing and executing private workflows on social network platforms BIBAFull-Text 747-750
  Sebastian Görg; Ralph Bergmann; Mirjam Minor; Sarah Gessinger; Siblee Islam
We propose a personal workflow management service as part of a social network that enables private users to construct personal workflows according to their specific needs and to keep track of the workflow execution. Unlike traditional workflows, such personal workflows aim at supporting processes that contain personal tasks and data. Our proposal includes a process-oriented case-based reasoning approach to support private users to obtain an appropriate personal workflow through sharing and reuse of respective experience.
Contextual trace-based video recommendations BIBAFull-Text 751-754
  Raafat Zarka; Amélie Cordier; Elöd Egyed-Zsigmond; Alain Mille
People like creating their own videos by mixing various contents. Many applications allow us to generate video clips by merging different media like videos clips, photos, text and sounds. Some of these applications enable us to combine online content with our own resources. Given the large amount of content available, the problem is to quickly find content that truly meet our needs. This is when recommender systems come in. In this paper, we propose an approach for contextual video recommendations based on a Trace-Based Reasoning approach.
Learning from users' querying experience on intranets BIBAFull-Text 755-764
  Ibrahim Adepoju Adeyanju; Dawei Song; M-Dyaa Albakour; Udo Kruschwitz; Anne De Roeck; Maria Fasli
Query recommendation is becoming a common feature of web search engines especially those for Intranets where the context is more restrictive. This is because of its utility for supporting users to find relevant information in less time by using the most suitable query terms. Selection of queries for recommendation is typically done by mining web documents or search logs of previous users. We propose the integration of these approaches by combining two models namely the concept hierarchy, typically built from an Intranet's documents, and the query flow graph, typically built from search logs. However, we build our concept hierarchy model from terms extracted from a subset (training set) of search logs since these are more representative of the user view of the domain than any concepts extracted from the collection. We then continually adapt the model by incorporating query refinements from another subset (test set) of the user search logs. This process implies learning from or reusing previous users' querying experience to recommend queries for a new but similar user query. The adaptation weights are extracted from a query flow graph built with the same logs. We evaluated our hybrid model using documents crawled from the Intranet of an academic institution and its search logs. The hybrid model was then compared to a concept hierarchy model and query flow graph built from the same collection and search logs respectively. We also tested various strategies for combining information in the search logs with respect to the frequency of clicked documents after query refinement. Our hybrid model significantly outperformed the concept hierarchy model and query flow graph when tested over two different periods of the academic year. We intend to further validate our experiments with documents and search logs from another institution and devise better strategies for selecting queries for recommendation from the hybrid model.

CQA'12 workshop 3

Exploiting user profile information for answer ranking in cQA BIBAFull-Text 767-774
  Zhi-Min Zhou; Man Lan; Zheng-Yu Niu; Yue Lu
Answer ranking is very important for cQA services due to the high variance in the quality of answers. Most existing works in this area focus on using various features or employing machine learning techniques to address this problem. Only a few of them noticed and involved user profile information in this particular task. In this work, we assume the close relationship between user profile information and the quality of their answers under the ground truth that user information records the user behaviors and histories as a summary. Thus, we exploited the effectiveness of three categories of user profile information, i.e. engagement-related, authority-related and level-related, on answer ranking in cQA. Different from previous work, we only employed the information which is easy to extract without any limitations, such as user privacy. Experimental results on Yahoo! Answers manner questions showed that our system by using the user profile information achieved comparable or even better results over the state-of-the-art baseline system. Moreover, we found that the picture existence of a user in cQA community contributed more than other information in the answer ranking task.
Analyzing and predicting question quality in community question answering services BIBAFull-Text 775-782
  Baichuan Li; Tan Jin; Michael R. Lyu; Irwin King; Barley Mak
Users tend to ask and answer questions in community question answering (CQA) services to seek information and share knowledge. A corollary is that myriad of questions and answers appear in CQA service. Accordingly, volumes of studies have been taken to explore the answer quality so as to provide a preliminary screening for better answers. However, to our knowledge, less attention has so far been paid to question quality in CQA. Knowing question quality provides us with finding and recommending good questions together with identifying bad ones which hinder the CQA service. In this paper, we are conducting two studies to investigate the question quality issue. The first study analyzes the factors of question quality and finds that the interaction between askers and topics results in the differences of question quality. Based on this finding, in the second study we propose a Mutual Reinforcement-based Label Propagation (MRLP) algorithm to predict question quality. We experiment with Yahoo!~Answers data and the results demonstrate the effectiveness of our algorithm in distinguishing high-quality questions from low-quality ones.
A classification-based approach to question routing in community question answering BIBAFull-Text 783-790
  Tom Chao Zhou; Michael R. Lyu; Irwin King
Community-based Question and Answering (CQA) services have brought users to a new era of knowledge dissemination by allowing users to ask questions and to answer other users' questions. However, due to the fast increasing of posted questions and the lack of an effective way to find interesting questions, there is a serious gap between posted questions and potential answerers. This gap may degrade a CQA service's performance as well as reduce users' loyalty to the system. To bridge the gap, we present a new approach to Question Routing, which aims at routing questions to participants who are likely to provide answers. We consider the problem of question routing as a classification task, and develop a variety of local and global features which capture different aspects of questions, users, and their relations. Our experimental results obtained from an evaluation over the Yahoo!~Answers dataset demonstrate high feasibility of question routing. We also perform a systematical comparison on how different types of features contribute to the final results and show that question-user relationship features play a key role in improving the overall performance.
Finding expert users in community question answering BIBAFull-Text 791-798
  Fatemeh Riahi; Zainab Zolaktaf; Mahdi Shafiei; Evangelos Milios
Community Question Answering (CQA) websites provide a rapidly growing source of information in many areas. This rapid growth, while offering new opportunities, puts forward new challenges. In most CQA implementations there is little effort in directing new questions to the right group of experts. This means that experts are not provided with questions matching their expertise, and therefore new matching questions may be missed and not receive a proper answer. We focus on finding experts for a newly posted question. We investigate the suitability of two statistical topic models for solving this issue and compare these methods against more traditional Information Retrieval approaches. We show that for a dataset constructed from the Stackoverflow website, these topic models outperform other methods in retrieving a candidate set of best experts for a question. We also show that the Segmented Topic Model gives consistently better performance compared to the Latent Dirichlet Allocation Model.
QAque: faceted query expansion techniques for exploratory search using community QA resources BIBAFull-Text 799-806
  Atsushi Otsuka; Yohei Seki; Noriko Kando; Tetsuji Satoh
Recently, query suggestions have become quite useful in web searches. Most provide additional and correct terms based on the initial query entered by users. However, query suggestions often recommend queries that differ from the user's search intentions due to different contexts. In such cases, faceted query expansions and their usages are quite efficient. In this paper, we propose faceted query expansion methods using the resources of Community Question Answering (CQA), which is social network service (SNS) that shares user knowledge. In a CQA site, users can post questions in a suitable category. Others answer them based on the category framework. Thus, the CQA "category" makes a "facet" of the query expansion. In addition, the time of year when the question was posted plays an important role in understanding its context. Thus, such seasonality creates another "facet" of the query expansion. We implement two-dimensional faceted query expansion methods based on the results of the Latent Dirichlet Allocation (LDA) analysis of CQA resources. The question articles deriving query expansion are provided for choosing appropriate terms by users. Our sophisticated evaluations using actual and long-term CQA resources, such as "Yahoo! CHIEBUKURO," demonstrate that most parts of the CQA questions are posted in periodicity and in bursts.
Socio-semantic conversational information access BIBAFull-Text 807-814
  Saurav Sahay; Ashwin Ram
We develop an innovative approach to delivering relevant information using a combination of socio-semantic search and filtering approaches. The goal is to facilitate timely and relevant information access through the medium of conversations by mixing past community specific conversational knowledge and web information access to recommend and connect users and information together. Conversational Information Access is a socio-semantic search and recommendation activity with the goal to interactively engage people in conversations by receiving agent supported recommendations. It is useful because people engage in online social discussions unlike solitary search; the agent brings in relevant information as well as identifies relevant users; participants provide feedback during the conversation that the agent uses to improve its recommendations.
Why do you ask this? BIBAFull-Text 815-822
  Giovanni Gardelli; Ingmar Weber
We use Yahoo!~Toolbar data to gain insights into why people use Q&A sites. For this purpose we look at tens of thousands of questions asked on both Yahoo!~Answers and on Wiki Answers. We analyze both the pre-question behavior of users as well as their general online behavior. Using an existing approach (Harper et al.), we classify questions into "informational" vs. "conversational". Finally, for a subset of users on Yahoo! Answers we also integrate age and gender into our analysis. Our results indicate that there is a one-dimensional spectrum of users ranging from "social users" to "informational users". In terms of demographics, we found that both younger and female users are more "social" on this scale, with older and male users being more "informational".
   Concerning the pre-question behavior, users who first issue a question-related query, and especially those who do not click any web results, are more likely to issue informational questions than users who do not search before. Questions asked shortly after the registration of a new user on Yahoo! Answers tend to be social and have a lower probability of being preceded by a web search than other questions.
   Finally, we observed evidence both for and against topical congruence between a user's questions and his web queries.
Understanding user intent in community question answering BIBAFull-Text 823-828
  Long Chen; Dell Zhang; Levene Mark
Community Question Answering (CQA) services, such as Yahoo! Answers, are specifically designed to address the innate limitation of Web search engines by helping users obtain information from a community. Understanding the user intent of questions would enable a CQA system identify similar questions, find relevant answers, and recommend potential answerers more effectively and efficiently. In this paper, we propose to classify questions into three categories according to their underlying user intent: subjective, objective, and social. In order to identify the user intent of a new question, we build a predictive model through machine learning based on both text and metadata features. Our investigation reveals that these two types of features are conditionally independent and each of them is sufficient for prediction. Therefore they can be exploited as two views in co-training -- a semi-supervised learning framework -- to make use of a large amount of unlabelled questions, in addition to the small set of manually labelled questions, for enhanced question classification. The preliminary experimental results show that co-training works significantly better than simply pooling these two types of features together.
Churn prediction in new users of Yahoo! answers BIBAFull-Text 829-834
  Gideon Dror; Dan Pelleg; Oleg Rokhlenko; Idan Szpektor
One of the important targets of community-based question answering (CQA) services, such as Yahoo! Answers, Quora and Baidu Zhidao, is to maintain and even increase the number of active answerers, that is the users who provide answers to open questions. The reasoning is that they are the engine behind satisfied askers, which is the overall goal behind CQA. Yet, this task is not an easy one. Indeed, our empirical observation shows that many users provide just one or two answers and then leave. In this work we try to detect answerers that are about to quit, a task known as churn prediction, but unlike prior work, we focus on new users. To address the task of churn prediction in new users, we extract a variety of features to model the behavior of \YA{} users over the first week of their activity, including personal information, rate of activity, and social interaction with other users. Several classifiers trained on the data show that there is a statistically significant signal for discriminating between users who are likely to churn and those who are not. A detailed feature analysis shows that the two most important signals are the total number of answers given by the user, closely related to the motivation of the user, and attributes related to the amount of recognition given to the user, measured in counts of best answers, thumbs up and positive responses by the asker.

EMAIL'12 workshop 4

Email between private use and organizational purpose BIBAFull-Text 837-840
  Uwe V. Riss
Emails have become an eminent source of personal and organizational information. They are not only used for personal communication but also for the management of information and the coordination of activities within organizations. Email traffic also exhibits the social networks existing in organizations. However, the central problem, which we still face, is how to tap this rich source appropriately. Main problems in this respect are the personal character of emails (their privacy) and the mainly unstructured character of their contents. Since these two features are essential success factors for the use of email they cannot be simply ignored. Meanwhile there are various approaches to recover the hidden treasure and make the contained information available to information and process management. For example, semantic or mining technologies play a prominent role in this attempt. The paper gives an overview of different strategies to make organizational use of emails, also touching the role of privacy.
Emails as graph: relation discovery in email archive BIBAFull-Text 841-846
  Michal Laclavík; Stefan Dlugolinský; Martin Seleng; Marek Ciglan; Ladislav Hluchý
In this paper, we present an approach for representing an email archive in the form of a network, capturing the communication among users and relations among the entities extracted from the textual part of the email messages. We showcase the method on the Enron email corpus, from which we extract various entities and a social network. The extracted named entities (NE), such as people, email addresses and telephone numbers, are organized in a graph along with the emails in which they were found. The edges in the graph indicate relations between NEs and represent a co-occurrence in the same email part, paragraph, sentence or a composite NE. We study mathematical properties of the graphs so created and describe our hands-on experience with the processing of such structures. Enron Graph corpus contains a few million nodes and is large enough for experimenting with various graph-querying techniques, e.g. graph traversal or spread of activation. Due to its size, the exploitation of traditional graph processing libraries might be problematic as they keep the whole structure in the memory. We describe our experience with the management of such data and with the relation discovery among the extracted entities. The described experience might be valuable for practitioners and highlights several research challenges.
Interpreting contact details out of e-mail signature blocks BIBAFull-Text 847-850
  Gaëlle Recourcé
This paper describes a fully automated process of address book enrichment by means of information extraction in e-mail signature blocks. The main issues we tackle are signature block detection, named entities tagging, mapping with a specific person, standardizing the details and auto-updating of the address book. We adopted a symbolic approach for NLP modules. We describe how the process was designed to handle multiple-type of errors (human or computer-driven) while aiming at 100% precision rate. Last, we tackle the question of automatic updating confronted to users rights over their own data.
Context-sensitive business process support based on emails BIBAFull-Text 851-856
  Thomas Burkhart; Dirk Werth; Peter Loos
In many companies, a majority of business processes take place via email communication. Large enterprises have the possibility to operate enterprise systems for a successful business process management. However, these systems are not appropriate for SMEs, which are the most common enterprise type in Europe. Thus, the European research project Commius addresses the special needs of SMEs and characteristics of email communication, namely highly flexibility and unstructuredness. Commius turns the existing email-system into a structured process management framework. Each incoming email is autonomously matched to the corresponding business process and enhanced by proactive annotations. These context-sensitive annotations include recommendations for the most suitable following process steps. An underlying, self-adjusting recommendation model ensures most appropriate recommendations by observing the actual user behavior. This implies that the proposed process course is in no way obligatory. To provide a high degree of flexibility, any deviation from the given process structure is allowed.
Full-text search in email archives using social evaluation, attached and linked resources BIBAFull-Text 857-860
  Vojtech Juhász
Emails are important tools for communication and cooperation, they contain large amount of information and connections to knowledge and data sources. Because of this, it is very important to improve the efficiency of their processing. This paper describes an email search system which integrates full-text search with social search while processing also the attached and linked resources. The project described in this paper is still in progress. Due to this fact, some proposed parts of the system are not implemented and also not proven yet. The proposed equation for determining the social importance of an email has also to be tuned during the last phases of the development and the evaluation phase. The already implemented part of the system includes content extraction from the email messages, attached and linked resources and also the textual search and social relation extraction is implemented. The next phase of the development includes tuning of the social evaluation and it's integration with textual search.

AdMIRe'12 workshop 6

Context-aware music recommender systems: workshop keynote abstract BIBKFull-Text 865-866
  Francesco Ricci
Keywords: Music recommender systems, context awareness, mobile services, tags
Data gathering for a culture specific approach in MIR BIBAFull-Text 867-868
  Xavier Serra
In this paper we describe the data gathering work done within a large research project, CompMusic, which emphasizes a culture specific approach in the automatic description of several world music repertoires. Currently we are focusing on the Hindustani (North India), Carnatic (South India) and Turkish-makam (Turkey) music traditions. The selection and organization of the data to be processed for the characterization of each of these traditions is of the utmost importance.
Music retagging using label propagation and robust principal component analysis BIBAFull-Text 869-876
  Yi-Hsuan Yang; Dmitry Bogdanov; Perfecto Herrera; Mohamed Sordo
The emergence of social tagging websites such as Last.fm has provided new opportunities for learning computational models that automatically tag music. Researchers typically obtain music tags from the Internet and use them to construct machine learning models. Nevertheless, such tags are usually noisy and sparse. In this paper, we present a preliminary study that aims at refining (retagging) social tags by exploiting the content similarity between tracks and the semantic redundancy of the track-tag matrix. The evaluated algorithms include a graph-based label propagation method that is often used in semi-supervised learning and a robust principal component analysis (PCA) algorithm that has led to state-of-the-art results in matrix completion. The results indicate that robust PCA with content similarity constraint is particularly effective; it improves the robustness of tagging against three types of synthetic errors and boosts the recall rate of music auto-tagging by 7% in a real-world setting.
Mining microblogs to infer music artist similarity and cultural listening patterns BIBAFull-Text 877-886
  Markus Schedl; David Hauger
This paper aims at leveraging microblogs to address two challenges in music information retrieval (MIR), similarity estimation between music artists and inferring typical listening patterns at different granularity levels (city, country, global). From two collections of several million microblogs, which we gathered over ten months, music-related information is extracted and statistically analyzed. We propose and evaluate four co-occurrence-based methods to compute artist similarity scores. Moreover, we derive and analyze culture-specific music listening patterns to investigate the diversity of listening behavior around the world.
Melody, bass line, and harmony representations for music version identification BIBAFull-Text 887-894
  Justin Salamon; Joan Serrà; Emilia Gómez
In this paper we compare the use of different musical representations for the task of version identification (i.e. retrieving alternative performances of the same musical piece). We automatically compute descriptors representing the melody and bass line using a state-of-the-art melody extraction algorithm, and compare them to a harmony-based descriptor. The similarity of descriptor sequences is computed using a dynamic programming algorithm based on nonlinear time series analysis which has been successfully used for version identification with harmony descriptors. After evaluating the accuracy of individual descriptors, we assess whether performance can be improved by descriptor fusion, for which we apply a classification approach, comparing different classification algorithms. We show that both melody and bass line descriptors carry useful information for version identification, and that combining them increases version detection accuracy. Whilst harmony remains the most reliable musical representation for version identification, we demonstrate how in some cases performance can be improved by combining it with melody and bass line descriptions. Finally, we identify some of the limitations of the proposed descriptor fusion approach, and discuss directions for future research.
Power-law distribution in encoded MFCC frames of speech, music, and environmental sound signals BIBAFull-Text 895-902
  Martín Haro; Joan Serrà; Álvaro Corral; Perfecto Herrera
Many sound-related applications use Mel-Frequency Cepstral Coefficients (MFCC) to describe audio timbral content. Most of the research efforts dealing with MFCCs have been focused on the study of different classification and clustering algorithms, the use of complementary audio descriptors, or the effect of different distance measures. The goal of this paper is to focus on the statistical properties of the MFCC descriptor itself. For that purpose, we use a simple encoding process that maps a short-time MFCC vector to a dictionary of binary code-words. We study and characterize the rank-frequency distribution of such MFCC code-words, considering speech, music, and environmental sound sources. We show that, regardless of the sound source, MFCC code-words follow a shifted power-law distribution. This implies that there are a few code-words that occur very frequently and many that happen rarely. We also observe that the inner structure of the most frequent code-words has characteristic patterns. For instance, close MFCC coefficients tend to have similar quantization values in the case of music signals. Finally, we study the rank-frequency distributions of individual music recordings and show that they present the same type of heavy-tailed distribution as found in the large-scale databases. This fact is exploited in two supervised semantic inference tasks: genre and instrument classification. In particular, we obtain similar classification results as the ones obtained by considering all frames in the recordings by just using 50 (properly selected) frames. Beyond this particular example, we believe that the fact that MFCC frames follow a power-law distribution could potentially have important implications for future audio-based applications.
Creating a large-scale searchable digital collection from printed music materials BIBAFull-Text 903-908
  Andrew Hankinson; John Ashley Burgoyne; Gabriel Vigliensoni; Ichiro Fujinaga
In this paper we present our work towards developing a large-scale web application for digitizing, recognizing (via optical music recognition), correcting, displaying, and searching printed music texts. We present the results of a recently completed prototype implementation of our workflow process, from document capture to presentation on the web. We discuss a number of lessons learned from this prototype. Finally, we present some open-source Web 2.0 tools developed to provide essential infrastructure components for making searchable printed music collections available online. Our hope is that these experiences and tools will help in creating next-generation globally accessible digital music libraries.
The million song dataset challenge BIBAFull-Text 909-916
  Brian McFee; Thierry Bertin-Mahieux; Daniel P. W. Ellis; Gert R. G. Lanckriet
We introduce the Million Song Dataset Challenge: a large-scale, personalized music recommendation challenge, where the goal is to predict the songs that a user will listen to, given both the user's listening history and full information (including meta-data and content analysis) for all songs. We explain the taste profile data, our goals and design choices in creating the challenge, and present baseline results using simple, off-the-shelf recommendation algorithms.
Towards minimal test collections for evaluation of audio music similarity and retrieval BIBAFull-Text 917-924
  Julián Urbano; Markus Schedl
Reliable evaluation of Information Retrieval systems requires large amounts of relevance judgments. Making these annotations is quite complex and tedious for many Music Information Retrieval tasks, so performing such evaluations requires too much effort. A low-cost alternative is the application of Minimal Test Collection algorithms, which offer quite reliable results while significantly reducing the annotation effort. The idea is to incrementally select what documents to judge so that we can compute estimates of the effectiveness differences between systems with a certain degree of confidence. In this paper we show a first approach towards its application to the evaluation of the Audio Music Similarity and Retrieval task, run by the annual MIREX evaluation campaign. An analysis with the MIREX 2011 data shows that the judging effort can be reduced to about 35% to obtain results with 95% confidence.
Combining usage and content in an online music recommendation system for music in the long-tail BIBAFull-Text 925-930
  Marcos Aurélio Domingues; Fabien Gouyon; Alípio Mário Jorge; José Paulo Leal; João Vinagre; Luís Lemos; Mohamed Sordo
In this paper we propose a hybrid music recommender system, which combines usage and content data. We describe an online evaluation experiment performed in real time on a commercial music web site, specialised in content from the very long tail of music content. We compare it against two stand-alone recommenders, the first system based on usage and the second one based on content data. The results show that the proposed hybrid recommender shows advantages with respect to usage- and content-based systems, namely, higher user absolute acceptance rate, higher user activity rate and higher user loyalty.
Adapting similarity on the MagnaTagATune database: effects of model and feature choices BIBAFull-Text 931-936
  Daniel Wolff; Tillman Weyde
Predicting user's tastes on music has become crucial for a competitive music recommendation systems, and perceived similarity plays an influential role in this. MIR currently turns towards making recommendation systems adaptive to user preferences and context. Here, we consider the particular task of adapting music similarity measures to user voting data. This work builds on and responds to previous publications based on the MagnaTagATune dataset. We have reproduced the similarity dataset presented by Stober and Nürnberger at AMR 2011 to enable a comparison of approaches. On this dataset, we compare their two-level approach, defining similarity measures on individual facets and combining them in a linear model, to the Metric Learning to Rank (MLR) algorithm. MLR adapts a similarity measure that operates directly on low-level features to the user data. We compare the different algorithms, features and parameter spaces with regards to minimising constraint violations. Furthermore, the effectiveness of the MLR algorithm in generalising to unknown data is evaluated on this dataset. We also explore the effects of feature choice. Here, we find that the binary genre data shows little correlation with the similarity data, but combined with audio features it clearly improves generalisation.

MultiAPro'12 workshop 7

User profile integration made easy: model-driven extraction and transformation of social network schemas BIBAFull-Text 939-948
  Martin Wischenbart; Stefan Mitsch; Elisabeth Kapsammer; Angelika Kusel; Birgit Pröll; Werner Retschitzegger; Wieland Schwinger; Johannes Schönböck; Manuel Wimmer; Stephan Lechner
User profile integration from multiple social networks is indispensable for gaining a comprehensive view on users. Although current social networks provide access to user profile data via dedicated APIs, they fail to provide accurate schema information, which aggravates the integration of user profiles, and not least the adaptation of applications in the face of schema evolution. To alleviate these problems, this paper presents, firstly, a semi-automatic approach to extract schema information from instance data. Secondly, transformations of the derived schemas to different technical spaces are utilized, thereby allowing, amongst other benefits, the application of established integration tools and methods. Finally, as a case study, schemas are derived for Facebook, Google+, and LinkedIn. The resulting schemas are analyzed (i) for completeness and correctness according to the documentation, and (ii) for semantic overlaps and heterogeneities amongst each other, building the basis for future user profile integration.
Multi-application profile updates propagation: a semantic layer to improve mapping between applications BIBAFull-Text 949-958
  Nadia Bennani; Max Chevalier; Elöd Egyed-Zsigmond; Gilles Hubert; Marco Viviani
In the field of multi-application personalization, several techniques have been proposed to support user modeling for user data management across different applications. Many of them are based on data reconciliation techniques often implying the concepts of static ontologies and generic user data models. None of them have sufficiently investigated two main issues related to user modeling: (1) profile definition in order to allow every application to build their own view of users while promoting the sharing of these profiles and (2) profile evolution over time in order to avoid data inconsistency and the subsequent loss of income for web-site users and companies. In this paper, we conduct work and propose separated solutions for every issue. We propose a flexible user modeling system, not imposing any fixed user model whom different applications should conform to, but based on the concept of mapping among applications (and mapping functions among their user attributes). We focus in particular on the management of user profile data propagation, as a way to reduce the amount of inconsistent user profile information over several applications.
   A second goal of this paper is to illustrate, in this context, the benefit obtained by the integration of a Semantic Layer that can help application designers to automatically identify potential user attribute mappings between applications.
   This paper so illustrates a work-in-progress work where two complementary approaches are integrated to improve a main goal: managing multi-application user profiles in a semi-automatic manner.
Personalised placement in networked video BIBAFull-Text 959-968
  Jeremy D. Foss; Benedita Malheiro; Juan-Carlos Burguillo
Personalised video can be achieved by inserting objects into a video play-out according to the viewer's profile. Content which has been authored and produced for general broadcast can take on additional commercial service features when personalised either for individual viewers or for groups of viewers participating in entertainment, training, gaming or informational activities. Although several scenarios and use-cases can be envisaged, we are focussed on the application of personalised product placement. Targeted advertising and product placement are currently garnering intense interest in the commercial networked media industries. Personalisation of product placement is a relevant and timely service for next generation online marketing and advertising and for many other revenue generating interactive services. This paper discusses the acquisition and insertion of media objects into a TV video play-out stream where the objects are determined by the profile of the viewer. The technology is based on MPEG-4 standards using object based video and MPEG-7 for metadata. No proprietary technology or protocol is proposed. To trade the objects into the video play-out, a Software-as-a-Service brokerage platform based on intelligent agent technology is adopted. Agencies, libraries and service providers are represented in a commercial negotiation to facilitate the contractual selection and usage of objects to be inserted into the video play-out.
A user profile modelling using social annotations: a survey BIBAFull-Text 969-976
  Manel Mezghani; Corinne Amel Zayani; Ikram Amous; Faiez Gargouri
As social networks are growing in terms of the number of users, resources and interactions; the user may be lost or unable to find useful information. Social elements could avoid this disorientation like the social annotations (tags) which become more and more popular and contribute to avoid the disorientation of the user. Representing a user based on these social annotations has showed their utility in reflecting an accurate user profile which could be used for a recommendation purpose. In this paper, we give a state of the art of characteristics of social user and techniques which model and update a tag-based profile. We show how to treat social annotations and the utility of modelling tag-based profiles for recommendation purposes.
Towards an interoperable device profile containing rich user constraints BIBAFull-Text 977-986
  Cédric Dromzée; Sébastien Laborie; Philippe Roose
Currently, multimedia documents can be accessed at anytime and anywhere with a wide variety of mobile devices, e.g., laptops, smartphones, tablets. Obviously, platforms heterogeneity, user's preferences and context variations require documents adaptation according to execution constraints, e.g., audio contents may not be played while a user is participating at a meeting. Current context modeling languages do not handle such a real life user constraints. They generally list multiple information values that are interpreted by adaptation processes in order to deduce implicitly such high-level constraints. This paper overcomes this limitation by proposing a novel context modeling approach based on services where context information are linked according to explicit high-level constraints. In order to validate our proposal, we have used Semantic Web technologies by specifying RDF profiles and experiment their usage on several platforms.

LSNA'12 workshop 8

From network mining to large scale business networks BIBAFull-Text 989-996
  Daniel Ritter
The vision of Large Scale Network Analysis (LSNA) states on large amounts of network data, which are produced by social media applications like Facebook, Twitter, and the competitive domain of biological networks as well as their needs for network data extraction and analysis. That raises data management challenges which are addressed by biological, data mining and linked (web) data management communities. So far, mainly these domains were considered when identifying research topics and measuring approaches and progress. We argue that an important domain, the Business Network Management (BNM), representing business and (technical) integration data, implicitly linked and available in enterprises, has been neglected. Not only do enterprises need visibilities into their business networks, they need ad-hoc analysis capabilities on them. In this paper, we introduce BNM as domain, which comes with large scale network data. We discuss how linked business data can be made explicit by what we called Network Mining (NM) from dynamic, heterogeneous enterprise environments to combine it to a (cross-) enterprise linked business data network and state on its different facets w.r.t large network analysis and highlight challenges and opportunities.
Role-dynamics: fast mining of large dynamic networks BIBAFull-Text 997-1006
  Ryan Rossi; Brian Gallagher; Jennifer Neville; Keith Henderson
To understand the structural dynamics of a large-scale social, biological or technological network, it may be useful to discover behavioral roles representing the main connectivity patterns present over time. In this paper, we propose a scalable non-parametric approach to automatically learn the structural dynamics of the network and individual nodes. Roles may represent structural or behavioral patterns such as the center of a star, peripheral nodes, or bridge nodes that connect different communities. Our novel approach learns the appropriate structural role dynamics for any arbitrary network and tracks the changes over time. In particular, we uncover the specific global network dynamics and the local node dynamics of a technological, communication, and social network. We identify interesting node and network patterns such as stationary and non-stationary roles, spikes/steps in role-memberships (perhaps indicating anomalies), increasing/decreasing role trends, among many others. Our results indicate that the nodes in each of these networks have distinct connectivity patterns that are non-stationary and evolve considerably over time. Overall, the experiments demonstrate the effectiveness of our approach for fast mining and tracking of the dynamics in large networks. Furthermore, the dynamic structural representation provides a basis for building more sophisticated models and tools that are fast for exploring large dynamic networks.
A fast algorithm to find all high degree vertices in power law graphs BIBAFull-Text 1007-1016
  Colin Cooper; Tomasz Radzik; Yiannis Siantos
Sampling from large graphs is an area which is of great interest, particularly with the recent emergence of huge structures such as Online Social Networks. These often contain hundreds of millions of vertices and billions of edges. The large size of these networks makes it computationally expensive to obtain structural properties of the underlying graph by exhaustive search. If we can estimate these properties by taking small but representative samples from the network, then size is no longer a problem.
   In this paper we develop an analysis of random walks, a commonly used method of sampling from networks. We present a method of biasing the random walk to acquire a complete sample of high degree vertices of social networks, or similar graphs. The preferential attachment model is a common method to generate graphs with a power law degree sequence. For this model, we prove that this sampling method is successful with high probability. We also make experimental studies of the method on various real world networks.
   For t-vertex graphs G(t) generated by a preferential attachment process, we analyze a biased random walk which makes transitions along undirected edges {x,y} proportional to [d(x)d(y)]b, where d(x) is the degree of vertex x and b > 0 is a constant parameter. Let S(a) be the set of all vertices of degree at least ta in G(t). We show that for some b approx 2/3, if the biased random walk starts at an arbitrary vertex of S(a), then with high probability the set S(a) can be discovered completely in O(t1-(4/3)a+d) steps, where d is a very small positive constant. The notation O ignores poly-log t factors.
   The preferential attachment process generates graphs with power law 3, so the above example is a special case of this result. For graphs with degree sequence power law c>2 generated by a generalized preferential attachment process, a random walk with transitions along undirected edges {x,y} proportional to (d(x)d(y))(c-2)/2, discovers the set S(a) completely in O(t1-a(c-2)+d) steps with high probability. The cover time of the graph is O(t).
   Our results say that if we search preferential attachment graphs with a bias b=(c-2)/2 proportional to the power law c then, (i) we can find all high degree vertices quickly, and (ii) the time to discover all vertices is not much higher than in the case of a simple random walk. We conduct experimental tests on generated networks and real-world networks, which confirm these two properties.
Harnessing user library statistics for research evaluation and knowledge domain visualization BIBAFull-Text 1017-1024
  Peter Kraker; Christian Körner; Kris Jack; Michael Granitzer
Social reference management systems provide a wealth of information that can be used for the analysis of science. In this paper, we examine whether user library statistics can produce meaningful results with regards to science evaluation and knowledge domain visualization. We are conducting two empirical studies, using a sample of library data from Mendeley, the world's largest social reference management system. Based on the occurrence of references in users' libraries, we perform a large-scale impact factor analysis and an exploratory co-readership analysis. Our preliminary findings indicate that the analysis of user library statistics can produce accurate, timely, and content-rich results. We find that there is a significant relationship between the impact factor and the occurrence of references in libraries. Using a knowledge domain visualization based on co-occurrence measures, we are able to identify two areas of topics within the emerging field of technology-enhanced learning.
MenuMiner: revealing the information architecture of large web sites by analyzing maximal cliques BIBAFull-Text 1025-1034
  Matthias Keller; Martin Nussbaumer
The foundation of almost all web sites' information architecture is a hierarchical content organization. Thus information architects put much effort in designing taxonomies that structure the content in a comprehensible and sound way. The taxonomies are obvious to human users from the site's system of main and sub menus. But current methods of web structure mining are not able to extract these central aspects of the information architecture. This is because they cannot interpret the visual encoding to recognize menus and their rank as humans do. In this paper we show that a web site's main navigation system can not only be distinguished by visual features but also by certain structural characteristics of the HTML tree and the web graph. We have developed a reliable and scalable solution that solves the problem of extracting menus for mining the information architecture. The novel MenuMiner-algorithm allows retrieving the original content organization of large-scale web sites. These data are very valuable for many applications, e.g. the presentation of search results. In an experiment we applied the method for finding site boundaries within a large domain. The evaluation showed that the method reliably delivers menus and site boundaries where other current approaches fail.
Large scale microblog mining using distributed MB-LDA BIBAFull-Text 1035-1042
  Chenyi Zhang; Jianling Sun
In the information explosion era, large scale data processing and mining is a hot issue. As microblog grows more popular, microblog services have become information provider on a web scale, so researches on microblog begin to focus more on its content mining than solely user's relationship analysis before. Although traditional text mining methods have been studied well, no algorithm is designed specially for microblog data, which contain structured information on social network besides plain text. In this paper, we introduce a novel probabilistic generative model MicroBlog-Latent Dirichlet Allocation (MB-LDA), which takes both contactor relevance relation and document relevance relation into consideration to improve topic mining in microblogs. Through Gibbs sampling for approximate inference of our model, MB-LDA can discover not only the topics of microblogs, but also the topics focused by contactors. When faced with large datasets, traditional techniques on single node become less practical within limited resources. So we present distributed MB-LDA in MapReduce framework in order to process large scale microblogs with high scalability. Furthermore, we apply a performance model to optimize the execution time by tuning the number of mappers and reducers. Experimental results on actual dataset show MB-LDA outperforms the baseline of LDA and distributed MB-LDA offers an effective solution to topic mining for large scale microblogs.
k-Centralities: local approximations of global measures based on shortest paths BIBAFull-Text 1043-1050
  Jürgen Pfeffer; Kathleen M. Carley
A lot of centrality measures have been developed to analyze different aspects of importance. Some of the most popular centrality measures (e.g. betweenness centrality, closeness centrality) are based on the calculation of shortest paths. This characteristic limits the applicability of these measures for larger networks. In this article we elaborate on the idea of bounded-distance shortest paths calculations. We claim criteria for k-centrality measures and we introduce one algorithm for calculating both betweenness and closeness based centralities. We also present normalizations for these measures. We show that k-centrality measures are good approximations for the corresponding centrality measures by achieving a tremendous gain of calculation time and also having linear calculation complexity O(n) for networks with constant average degree. This allows researchers to approximate centrality measures based on shortest paths for networks with millions of nodes or with high frequency in dynamically changing networks.
Building a role search engine for social media BIBAFull-Text 1051-1060
  Vanesa Junquero-Trabado; David Dominguez-Sal
A social role is a set of characteristics that describe the behavior of individuals and their interactions between them within a social context. In this paper, we describe the architecture of a search engine for detecting roles in a social network. Our approach, based on indexed clusters, gives the user the possibility to define the roles interactively during a search session and retrieve the users for that role in milliseconds. We found that role selection strategies based on selecting people deviating from the average standards provides flexible query expressions and high quality results.

SWCS'12 workshop 9

Wikidata: a new platform for collaborative data collection BIBAFull-Text 1063-1064
  Denny Vrandecic
This year, Wikimedia starts to build a new platform for the collaborative acquisition and maintenance of structured data: Wikidata. Wikidata's prime purpose is to be used within the other Wikimedia projects, like Wikipedia, to provide well-maintained, high-quality data. The nature and requirements of the Wikimedia projects require to develop a few novel, or at least unusual features for Wikidata: Wikidata will be a secondary database, i.e. instead of containing facts it will contain references for facts. It will be fully internationalized. It will contain inconsistent and contradictory facts, in order to represent the diversity of knowledge about a given entity.
User assistance for collaborative knowledge construction BIBAFull-Text 1065-1074
  Pierre-Antoine Champin; Amélie Cordier; Elise Lavoué; Marie Lefevre; Hala Skaf-Molli
In this paper, we study tools for providing assistance to users in distributed spaces. More precisely, we focus on the activity of collaborative construction of knowledge, supported by a network of distributed semantic wikis. Assisting the users in such an activity is made necessary mainly by two factors: the inherent complexity of the tools supporting that activity, and the collaborative nature of the activity, involving many interactions between users. In this paper we focus on the second aspect. For this, we propose to build an assistance tool based on users interaction traces. This tool will provide a contextualized assistance by leveraging the valuable knowledge contained in traces. We discuss the issue of assistance in our context and we show the different types of assistance that we intend to provide through three scenarios. We highlight research questions raised by this preliminary study.
Knowledge continuous integration process (K-CIP) BIBAFull-Text 1075-1082
  Hala Skaf-Molli; Emmanuel Desmontils; Emmanuel Nauer; Gérôme Canals; Amélie Cordier; Marie Lefevre; Pascal Molli; Yannick Toussaint
Social semantic web creates read/write spaces where users and smart agents collaborate to produce knowledge readable by humans and machines. An important issue concerns the ontology evolution and evaluation in man-machine collaboration. How to perform a change on ontologies in a social semantic space that currently use these ontologies through requests? In this paper, we propose to implement a continuous knowledge integration process named K-CIP. We take advantage of man-machine collaboration to transform feedback of people into tests. This paper presents how K-CIP can be deployed to allow fruitful man-machine collaboration in the context of the Wikitaaable system.
Linking justifications in the collaborative semantic web applications BIBAFull-Text 1083-1090
  Rakebul Hasan; Fabien Gandon
Collaborative Semantic Web applications produce ever changing interlinked Semantic Web data. Applications that utilize these data to obtain their results should provide explanations about how the results are obtained in order to ensure the effectiveness and increase the user acceptance of these applications. Justifications providing meta information about why a conclusion has been reached enable generation of such explanations. We present an encoding approach for justifications in a distributed environment focusing on the collaborative platforms. We discuss the usefulness of linking justifications across the Web. We introduce a vocabulary for encoding justifications in a distributed environment and provide examples of our encoding approach.
Synchronizing semantic stores with commutative replicated data types BIBAFull-Text 1091-1096
  Luis Daniel Ibáñez; Hala Skaf-Molli; Pascal Molli; Olivier Corby
Social semantic web technologies led to huge amounts of data and information being available. The production of knowledge from this information is challenging, and major efforts, like DBpedia, has been done to make it reality. Linked data provides interconnection between this information, extending the scope of the knowledge production. The knowledge construction between decentralized sources in the web follows a co-evolution scheme, where knowledge is generated collaboratively and continuously. Sources are also autonomous, meaning that they can use and publish only the information they want. The updating of sources with this criteria is intimately related with the problem of synchronization, and the consistency between all the replicas managed. Recently, a new family of algorithms called Commutative Replicated Data Types have emerged for ensuring eventual consistency in highly dynamic environments. In this paper, we define SU-Set, a CRDT for RDF-Graph that supports SPARQL Update 1.1 operations.
Building consensus via a semantic web collaborative space BIBAFull-Text 1097-1106
  George Anadiotis; Konstantinos Kafentzis; Iannis Pavlopoulos; Adam Westerski
In this paper we outline the design and implementation of the eDialogos Consensus process and platform to support wide-scale collaborative decision making. We present the design space and choices made and perform a conceptual alignment of the domains this space entails, based on the use of the eDialogos Consensus ontology as a crystallization point for platform design and implementation as well as interoperability with existing solutions. We also present a metric for calculating agreement on the issues under debate in the platform, incorporating argumentation structure and user feedback.
Improving Wikipedia with DBpedia BIBAFull-Text 1107-1112
  Diego Torres; Pascal Molli; Hala Skaf-Molli; Alicia Diaz
DBpedia is the semantic mirror of Wikipedia. DBpedia extracts information from Wikipedia and stores it in a semantic knowledge base. This semantic feature allows complex semantic queries, which could infer new relations that are missing in Wikipedia. This is an interesting source of knowledge to increase Wikipedia content. But, what is the best way to add these new relations following the Wikipedia conventions? In this paper, we propose a path indexing algorithm (PIA) which takes the resulting set of a DBPedia query and discovers the best representative path in Wikipedia. We evaluate the algorithm with real data sets from DBpedia.
Man-machine collaboration to acquire cooking adaptation knowledge for the TAAABLE case-based reasoning system BIBAFull-Text 1113-1120
  Amélie Cordier; Emmanuelle Gaillard; Emmanuel Nauer
This paper shows how humans and machines can better collaborate to acquire adaptation knowledge (AK) in the framework of a case-based reasoning (CBR) system whose knowledge is encoded in a semantic wiki. Automatic processes like the CBR reasoning process itself, or specific tools for acquiring AK are integrated as wiki extensions. These tools and processes are combined on purpose to collect AK. Users are at the center of our approach, as they are in a classical wiki, but they will now benefit from automatic tools for helping them to feed the wiki. In particular, the CBR system, which is currently only a consumer for the knowledge encoded in the semantic wiki, will also be used for producing knowledge for the wiki. A use case in the domain of cooking is given to exemplify the man-machine collaboration.
Community: issues, definitions, and operationalization on the web BIBAFull-Text 1121-1130
  Guo Zhang; Elin K. Jacob
This paper addresses the concepts of community and online community and discusses the physical, functional, and symbolic characteristics of a community that have formed the basis for traditional definitions. It applies a four-dimensional perspective of space and place (i.e., shape, structure, context, and experience) as a framework for refining the definition of traditional offline communities and for developing a definition of online communities that can be effectively operationalized. The methods and quantitative measures of social network analysis are proposed as appropriate tools for investigating the nature and function of communities because they can be used to quantify the typically subjective social phenomena generally associated with communities.

MSND'12 workshop 10

Business session "social media and news BIBAFull-Text 1135-1136
  Jochen Spangenberg
The workshop also includes a business section that will focus on aspects of Social Media in the News domain. Panelists with expertise in innovation management, news provision, journalism and market developments will discuss some of the challenges of and opportunities for the news sector with regards to Social Media. This part of the workshop is organized and brought to you by the SocialSensor project.
Graph embedding on spheres and its application to visualization of information diffusion data BIBAFull-Text 1137-1144
  Kazumi Saito; Masahiro Kimura; Kouzou Ohara; Hiroshi Motoda
We address the problem of visualizing structure of undirected graphs that have a value associated with each node into a K-dimensional Euclidean space in such a way that 1) the length of the point vector in this space is equal to the value assigned to the node and 2) nodes that are connected are placed as close as possible to each other in the space and nodes not connected are placed as far apart as possible from each other. The problem is reduced to K-dimensional spherical embedding with a proper objective function. The existing spherical embedding method can handle only a bipartite graph and cannot be used for this purpose. The other graph embedding methods, e.g., multi-dimensional scaling, spring force embedding methods, etc., cannot handle the value constraint and thus are not applicable, either. We propose a very efficient algorithm based on a power iteration that employs the double-centering operations. We apply the method to visualize the information diffusion process over a social network by assigning the node activation time to the node value, and compare the results with the other visualization methods. The results applied to four real world networks indicate that the proposed method can visualize the diffusion dynamics which the other methods cannot and the role of important nodes, e.g. mediator, more naturally than the other methods.
A predictive model for the temporal dynamics of information diffusion in online social networks BIBAFull-Text 1145-1152
  Adrien Guille; Hakim Hacid
Today, online social networks have become powerful tools for the spread of information. They facilitate the rapid and large-scale propagation of content and the consequences of an information -- whether it is favorable or not to someone, false or true -- can then take considerable proportions. Therefore it is essential to provide means to analyze the phenomenon of information dissemination in such networks. Many recent studies have addressed the modeling of the process of information diffusion, from a topological point of view and in a theoretical perspective, but we still know little about the factors involved in it. With the assumption that the dynamics of the spreading process at the macroscopic level is explained by interactions at microscopic level between pairs of users and the topology of their interconnections, we propose a practical solution which aims to predict the temporal dynamics of diffusion in social networks. Our approach is based on machine learning techniques and the inference of time-dependent diffusion probabilities from a multidimensional analysis of individual behaviors. Experimental results on a real dataset extracted from Twitter show the interest and effectiveness of the proposed approach as well as interesting recommendations for future investigation.
Targeting online communities to maximise information diffusion BIBAFull-Text 1153-1160
  Václav Belák; Samantha Lam; Conor Hayes
In recent years, many companies have started to utilise online social communities as a means of communicating with and targeting their employees and customers. Such online communities include discussion fora which are driven by the conversational activity of users. For example, users may respond to certain ideas as a result of the influence of their neighbours in the underlying social network. We analyse such influence to target communities rather than individual actors because information is usually shared with the community and not just with individual users. In this paper, we study information diffusion across communities and argue that some communities are more suitable for maximising spread than others. In order to achieve this, we develop a set of novel measures for cross-community influence, and show that it outperforms other targeting strategies on 51 weeks of data of the largest Irish online discussion system, Boards.ie.
Identifying communicator roles in Twitter BIBAFull-Text 1161-1168
  Ramine Tinati; Leslie Carr; Wendy Hall; Jonny Bentwood
Twitter has redefined the way social activities can be coordinated; used for mobilizing people during natural disasters, studying health epidemics, and recently, as a communication platform during social and political change. As a large scale system, the volume of data transmitted per day presents Twitter users with a problem: how can valuable content be distilled from the back chatter, how can the providers of valuable information be promoted, and ultimately how can influential individuals be identified? To tackle this, we have developed a model based upon the Twitter message exchange which enables us to analyze conversations around specific topics and identify key players in a conversation. A working implementation of the model helps categorize Twitter users by specific roles based on their dynamic communication behavior rather than an analysis of their static friendship network. This provides a method of identifying users who are potentially producers or distributors of valuable knowledge.
File diffusion in a dynamic peer-to-peer network BIBAFull-Text 1169-1172
  Alice Albano; Jean-Loup Guillaume; Bénédicte Le Grand
Many studies have been made on diffusion in the field of epidemiology, and in the last few years, the development of social networking has induced new types of diffusion. In this paper, we focus on file diffusion on a peer-to-peer dynamic network using eDonkey protocol. On this network, we observe a linear behavior of the actual file diffusion. This result is interesting, because most diffusion models exhibit exponential behaviors. In this paper, we propose a new model of diffusion, based on the SI (Susceptible / Infected) model, which produces results close to the linear behavior of the observed diffusion. We then justify the linearity of this model, and we study its behavior in more details.
Community cores in evolving networks BIBAFull-Text 1173-1180
  Massoud Seifi; Jean-Loup Guillaume
Community structure is a key property of complex networks. Many algorithms have been proposed to automatically detect communities in static networks but few studies have considered the detection and tracking of communities in an evolving network. Tracking the evolution of a given community over time requires a clustering algorithm that produces stable clusters. However, most community detection algorithms are very unstable and therefore unusable for evolving networks. In this paper, we apply the methodology proposed in [seifi2012] to detect what we call community cores in evolving networks. We show that cores are much more stable than "classical" communities and that we can overcome the disadvantages of the stabilized methods.
Watch me playing, i am a professional: a first study on video game live streaming BIBAFull-Text 1181-1188
  Mehdi Kaytoue; Arlei Silva; Loïc Cerf; Wagner, Jr. Meira; Chedy Raïssi
"Electronic-sport" (E-Sport) is now established as a new entertainment genre. More and more players enjoy streaming their games, which attract even more viewers. In fact, in a recent social study, casual players were found to prefer watching professional gamers rather than playing the game themselves. Within this context, advertising provides a significant source of revenue to the professional players, the casters (displaying other people's games) and the game streaming platforms. For this paper, we crawled, during more than 100 days, the most popular among such specialized platforms: Twitch.tv. Thanks to these gigabytes of data, we propose a first characterization of a new Web community, and we show, among other results, that the number of viewers of a streaming session evolves in a predictable way, that audience peaks of a game are explainable and that a Condorcet method can be used to sensibly rank the streamers by popularity. Last but not least, we hope that this paper will bring to light the study of E-Sport and its growing community. They indeed deserve the attention of industrial partners (for the large amount of money involved) and researchers (for interesting problems in social network dynamics, personalized recommendation, sentiment analysis, etc.).
Supervised rank aggregation approach for link prediction in complex networks BIBAFull-Text 1189-1196
  Manisha Pujari; Rushed Kanawati
In this paper we propose a new topological approach for link prediction in dynamic complex networks. The proposed approach applies a supervised rank aggregation method. This functions as follows: first we rank the list of unlinked nodes in a network at instant t according to different topological measures (nodes characteristics aggregation, nodes neighborhood based measures, distance based measures, etc). Each measure provides its own rank. Observing the network at instant t+1 where some new links appear, we weight each topological measure according to its performances in predicting these observed new links. These learned weights are then used in a modified version of classical computational social choice algorithms (such as Borda, Kemeny, etc) in order to have a model for predicting new links. We show the effectiveness of this approach through different experimentations applied to co-authorship networks extracted from the DBLP bibliographical database. Results we obtain, are also compared with the outcome of classical supervised machine learning based link prediction approaches applied to the same datasets.
Predicting information diffusion on social networks with partial knowledge BIBAFull-Text 1197-1204
  Anis Najar; Ludovic Denoyer; Patrick Gallinari
Models of information diffusion and propagation over large social media usually rely on a Close World Assumption: information can only propagate onto the network relational structure, it cannot come from external sources, the network structure is supposed fully known by the model. These assumptions are nonrealistic for many propagation processes extracted from Social Websites. We address the problem of predicting information propagation when the network diffusion structure is unknown and without making any closed world assumption. Instead of modeling a diffusion process, we propose to directly predict the final propagation state of the information over a whole user set. We describe a general model, able to learn predicting which users are the most likely to be contaminated by the information knowing an initial state of the network. Different instances are proposed and evaluated on artificial datasets.
Collective attention and the dynamics of group deals BIBAFull-Text 1205-1212
  Mao Ye; Thomas Sandholm; Chunyan Wang; Christina Aperjis; Bernardo A. Huberman
We present a study of the group purchasing behavior of daily deals in Groupon and LivingSocial and formulate a predictive dynamic model of collective attention for group buying behavior. Using large data sets from both Groupon and LivingSocial we show how the model is able to predict the success of group deals as a function of time.
   We find that Groupon deals are easier to predict accurately earlier in the deal lifecycle than LivingSocial deals due to the total number of deal purchases saturating quicker. One possible explanation for this is that the incentive to socially propagate a deal is based on an individual threshold in LivingSocial, whereas in Groupon it is based on a collective threshold which is reached very early. Furthermore, the personal benefit of propagating a deal is greater in LivingSocial.
Social networking trends and dynamics detection via a cloud-based framework design BIBAFull-Text 1213-1220
  Athena Vakali; Maria Giatsoglou; Stefanos Antaris
Social networking media generate huge content streams, which leverage, both academia and developers efforts in providing unbiased, powerful indications of users' opinion and interests. Here, we present Cloud4Trends, a framework for collecting and analyzing user generated content through microblogging and blogging applications, both separately and jointly, focused on certain geographical areas, towards the identification of the most significant topics using trend analysis techniques. The cloud computing paradigm appears to offer a significant benefit in order to make such applications viable considering that the massive data sizes produced daily impose the need of a scalable and powerful infrastructure. Cloud4Trends constitutes an efficient Cloud-based approach in order to solve the online trend tracking problem based on Web 2.0 sources. A detailed system architecture model is also proposed, which is largely based on a set of service modules developed within the VENUS-C research project to facilitate the deployment of research applications on Cloud infrastructures.
Effects of the recession on public mood in the UK BIBAFull-Text 1221-1226
  Thomas Lansdall-Welfare; Vasileios Lampos; Nello Cristianini
Large scale analysis of social media content allows for real time discovery of macro-scale patterns in public opinion and sentiment. In this paper we analyse a collection of 484 million tweets generated by more than 9.8 million users from the United Kingdom over the past 31 months, a period marked by economic downturn and some social tensions. Our findings, besides corroborating our choice of method for the detection of public mood, also present intriguing patterns that can be explained in terms of events and social changes. On the one hand, the time series we obtain show that periodic events such as Christmas and Halloween evoke similar mood patterns every year. On the other hand, we see that a significant increase in negative mood indicators coincide with the announcement of the cuts to public spending by the government, and that this effect is still lasting. We also detect events such as the riots of summer 2011, as well as a possible calming effect coinciding with the run up to the royal wedding.
Improving news ranking by community tweets BIBAFull-Text 1227-1232
  Xin Shuai; Xiaozhong Liu; Johan Bollen
Users frequently express their information needs by means of short and general queries that are difficult for ranking algorithms to interpret correctly. However, users' social contexts can offer important additional information about their information needs which can be leveraged by ranking algorithms to provide augmented, personalized results. Existing methods mostly rely on users' individual behavioral data such as clickstream and log data, but as a result suffer from data sparsity and privacy issues. Here, we propose a Community Tweets Voting Model (CTVM) to re-rank Google and Yahoo news search results on the basis of open, large-scale Twitter community data. Experimental results show that CTVM outperforms baseline rankings from Google and Yahoo for certain online communities. We propose an application scenario of CTVM and provide an agenda for further research.
TwitterEcho: a distributed focused crawler to support open research with Twitter data BIBAFull-Text 1233-1240
  Matko Boanjak; Eduardo Oliveira; José Martins; Eduarda Mendes Rodrigues; Luís Sarmento
Modern social network analysis relies on vast quantities of data to infer new knowledge about human relations and communication. In this paper we describe TwitterEcho, an open source Twitter crawler for supporting this kind of research, which is characterized by a modular distributed architecture. Our crawler enables researchers to continuously collect data from particular user communities, while respecting Twitter's imposed limits. We present the core modules of the crawling server, some of which were specifically designed to focus the crawl on the Portuguese Twittosphere. Additional modules can be easily implemented, thus changing the focus to a different community. Our evaluation of the system shows high crawling performance and coverage.
"Making sense of it all": an attempt to aid journalists in analysing and filtering user generated content BIBAFull-Text 1241-1246
  Sotiris Diplaris; Symeon Papadopoulos; Ioannis Kompatsiaris; Nicolaus Heise; Jochen Spangenberg; Nic Newman; Hakim Hacid
This position paper explores how journalists can embrace new ways of content provision and authoring, by aggregating and analyzing content gathered from Social Media. Current challenges in the news media industry are reviewed and a new system for capturing emerging knowledge from Social Media is described. Novel features that assist professional journalists in processing sheer amounts of Social Media information are presented with a reference to the technical requirements of the system. First implementation steps are also discussed, particularly focusing in event detection and user influence identification.