| Real-time recommendation of diverse related articles | | BIBA | Full-Text | 1-12 | |
| Sofiane Abbar; Sihem Amer-Yahia; Piotr Indyk; Sepideh Mahabadi | |||
| News articles typically drive a lot of traffic in the form of comments posted by users on a news site. Such user-generated content tends to carry additional information such as entities and sentiment. In general, when articles are recommended to users, only popularity (e.g., most shared and most commented), recency, and sometimes (manual) editors' picks (based on daily hot topics), are considered. We formalize a novel recommendation problem where the goal is to find the closest most diverse articles to the one the user is currently browsing. Our diversity measure incorporates entities and sentiment extracted from comments. Given the real-time nature of our recommendations, we explore the applicability of nearest neighbor algorithms to solve the problem. Our user study on real opinion articles from aljazeera.net and reuters.com validates the use of entities and sentiment extracted from articles and their comments to achieve news diversity when compared to content-based diversity. Finally, our performance experiments show the real-time feasibility of our solution. | |||
| Multi-label learning with millions of labels: recommending advertiser bid phrases for web pages | | BIBA | Full-Text | 13-24 | |
| Rahul Agrawal; Archit Gupta; Yashoteja Prabhu; Manik Varma | |||
| Recommending phrases from web pages for advertisers to bid on against search
engine queries is an important research problem with direct commercial impact.
Most approaches have found it infeasible to determine the relevance of all
possible queries to a given ad landing page and have focussed on making
recommendations from a small set of phrases extracted (and expanded) from the
page using NLP and ranking based techniques. In this paper, we eschew this
paradigm, and demonstrate that it is possible to efficiently predict the
relevant subset of queries from a large set of monetizable ones by posing the
problem as a multi-label learning task with each query being represented by a
separate label.
We develop Multi-label Random Forests to tackle problems with millions of labels. Our proposed classifier has prediction costs that are logarithmic in the number of labels and can make predictions in a few milliseconds using 10 Gb of RAM. We demonstrate that it is possible to generate training data for our classifier automatically from click logs without any human annotation or intervention. We train our classifier on tens of millions of labels, features and training points in less than two days on a thousand node cluster. We develop a sparse semi-supervised multi-label learning formulation to deal with training set biases and noisy labels harvested automatically from the click logs. This formulation is used to infer a belief in the state of each label for each training ad and the random forest classifier is extended to train on these beliefs rather than the given labels. Experiments reveal significant gains over ranking and NLP based techniques on a large test set of 5 million ads using multiple metrics. | |||
| Hierarchical geographical modeling of user locations from social media posts | | BIBA | Full-Text | 25-36 | |
| Amr Ahmed; Liangjie Hong; Alexander J. Smola | |||
| With the availability of cheap location sensors, geotagging of messages in
online social networks is proliferating. For instance, Twitter, Facebook,
Foursquare, and Google+ provide these services both explicitly by letting users
choose their location or implicitly via a sensor. This paper presents an
integrated generative model of location and message content. That is, we
provide a model for combining distributions over locations, topics, and over
user characteristics, both in terms of location and in terms of their content
preferences. Unlike previous work which modeled data in a flat pre-defined
representation, our model automatically infers both the hierarchical structure
over content and over the size and position of geographical locations. This
affords significantly higher accuracy -- location uncertainty is reduced by 40%
relative to the best previous results [21] achieved on location estimation from
Tweets.
We achieve this goal by proposing a new statistical model, the nested Chinese Restaurant Franchise (nCRF), a hierarchical model of tree distributions. Much statistical structure is shared between users. That said, each user has his own distribution over interests and places. The use of the nCRF allows us to capture the following effects: (1) We provide a topic model for Tweets; (2) We obtain location specific topics; (3) We infer a latent distribution of locations; (4) We provide a joint hierarchical model of topics and locations; (5) We infer personalized preferences over topics and locations within the above model. In doing so, we are both able to obtain accurate estimates of the location of a user based on his tweets and to obtain a detailed estimate of a geographical language model. | |||
| Distributed large-scale natural graph factorization | | BIBA | Full-Text | 37-48 | |
| Amr Ahmed; Nino Shervashidze; Shravan Narayanamurthy; Vanja Josifovski; Alexander J. Smola | |||
| Natural graphs, such as social networks, email graphs, or instant messaging
patterns, have become pervasive through the internet. These graphs are massive,
often containing hundreds of millions of nodes and billions of edges. While
some theoretical models have been proposed to study such graphs, their analysis
is still difficult due to the scale and nature of the data.
We propose a framework for large-scale graph decomposition and inference. To resolve the scale, our framework is distributed so that the data are partitioned over a shared-nothing set of machines. We propose a novel factorization technique that relies on partitioning a graph so as to minimize the number of neighboring vertices rather than edges across partitions. Our decomposition is based on a streaming algorithm. It is network-aware as it adapts to the network topology of the underlying computational hardware. We use local copies of the variables and an efficient asynchronous communication protocol to synchronize the replicated values in order to perform most of the computation without having to incur the cost of network communication. On a graph of 200 million vertices and 10 billion edges, derived from an email communication network, our algorithm retains convergence properties while allowing for almost linear scalability in the number of computers. | |||
| A CRM system for social media: challenges and experiences | | BIBA | Full-Text | 49-58 | |
| Jitendra Ajmera; Hyung-iL Ahn; Meena Nagarajan; Ashish Verma; Danish Contractor; Stephen Dill; Matthew Denesuk | |||
| The social Customer Relationship Management (CRM) landscape is attracting significant attention from customers and enterprises alike as a sustainable channel for tracking, managing and improving customer relations. Enterprises are taking a hard look at this open, unmediated platform because the community effect generated on this channel can have a telling effect on their brand image, potential market opportunity and customer loyalty. In this work we present our experiences in building a system that mines conversations on social platforms to identify and prioritize those posts and messages that are relevant to enterprises. The system presented in this work aims to empower an agent or a representative in an enterprise to monitor, track and respond to customer communication while also encouraging community participation. | |||
| Here's my cert, so trust me, maybe?: understanding TLS errors on the web | | BIBA | Full-Text | 59-70 | |
| Devdatta Akhawe; Bernhard Amann; Matthias Vallentin; Robin Sommer | |||
| When browsers report TLS errors, they cannot distinguish between attacks and harmless server misconfigurations; hence they leave it to the user to decide whether continuing is safe. However, actual attacks remain rare. As a result, users quickly become used to "false positives" that deplete their attention span, making it unlikely that they will pay sufficient scrutiny when a real attack comes along. Consequently, browser vendors should aim to minimize the number of low-risk warnings they report. To guide that process, we perform a large-scale measurement study of common TLS warnings. Using a set of passive network monitors located at different sites, we identify the prevalence of warnings for a total population of about 300,000 users over a nine-month period. We identify low-risk scenarios that consume a large chunk of the user attention budget and make concrete recommendations to browser vendors that will help maintain user attention in high-risk situations. We study the impact on end users with a data set much larger in scale than the data sets used in previous TLS measurement studies. A key novelty of our approach involves the use of internal browser code instead of generic TLS libraries for analysis, providing more accurate and representative results. | |||
| Towards a robust modeling of temporal interest change patterns for behavioral targeting | | BIBA | Full-Text | 71-82 | |
| Mohamed Aly; Sandeep Pandey; Vanja Josifovski; Kunal Punera | |||
| Modern web-scale behavioral targeting platforms leverage historical activity of billions of users to predict user interests and inclinations, and consequently future activities. Future activities of particular interest involve purchases or transactions, and are referred to as conversions. Unlike ad-clicks, conversions directly translate to advertiser's revenue, and thus provide a very concrete metric for return on advertising investment. A typical behavioral targeting system faces two main challenges: the web-scale amounts of user histories to process on a daily basis, and the relative sparsity of conversions (compared to clicks in a traditional setting). These challenges call for generation of effective and efficient user profiles. Most existing works use the historical intensity of a user's interest in various topics to model future interest. In this paper we explore how the change in user behavior can be used to predict future actions and show how it complements the traditional models of decaying interest and action recency to build a complete picture about the user interests and better predict conversions. Our evaluation over a real-world set of campaigns indicates that the combination of change of interest, decaying intensity, and action recency helps in: 1) scoring significant improvements in optimizing for conversions over traditional baselines, 2) substantially improving the targeting efficiency for campaigns with highly sparse conversions, and 3) highly reducing the overall history sizes used in targeting. Furthermore, our techniques have been deployed to production and scored a substantial improvement in targeting performance while imposing a negligible overhead in terms of overall platform running time. | |||
| The anatomy of LDNS clusters: findings and implications for web content delivery | | BIBA | Full-Text | 83-94 | |
| Hussein A. Alzoubi; Michael Rabinovich; Oliver Spatscheck | |||
| We present a large-scale measurement of clusters of hosts sharing the same local DNS servers. We analyze properties of these "LDNS clusters" from the perspective of content delivery networks, which commonly use DNS for load distribution. We found that not only LDNS clusters differ widely in terms of their size and geographical compactness but that the largest clusters are actually extremely compact. This suggests potential benefits of a load distribution strategy with nuanced treatment of different LDNS clusters based on the combination of their size and compactness. We further observed interesting variations in LDNS setups including a wide use of "LDNS pools" (which as we explain in the paper are different from setups where end-hosts simply utilize multiple resolvers). | |||
| Steering user behavior with badges | | BIBA | Full-Text | 95-106 | |
| Ashton Anderson; Daniel Huttenlocher; Jon Kleinberg; Jure Leskovec | |||
| An increasingly common feature of online communities and social media sites
is a mechanism for rewarding user achievements based on a system of badges.
Badges are given to users for particular contributions to a site, such as
performing a certain number of actions of a given type. They have been employed
in many domains, including news sites like the Huffington Post, educational
sites like Khan Academy, and knowledge-creation sites like Wikipedia and Stack
Overflow. At the most basic level, badges serve as a summary of a user's key
accomplishments; however, experience with these sites also shows that users
will put in non-trivial amounts of work to achieve particular badges, and as
such, badges can act as powerful incentives. Thus far, however, the incentive
structures created by badges have not been well understood, making it difficult
to deploy badges with an eye toward the incentives they are likely to create.
In this paper, we study how badges can influence and steer user behavior on a site -- leading both to increased participation and to changes in the mix of activities a user pursues on the site. We introduce a formal model for reasoning about user behavior in the presence of badges, and in particular for analyzing the ways in which badges can steer users to change their behavior. To evaluate the main predictions of our model, we study the use of badges and their effects on the widely used Stack Overflow question-answering site, and find evidence that their badges steer behavior in ways closely consistent with the predictions of our model. Finally, we investigate the problem of how to optimally place badges in order to induce particular user behaviors. Several robust design principles emerge from our framework that could potentially aid in the design of incentives for a broad range of sites. | |||
| Cascading tree sheets and recombinant HTML: better encapsulation and retargeting of web content | | BIBA | Full-Text | 107-118 | |
| Edward O. Benson; David R. Karger | |||
| Cascading Style Sheets (CSS) took a valuable step towards separating web content from presentation. But HTML pages still contain large amounts of "design scaffolding" needed to hierarchically layer content for proper presentation. This paper presents Cascading Tree Sheets (CTS), a CSS-like language for separating this presentational HTML from real content. With CTS, authors can use standard CSS selectors to describe how to graft presentational scaffolding onto their pure-content HTML. This improved separation of content from presentation enables even naive authors to incorporate rich layouts (including interactive Javascript) into their own pages simply by linking to a tree sheet and adding some class names to their HTML. | |||
| CopyCatch: stopping group attacks by spotting lockstep behavior in social networks | | BIBA | Full-Text | 119-130 | |
| Alex Beutel; Wanhong Xu; Venkatesan Guruswami; Christopher Palow; Christos Faloutsos | |||
| How can web services that depend on user generated content discern fraudulent input by spammers from legitimate input? In this paper we focus on the social network Facebook and the problem of discerning ill-gotten Page Likes, made by spammers hoping to turn a profit, from legitimate Page Likes. Our method, which we refer to as CopyCatch, detects lockstep Page Like patterns on Facebook by analyzing only the social graph between users and Pages and the times at which the edges in the graph (the Likes) were created. We offer the following contributions: (1) We give a novel problem formulation, with a simple concrete definition of suspicious behavior in terms of graph structure and edge constraints. (2) We offer two algorithms to find such suspicious lockstep behavior -- one provably-convergent iterative algorithm and one approximate, scalable MapReduce implementation. (3) We show that our method severely limits "greedy attacks" and analyze the bounds from the application of the Zarankiewicz problem to our setting. Finally, we demonstrate and discuss the effectiveness of CopyCatch at Facebook and on synthetic data, as well as potential extensions to anomaly detection problems in other domains. CopyCatch is actively in use at Facebook, searching for attacks on Facebook's social graph of over a billion users, many millions of Pages, and billions of Page Likes. | |||
| Inferring the demographics of search users: social data meets search queries | | BIBA | Full-Text | 131-140 | |
| Bin Bi; Milad Shokouhi; Michal Kosinski; Thore Graepel | |||
| Knowing users' views and demographic traits offers a great potential for personalizing web search results or related services such as query suggestion and query completion. Such signals however are often only available for a small fraction of search users, namely those who log in with their social network account and allow its use for personalization of search results. In this paper, we offer a solution to this problem by showing how user demographic traits such as age and gender, and even political and religious views can be efficiently and accurately inferred based on their search query histories. This is accomplished in two steps; we first train predictive models based on the publically available myPersonality dataset containing users' Facebook Likes and their demographic information. We then match Facebook Likes with search queries using Open Directory Project categories. Finally, we apply the model trained on Facebook Likes to large-scale query logs of a commercial search engine while explicitly taking into account the difference between the traits distribution in both datasets. We find that the accuracy of classifying age and gender, expressed by the area under the ROC curve (AUC), are 77% and 84% respectively for predictions based on Facebook Likes, and only degrade to 74% and 80% when based on search queries. On a US state-by-state basis we find a Pearson correlation of 0.72 for political views between the predicted scores and Gallup data, and 0.54 for affiliation with Judaism between predicted scores and data from the US Religious Landscape Survey. We conclude that it is indeed feasible to infer important demographic data of users from their query history based on labelled Likes data and believe that this approach could provide valuable information for personalization and monetization even in the absence of demographic data. | |||
| Strategyproof mechanisms for competitive influence in networks | | BIBA | Full-Text | 141-150 | |
| Allan Borodin; Mark Braverman; Brendan Lucier; Joel Oren | |||
| Motivated by applications to word-of-mouth advertising, we consider a
game-theoretic scenario in which competing advertisers want to target initial
adopters in a social network. Each advertiser wishes to maximize the resulting
cascade of influence, modeled by a general network diffusion process. However,
competition between products may adversely impact the rate of adoption for any
given firm. The resulting framework gives rise to complex preferences that
depend on the specifics of the stochastic diffusion model and the network
topology.
We study this model from the perspective of a central mechanism, such as a social networking platform, that can optimize seed placement as a service for the advertisers. We ask: given the reported demands of the competing firms, how should a mechanism choose seeds to maximize overall efficiency? Beyond the algorithmic problem, competition raises issues of strategic behaviour: rational agents should not be incentivized to underreport their budget demands. We show that when there are two players, the social welfare can be 2-approximated by a polynomial-time strategyproof mechanism. Our mechanism is defined recursively, randomizing the order in which advertisers are allocated seeds according to a particular greedy method. For three or more players, we demonstrate that under additional assumptions (satisfied by many existing models of influence spread) there exists a simpler strategyproof (e/e-1)-approximation mechanism; notably, this second mechanism is not necessarily strategyproof when there are only two players. | |||
| Reactive crowdsourcing | | BIBA | Full-Text | 153-164 | |
| Alessandro Bozzon; Marco Brambilla; Stefano Ceri; Andrea Mauri | |||
| An essential aspect for building effective crowdsourcing computations is the
ability of "controlling the crowd", i.e. of dynamically adapting the behaviour
of the crowdsourcing systems as response to the quantity and quality of
completed tasks or to the availability and reliability of performers. Most
crowdsourcing systems only provide limited and predefined controls; in
contrast, we present an approach to crowdsourcing which provides fine-level,
powerful and flexible controls. We model each crowdsourcing application as
composition of elementary task types and we progressively transform these high
level specifications into the features of a reactive execution environment that
supports task planning, assignment and completion as well as performer
monitoring and exclusion. Controls are specified as active rules on top of data
structures which are derived from the model of the application; rules can be
added, dropped or modified, thus guaranteeing maximal flexibility with limited
effort.
We also report on our prototype platform that implements the proposed framework and we show the results of our experimentations with different rule sets, demonstrating how simple changes to the rules can substantially affect time, effort and quality involved in crowdsourcing activities. | |||
| On participation in group chats on Twitter | | BIBA | Full-Text | 165-176 | |
| Ceren Budak; Rakesh Agrawal | |||
| The success of a group depends on continued participation of its members through time. We study the factors that affect continued user participation in the context of educational Twitter chats. To predict whether a user that attended her first session in a particular Twitter chat group will return to the group, we build 5F Model that captures five different factors: individual initiative, group characteristics, perceived receptivity, linguistic affinity and geographical proximity. Through statistical data analysis of thirty Twitter chats over a two year period as well as a survey study, our work provides many insights about group dynamics in Twitter chats. We show similarities between Twitter chats and traditional groups such as the importance of social inclusion and linguistic similarity while also identifying important distinctions such as the insignificance of geographical proximity. We also show that informational support is more important than emotional support in educational Twitter chats, but this does not reduce the sense of community as suggested in earlier studies. | |||
| The role of web hosting providers in detecting compromised websites | | BIBA | Full-Text | 177-188 | |
| Davide Canali; Davide Balzarotti; Aurélien Francillon | |||
| Compromised websites are often used by attackers to deliver malicious
content or to host phishing pages designed to steal private information from
their victims. Unfortunately, most of the targeted websites are managed by
users with little security background -- often unable to detect this kind of
threats or to afford an external professional security service.
In this paper we test the ability of web hosting providers to detect compromised websites and react to user complaints. We also test six specialized services that provide security monitoring of web pages for a small fee. During a period of 30 days, we hosted our own vulnerable websites on 22 shared hosting providers, including 12 of the most popular ones. We repeatedly ran five different attacks against each of them. Our tests included a bot-like infection, a drive-by download, the upload of malicious files, an SQL injection stealing credit card numbers, and a phishing kit for a famous American bank. In addition, we also generated traffic from seemingly valid victims of phishing and drive-by download sites. We show that most of these attacks could have been detected by free network or file analysis tools. After 25 days, if no malicious activity was detected, we started to file abuse complaints to the providers. This allowed us to study the reaction of the web hosting providers to both real and bogus complaints. The general picture we drew from our study is quite alarming. The vast majority of the providers, or "add-on" security monitoring services, are unable to detect the most simple signs of malicious activity on hosted websites. | |||
| Your browsing behavior for a big mac: economics of personal information online | | BIBA | Full-Text | 189-200 | |
| Juan Pablo Carrascal; Christopher Riederer; Vijay Erramilli; Mauro Cherubini; Rodrigo de Oliveira | |||
| Most online service providers offer free services to users and in part,
these services collect and monetize personally identifiable information (PII),
primarily via targeted advertisements. Against this backdrop of economic
exploitation of PII, it is vital to understand the value that users put to
their own PII. Although studies have tried to discover how users value their
privacy, little is known about how users value their PII while browsing, or the
exploitation of their PII. Extracting valuations of PII from users is
non-trivial -- surveys cannot be relied on as they do not gather information of
the context where PII is being released, thus reducing validity of answers. In
this work, we rely on refined Experience Sampling -- a data collection method
that probes users to valuate their PII at the time and place where it was
generated in order to minimize retrospective recall and hence increase
measurement validity. For obtaining an honest valuation of PII, we use a
reverse second price auction. We developed a web browser plugin and had 168
users -- living in Spain -- install and use this plugin for 2 weeks in order to
extract valuations of PII in different contexts.
We found that users value items of their online browsing history for about €7 (~10USD), and they give higher valuations to their offline PII, such as age and address (about 25€ or 36USD). When it comes to PII shared in specific online services, users value information pertaining to financial transactions and social network interactions more than activities like search and shopping. No significant distinction was found between valuations of different quantities of PII (e.g. one vs. 10 search keywords), but deviation was found between types of PII (e.g. photos vs. keywords). Finally, the users' preferred goods for exchanging their PII included money and improvements in service, followed by getting more free services and targeted advertisements. | |||
| Is this app safe for children?: a comparison study of maturity ratings on Android and iOS applications | | BIBA | Full-Text | 201-212 | |
| Ying Chen; Heng Xu; Yilu Zhou; Sencun Zhu | |||
| There is a rising concern among parents who have experienced unreliable content maturity ratings for mobile applications (apps) that result in inappropriate risk exposure for their children and adolescents. In reality, there is no consistent maturity rating policy for mobile applications. The maturity ratings of Android apps are provided purely by developers' self-disclosure and are rarely verified. While Apple's iOS app ratings are considered to be more accurate, they can also be inconsistent with Apple's published policies. To address these issues, this research aims to systematically uncover the extent and severity of unreliable maturity ratings for mobile apps. Specifically, we develop mechanisms to verify the maturity ratings of mobile apps and investigate possible reasons behind the incorrect ratings. We believe that our findings have important implications for platform providers (e.g., Google or Apple) as well as for regulatory bodies and application developers. | |||
| Traveling the silk road: a measurement analysis of a large anonymous online marketplace | | BIBA | Full-Text | 213-224 | |
| Nicolas Christin | |||
| We perform a comprehensive measurement analysis of Silk Road, an anonymous, international online marketplace that operates as a Tor hidden service and uses Bitcoin as its exchange currency. We gather and analyze data over eight months between the end of 2011 and 2012, including daily crawls of the marketplace for nearly six months in 2012. We obtain a detailed picture of the type of goods sold on Silk Road, and of the revenues made both by sellers and Silk Road operators. Through examining over 24,400 separate items sold on the site, we show that Silk Road is overwhelmingly used as a market for controlled substances and narcotics, and that most items sold are available for less than three weeks. The majority of sellers disappears within roughly three months of their arrival, but a core of 112 sellers has been present throughout our measurement interval. We evaluate the total revenue made by all sellers, from public listings, to slightly over USD 1.2 million per month; this corresponds to about USD 92,000 per month in commissions for the Silk Road operators. We further show that the marketplace has been operating steadily, with daily sales and number of sellers overall increasing over our measurement interval. We discuss economic and policy implications of our analysis and results, including ethical considerations for future research in this area. | |||
| Group chats on Twitter | | BIBA | Full-Text | 225-236 | |
| James Cook; Krishnaram Kenthapadi; Nina Mishra | |||
| We report on a new kind of group conversation on Twitter that we call a group chat. These chats are periodic, synchronized group conversations focused on specific topics and they exist at a massive scale. The groups and the members of these groups are not explicitly known. Rather, members agree on a hashtag and a meeting time (e.g, 3pm Pacific Time every Wednesday) to discuss a subject of interest. Topics of these chats are numerous and varied. Some are support groups, for example, post-partum depression and mood disorder groups. Others are about a passionate interest: topics include skiing, photography, movies, wine and foodie communities. We develop a definition of a group that is inspired by how sociologists define groups and present an algorithm for discovering groups. We prove that our algorithms find all groups under certain assumptions. While these groups are of course known to the people who participate in the discussions, what we do not believe is known is the scale and variety of groups. We provide some insight into the nature of these groups based on over two years of tweets. Finally, we show that group chats are a growing phenomenon on Twitter and hope that reporting their existence propels their growth even further. | |||
| How to grow more pairs: suggesting review targets for comparison-friendly review ecosystems | | BIBA | Full-Text | 237-248 | |
| James Cook; Alex Fabrikant; Avinatan Hassidim | |||
| We consider the algorithmic challenges behind a novel interface that
simplifies consumer research of online reviews by surfacing relevant comparable
review bundles: reviews for two or more of the items being researched, all
generated in similar enough circumstances to provide for easy comparison. This
can be reviews by the same reviewer, or by the same demographic category of
reviewer, or reviews focusing on the same aspect of the items. But such an
interface will work only if the review ecosystem often has comparable review
bundles for common research tasks.
Here, we develop and evaluate practical algorithms for suggesting additional review targets to reviewers to maximize comparable pair coverage, the fraction of co-researched pairs of items that have both been reviewed by the same reviewer (or more generally are comparable in one of several ways). We show the exact problem and many subcases to be intractable, and give a greedy online, linear-time 2-approximation for a very general setting, and an offline 1.583-approximation for a narrower setting. We evaluate the algorithms on the Google+ Local reviews dataset, yielding more than 10x gain in pair coverage from six months of simulated replacement of existing reviews by suggested reviews. Even allowing for 90% of reviewers ignoring the suggestions, the pair coverage grows more than 2x in the simulation. To explore other parts of the parameter space, we also evaluate the algorithms on synthetic models. | |||
| A framework for benchmarking entity-annotation systems | | BIBA | Full-Text | 249-260 | |
| Marco Cornolti; Paolo Ferragina; Massimiliano Ciaramita | |||
| In this paper we design and implement a benchmarking framework for fair and exhaustive comparison of entity-annotation systems. The framework is based upon the definition of a set of problems related to the entity-annotation task, a set of measures to evaluate systems performance, and a systematic comparative evaluation involving all publicly available datasets, containing texts of various types such as news, tweets and Web pages. Our framework is easily-extensible with novel entity annotators, datasets and evaluation measures for comparing systems, and it has been released to the public as open source. We use this framework to perform the first extensive comparison among all available entity annotators over all available datasets, and draw many interesting conclusions upon their efficiency and effectiveness. We also draw conclusions between academic versus commercial annotators. | |||
| A framework for learning web wrappers from the crowd | | BIBA | Full-Text | 261-272 | |
| Valter Crescenzi; Paolo Merialdo; Disheng Qiu | |||
| The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches but the costs of training data, i.e., annotations over a set of sample pages, limit their scalability. Crowd sourcing platforms are making the manual annotation process more affordable. However, the tasks demanded to these platforms should be extremely simple, to be performed by non-expert people, and their number should be minimized, to contain the costs. We introduce a framework to support a supervised wrapper inference system with training data generated by the crowd. Training data are labeled values generated by means of membership queries, the simplest form of queries, posed to the crowd. We show that the costs of producing the training data are strongly affected by the expressiveness of the wrapper formalism and by the choice of the training set. Traditional supervised wrapper inference approaches use a statically defined formalism, assuming it is able to express the wrapper. Conversely, we present an inference algorithm that dynamically chooses the expressiveness of the wrapper formalism and actively selects the training set, while minimizing the number of membership queries to the crowd. We report the results of experiments on real web sources to confirm the effectiveness and the feasibility of the approach. | |||
| Lightweight server support for browser-based CSRF protection | | BIBA | Full-Text | 273-284 | |
| Alexei Czeskis; Alexander Moshchuk; Tadayoshi Kohno; Helen J. Wang | |||
| Cross-Site Request Forgery (CSRF) attacks are one of the top threats on the web today. These attacks exploit ambient authority in browsers (eg cookies, HTTP authentication state), turning them into confused deputies and causing undesired side effects on vulnerable web sites. Existing defenses against CSRFs fall short in their coverage and/or ease of deployment. In this paper, we present a browser/server solution, Allowed Referrer Lists (ARLs), that addresses the root cause of CSRFs and removes ambient authority for participating web sites that want to be resilient to CSRF attacks. Our solution is easy for web sites to adopt and does not affect any functionality on non-participating sites. We have implemented our design in Firefox and have evaluated it with real-world sites. We found that ARLs successfully block CSRF attacks, are simpler to implement than existing defenses, and do not significantly impact browser performance. | |||
| Aggregating crowdsourced binary ratings | | BIBA | Full-Text | 285-294 | |
| Nilesh Dalvi; Anirban Dasgupta; Ravi Kumar; Vibhor Rastogi | |||
| In this paper we analyze a crowdsourcing system consisting of a set of users and a set of binary choice questions. Each user has an unknown, fixed, reliability that determines the user's error rate in answering questions. The problem is to determine the truth values of the questions solely based on the user answers. Although this problem has been studied extensively, theoretical error bounds have been shown only for restricted settings: when the graph between users and questions is either random or complete. In this paper we consider a general setting of the problem where the user -- question graph can be arbitrary. We obtain bounds on the error rate of our algorithm and show it is governed by the expansion of the graph. We demonstrate, using several synthetic and real datasets, that our algorithm outperforms the state of the art. | |||
| Optimal hashing schemes for entity matching | | BIBA | Full-Text | 295-306 | |
| Nilesh Dalvi; Vibhor Rastogi; Anirban Dasgupta; Anish Das Sarma; Tamas Sarlos | |||
| In this paper, we consider the problem of devising blocking schemes for
entity matching. There is a lot of work on blocking techniques for supporting
various kinds of predicates, e.g. exact matches, fuzzy string-similarity
matches, and spatial matches. However, given a complex entity matching function
in the form of a Boolean expression over several such predicates, we show that
it is an important and non-trivial problem to combine the individual blocking
techniques into an efficient blocking scheme for the entity matching function,
a problem that has not been studied previously.
In this paper, we make fundamental contributions to this problem. We consider an abstraction for modeling complex entity matching functions as well as blocking schemes. We present several results of theoretical and practical interest for the problem. We show that in general, the problem of computing the optimal blocking strategy is NP-hard in the size of the DNF formula describing the matching function. We also present several algorithms for computing the exact optimal strategies (with exponential complexity, but often feasible in practice) as well as fast approximation algorithms. We experimentally demonstrate over commercially used rule-based matching systems over real datasets at Yahoo!, as well as synthetic datasets, that our blocking strategies can be an order of magnitude faster than the baseline methods, and our algorithms can efficiently find good blocking strategies. | |||
| No country for old members: user lifecycle and linguistic change in online communities | | BIBA | Full-Text | 307-318 | |
| Cristian Danescu-Niculescu-Mizil; Robert West; Dan Jurafsky; Jure Leskovec; Christopher Potts | |||
| Vibrant online communities are in constant flux. As members join and depart,
the interactional norms evolve, stimulating further changes to the membership
and its social dynamics. Linguistic change -- in the sense of innovation that
becomes accepted as the norm -- is essential to this dynamic process: it both
facilitates individual expression and fosters the emergence of a collective
identity.
We propose a framework for tracking linguistic change as it happens and for understanding how specific users react to these evolving norms. By applying this framework to two large online communities we show that users follow a determined two-stage lifecycle with respect to their susceptibility to linguistic change: a linguistically innovative learning phase in which users adopt the language of the community followed by a conservative phase in which users stop changing and the evolving community norms pass them by. Building on this observation, we show how this framework can be used to detect, early in a user's career, how long she will stay active in the community. Thus, this work has practical significance for those who design and maintain online communities. It also yields new theoretical insights into the evolution of linguistic norms and the complex interplay between community-level and individual-level linguistic change. | |||
| Crowdsourced judgement elicitation with endogenous proficiency | | BIBA | Full-Text | 319-330 | |
| Anirban Dasgupta; Arpita Ghosh | |||
| Crowdsourcing is now widely used to replace judgement or evaluation by an
expert authority with an aggregate evaluation from a number of non-experts, in
applications ranging from rating and categorizing online content all the way to
evaluation of student assignments in massively open online courses (MOOCs) via
peer grading. A key issue in these settings, where direct monitoring of both
effort and accuracy is infeasible, is incentivizing agents in the 'crowd' to
put in effort to make good evaluations, as well as to truthfully report their
evaluations. We study the design of mechanisms for crowdsourced judgement
elicitation when workers strategically choose both their reports and the effort
they put into their evaluations. This leads to a new family of information
elicitation problems with unobservable ground truth, where an agent's
proficiency -- the probability with which she correctly evaluates the
underlying ground truth -- is endogenously determined by her strategic choice
of how much effort to put into the task.
Our main contribution is a simple, new, mechanism for binary information elicitation for multiple tasks when agents have endogenous proficiencies, with the following properties: (i) Exerting maximum effort followed by truthful reporting of observations is a Nash equilibrium. (ii) This is the equilibrium with maximum payoff to all agents, even when agents have different maximum proficiencies, can use mixed strategies, and can choose a different strategy for each of their tasks. Our information elicitation mechanism requires only minimal bounds on the priors, asks agents to only report their own evaluations, and does not require any conditions on a diverging number of agent reports per task to achieve its incentive properties. The main idea behind our mechanism is to use the presence of multiple tasks and ratings to estimate a reporting statistic to identify and penalize low-effort agreement -- the mechanism rewards agents for agreeing with another 'reference' report on the same task, but also penalizes for blind agreement by subtracting out this statistic term, designed so that agents obtain rewards only when they put in effort into their observations. | |||
| Timespent based models for predicting user retention | | BIBA | Full-Text | 331-342 | |
| Kushal S. Dave; Vishal Vaingankar; Sumanth Kolar; Vasudeva Varma | |||
| Content discovery is fast becoming the preferred tool for user engagement on
the web. Discovery allows users to get educated and entertained about their
topics of interest. StumbleUpon is the largest personalized content discovery
engine on the Web, delivering more than 1 billion personalized recommendations
per month. As a recommendation system one of the primary metrics we track is
whether the user returns (retention) to use the product after their initial
experience (session) with StumbleUpon.
In this paper, we attempt to address the problem of predicting user retention based on the user's previous sessions. The paper first explores the different user and content features that are helpful in predicting user retention. This involved mapping the user and the user's recommendations (stumbles) in a descriptive feature space such as the time-spent by user, number of stumbles, and content features of the recommendations. To model the diversity in user behaviour, we also generated normalized features that account for the user's speed of stumbling. Using these features, we built a decision tree classifier to predict retention. We find that a model that uses both the user and content features achieves higher prediction accuracy than a model that uses the two features separately. Further, we used information theoretical analysis to find a subset of recommendations that are most indicative of user retention. A classifier trained on this subset of recommendations achieves the highest prediction accuracy. This indicates that not every recommendation seen by the user is predictive of whether the user will be retained; instead, a subset of most informative recommendations is more useful in predicting retention. | |||
| Attributing authorship of revisioned content | | BIBA | Full-Text | 343-354 | |
| Luca de Alfaro; Michael Shavlovsky | |||
| A considerable portion of web content, from wikis to collaboratively edited
documents, to code posted online, is revisioned. We consider the problem of
attributing authorship to such revisioned content, and we develop scalable
attribution algorithms that can be applied to very large bodies of revisioned
content, such as the English Wikipedia.
Since content can be deleted, only to be later re-inserted, we introduce a notion of authorship that requires comparing each new revision with the entire set of past revisions. For each portion of content in the newest revision, we search the entire history for content matches that are statistically unlikely to occur spontaneously, thus denoting common origin. We use these matches to compute the earliest possible attribution of each word (or each token) of the new content. We show that this "earliest plausible attribution" can be computed efficiently via compact summaries of the past revision history. This leads to an algorithm that runs in time proportional to the sum of the size of the most recent revision, and the total amount of change (edit work) in the revision history. This amount of change is typically much smaller than the total size of all past revisions. The resulting algorithm can scale to very large repositories of revisioned content, as we show via experimental data over the English Wikipedia. | |||
| ClausIE: clause-based open information extraction | | BIBA | Full-Text | 355-366 | |
| Luciano Del Corro; Rainer Gemulla | |||
| We propose ClausIE, a novel, clause-based approach to open information extraction, which extracts relations and their arguments from natural language text. ClausIE fundamentally differs from previous approaches in that it separates the detection of "useful" pieces of information expressed in a sentence from their representation in terms of extractions. In more detail, ClausIE exploits linguistic knowledge about the grammar of the English language to first detect clauses in an input sentence and to subsequently identify the type of each clause according to the grammatical function of its constituents. Based on this information, ClausIE is able to generate high-precision extractions; the representation of these extractions can be flexibly customized to the underlying application. ClausIE is based on dependency parsing and a small set of domain-independent lexica, operates sentence by sentence without any post-processing, and requires no training data (whether labeled or unlabeled). Our experimental study on various real-world datasets suggests that ClausIE obtains higher recall and higher precision than existing approaches, both on high-quality text as well as on noisy text as found in the web. | |||
| Pick-a-crowd: tell me what you like, and i'll tell you what to do | | BIBA | Full-Text | 367-374 | |
| Djellel Eddine Difallah; Gianluca Demartini; Philippe Cudré-Mauroux | |||
| Crowdsourcing allows to build hybrid online platforms that combine scalable information systems with the power of human intelligence to complete tasks that are difficult to tackle for current algorithms. Examples include hybrid database systems that use the crowd to fill missing values or to sort items according to subjective dimensions such as picture attractiveness. Current approaches to Crowdsourcing adopt a pull methodology where tasks are published on specialized Web platforms where workers can pick their preferred tasks on a first-come-first-served basis. While this approach has many advantages, such as simplicity and short completion times, it does not guarantee that the task is performed by the most suitable worker. In this paper, we propose and extensively evaluate a different Crowdsourcing approach based on a push methodology. Our proposed system carefully selects which workers should perform a given task based on worker profiles extracted from social networks. Workers and tasks are automatically matched using an underlying categorization structure that exploits entities extracted from the task descriptions on one hand, and categories liked by the user on social platforms on the other hand. We experimentally evaluate our approach on tasks of varying complexity and show that our push methodology consistently yield better results than usual pull strategies. | |||
| Compact explanation of data fusion decisions | | BIBA | Full-Text | 379-390 | |
| Xin Luna Dong; Divesh Srivastava | |||
| Despite the abundance of useful information on the Web, different Web
sources often provide conflicting data, some being out-of-date, inaccurate, or
erroneous. Data fusion aims at resolving conflicts and finding the truth.
Advanced fusion techniques apply iterative MAP (Maximum A Posteriori) analysis
that reasons about trustworthiness of sources and copying relationships between
them. Providing explanations for such decisions is important for a better
understanding, but can be extremely challenging because of the complexity of
the analysis during decision making.
This paper proposes two types of explanations for data-fusion results: snapshot explanations take the provided data and any other decision inferred from the data as evidence and provide a high-level understanding of a fusion decision; comprehensive explanations take only the data as evidence and provide an in-depth understanding of a fusion decision. We propose techniques that can efficiently generate correct and compact explanations. Experimental results show that (1) we generate correct explanations, (2) our techniques can significantly reduce the sizes of the explanations, and (3) we can generate the explanations efficiently. | |||
| From query to question in one click: suggesting synthetic questions to searchers | | BIBA | Full-Text | 391-402 | |
| Gideon Dror; Yoelle Maarek; Avihai Mejer; Idan Szpektor | |||
| In Web search, users may remain unsatisfied for several reasons: the search engine may not be effective enough or the query might not reflect their intent. Years of research focused on providing the best user experience for the data available to the search engine. However, little has been done to address the cases in which relevant content for the specific user need has not been posted on the Web yet. One obvious solution is to directly ask other users to generate the missing content using Community Question Answering services such as Yahoo! Answers or Baidu Zhidao. However, formulating a full-fledged question after having issued a query requires some effort. Some previous work proposed to automatically generate natural language questions from a given query, but not for scenarios in which a searcher is presented with a list of questions to choose from. We propose here to generate synthetic questions that can actually be clicked by the searcher so as to be directly posted as questions on a Community Question Answering service. This imposes new constraints, as questions will be actually shown to searchers, who will not appreciate an awkward style or redundancy. To this end, we introduce a learning-based approach that improves not only the relevance of the suggested questions to the original query, but also their grammatical correctness. In addition, since queries are often underspecified and ambiguous, we put a special emphasis on increasing the diversity of suggestions via a novel diversification mechanism. We conducted several experiments to evaluate our approach by comparing it to prior work. The experiments show that our algorithm improves question quality by 14% over prior work and that adding diversification reduced redundancy by 55%. | |||
| Perception and understanding of social annotations in web search | | BIBA | Full-Text | 403-412 | |
| Jennifer Fernquist; Ed H. Chi | |||
| As web search increasingly becomes reliant on social signals, it is
imperative for us to understand the effect of these signals on users' behavior.
There are multiple ways in which social signals can be used in search: (a) to
surface and rank important social content; (b) to signal to users which results
are more trustworthy and important by placing annotations on search results. We
focus on the latter problem of understanding how social annotations affect user
behavior.
In previous work, through eyetracking research we learned that users do not generally seem to fixate on social annotations when they are placed at the bottom of the search result block, with 11% probability of fixation [22]. A second eyetracking study showed that placing the annotation on top of the snippet block might mitigate this issue [22], but this study was conducted using mock-ups and with expert searchers. In this paper, we describe a study conducted with a new eyetracking mix-method using a live traffic search engine with the suggested design changes on real users using the same experimental procedures. The study comprised of 11 subjects with an average of 18 tasks per subject using an eyetrace-assisted retrospective think-aloud protocol. Using a funnel analysis, we found that users are indeed more likely to notice the annotations with a 60% probability of fixation (if the annotation was in view). Moreover, we found no learning effects across search sessions but found significant differences in query types, with subjects having a lower chance of fixating on annotations for queries in the news category. In the interview portion of the study, users reported interesting "wow" moments as well as usefulness in recalling or re-finding content previously shared by oneself or friends. The results not only shed light on how social annotations should be designed in search engines, but also how users make use of social annotations to make decisions about which pages are useful and potentially trustworthy. | |||
| AMIE: association rule mining under incomplete evidence in ontological knowledge bases | | BIBA | Full-Text | 413-422 | |
| Luis Antonio Galárraga; Christina Teflioudi; Katja Hose; Fabian Suchanek | |||
| Recent advances in information extraction have led to huge knowledge bases (KBs), which capture knowledge in a machine-readable format. Inductive Logic Programming (ILP) can be used to mine logical rules from the KB. These rules can help deduce and add missing knowledge to the KB. While ILP is a mature field, mining logical rules from KBs is different in two aspects: First, current rule mining systems are easily overwhelmed by the amount of data (state-of-the art systems cannot even run on today's KBs). Second, ILP usually requires counterexamples. KBs, however, implement the open world assumption (OWA), meaning that absent data cannot be used as counterexamples. In this paper, we develop a rule mining model that is explicitly tailored to support the OWA scenario. It is inspired by association rule mining and introduces a novel measure for confidence. Our extensive experiments show that our approach outperforms state-of-the-art approaches in terms of precision and coverage. Furthermore, our system, AMIE, mines rules orders of magnitude faster than state-of-the-art approaches. | |||
| PrefixSolve: efficiently solving multi-source multi-destination path queries on RDF graphs by sharing suffix computations | | BIBA | Full-Text | 423-434 | |
| Sidan Gao; Kemafor Anyanwu | |||
| Uncovering the "nature" of the connections between a set of entities e.g.
passengers on a flight and organizations on a watchlist can be viewed as a
Multi-Source Multi-Destination (MSMD) Path Query problem on labeled graph data
models such as RDF. Using existing graph-navigational path finding techniques
to solve MSMD problems will require queries to be decomposed into multiple
single-source or destination path subqueries, each of which is solved
independently. Navigational techniques on disk-resident graphs typically
generate very poor I/O access patterns for large, disk-resident graphs and for
MSMD path queries, such poor access patterns may be repeated if common graph
exploration steps exist across subqueries.
In this paper, we propose an optimization technique for general MSMD path queries that generalizes an efficient algebraic approach for solving a variety of single-source path problems. The generalization enables holistic evaluation of MSMD path queries without the need for query decomposition. We present a conceptual framework for sharing computation in the algebraic framework that is based on "suffix equivalence". Suffix equivalence amongst subqueries captures the fact that multiple subqueries with different prefixes can share a suffix and as such share the computation of shared suffixes, which allows prefix path computations to share common suffix path computations. This approach offers orders of magnitude better performance than current existing techniques as demonstrated by a comprehensive experimental evaluation over real and synthetic datasets. | |||
| When tolerance causes weakness: the case of injection-friendly browsers | | BIBA | Full-Text | 435-446 | |
| Yossi Gilad; Amir Herzberg | |||
| We present a practical off-path TCP-injection attack for connections between
current, non-buggy browsers and web-servers. The attack allows web-cache
poisoning with malicious objects; these objects can be cached for long time
period, exposing any user of that cache to XSS, CSRF and phishing attacks.
In contrast to previous TCP-injection attacks, we assume neither vulnerabilities such as client-malware nor predictable choice of client port or IP-ID. We only exploit subtle details of HTTP and TCP specifications, and features of legitimate (and common) browser implementations. An empirical evaluation of our techniques with current versions of browsers shows that connections with popular websites are vulnerable. Our attack is modular, and its modules may improve other off-path attacks on TCP communication. We present practical patches against the attack; however, the best defense is surely adoption of TLS, that ensures security even against the stronger Man-in-the-Middle attacker. | |||
| Exploiting innocuous activity for correlating users across sites | | BIBA | Full-Text | 447-458 | |
| Oana Goga; Howard Lei; Sree Hari Krishnan Parthasarathi; Gerald Friedland; Robin Sommer; Renata Teixeira | |||
| We study how potential attackers can identify accounts on different social network sites that all belong to the same user, exploiting only innocuous activity that inherently comes with posted content. We examine three specific features on Yelp, Flickr, and Twitter: the geo-location attached to a user's posts, the timestamp of posts, and the user's writing style as captured by language models. We show that among these three features the location of posts is the most powerful feature to identify accounts that belong to the same user in different sites. When we combine all three features, the accuracy of identifying Twitter accounts that belong to a set of Flickr users is comparable to that of existing attacks that exploit usernames. Our attack can identify 37% more accounts than using usernames when we instead correlate Yelp and Twitter. Our results have significant privacy implications as they present a novel class of attacks that exploit users' tendency to assume that, if they maintain different personas with different names, the accounts cannot be linked together; whereas we show that the posts themselves can provide enough information to correlate the accounts. | |||
| The cost of annoying ads | | BIBA | Full-Text | 459-470 | |
| Daniel G. Goldstein; R. Preston McAfee; Siddharth Suri | |||
| Display advertisements vary in the extent to which they annoy users. While publishers know the payment they receive to run annoying ads, little is known about the cost such ads incur due to user abandonment. We conducted a two-experiment investigation to analyze ad features that relate to annoyingness and to put a monetary value on the cost of annoying ads. The first experiment asked users to rate and comment on a large number of ads taken from the Web. This allowed us to establish sets of annoying and innocuous ads for use in the second experiment, in which users were given the opportunity to categorize emails for a per-message wage and quit at any time. Participants were randomly assigned to one of three different pay rates and also randomly assigned to categorize the emails in the presence of no ads, annoying ads, or innocuous ads. Since each email categorization constituted an impression, this design, inspired by Toomim et al., allowed us to determine how much more one must pay a person to generate the same number of impressions in the presence of annoying ads compared to no ads or innocuous ads. We conclude by proposing a theoretical model which relates ad quality to publisher market share, illustrating how our empirical findings could affect the economics of Internet advertising. | |||
| Researcher homepage classification using unlabeled data | | BIBA | Full-Text | 471-482 | |
| Sujatha Das Gollapalli; Cornelia Caragea; Prasenjit Mitra; C. Lee Giles | |||
| A classifier that determines if a webpage is relevant to a specified set of
topics comprises a key component for focused crawling. Can a classifier that is
tuned to perform well on training datasets continue to filter out irrelevant
pages in the face of changed content on the Web? We investigate this question
in the context of researcher homepage crawling.
We show experimentally that classifiers trained on existing datasets for homepage identification underperform while classifying "irrelevant" pages on current-day academic websites. As an alternative to obtaining datasets to retrain the classifier for the new content, we propose to use effectively unlimited amounts of unlabeled data readily available from these websites in a co-training scenario. To this end, we design novel URL-based features and use them in conjunction with content-based features as complementary views of the data to obtain remarkable improvements in accurately identifying homepages from the current-day university websites. In addition, we propose a novel technique for "learning a conforming pair of classifiers" using mini-batch gradient descent. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We demonstrate that tuning the classifiers so that they make "similar" predictions on unlabeled data strongly corresponds to the effect achieved by co-training algorithms. We argue that this loss formulation provides insight into understanding the co-training process and can be used even in absence of a validation set. | |||
| Google+ or Google-?: dissecting the evolution of the new OSN in its first year | | BIBA | Full-Text | 483-494 | |
| Roberto Gonzalez; Ruben Cuevas; Reza Motamedi; Reza Rejaie; Angel Cuevas | |||
| In the era when Facebook and Twitter dominate the market for social media,
Google has introduced Google+ (G+) and reported a significant growth in its
size while others called it a ghost town. This begs the question that "whether
G+ can really attract a significant number of connected and active users
despite the dominance of Facebook and Twitter?".
This paper tackles the above question by presenting a detailed characterization of G+ based on large scale measurements. We identify the main components of G+ structure, characterize the key features of their users and their evolution over time. We then conduct detailed analysis on the evolution of connectivity and activity among users in the largest connected component (LCC) of G+ structure, and compare their characteristics with other major OSNs. We show that despite the dramatic growth in the size of G+, the relative size of LCC has been decreasing and its connectivity has become less clustered. While the aggregate user activity has gradually increased, only a very small fraction of users exhibit any type of activity. To our knowledge, our study offers the most comprehensive characterization of G+ based on the largest collected data sets. | |||
| Probabilistic group recommendation via information matching | | BIBA | Full-Text | 495-504 | |
| Jagadeesh Gorla; Neal Lathia; Stephen Robertson; Jun Wang | |||
| Increasingly, web recommender systems face scenarios where they need to serve suggestions to groups of users; for example, when families share e-commerce or movie rental web accounts. Research to date in this domain has proposed two approaches: computing recommendations for the group by merging any members' ratings into a single profile, or computing ranked recommendations for each individual that are then merged via a range of heuristics. In doing so, none of the past approaches reason on the preferences that arise in individuals when they are members of a group. In this work, we present a probabilistic framework, based on the notion of information matching, for group recommendation. This model defines group relevance as a combination of the item's relevance to each user as an individual and as a member of the group; it can then seamlessly incorporate any group recommendation strategy in order to rank items for a set of individuals. We evaluate the model's efficacy at generating recommendations for both single individuals and groups using the MovieLens and MoviePilot data sets. In both cases, we compare our results with baselines and state-of-the-art collaborative filtering algorithms, and show that the model outperforms all others over a variety of ranking metrics. | |||
| WTF: the who to follow service at Twitter | | BIBA | Full-Text | 505-514 | |
| Pankaj Gupta; Ashish Goel; Jimmy Lin; Aneesh Sharma; Dong Wang; Reza Zadeh | |||
| WTF ("Who to Follow") is Twitter's user recommendation service, which is responsible for creating millions of connections daily between users based on shared interests, common connections, and other related factors. This paper provides an architectural overview and shares lessons we learned in building and running the service over the past few years. Particularly noteworthy was our design decision to process the entire Twitter graph in memory on a single server, which significantly reduced architectural complexity and allowed us to develop and deploy the service in only a few months. At the core of our architecture is Cassovary, an open-source in-memory graph processing engine we built from scratch for WTF. Besides powering Twitter's user recommendations, Cassovary is also used for search, discovery, promoted products, and other services as well. We describe and evaluate a few graph recommendation algorithms implemented in Cassovary, including a novel approach based on a combination of random walks and SALSA. Looking into the future, we revisit the design of our architecture and comment on its limitations, which are presently being addressed in a second-generation system under development. | |||
| Mining expertise and interests from social media | | BIBA | Full-Text | 515-526 | |
| Ido Guy; Uri Avraham; David Carmel; Sigalit Ur; Michal Jacovi; Inbal Ronen | |||
| The rising popularity of social media in the enterprise presents new opportunities for one of the organization's most important needs -- expertise location. Social media data can be very useful for expertise mining due to the variety of existing applications, the rich metadata, and the diversity of user associations with content. In this work, we provide an extensive study that explores the use of social media to infer expertise within a large global organization. We examine eight different social media applications by evaluating the data they produce through a large user survey, with 670 enterprise social media users. We distinguish between two semantics that relate a user to a topic: expertise in the topic and interest in it and compare these two semantics across the different social media applications. | |||
| Measuring personalization of web search | | BIBA | Full-Text | 527-538 | |
| Aniko Hannak; Piotr Sapiezynski; Arash Molavi Kakhki; Balachander Krishnamurthy; David Lazer; Alan Mislove; Christo Wilson | |||
| Web search is an integral part of our daily lives. Recently, there has been
a trend of personalization in Web search, where different users receive
different results for the same search query. The increasing personalization is
leading to concerns about Filter Bubble effects, where certain users are simply
unable to access information that the search engines' algorithm decides is
irrelevant. Despite these concerns, there has been little quantification of the
extent of personalization in Web search today, or the user attributes that
cause it.
In light of this situation, we make three contributions. First, we develop a methodology for measuring personalization in Web search results. While conceptually simple, there are numerous details that our methodology must handle in order to accurately attribute differences in search results to personalization. Second, we apply our methodology to 200 users on Google Web Search; we find that, on average, 11.7% of results show differences due to personalization, but that this varies widely by search query and by result ranking. Third, we investigate the causes of personalization on Google Web Search. Surprisingly, we only find measurable personalization as a result of searching with a logged in account and the IP address of the searching user. Our results are a first step towards understanding the extent and effects of personalization on Web search engines today. | |||
| Estimating clustering coefficients and size of social networks via random walk | | BIBA | Full-Text | 539-550 | |
| Stephen J. Hardiman; Liran Katzir | |||
| Online social networks have become a major force in today's society and
economy. The largest of today's social networks may have hundreds of millions
to more than a billion users. Such networks are too large to be downloaded or
stored locally, even if terms of use and privacy policies were to permit doing
so. This limitation complicates even simple computational tasks. One such task
is computing the clustering coefficient of a network. Another task is to
compute the network size (number of registered users) or a subpopulation size.
The clustering coefficient, a classic measure of network connectivity, comes in two flavors, global and network average. In this work, we provide efficient algorithms for estimating these measures which (1) assume no prior knowledge about the network; and (2) access the network using only the publicly available interface. More precisely, this work provides three new estimation algorithms (a) the first external access algorithm for estimating the global clustering coefficient; (b) an external access algorithm that improves on the accuracy of previous network average clustering coefficient estimation algorithms; and (c) an improved external access network size estimation algorithm. The main insight offered by this work is that only a relatively small number of public interface calls are required to allow our algorithms to achieve a high accuracy estimation. Our approach is to view a social network as an undirected graph and use the public interface to retrieve a random walk. To estimate the clustering coefficient, the connectivity of each node in the random walk sequence is tested in turn. We show that the error of this estimation drops exponentially in the number of random walk steps. Another insight of this work is the fact that, although the proposed algorithms can be used to estimate the clustering coefficient of any undirected graph, they are particularly efficient on social network-like graphs. To improve the network size prior-art estimation algorithms, we count node collision one step before they actually occur. In our experiments we validate our algorithms on several publicly available social network datasets. Our results validate the theoretical claims and demonstrate the effectiveness of our algorithms. | |||
| Exploiting annotations for the rapid development of collaborative web applications | | BIBA | Full-Text | 551-560 | |
| Matthias Heinrich; Franz Josef Grüneberger; Thomas Springer; Martin Gaedke | |||
| Web application frameworks are a proven means to accelerate the development of interactive web applications. However, implementing collaborative real-time applications like Google Docs requires specific concurrency control services (i.e. document synchronization and conflict resolution) that are not included in prevalent general-purpose frameworks like jQuery or Knockout. Hence, developers have to get familiar with specific collaboration frameworks (e.g. ShareJS) which substantially increases the development effort. To ease the development of collaborative web applications, we propose a set of source code annotations representing a lightweight mechanism to introduce concurrency control services into mature web frameworks. Those annotations are interpreted at runtime by a dedicated collaboration engine to sync documents and resolve conflicts. We enhanced the general-purpose framework Knockout with a collaboration engine and conducted a developer study comparing our approach to a traditional concurrency control library. The evaluation results show that the effort to incorporate collaboration capabilities into a web application can be reduced by up to 40 percent using the annotation-based solution. | |||
| Web usage mining with semantic analysis | | BIBA | Full-Text | 561-570 | |
| Laura Hollink; Peter Mika; Roi Blanco | |||
| Web usage mining has traditionally focused on the individual queries or query words leading to a web site or web page visit, mining patterns in such data. In our work, we aim to characterize websites in terms of the semantics of the queries that lead to them by linking queries to large knowledge bases on the Web. We demonstrate how to exploit such links for more effective pattern mining on query log data. We also show how such patterns can be used to qualitatively describe the differences between competing websites in the same domain and to quantitatively predict website abandonment. | |||
| Organizational overlap on social networks and its applications | | BIBA | Full-Text | 571-582 | |
| Cho-Jui Hsieh; Mitul Tiwari; Deepak Agarwal; Xinyi (Lisa) Huang; Sam Shah | |||
| Online social networks have become important for networking, communication, sharing, and discovery. A considerable challenge these networks face is the fact that an online social network is partially observed because two individuals might know each other, but may not have established a connection on the site. Therefore, link prediction and recommendations are important tasks for any online social network. In this paper, we address the problem of computing edge affinity between two users on a social network, based on the users belonging to organizations such as companies, schools, and online groups. We present experimental insights from social network data on organizational overlap, a novel mathematical model to compute the probability of connection between two people based on organizational overlap, and experimental validation of this model based on real social network data. We also present novel ways in which the organization overlap model can be applied to link prediction and community detection, which in itself could be useful for recommending entities to follow and generating personalized news feed. | |||
| Space-efficient data structures for Top-k completion | | BIBA | Full-Text | 583-594 | |
| Bo-June (Paul) Hsu; Giuseppe Ottaviano | |||
| Virtually every modern search application, either desktop, web, or mobile,
features some kind of query auto-completion. In its basic form, the problem
consists in retrieving from a string set a small number of completions, i.e.
strings beginning with a given prefix, that have the highest scores according
to some static ranking. In this paper, we focus on the case where the string
set is so large that compression is needed to fit the data structure in memory.
This is a compelling case for web search engines and social networks, where it
is necessary to index hundreds of millions of distinct queries to guarantee a
reasonable coverage; and for mobile devices, where the amount of memory is
limited.
We present three different trie-based data structures to address this problem, each one with different space/time/complexity trade-offs. Experiments on large-scale datasets show that it is possible to compress the string sets, including the scores, down to spaces competitive with the gzip'ed data, while supporting efficient retrieval of completions at about a microsecond per completion. | |||
| Personalized recommendation via cross-domain triadic factorization | | BIBA | Full-Text | 595-606 | |
| Liang Hu; Jian Cao; Guandong Xu; Longbing Cao; Zhiping Gu; Can Zhu | |||
| Collaborative filtering (CF) is a major technique in recommender systems to help users find their potentially desired items. Since the data sparsity problem is quite commonly encountered in real-world scenarios, Cross-Domain Collaborative Filtering (CDCF) hence is becoming an emerging research topic in recent years. However, due to the lack of sufficient dense explicit feedbacks and even no feedback available in users' uninvolved domains, current CDCF approaches may not perform satisfactorily in user preference prediction. In this paper, we propose a generalized Cross Domain Triadic Factorization (CDTF) model over the triadic relation user-item-domain, which can better capture the interactions between domain-specific user factors and item factors. In particular, we devise two CDTF algorithms to leverage user explicit and implicit feedbacks respectively, along with a genetic algorithm based weight parameters tuning algorithm to trade off influence among domains optimally. Finally, we conduct experiments to evaluate our models and compare with other state-of-the-art models by using two real world datasets. The results show the superiority of our models against other comparative models. | |||
| Unsupervised sentiment analysis with emotional signals | | BIBA | Full-Text | 607-618 | |
| Xia Hu; Jiliang Tang; Huiji Gao; Huan Liu | |||
| The explosion of social media services presents a great opportunity to understand the sentiment of the public via analyzing its large-scale and opinion-rich data. In social media, it is easy to amass vast quantities of unlabeled data, but very costly to obtain sentiment labels, which makes unsupervised sentiment analysis essential for various applications. It is challenging for traditional lexicon-based unsupervised methods due to the fact that expressions in social media are unstructured, informal, and fast-evolving. Emoticons and product ratings are examples of emotional signals that are associated with sentiments expressed in posts or words. Inspired by the wide availability of emotional signals in social media, we propose to study the problem of unsupervised sentiment analysis with emotional signals. In particular, we investigate whether the signals can potentially help sentiment analysis by providing a unified way to model two main categories of emotional signals, i.e., emotion indication and emotion correlation. We further incorporate the signals into an unsupervised learning framework for sentiment analysis. In the experiment, we compare the proposed framework with the state-of-the-art methods on two Twitter datasets and empirically evaluate our proposed framework to gain a deep understanding of the effects of emotional signals. | |||
| An analysis of socware cascades in online social networks | | BIBA | Full-Text | 619-630 | |
| Ting-Kai Huang; Md Sazzadur Rahman; Harsha V. Madhyastha; Michalis Faloutsos | |||
| Online social networks (OSNs) have become a popular new vector for
distributing malware and spam, which we refer to as socware. Unlike email spam,
which is sent by spammers directly to intended victims, socware cascades
through OSNs as compromised users spread it to their friends. In this paper, we
analyze data from the walls of roughly 3 million Facebook users over five
months, with the goal of developing a better understanding of socware cascades.
We study socware cascades to understand: (a) their spatio-temporal properties, (b) the underlying motivations and mechanisms, and (c) the social engineering tricks used to con users. First, we identify an evolving trend in which cascades appear to be throttling their rate of growth to evade detection, and thus, lasting longer. Second, our forensic investigation into the infrastructure that supports these cascades shows that, surprisingly, Facebook seems to be inadvertently enabling most cascades; 44% of cascades are disseminated via Facebook applications. At the same time, we observe large groups of synergistic Facebook apps (more than 144 groups of size 5 or more) that collaborate to support multiple cascades. Lastly, we find that hackers rely on two social engineering tricks in equal measure?luring users with free products and appealing to users' social curiosity?to enable socware cascades. Our findings present several promising avenues towards reducing socware on Facebook, but also highlight associated challenges. | |||
| Measurement and analysis of child pornography trafficking on P2P networks | | BIBA | Full-Text | 631-642 | |
| Ryan Hurley; Swagatika Prusty; Hamed Soroush; Robert J. Walls; Jeannie Albrecht; Emmanuel Cecchet; Brian Neil Levine; Marc Liberatore; Brian Lynn; Janis Wolak | |||
| Peer-to-peer networks are the most popular mechanism for the criminal
acquisition and distribution of child pornography (CP). In this paper, we
examine observations of peers sharing known CP on the eMule and Gnutella
networks, which were collected by law enforcement using forensic tools that we
developed. We characterize a year's worth of network activity and evaluate
different strategies for prioritizing investigators' limited resources. The
highest impact research in criminal forensics works within, and is evaluated
under, the constraints and goals of investigations. We follow that principle,
rather than presenting a set of isolated, exploratory characterizations of
users.
First, we focus on strategies for reducing the number of CP files available on the network by removing a minimal number of peers. We present a metric for peer removal that is more effective than simply selecting peers with the largest libraries or the most days online. Second, we characterize six aggressive peer subgroups, including: peers using Tor, peers that bridge multiple p2p networks, and the top 10% of peers contributing to file availability. We find that these subgroups are more active in their trafficking, having more known CP and more uptime, than the average peer. Finally, while in theory Tor presents a challenge to investigators, we observe that in practice offenders use Tor inconsistently. Over 90% of regular Tor users send traffic from a non-Tor IP at least once after first using Tor. | |||
| HeteroMF: recommendation in heterogeneous information networks using context dependent factor models | | BIBA | Full-Text | 643-654 | |
| Mohsen Jamali; Laks Lakshmanan | |||
| With the growing amount of information available online, recommender systems are starting to provide a viable alternative and complement to search engines, in helping users to find objects of interest. Methods based on Matrix Factorization (MF) models are the state-of-the-art in recommender systems. The input to MF is user feedback, in the form of a rating matrix. However, users can be engaged in interactions with multiple types of entities across different contexts, leading to multiple rating matrices. In other words, users can have interactions in a heterogeneous information network. Generally, in a heterogeneous network, entities from any two entity types can have interactions with a weight (rating) indicating the level of endorsement. Collective Matrix Factorization (CMF) has been proposed to address the recommendation problem in heterogeneous networks. However, a main issue with CMF is that entities share the same latent factor across different contexts. This is particularly problematic in two cases: Latent factors for entities that are cold-start in a context will be learnt mainly based on the data from other contexts where these entities are not cold-start, and therefore the factors are not properly learned for the cold-start context. Also, if a context has more data compared to another context, then the dominant context will dominate the learning process for the latent factors for entities shared in these two contexts. In this paper, we propose a context-dependent matrix factorization model, HeteroMF, that considers a general latent factor for entities of every entity type and context-dependent latent factors for every context in which the entities are involved. We learn a general latent factor for every entity and transfer matrices for every context to convert the general latent factors into a context-dependent latent factor. Experiments on two real life datasets from Epinions and Flixster demonstrate that HeteroMF substantially outperforms CMF, particularly for cold-start entities and for contexts where interactions in one contexts are dominated by other contexts. | |||
| Interactive exploratory search for multi page search results | | BIBA | Full-Text | 655-666 | |
| Xiaoran Jin; Marc Sloan; Jun Wang | |||
| Modern information retrieval interfaces typically involve multiple pages of search results, and users who are recall minded or engaging in exploratory search using ad hoc queries are likely to access more than one page. Document rankings for such queries can be improved by allowing additional context to the query to be provided by the user herself using explicit ratings or implicit actions such as clickthroughs. Existing methods using this information usually involved detrimental UI changes that can lower user satisfaction. Instead, we propose a new feedback scheme that makes use of existing UIs and does not alter user's browsing behaviour; to maximise retrieval performance over multiple result pages, we propose a novel retrieval optimisation framework and show that the optimal ranking policy should choose a diverse, exploratory ranking to display on the first page. Then, a personalised re-ranking of the next pages can be generated based on the user's feedback from the first page. We show that document correlations used in result diversification have a significant impact on relevance feedback and its effectiveness over a search session. TREC evaluations demonstrate that our optimal rank strategy (including approximative Monte Carlo Sampling) can naturally optimise the trade-off between exploration and exploitation and maximise the overall user's satisfaction over time against a number of similar baselines. | |||
| Spatio-temporal dynamics of online memes: a study of geo-tagged tweets | | BIBA | Full-Text | 667-678 | |
| Krishna Y. Kamath; James Caverlee; Kyumin Lee; Zhiyuan Cheng | |||
| We conduct a study of the spatio-temporal dynamics of Twitter hashtags through a sample of 2 billion geo-tagged tweets. In our analysis, we (i) examine the impact of location, time, and distance on the adoption of hashtags, which is important for understanding meme diffusion and information propagation; (ii) examine the spatial propagation of hashtags through their focus, entropy, and spread; and (iii) present two methods that leverage the spatio-temporal propagation of hashtags to characterize locations. Based on this study, we find that although hashtags are a global phenomenon, the physical distance between locations is a strong constraint on the adoption of hashtags, both in terms of the hashtags shared between locations and in the timing of when these hashtags are adopted. We find both spatial and temporal locality as most hashtags spread over small geographical areas but at high speeds. We also find that hashtags are mostly a local phenomenon with long-tailed life spans. These (and other) findings have important implications for a variety of systems and applications, including targeted advertising, location-based services, social media search, and content delivery networks. | |||
| Accountable key infrastructure (AKI): a proposal for a public-key validation infrastructure | | BIBA | Full-Text | 679-690 | |
| Tiffany Hyun-Jin Kim; Lin-Shung Huang; Adrian Perring; Collin Jackson; Virgil Gligor | |||
| Recent trends in public-key infrastructure research explore the tradeoff between decreased trust in Certificate Authorities (CAs), resilience against attacks, communication overhead (bandwidth and latency) for setting up an SSL/TLS connection, and availability with respect to verifiability of public key information. In this paper, we propose AKI as a new public-key validation infrastructure, to reduce the level of trust in CAs. AKI integrates an architecture for key revocation of all entities (e.g., CAs, domains) with an architecture for accountability of all infrastructure parties through checks-and-balances. AKI efficiently handles common certification operations, and gracefully handles catastrophic events such as domain key loss or compromise. We propose AKI to make progress towards a public-key validation infrastructure with key revocation that reduces trust in any single entity. | |||
| DIGTOBI: a recommendation system for Digg articles using probabilistic modeling | | BIBA | Full-Text | 691-702 | |
| Younghoon Kim; Yoonjae Park; Kyuseok Shim | |||
| Digg is a social news website that lets people submit articles to share
their favorite web pages (e.g. blog postings or news articles) and vote the
articles posted by others. Digg service currently lists the articles in the
front page by popularity without considering each user's preference to the
topics in the articles. Helping users to find the most interesting Digg
articles tailored to each user's own interests will be very useful, but it is
not an easy task to classify the articles according to their topics in order to
recommend the articles differently to each user.
In this paper, we propose DIGTOBI, a personalized recommendation system for Digg articles using a novel probabilistic modeling. Our model considers the relevant articles with low Digg scores important as well. We show that our model can handle both warm-start and cold-start scenarios seamlessly through a single model. We next propose an EM algorithm to learn the parameters of our probabilistic model. Our performance study with Digg data confirms the effectiveness of DIGTOBI compared to the traditional recommendations algorithms. | |||
| Understanding latency variations of black box services | | BIBA | Full-Text | 703-714 | |
| Darja Krushevskaja; Mark Sandler | |||
| Data centers run many services that impact millions of users daily. In
reality, the latency of each service varies from one request to another.
Existing tools allow to monitor services for performance glitches or service
disruptions, but typically they do not help understanding the variations in
latency.
We propose a general framework for understanding performance of arbitrary black box services. We consider a stream of requests to a given service with their monitored attributes, as well as latencies of serving each request. We propose what we call the multi-dimensional f-measure, that helps for a given interval to identify the subset of monitored attributes that explains it. We design algorithms that use this measure not only for a fixed latency interval, but also to explain the entire range of latencies of the service by segmenting it into smaller intervals. We perform a detailed experimental study with synthetic data, as well as real data from a large search engine. Our experiments show that our methods automatically identify significant latency intervals together with request attributes that explain them, and are robust. | |||
| Diversified recommendation on graphs: pitfalls, measures, and algorithms | | BIBA | Full-Text | 715-726 | |
| Onur Küçüktunç; Erik Saule; Kamer Kaya; Ümit V. Çatalyürek | |||
| Result diversification has gained a lot of attention as a way to answer ambiguous queries and to tackle the redundancy problem in the results. In the last decade, diversification has been applied on or integrated into the process of PageRank- or eigenvector-based methods that run on various graphs, including social networks, collaboration networks in academia, web and product co-purchasing graphs. For these applications, the diversification problem is usually addressed as a bicriteria objective optimization problem of relevance and diversity. However, such an approach is questionable since a query-oblivious diversification algorithm that recommends most of its results without even considering the query may perform the best on these commonly used measures. In this paper, we show the deficiencies of popular evaluation techniques of diversification methods, and investigate multiple relevance and diversity measures to understand whether they have any correlations. Next, we propose a novel measure called expanded relevance which combines both relevance and diversity into a single function in order to measure the coverage of the relevant part of the graph. We also present a new greedy diversification algorithm called BestCoverage, which optimizes the expanded relevance of the result set with (1-1/e)-approximation. With a rigorous experimentation on graphs from various applications, we show that the proposed method is efficient and effective for many use cases. | |||
| What is the added value of negative links in online social networks? | | BIBA | Full-Text | 727-736 | |
| Jérôme Kunegis; Julia Preusse; Felix Schwagereit | |||
| We investigate the "negative link" feature of social networks that allows users to tag other users as foes or as distrusted in addition to the usual friend and trusted links. To answer the question whether negative links have an added value for an online social network, we investigate the machine learning problem of predicting the negative links of such a network using only the positive links as a basis, with the idea that if this problem can be solved with high accuracy, then the "negative link" feature is redundant. In doing so, we also present a general methodology for assessing the added value of any new link type in online social networks. Our evaluation is performed on two social networks that allow negative links: The technology news website Slashdot and the product review site Epinions. In experiments with these two datasets, we come to the conclusion that a combination of centrality-based and proximity-based link prediction functions can be used to predict the negative edges in the networks we analyse. We explain this result by an application of the models of preferential attachment and balance theory to our learning problem, and show that the "negative link" feature has a small but measurable added value for these social networks. | |||
| Voices of victory: a computational focus group framework for tracking opinion shift in real time | | BIBA | Full-Text | 737-748 | |
| Yu-Ru Lin; Drew Margolin; Brian Keegan; David Lazer | |||
| Social media have been employed to assess public opinions on events, markets, and policies. Most current work focuses on either developing aggregated measures or opinion extraction methods like sentiment analysis. These approaches suffer from unpredictable turnover in the participants and the information they react to, making it difficult to distinguish meaningful shifts from those that follow from known information. We propose a novel approach to tame these sources of uncertainty through the introduction of "computational focus groups" to track opinion shifts in social media streams. Our approach uses prior user behaviors to detect users' biases, then groups users with similar biases together. We track the behavior streams from these like-minded sub-groups and present time-dependent collective measures of their opinions. These measures control for the response rate and base attitudes of the users, making shifts in opinion both easier to detect and easier to interpret. We test the effectiveness of our system by tracking groups' Twitter responses to a common stimulus set: the 2012 U.S. presidential election debates. While our groups' behavior is consistent with their biases, there are numerous moments and topics on which they behave "out of character," suggesting precise targets for follow-up inquiry. We also demonstrate that tracking elite users with well-established biases does not yield such insights, as they are insensitive to the stimulus and simply reproduce expected patterns. The effectiveness of our system suggests a new direction both for researchers and data-driven journalists interested in identifying opinion shifting processes in real-time. | |||
| Rethinking the web as a personal archive | | BIBA | Full-Text | 749-760 | |
| Siân E. Lindley; Catherine C. Marshall; Richard Banks; Abigail Sellen; Tim Regan | |||
| In recent years the Web has evolved substantially, transforming from a place where we primarily find information to a place where we also leave, share and keep it. This presents a fresh set of challenges for the management of personal information, which include how to underpin greater awareness and more control over digital belongings and other personally meaningful content that is hosted online. In the study reported here, we follow up on research that suggests a sense of ownership and control can be reinforced by federating online content as a virtual, single store; we do this by conducting interviews with 14 individuals about their Web-based content. Participants were asked to give the researchers a tour of online content that is personally meaningful to them; to perform a search for themselves in order to uncover additional content; and to respond to a series of design envisionments. We examine whether there is any value in an integrated personal archive that would automatically update and serve firstly, as a source of information regarding the content within it (e.g. where it is stored, who has the rights to it), and secondly, as a resource for crafting personal artefacts such as scrapbooks, CVs and gifts for others. Our analysis leads us to reject the concept of a single archive. Instead, we present a framework of five different types of online content, each of which has separate implications for personal information management. | |||
| Expressive languages for selecting groups from graph-structured data | | BIBA | Full-Text | 761-770 | |
| Vitaliy Liptchinsky; Benjamin Satzger; Rostyslav Zabolotnyi; Schahram Dustdar | |||
| Many query languages for graph-structured data are based on regular path expressions, which describe relations among pairs of nodes. We propose an extension that allows to retrieve groups of nodes based on group structural characteristics and relations to other nodes or groups. It allows to express group selection queries in a concise and natural style, and can be integrated into any query language based on regular path queries. We present an efficient algorithm for evaluating group queries in polynomial time from an input data graph. Evaluations using real-world social networks demonstrate the practical feasibility of our approach. | |||
| Modeling/predicting the evolution trend of osn-based applications | | BIBA | Full-Text | 771-780 | |
| Han Liu; Atif Nazir; Jinoo Joung; Chen-Nee Chuah | |||
| While various models have been proposed for generating social/friendship network graphs, the dynamics of user interactions through online social network (OSN) based applications remain largely unexplored. We previously developed a growth model to capture static weekly snapshots of user activity graphs (UAGs) using data from popular Facebook gifting applications. This paper presents a new continuous graph evolution model aimed to capture microscopic user-level behaviors that govern the growth of the UAG and collectively define the overall graph structure. We demonstrate the utility of our model by applying it to forecast the number of active users over time as the application transitions from initial growth to peak/mature and decline/fatigue phase. Using empirical evaluations, we show that our model can accurately reproduce the evolution trend of active user population for gifting applications, or other OSN applications that employ similar growth mechanisms. We also demonstrate that the predictions from our model can guide the generation of synthetic graphs that accurately represent empirical UAG snapshots sampled at different evolution stages. | |||
| SoCo: a social network aided context-aware recommender system | | BIBA | Full-Text | 781-802 | |
| Xin Liu; Karl Aberer | |||
| Contexts and social network information have been proven to be valuable information for building accurate recommender system. However, to the best of our knowledge, no existing works systematically combine diverse types of such information to further improve recommendation quality. In this paper, we propose SoCo, a novel context-aware recommender system incorporating elaborately processed social network information. We handle contextual information by applying random decision trees to partition the original user-item-rating matrix such that the ratings with similar contexts are grouped. Matrix factorization is then employed to predict missing preference of a user for an item using the partitioned matrix. In order to incorporate social network information, we introduce an additional social regularization term to the matrix factorization objective function to infer a user's preference for an item by learning opinions from his/her friends who are expected to share similar tastes. A context-aware version of Pearson Correlation Coefficient is proposed to measure user similarity. Real datasets based experiments show that SoCo improves the performance (in terms of root mean square error) of the state-of-the-art context-aware recommender system and social recommendation model by 15.7% and 12.2% respectively. | |||
| Using stranger as sensors: temporal and geo-sensitive question answering via social media | | BIBA | Full-Text | 803-814 | |
| Yefeng Liu; Todorka Alexandrova; Tatsuo Nakajima | |||
| MoboQ is a location-based real-time social question answering service deployed in the field in China. Using MoboQ, people can ask temporal and geo-sensitive questions, such as how long is the line at a popular business right now, and then receive answers that crowdsourced from other users in a timely fashion. To obtain answers for questions, the system analyzes the live stream from public microblogging service Sina Weibo to identify people who are likely to currently be at the place that is associated with a question and sends them the unsolicited question through the microblogging service from which they were identified. MoboQ was deployed in China at the beginning of 2012, until October of the same year, it was used to ask 15,224 questions by 35,214 registered users, and it gathered 29,491 answers; 74.6% of the questions received at least one answer, 28% received a first response within 10 minutes, and 51% of the questions got first answer within 20 minutes. In total, 91% of the questions successfully found at least one answer candidate, and they were sent to 162,954 microblogging service users. We analyze the usage patterns and behaviors of the real-world end-users, discuss the lessons learned, and outline the future directions and possible applications that could be built on top of MoboQ. | |||
| Imagen: runtime migration of browser sessions for javascript web applications | | BIBA | Full-Text | 815-826 | |
| James Teng Kin Lo; Eric Wohlstadter; Ali Mesbah | |||
| Due to the increasing complexity of web applications and emerging HTML5 standards, a large amount of runtime state is created and managed in the user's browser. While such complexity is desirable for user experience, it makes it hard for developers to implement mechanisms that provide users ubiquitous access to the data they create during application use. This paper presents our research into browser session migration for JavaScript-based web applications. Session migration is the act of transferring a session between browsers at runtime. Without burden to developers, our system allows users to create a snapshot image that captures all runtime state needed to resume the session elsewhere. Our system works completely in the JavaScript layer and thus snapshots can be transfered between different browser vendors and hardware devices. We report on performance metrics of the system using five applications, four different browsers, and three different devices. | |||
| Gender swapping and user behaviors in online social games | | BIBA | Full-Text | 827-836 | |
| Jing-Kai Lou; Kunwoo Park; Meeyoung Cha; Juyong Park; Chin-Laung Lei; Kuan-Ta Chen | |||
| Modern Massively Multiplayer Online Role-Playing Games (MMORPGs) provide lifelike virtual environments in which players can conduct a variety of activities including combat, trade, and chat with other players. While the game world and the available actions therein are inspired by their offline counterparts, the games' popularity and dedicated fan base are testaments to the allure of novel social interactions granted to people by allowing them an alternative life as a new character and persona. In this paper we investigate the phenomenon of "gender swapping," which refers to players choosing avatars of genders opposite to their natural ones. We report the behavioral patterns observed in players of Fairyland Online, a globally serviced MMORPG, during social interactions when playing as in-game avatars of their own real gender or gender-swapped. We also discuss the effect of gender role and self-image in virtual social situations and the potential of our study for improving MMORPG quality and detecting online identity frauds. | |||
| Mining structural hole spanners through information diffusion in social networks | | BIBA | Full-Text | 837-847 | |
| Tiancheng Lou; Jie Tang | |||
| The theory of structural holes suggests that individuals would benefit from
filling the "holes" (called as structural hole spanners) between people or
groups that are otherwise disconnected. A few empirical studies have verified
that structural hole spanners play a key role in the information diffusion.
However, there is still lack of a principled methodology to detect structural
hole spanners from a given social network.
In this work, we precisely define the problem of mining top-k structural hole spanners in large-scale social networks and provide an objective (quality) function to formalize the problem. Two instantiation models have been developed to implement the objective function. For the first model, we present an exact algorithm to solve it and prove its convergence. As for the second model, the optimization is proved to be NP-hard, and we design an efficient algorithm with provable approximation guarantees. We test the proposed models on three different networks: Coauthor, Twitter, and Inventor. Our study provides evidence for the theory of structural holes, e.g., 1% of Twitter users who span structural holes control 25% of the information diffusion on Twitter. We compare the proposed models with several alternative methods and the results show that our models clearly outperform the comparison methods. Our experiments also demonstrate that the detected structural hole spanners can help other social network applications, such as community kernel detection and link prediction. To the best of our knowledge, this is the first attempt to address the problem of mining structural hole spanners in large social networks. | |||
| On the evolution of the internet economic ecosystem | | BIBA | Full-Text | 849-860 | |
| Richard T. B. Ma; John C. S. Lui; Vishal Misra | |||
| The evolution of the Internet has manifested itself in many ways: the traffic characteristics, the interconnection topologies and the business relationships among the autonomous components. It is important to understand why (and how) this evolution came about, and how the interplay of these dynamics may affect future evolution and services. We propose a network aware, macroscopic model that captures the characteristics and interactions of the application and network providers, and show how it leads to a market equilibrium of the ecosystem. By analyzing the driving forces and the dynamics of the market equilibrium, we obtain some fundamental understandings of the cause and effect of the Internet evolution, which explain why some historical and recent evolutions have happened. Furthermore, by projecting the likely future evolutions, our model can help application and network providers to make informed business decisions so as to succeed in this competitive ecosystem. | |||
| Two years of short URLs internet measurement: security threats and countermeasures | | BIBA | Full-Text | 861-872 | |
| Federico Maggi; Alessandro Frossi; Stefano Zanero; Gianluca Stringhini; Brett Stone-Gross; Christopher Kruegel; Giovanni Vigna | |||
| URL shortening services have become extremely popular. However, it is still unclear whether they are an effective and reliable tool that can be leveraged to hide malicious URLs, and to what extent these abuses can impact the end users. With these questions in mind, we first analyzed existing countermeasures adopted by popular shortening services. Surprisingly, we found such countermeasures to be ineffective and trivial to bypass. This first measurement motivated us to proceed further with a large-scale collection of the HTTP interactions that originate when web users access live pages that contain short URLs. To this end, we monitored 622 distinct URL shortening services between March 2010 and April 2012, and collected 24,953,881 distinct short URLs. With this large dataset, we studied the abuse of short URLs. Despite short URLs are a significant, new security risk, in accordance with the reports resulting from the observation of the overall phishing and spamming activity, we found that only a relatively small fraction of users ever encountered malicious short URLs. Interestingly, during the second year of measurement, we noticed an increased percentage of short URLs being abused for drive-by download campaigns and a decreased percentage of short URLs being abused for spam campaigns. In addition to these security-related findings, our unique monitoring infrastructure and large dataset allowed us to complement previous research on short URLs and analyze these web services from the user's perspective. | |||
| Know your personalization: learning topic level personalization in online services | | BIBA | Full-Text | 873-884 | |
| Anirban Majumder; Nisheeth Shrivastava | |||
| Online service platforms (OSPs), such as search engines, news-websites,
ad-providers, etc., serve highly personalized content to the user, based on the
profile extracted from her history with the OSP. In this paper, we capture
OSP's personalization for an user in a new data structure called the
personalization vector (?), which is a weighted vector over a set of topics,
and present efficient algorithms to learn it.
Our approach treats OSPs as black-boxes, and extracts η by mining only their output, specifically, the personalized (for an user) and vanilla (without any user information) contents served, and the differences in these content. We believe that such treatment of OSPs is a unique aspect of our work, not just enabling access to (so far hidden) profiles in OSPs, but also providing a novel and practical approach for retrieving information from OSPs by mining differences in their outputs. We formulate a new model called Latent Topic Personalization (LTP) that captures the personalization vector in a learning framework and present efficient inference algorithms for determining it. We perform extensive experiments targeting search engine personalization, using data from both real Google users and synthetic setup. Our results indicate that LTP achieves high accuracy (R-pre = 84%) in discovering personalized topics. For Google data, our qualitative results demonstrate that the topics determined by LTP for a user correspond well to his ad-categories determined by Google. | |||
| Saving, reusing, and remixing web video: using attitudes and practices to reveal social norms | | BIBA | Full-Text | 885-896 | |
| Catherine C. Marshall; Frank M. Shipman | |||
| The growth of online videos has spurred a concomitant increase in the storage, reuse, and remix of this content. As we gain more experience with video content, social norms about ownership have evolved accordingly, spelling out what people think is appropriate use of content that is not necessarily their own. We use a series of three studies, each centering on a different genre of recordings, to probe 634 participants' attitudes toward video storage, reuse, and remix; we also question participants about their own experiences with online video. The results allow us to characterize current practice and emerging social norms and to establish the relationship between the two. Hypotheticals borrowed from legal research are used as the primary vehicle for testing attitudes, and for identifying boundaries between socially acceptable and unacceptable behavior. | |||
| From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews | | BIBA | Full-Text | 897-908 | |
| Julian John McAuley; Jure Leskovec | |||
| Recommending products to consumers means not only understanding their tastes, but also understanding their level of experience. For example, it would be a mistake to recommend the iconic film Seven Samurai simply because a user enjoys other action movies; rather, we might conclude that they will eventually enjoy it -- once they are ready. The same is true for beers, wines, gourmet foods -- or any products where users have acquired tastes: the 'best' products may not be the most 'accessible'. Thus our goal in this paper is to recommend products that a user will enjoy now, while acknowledging that their tastes may have changed over time, and may change again in the future. We model how tastes change due to the very act of consuming more products -- in other words, as users become more experienced. We develop a latent factor recommendation system that explicitly accounts for each user's level of experience. We find that such a model not only leads to better recommendations, but also allows us to study the role of user experience and expertise on a novel dataset of fifteen million beer, wine, food, and movie reviews. | |||
| The FLDA model for aspect-based opinion mining: addressing the cold start problem | | BIBA | Full-Text | 909-918 | |
| Samaneh Moghaddam; Martin Ester | |||
| Aspect-based opinion mining from online reviews has attracted a lot of attention recently. The main goal of all of the proposed methods is extracting aspects and/or estimating aspect ratings. Recent works, which are often based on Latent Dirichlet Allocation (LDA), consider both tasks simultaneously. These models are normally trained at the item level, i.e., a model is learned for each item separately. Learning a model per item is fine when the item has been reviewed extensively and has enough training data. However, in real-life data sets such as those from Epinions.com and Amazon.com more than 90% of items have less than 10 reviews, so-called cold start items. State-of-the-art LDA models for aspect-based opinion mining are trained at the item level and therefore perform poorly for cold start items due to the lack of sufficient training data. In this paper, we propose a probabilistic graphical model based on LDA, called Factorized LDA (FLDA), to address the cold start problem. The underlying assumption of FLDA is that aspects and ratings of a review are influenced not only by the item but also by the reviewer. It further assumes that both items and reviewers can be modeled by a set of latent factors which represent their aspect and rating distributions. Different from state-of-the-art LDA models, FLDA is trained at the category level and learns the latent factors using the reviews of all the items of a category, in particular the non cold start items, and uses them as prior for cold start items. Our experiments on three real-life data sets demonstrate the improved effectiveness of the FLDA model in terms of likelihood of the held-out test set. We also evaluate the accuracy of FLDA based on two application-oriented measures. | |||
| Iolaus: securing online content rating systems | | BIBA | Full-Text | 919-930 | |
| Arash Molavi Kakhki; Chloe Kliman-Silver; Alan Mislove | |||
| Online content ratings services allow users to find and share content
ranging from news articles (Digg) to videos (YouTube) to businesses (Yelp).
Generally, these sites allow users to create accounts, declare friendships,
upload and rate content, and locate new content by leveraging the aggregated
ratings of others. These services are becoming increasingly popular; Yelp alone
has over 33 million reviews. Unfortunately, this popularity is leading to
increasing levels of malicious activity, including multiple identity (Sybil)
attacks and the "buying" of ratings from users.
In this paper, we present Iolaus, a system that leverages the underlying social network of online content rating systems to defend against such attacks. Iolaus uses two novel techniques: (a) weighing ratings to defend against multiple identity attacks and (b) relative ratings to mitigate the effect of "bought" ratings. An evaluation of Iolaus using microbenchmarks, synthetic data, and real-world content rating data demonstrates that Iolaus is able to outperform existing approaches and serve as a practical defense against multiple-identity and rating-buying attacks. | |||
| On cognition, emotion, and interaction aspects of search tasks with different search intentions | | BIBA | Full-Text | 931-942 | |
| Yashar Moshfeghi; Joemon M. Jose | |||
| The complex and dynamic nature of search processes surrounding information seeking have been exhaustively studied. Recent studies have highlighted search processes with different intentions, such as those for entertainment purposes or re-finding a visited information object, are fundamentally different in nature to typical information seeking intentions. Despite the popularity of such search processes on the Web, they have not yet been thoroughly explored. Using a video retrieval system as a use case, we study the characteristics of four different search task types: seeking information, re-finding a particular information object, and two different entertainment intentions (i.e. entertainment by adjusting arousal level, and entertainment by adjusting mood). In particular, we looked at the cognition, emotion and action aspects of these search tasks at different phases of a search process. This follows the common assumption in the information seeking and retrieval community that a complex search process can be broken down into a relatively small number of activity phases. Our experimental results show significant differences in the characteristics of studied search tasks. Furthermore, we investigate whether we can predict these search tasks given user's interaction with the system. Results show that we can learn a model that predicts the search task types with reasonable accuracy. Overall, these findings may help to steer search engines to better satisfy searchers' needs beyond typically assumed information seeking processes. | |||
| Ad impression forecasting for sponsored search | | BIBA | Full-Text | 943-952 | |
| Abhirup Nath; Shibnath Mukherjee; Prateek Jain; Navin Goyal; Srivatsan Laxman | |||
| A typical problem for a search engine (hosting sponsored search service) is to provide the advertisers with a forecast of the number of impressions his/her ad is likely to obtain for a given bid. Accurate forecasts have high business value, since they enable advertisers to select bids that lead to better returns on their investment. They also play an important role in services such as automatic campaign optimization. Despite its importance the problem has remained relatively unexplored in literature. Existing methods typically overfit to the training data, leading to inconsistent performance. Furthermore, some of the existing methods cannot provide predictions for new ads, i.e., for ads that are not present in the logs. In this paper, we develop a generative model based approach that addresses these drawbacks. We design a Bayes net to capture inter-dependencies between the query traffic features and the competitors in an auction. Furthermore, we account for variability in the volume of query traffic by using a dynamic linear model. Finally, we implement our approach on a production grade MapReduce framework and conduct extensive large scale experiments on substantial volumes of sponsored search data from Bing. Our experimental results demonstrate significant advantages over existing methods as measured using several accuracy/error criteria, improved ability to provide estimates for new ads and more consistent performance with smaller variance in accuracies. Our method can also be adapted to several other related forecasting problems such as predicting average position of ads or the number of clicks under budget constraints. | |||
| Measurement and modeling of eye-mouse behavior in the presence of nonlinear page layouts | | BIBA | Full-Text | 953-964 | |
| Vidhya Navalpakkam; LaDawn Jentzsch; Rory Sayres; Sujith Ravi; Amr Ahmed; Alex Smola | |||
| As search pages are becoming increasingly complex, with images and nonlinear page layouts, understanding how users examine the page is important. We present a lab study on the effect of a rich informational panel to the right of the search result column, on eye and mouse behavior. Using eye and mouse data, we show that the flow of user attention on nonlinear page layouts is different from the widely believed top-down linear examination order of search results. We further demonstrate that the mouse, like the eye, is sensitive to two key attributes of page elements -- their position (layout), and their relevance to the user's task. We identify mouse measures that are strongly correlated with eye movements, and develop models to predict user attention (eye gaze) from mouse activity. These findings show that mouse tracking can be used to infer user attention and information flow patterns on search pages. Potential applications include ranking, search page optimization, and UI evaluation. | |||
| Understanding and decreasing the network footprint of catch-up tv | | BIBA | Full-Text | 965-976 | |
| Gianfranco Nencioni; Nishanth Sastry; Jigna Chandaria; Jon Crowcroft | |||
| "Catch-up", or on-demand access of previously broadcast TV content over the
public Internet, constitutes a significant fraction of peak time network
traffic. This paper analyses consumption patterns of nearly 6 million users of
a nationwide deployment of a catch-up TV service, to understand the network
support required. We find that catch-up has certain natural scaling properties
compared to traditional TV: The on-demand nature spreads load over time, and
users have much higher completion rates for content streams than previously
reported. Users exhibit strong preferences for serialised content, and for
specific genres.
Exploiting this, we design a Speculative Content Offloading and Recording Engine (SCORE) that predictively records a personalised set of shows on user-local storage, and thereby offloads traffic that might result from subsequent catch-up access. Evaluations show that even with a modest storage of 32GB, an oracle with complete knowledge of user consumption can save up to 74% of the energy, and 97% of the peak bandwidth compared to the current IP streaming-based architecture. In the best case, optimising for energy consumption, SCORE can recover more than 60% of the traffic and energy savings achieved by the oracle. Optimising purely for traffic rather than energy can reduce bandwidth by an additional 5%. | |||
| Sorry, i don't speak SPARQL: translating SPARQL queries into natural language | | BIBA | Full-Text | 977-988 | |
| Axel-Cyrille Ngonga Ngomo; Lorenz Bühmann; Christina Unger; Jens Lehmann; Daniel Gerber | |||
| Over the past years, Semantic Web and Linked Data technologies have reached the backend of a considerable number of applications. Consequently, large amounts of RDF data are constantly being made available across the planet. While experts can easily gather information from this wealth of data by using the W3C standard query language SPARQL, most lay users lack the expertise necessary to proficiently interact with these applications. Consequently, non-expert users usually have to rely on forms, query builders, question answering or keyword search tools to access RDF data. However, these tools have so far been unable to explicate the queries they generate to lay users, making it difficult for these users to i) assess the correctness of the query generated out of their input, and ii) to adapt their queries or iii) to choose in an informed manner between possible interpretations of their input. This paper addresses this drawback by presenting SPARQL2NL, a generic approach that allows verbalizing SPARQL queries, i.e., converting them into natural language. Our framework can be integrated into applications where lay users are required to understand SPARQL or to generate SPARQL queries in a direct (forms, query builders) or an indirect (keyword search, question answering) manner. We evaluate our approach on the DBpedia question set provided by QALD-2 within a survey setting with both SPARQL experts and lay users. The results of the 115 filled surveys show that SPARQL2NL can generate complete and easily understandable natural language descriptions. In addition, our results suggest that even SPARQL experts can process the natural language representation of SPARQL queries computed by our approach more efficiently than the corresponding SPARQL queries. Moreover, non-experts are enabled to reliably understand the content of SPARQL queries. | |||
| Bitsquatting: exploiting bit-flips for fun, or profit? | | BIBA | Full-Text | 989-998 | |
| Nick Nikiforakis; Steven Van Acker; Wannes Meert; Lieven Desmet; Frank Piessens; Wouter Joosen | |||
| Over the last fifteen years, several types of attacks against domain names
and the companies relying on them have been observed. The well-known
cybersquatting of domain names gave way to typosquatting, the abuse of a user's
mistakes when typing a URL in her browser's address bar. Recently, a new attack
against domain names surfaced, namely bitsquatting. In bitsquatting, an
attacker leverages random bit-errors occurring in the memory of commodity
computers and smartphones, to redirect Internet traffic to attacker-controlled
domains.
In this paper, we report on a large-scale experiment, measuring the adoption of bitsquatting by the domain-squatting community through the tracking of registrations of bitsquatting domains targeting popular web sites over a 9-month period. We show how new bitsquatting domains are registered daily and how attackers are trying to monetize their domains through the use of ads, abuse of affiliate programs and even malware installations. Lastly, given the discovered prevalence of bitsquatting, we review possible defense measures that companies, software developers and Internet Service Providers can use to protect against it. | |||
| One-class collaborative filtering with random graphs | | BIBA | Full-Text | 999-1008 | |
| Ulrich Paquet; Noam Koenigstein | |||
| The bane of one-class collaborative filtering is interpreting and modelling the latent signal from the missing class. In this paper we present a novel Bayesian generative model for implicit collaborative filtering. It forms a core component of the Xbox Live architecture, and unlike previous approaches, delineates the odds of a user disliking an item from simply being unaware of it. The latent signal is treated as an unobserved random graph connecting users with items they might have encountered. We demonstrate how large-scale distributed learning can be achieved through a combination of stochastic gradient descent and mean field variational inference over random graph samples. A fine-grained comparison is done against a state of the art baseline on real world data. | |||
| Latent credibility analysis | | BIBA | Full-Text | 1009-1020 | |
| Jeff Pasternack; Dan Roth | |||
| A frequent problem when dealing with data gathered from multiple sources on the web (ranging from booksellers to Wikipedia pages to stock analyst predictions) is that these sources disagree, and we must decide which of their (often mutually exclusive) claims we should accept. Current state-of-the-art information credibility algorithms known as "fact-finders" are transitive voting systems with rules specifying how votes iteratively flow from sources to claims and then back to sources. While this is quite tractable and often effective, fact-finders also suffer from substantial limitations; in particular, a lack of transparency obfuscates their credibility decisions and makes them difficult to adapt and analyze: knowing the mechanics of how votes are calculated does not readily tell us what those votes mean, and finding, for example, that a source has a score of 6 is not informative. We introduce a new approach to information credibility, Latent Credibility Analysis (LCA), constructing strongly principled, probabilistic models where the truth of each claim is a latent variable and the credibility of a source is captured by a set of model parameters. This gives LCA models clear semantics and modularity that make extending them to capture additional observed and latent credibility factors straightforward. Experiments over four real-world datasets demonstrate that LCA models can outperform the best fact-finders in both unsupervised and semi-supervised settings. | |||
| Predicting group stability in online social networks | | BIBA | Full-Text | 1021-1030 | |
| Akshay Patil; Juan Liu; Jie Gao | |||
| Social groups often exhibit a high degree of dynamism. Some groups thrive, while many others die over time. Modeling group stability dynamics and understanding whether/when a group will remain stable or shrink over time can be important in a number of social domains. In this paper, we study two different types of social networks as exemplar platforms for modeling and predicting group stability dynamics. We build models to predict if a group is going to remain stable or is likely to shrink over a period of time. We observe that both the level of member diversity and social activities are critical in maintaining the stability of groups. We also find that certain 'prolific' members play a more important role in maintaining the group stability. Our study shows that group stability can be predicted with high accuracy, and feature diversity is critical to prediction performance. | |||
| Predictive web automation assistant for people with vision impairments | | BIBA | Full-Text | 1031-1040 | |
| Yury Puzis; Yevgen Borodin; Rami Puzis; I. V. Ramakrishnan | |||
| The Web is far less usable and accessible for people with vision impairments
than it is for sighted people. Web automation, a process of automating browsing
actions on behalf of the user, has the potential to bridge the divide between
the ways sighted and people with vision impairment access the Web;
specifically, it can enable the latter to breeze through web browsing tasks
that beforehand were slow, hard, or even impossible to accomplish. Typical web
automation requires that the user record a macro, a sequence of browsing steps,
so that these steps can be automated in the future by replaying the macro.
However, for people with vision impairment, automation with macros is not
usable.
In this paper, we propose a novel model-based approach that facilitates web automation without having to either record or replay macros. Using the past browsing history and the current web page as the browsing context, the proposed model can predict the most probable browsing actions that the user can do. The model construction is "unsupervised". More importantly, the model is continuously and incrementally updated as history evolves, thereby, ensuring the predictions are not "outdated". We also describe a novel interface that lets the user focus on the objects associated with the most probable predicted browsing steps (e.g., clicking links and filling out forms), and facilitates automatic execution of the selected steps. A study with 19 blind participants showed that the proposed approach dramatically reduced the interaction time needed to accomplish typical browsing tasks, and the user interface was perceived to be much more usable than the standard screen-reading interfaces. | |||
| Mining collective intelligence in diverse groups | | BIBA | Full-Text | 1041-1052 | |
| Guo-Jun Qi; Charu C. Aggarwal; Jiawei Han; Thomas Huang | |||
| Collective intelligence, which aggregates the shared information from large crowds, is often negatively impacted by unreliable information sources with the low quality data. This becomes a barrier to the effective use of collective intelligence in a variety of applications. In order to address this issue, we propose a probabilistic model to jointly assess the reliability of sources and find the true data. We observe that different sources are often not independent of each other. Instead, sources are prone to be mutually influenced, which makes them dependent when sharing information with each other. High dependency between sources makes collective intelligence vulnerable to the overuse of redundant (and possibly incorrect) information from the dependent sources. Thus, we reveal the latent group structure among dependent sources, and aggregate the information at the group level rather than from individual sources directly. This can prevent the collective intelligence from being inappropriately dominated by dependent sources. We will also explicitly reveal the reliability of groups, and minimize the negative impacts of unreliable groups. Experimental results on real-world data sets show the effectiveness of the proposed approach with respect to existing algorithms. | |||
| Trade area analysis using user generated mobile location data | | BIBA | Full-Text | 1053-1064 | |
| Yan Qu; Jun Zhang | |||
| In this paper, we illustrate how User Generated Mobile Location Data (UGMLD) like Foursquare check-ins can be used in Trade Area Analysis (TAA) by introducing a new framework and corresponding analytic methods. Three key processes were created: identifying the activity center of a mobile user, profiling users based on their location history, and modeling users' preference probability. Extensions to traditional TAA are introduced, including customer-centric distance decay analysis and check-in sequence analysis. Adopting the rich content and context of UGMLD, these methods introduce new dimensions to modeling and delineating trade areas. Analyzing customers' visits to a business in the context of their daily life sheds new light on the nature and performance of the venue. This work has important business implications in the field of mobile computing. | |||
| Psychological maps 2.0: a web engagement enterprise starting in London | | BIBA | Full-Text | 1065-1076 | |
| Daniele Quercia; Joao Paulo Pesce; Virgilio Almeida; Jon Crowcroft | |||
| Planners and social psychologists have suggested that the recognizability of the urban environment is linked to people's socio-economic well-being. We build a web game that puts the recognizability of London's streets to the test. It follows as closely as possible one experiment done by Stanley Milgram in 1972. The game picks up random locations from Google Street View and tests users to see if they can judge the location in terms of closest subway station, borough, or region. Each participant dedicates only few minutes to the task (as opposed to 90 minutes in Milgram's). We collect data from 2,255 participants (one order of magnitude a larger sample) and build a recognizability map of London based on their responses. We find that some boroughs have little cognitive representation; that recognizability of an area is explained partly by its exposure to Flickr and Foursquare users and mostly by its exposure to subway passengers; and that areas with low recognizability do not fare any worse on the economic indicators of income, education, and employment, but they do significantly suffer from social problems of housing deprivation, poor living conditions, and crime. These results could not have been produced without analyzing life off- and online: that is, without considering the interactions between urban places in the physical world and their virtual presence on platforms such as Flickr and Foursquare. This line of work is at the crossroad of two emerging themes in computing research -- a crossroad where "web science" meets the "smart city" agenda. | |||
| Towards realistic team formation in social networks based on densest subgraphs | | BIBA | Full-Text | 1077-1088 | |
| Syama Sundar Rangapuram; Thomas Bühler; Matthias Hein | |||
| Given a task T, a set of experts V with multiple skills and a social network G(V, W) reflecting the compatibility among the experts, team formation is the problem of identifying a team C ⊆ V that is both competent in performing the task T and compatible in working together. Existing methods for this problem make too restrictive assumptions and thus cannot model practical scenarios. The goal of this paper is to consider the team formation problem in a realistic setting and present a novel formulation based on densest subgraphs. Our formulation allows modeling of many natural requirements such as (i) inclusion of a designated team leader and/or a group of given experts, (ii) restriction of the size or more generally cost of the team (iii) enforcing locality of the team, e.g., in a geographical sense or social sense, etc. The proposed formulation leads to a generalized version of the classical densest subgraph problem with cardinality constraints (DSP), which is an NP hard problem and has many applications in social network analysis. In this paper, we present a new method for (approximately) solving the generalized DSP (GDSP). Our method, FORTE, is based on solving an equivalent continuous relaxation of GDSP. The solution found by our method has a quality guarantee and always satisfies the constraints of GDSP. Experiments show that the proposed formulation (GDSP) is useful in modeling a broader range of team formation problems and that our method produces more coherent and compact teams of high quality. We also show, with the help of an LP relaxation of GDSP, that our method gives close to optimal solutions to GDSP. | |||
| Efficient community detection in large networks using content and links | | BIBA | Full-Text | 1089-1098 | |
| Yiye Ruan; David Fuhry; Srinivasan Parthasarathy | |||
| In this paper we discuss a very simple approach of combining content and
link information in graph structures for the purpose of community discovery, a
fundamental task in network analysis. Our approach hinges on the basic
intuition that many networks contain noise in the link structure and that
content information can help strengthen the community signal. This enables ones
to eliminate the impact of noise (false positives and false negatives), which
is particularly prevalent in online social networks and Web-scale information
networks.
Specifically we introduce a measure of signal strength between two nodes in the network by fusing their link strength with content similarity. Link strength is estimated based on whether the link is likely (with high probability) to reside within a community. Content similarity is estimated through cosine similarity or Jaccard coefficient. We discuss a simple mechanism for fusing content and link similarity. We then present a biased edge sampling procedure which retains edges that are locally relevant for each graph node. The resulting backbone graph can be clustered using standard community discovery algorithms such as Metis and Markov clustering. Through extensive experiments on multiple real-world datasets (Flickr, Wikipedia and CiteSeer) with varying sizes and characteristics, we demonstrate the effectiveness and efficiency of our methods over state-of-the-art learning and mining approaches several of which also attempt to combine link and content analysis for the purposes of community discovery. Specifically we always find a qualitative benefit when combining content with link analysis. Additionally our biased graph sampling approach realizes a quantitative benefit in that it is typically several orders of magnitude faster than competing approaches. | |||
| Learning joint query interpretation and response ranking | | BIBA | Full-Text | 1099-1110 | |
| Uma Sawant; Soumen Chakrabarti | |||
| Thanks to information extraction and semantic Web efforts, search on unstructured text is increasingly refined using semantic annotations and structured knowledge bases. However, most users cannot become familiar with the schema of knowledge bases and ask structured queries. Interpreting free-format queries into a more structured representation is of much current interest. The dominant paradigm is to segment or partition query tokens by purpose (references to types, entities, attribute names, attribute values, relations) and then launch the interpreted query on structured knowledge bases. Given that structured knowledge extraction is never complete, here we choose a less trodden path: a data representation that retains the unstructured text corpus, along with structured annotations (mentions of entities and relationships) on it. We propose two new, natural formulations for joint query interpretation and response ranking that exploit bidirectional flow of information between the knowledge base and the corpus. One, inspired by probabilistic language models, computes expected response scores over the uncertainties of query interpretation. The other is based on max-margin discriminative learning, with latent variables representing those uncertainties. In the context of typed entity search, both formulations bridge a considerable part of the accuracy gap between a generic query that does not constrain the type at all, and the upper bound where the "perfect" target entity type of each query is provided by humans. Our formulations are also superior to a two-stage approach of first choosing a target type using recent query type prediction techniques, and then launching a type-restricted entity search query. | |||
| A model for green design of online news media services | | BIBA | Full-Text | 1111-1122 | |
| Daniel Schien; Paul Shabajee; Stephen G. Wood; Chris Preist | |||
| The use of information and communication technology and the web-based products it provides is responsible for significant emissions of greenhouse gases. In order to enable the reduction of emissions during the design of such products, it is necessary to estimate as accurately as possible their carbon impact over the entire product system. In this work we describe a new method which combines models of energy consumption during the use of digital media with models of the behavior of the audience. We apply this method to conduct an assessment of the annual carbon emissions for the product suite of a major international news organization. We then demonstrate its use for green design by evaluating the impacts of five different interventions on the product suite. We find that carbon footprint of the online newspaper amounts to approximately 7700 tCO2e per year, of which 75% are caused by the user devices. Among the evaluated scenarios a significant uptake of eReaders in favor of PCs has the greatest reduction potential. Our results also show that even a significant reduction of data volume on a web page would only result in small overall energy savings. | |||
| Potential networks, contagious communities, and understanding social network structure | | BIBA | Full-Text | 1123-1132 | |
| Grant Schoenebeck | |||
| In this paper we study how the network of agents adopting a particular
technology relates to the structure of the underlying network over which the
technology adoption spreads. We develop a model and show that the network of
agents adopting a particular technology may have characteristics that differ
significantly from the social network of agents over which the technology
spreads. For example, the network induced by a cascade may have a heavy-tailed
degree distribution even if the original network does not.
This provides evidence that online social networks created by technology adoption over an underlying social network may look fundamentally different from social networks and indicates that using data from many online social networks may mislead us if we try to use it to directly infer the structure of social networks. Our results provide an alternate explanation for certain properties repeatedly observed in data sets, for example: heavy-tailed degree distribution, network densification, shrinking diameter, and network community profile. These properties could be caused by a sort of sampling bias rather than by attributes of the underlying social structure. By generating networks using cascades over traditional network models that do not themselves contain these properties, we can nevertheless reliably produce networks that contain all these properties. An opportunity for interesting future research is developing new methods that correctly infer underlying network structure from data about a network that is generated via a cascade spread over the underlying network. | |||
| Do social explanations work?: studying and modeling the effects of social explanations in recommender systems | | BIBA | Full-Text | 1133-1144 | |
| Amit Sharma; Dan Cosley | |||
| Recommender systems associated with social networks often use social explanations (e.g. "X, Y and 2 friends like this") to support the recommendations. We present a study of the effects of these social explanations in a music recommendation context. We start with an experiment with 237 users, in which we show explanations with varying levels of social information and analyze their effect on users' decisions. We distinguish between two key decisions: the likelihood of checking out the recommended artist, and the actual rating of the artist based on listening to several songs. We find that while the explanations do have some influence on the likelihood, there is little correlation between the likelihood and actual (listening) rating for the same artist. Based on these insights, we present a generative probabilistic model that explains the interplay between explanations and background information on music preferences, and how that leads to a final likelihood rating for an artist. Acknowledging the impact of explanations, we discuss a general recommendation framework that models external informational elements in the recommendation interface, in addition to inherent preferences of users. | |||
| Question answering on interlinked data | | BIBA | Full-Text | 1145-1156 | |
| Saeedeh Shekarpour; Axel-Cyrille Ngonga Ngomo; Sören Auer | |||
| The Data Web contains a wealth of knowledge on a large number of domains. Question answering over interlinked data sources is challenging due to two inherent characteristics. First, different datasets employ heterogeneous schemas and each one may only contain a part of the answer for a certain question. Second, constructing a federated formal query across different datasets requires exploiting links between the different datasets on both the schema and instance levels. We present a question answering system, which transforms user supplied queries (i.e. natural language sentences or keywords) into conjunctive SPARQL queries over a set of interlinked data sources. The contribution of this paper is two-fold: Firstly, we introduce a novel approach for determining the most suitable resources for a user-supplied query from different datasets (disambiguation). We employ a hidden Markov model, whose parameters were bootstrapped with different distribution functions. Secondly, we present a novel method for constructing a federated formal queries using the disambiguated resources and leveraging the linking structure of the underlying datasets. This approach essentially relies on a combination of domain and range inference as well as a link traversal method for constructing a connected graph which ultimately renders a corresponding SPARQL query. The results of our evaluation with three life-science datasets and 25 benchmark queries demonstrate the effectiveness of our approach. | |||
| Pricing mechanisms for crowdsourcing markets | | BIBA | Full-Text | 1157-1166 | |
| Yaron Singer; Manas Mittal | |||
| Every day millions of crowdsourcing tasks are performed in exchange for
payments. Despite the important role pricing plays in crowdsourcing campaigns
and the complexity of the market, most platforms do not provide requesters
appropriate tools for effective pricing and allocation of tasks.
In this paper, we introduce a framework for designing mechanisms with provable guarantees in crowdsourcing markets. The framework enables automating the process of pricing and allocation of tasks for requesters in complex markets like Amazon's Mechanical Turk where workers arrive in an online fashion and requesters face budget constraints and task completion deadlines. We present constant-competitive incentive compatible mechanisms for maximizing the number of tasks under a budget, and for minimizing payments given a fixed number of tasks to complete. To demonstrate the effectiveness of this framework we created a platform that enables applying pricing mechanisms in markets like Mechanical Turk. The platform allows us to show that the mechanisms we present here work well in practice, as well as to give experimental evidence to workers' strategic behavior in absence of appropriate incentive schemes. | |||
| Truthful incentives in crowdsourcing tasks using regret minimization mechanisms | | BIBA | Full-Text | 1167-1178 | |
| Adish Singla; Andreas Krause | |||
| What price should be offered to a worker for a task in an online labor
market? How can one enable workers to express the amount they desire to receive
for the task completion? Designing optimal pricing policies and determining the
right monetary incentives is central to maximizing requester's utility and
workers' profits. Yet, current crowdsourcing platforms only offer a limited
capability to the requester in designing the pricing policies and often rules
of thumb are used to price tasks. This limitation could result in inefficient
use of the requester's budget or workers becoming disinterested in the task.
In this paper, we address these questions and present mechanisms using the approach of regret minimization in online learning. We exploit a link between procurement auctions and multi-armed bandits to design mechanisms that are budget feasible, achieve near-optimal utility for the requester, are incentive compatible (truthful) for workers and make minimal assumptions about the distribution of workers' true costs. Our main contribution is a novel, no-regret posted price mechanism, BP-UCB, for budgeted procurement in stochastic online settings. We prove strong theoretical guarantees about our mechanism, and extensively evaluate it in simulations as well as on real data from the Mechanical Turk platform. Compared to the state of the art, our approach leads to a 180% increase in utility. | |||
| A predictive model for advertiser value-per-click in sponsored search | | BIBA | Full-Text | 1179-1190 | |
| Eric Sodomka; Sébastien Lahaie; Dustin Hillard | |||
| Sponsored search is a form of online advertising where advertisers bid for placement next to search engine results for specific keywords. As search engines compete for the growing share of online ad spend, it becomes important for them to understand what keywords advertisers value most, and what characteristics of keywords drive value. In this paper we propose an approach to keyword value prediction that draws on advertiser bidding behavior across the terms and campaigns in an account. We provide original insights into the structure of sponsored search accounts that motivate the use of a hierarchical modeling strategy. We propose an economically meaningful loss function which allows us to implicitly fit a linear model for values given observables such as bids and click-through rates. The model draws on demographic and textual features of keywords and takes advantage of the hierarchical structure of sponsored search accounts. Its predictive quality is evaluated on several high-revenue and high-exposure advertising accounts on a major search engine. Besides the general evaluation of advertiser welfare, our approach has potential applications to keyword and bid suggestion. | |||
| I know the shortened URLs you clicked on Twitter: inference attack using public click analytics and Twitter metadata | | BIBA | Full-Text | 1191-1200 | |
| Jonghyuk Song; Sangho Lee; Jong Kim | |||
| Twitter is a popular social network service for sharing messages among friends. Because Twitter restricts the length of messages, many Twitter users use URL shortening services, such as bit.ly and goo.gl, to share long URLs with friends. Some URL shortening services also provide click analytics of the shortened URLs, including the number of clicks, countries, platforms, browsers and referrers. To protect visitors' privacy, they do not reveal identifying information about individual visitors. In this paper, we propose a practical attack technique that can infer who clicks what shortened URLs on Twitter. Unlike the conventional browser history stealing attacks, our attack methods only need publicly available information provided by URL shortening services and Twitter. Evaluation results show that our attack technique can compromise Twitter users' privacy with high accuracy. | |||
| Exploring and exploiting user search behavior on mobile and tablet devices to improve search relevance | | BIBA | Full-Text | 1201-1212 | |
| Yang Song; Hao Ma; Hongning Wang; Kuansan Wang | |||
| In this paper, we present a log-based study on user search behavior comparisons on three different platforms: desktop, mobile and tablet. We use three-month search logs in 2012 from a commercial search engine for our study. Our objective is to better understand how and to what extent mobile and tablet searchers behave differently than desktop users. Our study spans a variety of aspects including query categorization, query length, search time distribution, search location distribution, user click patterns and so on. From our data set, we reveal that there are significant differences between user search patterns in these three platforms, and therefore use the same ranking system is not an optimal solution for all of them. Consequently, we propose a framework that leverages a set of domain-specific features, along with the training data from desktop search, to further improve the search relevance for mobile and tablet platforms. Experimental results demonstrate that by transferring knowledge from desktop search, search relevance on mobile and tablet can be greatly improved. | |||
| Evaluating and predicting user engagement change with degraded search relevance | | BIBA | Full-Text | 1213-1224 | |
| Yang Song; Xiaolin Shi; Xin Fu | |||
| User engagement in search refers to the frequency for users (re-)using the search engine to accomplish their tasks. Among factors that affected users' visit frequency, relevance of search results is believed to play a pivotal role. While multiple work in the past has demonstrated the correlation between search success and user engagement based on longitudinal analysis, we examine this problem from a different perspective in this work. Specifically, we carefully designed a large-scale controlled experiment on users of a large commercial Web search engine, in which users were separated into control and treatment groups, where users in treatment group were presented with search results which are deliberate degraded in relevance. We studied users' responses to the relevance degradation through tracking several behavioral metrics (such as query per user, click per session) over an extended period of time both during and following the experiment. By quantifying the relationship between user engagement and search relevance, we observe significant differences between user's short-term search behavior and long-term engagement change. By leveraging some of the key findings from the experiment, we developed a machine learning model to predict the long term impact of relevance degradation on user engagement. Overall, our model achieves over 67% of accuracy in predicting user engagement drop. Besides, our model is also capable of predicting engagement change for low-frequency users with very few user signals. We believe that insights from this study can be leveraged by search engine companies to detect and intervene search relevance degradation and to prevent long term user engagement drop. | |||
| Data-Fu: a language and an interpreter for interaction with read/write linked data | | BIBA | Full-Text | 1225-1236 | |
| Steffen Stadtmüller; Sebastian Speiser; Andreas Harth; Rudi Studer | |||
| An increasing amount of applications build their functionality on the utilisation and manipulation of web resources. Consequently REST gains popularity with a resource-centric interaction architecture that draws its flexibility from links between resources. Linked Data offers a uniform data model for REST with self-descriptive resources that can be leveraged to avoid a manual ad-hoc development of web-based applications. For declaratively specifying interactions between web resources we introduce Data-Fu, a lightweight declarative rule language with state transition systems as formal grounding. Data-Fu enables the development of data-driven applications that facilitate the RESTful manipulation of read/write Linked Data resources. Furthermore, we describe an interpreter for Data-Fu as a general purpose engine that allows to perform described interactions with web resources by orders of magnitude faster than a comparable Linked Data processor. | |||
| NIFTY: a system for large scale information flow tracking and clustering | | BIBA | Full-Text | 1237-1248 | |
| Caroline Suen; Sandy Huang; Chantat Eksombatchai; Rok Sosic; Jure Leskovec | |||
| The real-time information on news sites, blogs and social networking sites
changes dynamically and spreads rapidly through the Web. Developing methods for
handling such information at a massive scale requires that we think about how
information content varies over time, how it is transmitted, and how it mutates
as it spreads.
We describe the News Information Flow Tracking, Yay! (NIFTY) system for large scale real-time tracking of "memes" -- short textual phrases that travel and mutate through the Web. NIFTY is based on a novel highly-scalable incremental meme-clustering algorithm that efficiently extracts and identifies mutational variants of a single meme. NIFTY runs orders of magnitude faster than our previous Memetracker system, while also maintaining better consistency and quality of extracted memes. We demonstrate the effectiveness of our approach by processing a 20 terabyte dataset of 6.1 billion blog posts and news articles that we have been continuously collecting for the last four years. NIFTY extracted 2.9 billion unique textual phrases and identified more than 9 million memes. Our meme-tracking algorithm was able to process the entire dataset in less than five days using a single machine. Furthermore, we also provide a live deployment of the NIFTY system that allows users to explore the dynamics of online news in near real-time. | |||
| When relevance is not enough: promoting diversity and freshnessin personalized question recommendation | | BIBA | Full-Text | 1249-1260 | |
| Idan Szpektor; Yoelle Maarek; Dan Pelleg | |||
| What makes a good question recommendation system for community
question-answering sites? First, to maintain the health of the ecosystem, it
needs to be designed around answerers, rather than exclusively for askers.
Next, it needs to scale to many questions and users, and be fast enough to
route a newly-posted question to potential answerers within the few minutes
before the asker's patience runs out. It also needs to show each answerer
questions that are relevant to his or her interests. We have designed and built
such a system for Yahoo! Answers, but realized, when testing it with live
users, that it was not enough.
We found that those drawing-board requirements fail to capture user's interests. The feature that they really missed was diversity. In other words, showing them just the main topics they had previously expressed interest in was simply too dull. Adding the spice of topics slightly outside the core of their past activities significantly improved engagement. We conducted a large-scale online experiment in production in Yahoo! Answers that showed that recommendations driven by relevance alone perform worse than a control group without question recommendations, which is the current behavior. However, an algorithm promoting both diversity and freshness improved the number of answers by 17%, daily session length by 10%, and had a significant positive impact on peripheral activities such as voting. | |||
| Mining acronym expansions and their meanings using query click log | | BIBA | Full-Text | 1261-1272 | |
| Bilyana Taneva; Tao Cheng; Kaushik Chakrabarti; Yeye He | |||
| Acronyms are abbreviations formed from the initial components of words or phrases. Acronym usage is becoming more common in web searches, email, text messages, tweets, blogs and posts. Acronyms are typically ambiguous and often disambiguated by context words. Given either just an acronym as a query or an acronym with a few context words, it is immensely useful for a search engine to know the most likely intended meanings, ranked by their likelihood. To support such online scenarios, we study the offline mining of acronyms and their meanings in this paper. For each acronym, our goal is to discover all distinct meanings and for each meaning, compute the expanded string, its popularity score and a set of context words that indicate this meaning. Existing approaches are inadequate for this purpose. Our main insight is to leverage "co-clicks" in search engine query click log to mine expansions of acronyms. There are several technical challenges such as ensuring 1:1 mapping between expansions and meanings, handling of "tail meanings" and extracting context words. We present a novel, end-to-end solution that addresses the above challenges. We further describe how web search engines can leverage the mined information for prediction of intended meaning for queries containing acronyms. Our experiments show that our approach (i) discovers the meanings of acronyms with high precision and recall, (ii) significantly complements existing meanings in Wikipedia and (iii) accurately predicts intended meaning for online queries with over 90% precision. | |||
| Groundhog day: near-duplicate detection on Twitter | | BIBA | Full-Text | 1273-1284 | |
| Ke Tao; Fabian Abel; Claudia Hauff; Geert-Jan Houben; Ujwal Gadiraju | |||
| With more than 340~million messages that are posted on Twitter every day, the amount of duplicate content as well as the demand for appropriate duplicate detection mechanisms is increasing tremendously. Yet there exists little research that aims at detecting near-duplicate content on microblogging platforms. We investigate the problem of near-duplicate detection on Twitter and introduce a framework that analyzes the tweets by comparing (i) syntactical characteristics, (ii) semantic similarity, and (iii) contextual information. Our framework provides different duplicate detection strategies that, among others, make use of external Web resources which are referenced from microposts. Machine learning is exploited in order to learn patterns that help identifying duplicate content. We put our duplicate detection framework into practice by integrating it into Twinder, a search engine for Twitter streams. An in-depth analysis shows that it allows Twinder to diversify search results and improve the quality of Twitter search. We conduct extensive experiments in which we (1) evaluate the quality of different strategies for detecting duplicates, (2) analyze the impact of various features on duplicate detection, (3) investigate the quality of strategies that classify to what exact level two microposts can be considered as duplicates and (4) optimize the process of identifying duplicate content on Twitter. Our results prove that semantic features which are extracted by our framework can boost the performance of detecting duplicates. | |||
| Uncovering locally characterizing regions within geotagged data | | BIBA | Full-Text | 1285-1296 | |
| Bart Thomee; Adam Rae | |||
| We propose a novel algorithm for uncovering the colloquial boundaries of locally characterizing regions present in collections of labeled geospatial data. We address the problem by first modeling the data using scale-space theory, allowing us to represent it simultaneously across different scales as a family of increasingly smoothed density distributions. We then derive region boundaries by applying localized label weighting and image processing techniques to the scale-space representation of each label. Important insights into the data can be acquired by visualizing the shape and size of the resulting boundaries for each label at multiple scales. We demonstrate our technique operating at scale by discovering the boundaries of the most geospatially salient tags associated with a large collection of georeferenced photos from Flickr and compare our characterizing regions that emerge from the data with those produced by a recent technique from the research literature. | |||
| Spectral analysis of communication networks using Dirichlet eigenvalues | | BIBA | Full-Text | 1297-1306 | |
| Alexander Tsiatas; Iraj Saniee; Onuttom Narayan; Matthew Andrews | |||
| Good clustering can provide critical insight into potential locations where congestion in a network may occur. A natural measure of congestion for a collection of nodes in a graph is its Cheeger ratio, defined as the ratio of the size of its boundary to its volume. Spectral methods provide effective means to estimate the smallest Cheeger ratio via the spectral gap of the graph Laplacian. Here, we compute the spectral gap of the truncated graph Laplacian, with the so-called Dirichlet boundary condition, for the graphs of a dozen communication networks at the IP-layer, which are subgraphs of the much larger global IP-layer network. We show that i) the Dirichlet spectral gap of these networks is substantially larger than the standard spectral gap and is therefore a better indicator of the true expansion properties of the graph, ii) unlike the standard spectral gap, the Dirichlet spectral gaps of progressively larger subgraphs converge to that of the global network, thus allowing properties of the global network to be efficiently obtained from them, and (iii) the (first two) eigenvectors of the Dirichlet graph Laplacian can be used for spectral clustering with arguably better results than standard spectral clustering. We first demonstrate these results analytically for finite regular trees. We then perform spectral clustering on the IP-layer networks using Dirichlet eigenvectors and show that it yields cuts near the network core, thus creating genuine single-component clusters. This is much better than traditional spectral clustering where several disjoint fragments near the network periphery are liable to be misleadingly classified as a single cluster. Since congestion in communication networks is known to peak at the core due to large-scale curvature and geometry, identification of core congestion and its localization are important steps in analysis and improved engineering of networks. Thus, spectral clustering with Dirichlet boundary condition is seen to be more effective at finding bona-fide bottlenecks and congestion than standard spectral clustering. | |||
| Subgraph frequencies: mapping the empirical and extremal geography of large graph collections | | BIBA | Full-Text | 1307-1318 | |
| Johan Ugander; Lars Backstrom; Jon Kleinberg | |||
| A growing set of on-line applications are generating data that can be viewed
as very large collections of small, dense social graphs -- these range from
sets of social groups, events, or collaboration projects to the vast collection
of graph neighborhoods in large social networks. A natural question is how to
usefully define a domain-independent 'coordinate system' for such a collection
of graphs, so that the set of possible structures can be compactly represented
and understood within a common space. In this work, we draw on the theory of
graph homomorphisms to formulate and analyze such a representation, based on
computing the frequencies of small induced subgraphs within each graph. We find
that the space of subgraph frequencies is governed both by its combinatorial
properties -- based on extremal results that constrain all graphs -- as well as
by its empirical properties -- manifested in the way that real social graphs
appear to lie near a simple one-dimensional curve through this space.
We develop flexible frameworks for studying each of these aspects. For capturing empirical properties, we characterize a simple stochastic generative model, a single-parameter extension of Erdos-Renyi random graphs, whose stationary distribution over subgraphs closely tracks the one-dimensional concentration of the real social graph families. For the extremal properties, we develop a tractable linear program for bounding the feasible space of subgraph frequencies by harnessing a toolkit of known extremal graph theory. Together, these two complementary frameworks shed light on a fundamental question pertaining to social graphs: what properties of social graphs are 'social' properties and what properties are 'graph' properties? We conclude with a brief demonstration of how the coordinate system we examine can also be used to perform classification tasks, distinguishing between structures arising from different types of social graphs. | |||
| The self-feeding process: a unifying model for communication dynamics in the web | | BIBA | Full-Text | 1319-1330 | |
| Pedro Olmo S. Vaz de Melo; Christos Faloutsos; Renato Assunção; Antonio Loureiro | |||
| How often do individuals perform a given communication activity in the Web, such as posting comments on blogs or news? Could we have a generative model to create communication events with realistic inter-event time distributions (IEDs)? Which properties should we strive to match? Current literature has seemingly contradictory results for IED: some studies claim good fits with power laws; others with non-homogeneous Poisson processes. Given these two approaches, we ask: which is the correct one? Can we reconcile them all? We show here that, surprisingly, both approaches are correct, being corner cases of the proposed Self-Feeding Process (SFP). We show that the SFP (a) exhibits a unifying power, which generates power law tails (including the so-called "top-concavity" that real data exhibits), as well as short-term Poisson behavior; (b) avoids the "i.i.d. fallacy", which none of the prevailing models have studied before; and (c) is extremely parsimonious, requiring usually only one, and in general, at most two parameters. Experiments conducted on eight large, diverse real datasets (e.g., Youtube and blog comments, e-mails, SMSs, etc) reveal that the SFP mimics their properties very well. | |||
| Whom to mention: expand the diffusion of tweets by @ recommendation on micro-blogging systems | | BIBA | Full-Text | 1331-1340 | |
| Beidou Wang; Can Wang; Jiajun Bu; Chun Chen; Wei Vivian Zhang; Deng Cai; Xiaofei He | |||
| Nowadays, micro-blogging systems like Twitter have become one of the most important ways for information sharing. In Twitter, a user posts a message (tweet) and the others can forward the message (retweet). Mention is a new feature in micro-blogging systems. By mentioning users in a tweet, they will receive notifications and their possible retweets may help to initiate large cascade diffusion of the tweet. To enhance a tweet's diffusion by finding the right persons to mention, we propose in this paper a novel recommendation scheme named as whom-to-mention. Specifically, we present an in-depth study of mention mechanism and propose a recommendation scheme to solve the essential question of whom to mention in a tweet. In this paper, whom-to-mention is formulated as a ranking problem and we try to address several new challenges which are not well studied in the traditional information retrieval tasks. By adopting features including user interest match, content-dependent user relationship and user influence, a machine learned ranking function is trained based on newly defined information diffusion based relevance. The extensive evaluation using data gathered from real users demonstrates the advantage of our proposed algorithm compared with the traditional recommendation methods. | |||
| Wisdom in the social crowd: an analysis of quora | | BIBA | Full-Text | 1341-1352 | |
| Gang Wang; Konark Gill; Manish Mohanlal; Haitao Zheng; Ben Y. Zhao | |||
| Efforts such as Wikipedia have shown the ability of user communities to
collect, organize and curate information on the Internet. Recently, a number of
question and answer (Q&A) sites have successfully built large growing
knowledge repositories, each driven by a wide range of questions and answers
from its users community. While sites like Yahoo Answers have stalled and begun
to shrink, one site still going strong is Quora, a rapidly growing service that
augments a regular Q&A system with social links between users. Despite its
success, however, little is known about what drives Quora's growth, and how it
continues to connect visitors and experts to the right questions as it grows.
In this paper, we present results of a detailed analysis of Quora using measurements. We shed light on the impact of three different connection networks (or graphs) inside Quora, a graph connecting topics to users, a social graph connecting users, and a graph connecting related questions. Our results show that heterogeneity in the user and question graphs are significant contributors to the quality of Quora's knowledge base. One drives the attention and activity of users, and the other directs them to a small set of popular and interesting questions. | |||
| Learning to extract cross-session search tasks | | BIBA | Full-Text | 1353-1364 | |
| Hongning Wang; Yang Song; Ming-Wei Chang; Xiaodong He; Ryen W. White; Wei Chu | |||
| Search tasks, comprising a series of search queries serving the same information need, have recently been recognized as an accurate atomic unit for modeling user search intent. Most prior research in this area has focused on short-term search tasks within a single search session, and heavily depend on human annotations for supervised classification model learning. In this work, we target the identification of long-term, or cross-session, search tasks (transcending session boundaries) by investigating inter-query dependencies learned from users' searching behaviors. A semi-supervised clustering model is proposed based on the latent structural SVM framework, and a set of effective automatic annotation rules are proposed as weak supervision to release the burden of manual annotation. Experimental results based on a large-scale search log collected from Bing.com confirms the effectiveness of the proposed model in identifying cross-session search tasks and the utility of the introduced weak supervision signals. Our learned model enables a more comprehensive understanding of users' search behaviors via search logs and facilitates the development of dedicated search-engine support for long-term tasks. | |||
| Content-aware click modeling | | BIBA | Full-Text | 1365-1376 | |
| Hongning Wang; ChengXiang Zhai; Anlei Dong; Yi Chang | |||
| Click models aim at extracting intrinsic relevance of documents to queries
from biased user clicks. One basic modeling assumption made in existing work is
to treat such intrinsic relevance as an atomic query-document-specific
parameter, which is solely estimated from historical clicks without using any
content information about a document or relationship among the clicked/skipped
documents under the same query. Due to this overly simplified assumption,
existing click models can neither fully explore the information about a
document's relevance quality nor make predictions of relevance for any unseen
documents.
In this work, we proposed a novel Bayesian Sequential State model for modeling the user click behaviors, where the document content and dependencies among the sequential click events within a query are characterized by a set of descriptive features via a probabilistic graphical model. By applying the posterior regularized Expectation Maximization algorithm for parameter learning, we tailor the model to meet specific ranking-oriented properties, e.g., pairwise click preferences, so as to exploit richer information buried in the user clicks. Experiment results on a large set of real click logs demonstrate the effectiveness of the proposed model compared with several state-of-the-art click models. | |||
| Is it time for a career switch? | | BIBA | Full-Text | 1377-1388 | |
| Jian Wang; Yi Zhang; Christian Posse; Anmol Bhasin | |||
| Tenure is a critical factor for an individual to consider when making a job
transition. For instance, software engineers make a job transition to senior
software engineers in a span of 2 years on average, or it takes for
approximately 3 years for realtors to switch to brokers. While most existing
work on recommender systems focuses on finding what to recommend to a user,
this paper places emphasis on when to make appropriate recommendations and its
impact on the item selection in the context of a job recommender system. The
approach we propose, however, is general and can be applied to any
recommendation scenario where the decision-making process is dependent on the
tenure (i.e., the time interval) between successive decisions.
Our approach is inspired by the proportional hazards model in statistics. It models the tenure between two successive decisions and related factors. We further extend the model with a hierarchical Bayesian framework to address the problem of data sparsity. The proposed model estimates the likelihood of a user's decision to make a job transition at a certain time, which is denoted as the tenure-based decision probability. New and appropriate evaluation metrics are designed to analyze the model's performance on deciding when is the right time to recommend a job to a user. We validate the soundness of our approach by evaluating it with an anonymous job application dataset across 140+ industries on LinkedIn. Experimental results show that the hierarchical proportional hazards model has better predictability of the user's decision time, which in turn helps the recommender system to achieve higher utility/user satisfaction. | |||
| Google+Ripples: a native visualization of information flow | | BIBA | Full-Text | 1389-1398 | |
| Fernanda Viégas; Martin Wattenberg; Jack Hebert; Geoffrey Borggaard; Alison Cichowlas; Jonathan Feinberg; Jon Orwant; Christopher Wren | |||
| G+ Ripples is a visualization of information flow that shows users how public posts are shared on Google+. Unlike other social network visualizations, Ripples exists as a "native" visualization: it is directly accessible from public posts on Google+. This unique position leads to both new constraints and new possibilities for design. We describe the visualization technique, which is a new mix of node-and-link and circular treemap metaphors. We then describe user reactions as well as some of the patterns of sharing that are made evident by the Ripples visualization. | |||
| From cookies to cooks: insights on dietary patterns via analysis of web usage logs | | BIBA | Full-Text | 1399-1410 | |
| Robert West; Ryen W. White; Eric Horvitz | |||
| Nutrition is a key factor in people's overall health. Hence, understanding the nature and dynamics of population-wide dietary preferences over time and space can be valuable in public health. To date, studies have leveraged small samples of participants via food intake logs or treatment data. We propose a complementary source of population data on nutrition obtained via Web logs. Our main contribution is a spatiotemporal analysis of population-wide dietary preferences through the lens of logs gathered by a widely distributed Web-browser add-on, using the access volume of recipes that users seek via search as a proxy for actual food consumption. We discover that variation in dietary preferences as expressed via recipe access has two main periodic components, one yearly and the other weekly, and that there exist characteristic regional differences in terms of diet within the United States. In a second study, we identify users who show evidence of having made an acute decision to lose weight. We characterize the shifts in interests that they express in their search queries and focus on changes in their recipe queries in particular. Last, we correlate nutritional time series obtained from recipe queries with time-aligned data on hospital admissions, aimed at understanding how behavioral data captured in Web logs might be harnessed to identify potential relationships between diet and acute health problems. In this preliminary study, we focus on patterns of sodium identified in recipes over time and patterns of admission for congestive heart failure, a chronic illness that can be exacerbated by increases in sodium intake. | |||
| Enhancing personalized search by mining and modeling task behavior | | BIBA | Full-Text | 1411-1420 | |
| Ryen W. White; Wei Chu; Ahmed Hassan; Xiaodong He; Yang Song; Hongning Wang | |||
| Personalized search systems tailor search results to the current user intent using historic search interactions. This relies on being able to find pertinent information in that user's search history, which can be challenging for unseen queries and for new search scenarios. Building richer models of users' current and historic search tasks can help improve the likelihood of finding relevant content and enhance the relevance and coverage of personalization methods. The task-based approach can be applied to the current user's search history, or as we focus on here, all users' search histories as so-called "groupization" (a variant of personalization whereby other users' profiles can be used to personalize the search experience). We describe a method whereby we mine historic search-engine logs to find other users performing similar tasks to the current user and leverage their on-task behavior to identify Web pages to promote in the current ranking. We investigate the effectiveness of this approach versus query-based matching and finding related historic activity from the current user (i.e., group versus individual). As part of our studies we also explore the use of the on-task behavior of particular user cohorts, such as people who are expert in the topic currently being searched, rather than all other users. Our approach yields promising gains in retrieval performance, and has direct implications for improving personalization in search systems. | |||
| Inferring dependency constraints on parameters for web services | | BIBA | Full-Text | 1421-1432 | |
| Qian Wu; Ling Wu; Guangtai Liang; Qianxiang Wang; Tao Xie; Hong Mei | |||
| Recently many popular websites such as Twitter and Flickr expose their data through web service APIs, enabling third-party organizations to develop client applications that provide functionalities beyond what the original websites offer. These client applications should follow certain constraints in order to correctly interact with the web services. One common type of such constraints is Dependency Constraints on Parameters. Given a web service operation O and its parameters Pi, Pj, these constraints describe the requirement on one parameter Pi that is dependent on the conditions of some other parameter(s) Pj. For example, when requesting the Twitter operation "GET statuses/user_timeline", a user_id parameter must be provided if a screen_name parameter is not provided. Violations of such constraints can cause fatal errors or incorrect results in the client applications. However, these constraints are often not formally specified and thus not available for automatic verification of client applications. To address this issue, we propose a novel approach, called INDICATOR, to automatically infer dependency constraints on parameters for web services, via a hybrid analysis of heterogeneous web service artifacts, including the service documentation, the service SDKs, and the web services themselves. To evaluate our approach, we applied INDICATOR to infer dependency constraints for four popular web services. The results showed that INDICATOR effectively infers constraints with an average precision of 94.4% and recall of 95.5%. | |||
| Predicting advertiser bidding behaviors in sponsored search by rationality modeling | | BIBA | Full-Text | 1433-1444 | |
| Haifeng Xu; Bin Gao; Diyi Yang; Tie-Yan Liu | |||
| We study how an advertiser changes his/her bid prices in sponsored search, by modeling his/her rationality. Predicting the bid changes of advertisers with respect to their campaign performances is a key capability of search engines, since it can be used to improve the offline evaluation of new advertising technologies and the forecast of future revenue of the search engine. Previous work on advertiser behavior modeling heavily relies on the assumption of perfect advertiser rationality; however, in most cases, this assumption does not hold in practice. Advertisers may be unwilling, incapable, and/or constrained to achieve their best response. In this paper, we explicitly model these limitations in the rationality of advertisers, and build a probabilistic advertiser behavior model from the perspective of a search engine. We then use the expected payoff to define the objective function for an advertiser to optimize given his/her limited rationality. By solving the optimization problem with Monte Carlo, we get a prediction of mixed bid strategy for each advertiser in the next period of time. We examine the effectiveness of our model both directly using real historical bids and indirectly using revenue prediction and click number prediction. Our experimental results based on the sponsored search logs from a commercial search engine show that the proposed model can provide a more accurate prediction of advertiser bid behaviors than several baseline methods. | |||
| A biterm topic model for short texts | | BIBA | Full-Text | 1445-1456 | |
| Xiaohui Yan; Jiafeng Guo; Yanyan Lan; Xueqi Cheng | |||
| Uncovering the topics within short texts, such as tweets and instant messages, has become an important task for many content analysis applications. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work well. The fundamental reason lies in that conventional topic models implicitly capture the document-level word co-occurrence patterns to reveal topics, and thus suffer from the severe data sparsity in short documents. In this paper, we propose a novel way for modeling topics in short texts, referred as biterm topic model (BTM). Specifically, in BTM we learn the topics by directly modeling the generation of word co-occurrence patterns (i.e. biterms) in the whole corpus. The major advantages of BTM are that 1) BTM explicitly models the word co-occurrence patterns to enhance the topic learning; and 2) BTM uses the aggregated patterns in the whole corpus for learning topics to solve the problem of sparse word co-occurrence patterns at document-level. We carry out extensive experiments on real-world short text collections. The results demonstrate that our approach can discover more prominent and coherent topics, and significantly outperform baseline methods on several evaluation metrics. Furthermore, we find that BTM can outperform LDA even on normal texts, showing the potential generality and wider usage of the new topic model. | |||
| Unified entity search in social media community | | BIBA | Full-Text | 1457-1466 | |
| Ting Yao; Yuan Liu; Chong-Wah Ngo; Tao Mei | |||
| The search for entities is the most common search behavior on the Web, especially in social media communities where entities (such as images, videos, people, locations, and tags) are highly heterogeneous and correlated. While previous research usually deals with these social media entities separately, we are investigating in this paper a unified, multi-level, and correlative entity graph to represent the unstructured social media data, through which various applications (e.g., friend suggestion, personalized image search, image tagging, etc.) can be realized more effectively in one single framework. We regard the social media objects equally as "entities" and all of these applications as "entity search" problem which searches for entities with different types. We first construct a multi-level graph which organizes the heterogeneous entities into multiple levels, with one type of entities as vertices in each level. The edges between graphs pairwisely connect the entities weighted by intra-relations in the same level and inter-links across two different levels distilled from the social behaviors (e.g., tagging, commenting, and joining communities). To infer the strength of intra-relations, we propose a circular propagation scheme, which reinforces the mutual exchange of information across different entity types in a cyclic manner. Based on the constructed unified graph, we explicitly formulate entity search as a global optimization problem in a unified Bayesian framework, in which various applications are efficiently realized. Empirically, we validate the effectiveness of our unified entity graph for various social media applications on million-scale real-world dataset. | |||
| MATRI: a multi-aspect and transitive trust inference model | | BIBA | Full-Text | 1467-1476 | |
| Yuan Yao; Hanghang Tong; Xifeng Yan; Feng Xu; Jian Lu | |||
| Trust inference, which is the mechanism to build new pair-wise
trustworthiness relationship based on the existing ones, is a fundamental
integral part in many real applications, e.g., e-commerce, social networks,
peer-to-peer networks, etc. State-of-the-art trust inference approaches mainly
employ the transitivity property of trust by propagating trust along connected
users (a.k.a. trust propagation), but largely ignore other important
properties, e.g., prior knowledge, multi-aspect, etc.
In this paper, we propose a multi-aspect trust inference model by exploring an equally important property of trust, i.e., the multi-aspect property. The heart of our method is to view the problem as a recommendation problem, and hence opens the door to the rich methodologies in the field of collaborative filtering. The proposed multi-aspect model directly characterizes multiple latent factors for each trustor and trustee from the locally-generated trust relationships. Moreover, we extend this model to incorporate the prior knowledge as well as trust propagation to further improve inference accuracy. We conduct extensive experimental evaluations on real data sets, which demonstrate that our method achieves significant improvement over several existing benchmark approaches. Overall, the proposed method (MaTrI) leads to 26.7% -- 40.7% improvement over its best known competitors in prediction accuracy; and up to 7 orders of magnitude speedup with linear scalability. | |||
| Predicting positive and negative links in signed social networks by transfer learning | | BIBA | Full-Text | 1477-1488 | |
| Jihang Ye; Hong Cheng; Zhe Zhu; Minghua Chen | |||
| Different from a large body of research on social networks that has focused
almost exclusively on positive relationships, we study signed social networks
with both positive and negative links. Specifically, we focus on how to
reliably and effectively predict the signs of links in a newly formed signed
social network (called a target network). Since usually only a very small
amount of edge sign information is available in such newly formed networks,
this small quantity is not adequate to train a good classifier. To address this
challenge, we need assistance from an existing, mature signed network (called a
source network) which has abundant edge sign information. We adopt the transfer
learning approach to leverage the edge sign information from the source
network, which may have a different yet related joint distribution of the edge
instances and their class labels.
As there is no predefined feature vector for the edge instances in a signed network, we construct generalizable features that can transfer the topological knowledge from the source network to the target. With the extracted features, we adopt an AdaBoost-like transfer learning algorithm with instance weighting to utilize more useful training instances in the source network for model learning. Experimental results on three real large signed social networks demonstrate that our transfer learning algorithm can improve the prediction accuracy by 40% over baseline methods. | |||
| Sparse online topic models | | BIBA | Full-Text | 1489-1500 | |
| Aonan Zhang; Jun Zhu; Bo Zhang | |||
| Topic models have shown great promise in discovering latent semantic structures from complex data corpora, ranging from text documents and web news articles to images, videos, and even biological data. In order to deal with massive data collections and dynamic text streams, probabilistic online topic models such as online latent Dirichlet allocation (OLDA) have recently been developed. However, due to normalization constraints, OLDA can be ineffective in controlling the sparsity of discovered representations, a desirable property for learning interpretable semantic patterns, especially when the total number of topics is large. In contrast, sparse topical coding (STC) has been successfully introduced as a non-probabilistic topic model for effectively discovering sparse latent patterns by using sparsity-inducing regularization. But, unfortunately STC cannot scale to very large datasets or deal with online text streams, partly due to its batch learning procedure. In this paper, we present a sparse online topic model, which directly controls the sparsity of latent semantic patterns by imposing sparsity-inducing regularization and learns the topical dictionary by an online algorithm. The online algorithm is efficient and guaranteed to converge. Extensive empirical results of the sparse online topic model as well as its collapsed and supervised extensions on a large-scale Wikipedia dataset and the medium-sized 20Newsgroups dataset demonstrate appealing performance. | |||
| TopRec: domain-specific recommendation through community topic mining in social network | | BIBA | Full-Text | 1501-1510 | |
| Xi Zhang; Jian Cheng; Ting Yuan; Biao Niu; Hanqing Lu | |||
| Traditionally, Collaborative Filtering assumes that similar users have similar responses to similar items. However, human activities exhibit heterogenous features across multiple domains such that users own similar tastes in one domain may behave quite differently in other domains. Moreover, highly sparse data presents crucial challenge in preference prediction. Intuitively, if users' interested domains are captured first, the recommender system is more likely to provide the enjoyed items while filter out those uninterested ones. Therefore, it is necessary to learn preference profiles from the correlated domains instead of the entire user-item matrix. In this paper, we propose a unified framework, TopRec, which detects topical communities to construct interpretable domains for domain-specific collaborative filtering. In order to mine communities as well as the corresponding topics, a semi-supervised probabilistic topic model is utilized by integrating user guidance with social network. Experimental results on real-world data from Epinions and Ciao demonstrate the effectiveness of the proposed framework. | |||
| Localized matrix factorization for recommendation based on matrix block diagonal forms | | BIBA | Full-Text | 1511-1520 | |
| Yongfeng Zhang; Min Zhang; Yiqun Liu; Shaoping Ma; Shi Feng | |||
| Matrix factorization on user-item rating matrices has achieved significant success in collaborative filtering based recommendation tasks. However, it also encounters the problems of data sparsity and scalability when applied in real-world recommender systems. In this paper, we present the Localized Matrix Factorization (LMF) framework, which attempts to meet the challenges of sparsity and scalability by factorizing Block Diagonal Form (BDF) matrices. In the LMF framework, a large sparse matrix is first transformed into Recursive Bordered Block Diagonal Form (RBBDF), which is an intuitionally interpretable structure for user-item rating matrices. Smaller and denser submatrices are then extracted from this RBBDF matrix to construct a BDF matrix for more effective collaborative prediction. We show formally that the LMF framework is suitable for matrix factorization and that any decomposable matrix factorization algorithm can be integrated into this framework. It has the potential to improve prediction accuracy by factorizing smaller and denser submatrices independently, which is also suitable for parallelization and contributes to system scalability at the same time. Experimental results based on a number of real-world public-access benchmarks show the effectiveness and efficiency of the proposed LMF framework. | |||
| Predicting purchase behaviors from social media | | BIBA | Full-Text | 1521-1532 | |
| Yongzheng Zhang; Marco Pennacchiotti | |||
| In the era of social commerce, users often connect from e-commerce websites to social networking venues such as Facebook and Twitter. However, there have been few efforts on understanding the correlations between users' social media profiles and their e-commerce behaviors. This paper presents a system for predicting a user's purchase behaviors on e-commerce websites from the user's social media profile. We specifically aim at understanding if the user's profile information in a social network (for example Facebook) can be leveraged to predict what categories of products the user will buy from (for example eBay Electronics). The paper provides an extensive analysis on how users' Facebook profile information correlates to purchases on eBay, and analyzes the performance of different feature sets and learning algorithms on the task of purchase behavior prediction. | |||
| Anatomy of a web-scale resale market: a data mining approach | | BIBA | Full-Text | 1533-1544 | |
| Yuchen Zhao; Neel Sundaresan; Zeqian Shen; Philip S. Yu | |||
| Reuse and remarketing of content and products is an integral part of the internet. As E-commerce has grown, online resale and secondary markets form a significant part of the commerce space. The intentions and methods for reselling are diverse. In this paper, we study an instance of such markets that affords interesting data at large scale for mining purposes to understand the properties and patterns of this online market. As part of knowledge discovery of such a market, we first formally propose criteria to reveal unseen resale behaviors by elastic matching identification (EMI) based on the account transfer and item similarity properties of transactions. Then, we present a large-scale system that leverages MapReduce paradigm to mine millions of online resale activities from petabyte scale heterogeneous e-commerce data. With the collected data, we show that the number of resale activities leads to a power law distribution with a 'long tail', where a significant share of users only resell in very low numbers and a large portion of resales come from a small number of highly active resellers. We further conduct a comprehensive empirical study from different aspects of resales, including the temporal, spatial patterns, user demographics, reputation and the content of sale postings. Based on these observations, we explore the features related to "successful" resale transactions and evaluate if they can be predictable. We also discuss uses of this information mining for business insights and user experience on a real-world online marketplace. | |||
| Questions about questions: an empirical analysis of information needs on Twitter | | BIBA | Full-Text | 1545-1556 | |
| Zhe Zhao; Qiaozhu Mei | |||
| Conventional studies of online information seeking behavior usually focus on
the use of search engines or question answering (Q&A) websites. Recently,
the fast growth of online social platforms such as Twitter and Facebook has
made it possible for people to utilize them for information seeking by asking
questions to their friends or followers. We anticipate a better understanding
of Web users' information needs by investigating research questions about these
questions. How are they distinctive from daily tweeted conversations? How are
they related to search queries? Can users' information needs on one platform
predict those on the other?
In this study, we take the initiative to extract and analyze information needs from billions of online conversations collected from Twitter. With an automatic text classifier, we can accurately detect real questions in tweets (i.e., tweets conveying real information needs). We then present a comprehensive analysis of the large-scale collection of information needs we extracted. We found that questions being asked on Twitter are substantially different from the topics being tweeted in general. Information needs detected on Twitter have a considerable power of predicting the trends of Google queries. Many interesting signals emerge through longitudinal analysis of the volume, spikes, and entropy of questions on Twitter, which provide insights to the understanding of the impact of real world events and user behavioral patterns in social platforms. | |||
| Which vertical search engines are relevant? | | BIBA | Full-Text | 1557-1568 | |
| Ke Zhou; Ronan Cummins; Mounia Lalmas; Joemon M. Jose | |||
| Aggregating search results from a variety of heterogeneous sources, so-called verticals, such as news, image and video, into a single interface is a popular paradigm in web search. Current approaches that evaluate the effectiveness of aggregated search systems are based on rewarding systems that return highly relevant verticals for a given query, where this relevance is assessed under different assumptions. It is difficult to evaluate or compare those systems without fully understanding the relationship between those underlying assumptions. To address this, we present a formal analysis and a set of extensive user studies to investigate the effects of various assumptions made for assessing query vertical relevance. A total of more than 20,000 assessments on 44 search tasks across 11 verticals are collected through Amazon Mechanical Turk and subsequently analysed. Our results provide insights into various aspects of query vertical relevance and allow us to explain in more depth as well as questioning the evaluation results published in the literature. | |||
| Making the most of your triple store: query answering in OWL 2 using an RL reasoner | | BIBA | Full-Text | 1569-1580 | |
| Yujiao Zhou; Bernardo Cuenca Grau; Ian Horrocks; Zhe Wu; Jay Banerjee | |||
| Triple stores implementing the RL profile of OWL 2 are becoming increasingly popular. In contrast to unrestricted OWL 2, the RL profile is known to enjoy favourable computational properties for query answering, and state-of-the-art RL reasoners such as OWLim and Oracle's native inference engine of Oracle Spatial and Graph have proved extremely successful in industry-scale applications. The expressive restrictions imposed by OWL 2 RL may, however, be problematical for some applications. In this paper, we propose novel techniques that allow us (in many cases) to compute exact query answers using an off-the-shelf RL reasoner, even when the ontology is outside the RL profile. Furthermore, in the cases where exact query answers cannot be computed, we can still compute both lower and upper bounds on the exact answers. These bounds allow us to estimate the degree of incompleteness of the RL reasoner on the given query, and to optimise the computation of exact answers using a fully-fledged OWL 2 reasoner. A preliminary evaluation using the RDF Semantic Graph feature in Oracle Database has shown very promising results with respect to both scalability and tightness of the bounds. | |||
| Security implications of password discretization for click-based graphical passwords | | BIBA | Full-Text | 1581-1591 | |
| Bin B. Zhu; Dongchen Wei; Maowei Yang; Jeff Yan | |||
| Discretization is a standard technique used in click-based graphical passwords for tolerating input variance so that approximately correct passwords are accepted by the system. In this paper, we show for the first time that two representative discretization schemes leak a significant amount of password information, undermining the security of such graphical passwords. We exploit such information leakage for successful dictionary attacks on Persuasive Cued Click Points (PCCP), which is to date the most secure click-based graphical password scheme and was considered to be resistant to such attacks. In our experiments, our purely automated attack successfully guessed 69.2% of the passwords when Centered Discretization was used to implement PCCP, and 39.4% of the passwords when Robust Discretization was used. Each attack dictionary we used was of approximately 235 entries, whereas the full password space was of 243 entries. For Centered Discretization, our attack still successfully guessed 50% of the passwords when the dictionary size was reduced to approximately 230 entries. Our attack is also applicable to common implementations of other click-based graphical password systems such as PassPoints and Cued Click Points -- both have been extensively studied in the research communities. | |||
| The linked data platform (LDP) | | BIBA | Full-Text | 1-2 | |
| Arnaud J. Le Hors; Steve Speicher | |||
| As a result of the Linked Data Basic Profile submission, made by several
organizations including IBM, EMC, and Oracle, the W3C launched in June 2012 the
Linked Data Platform (LDP) Working Group (WG).
The LDP WG is chartered to produce a W3C Recommendation for HTTP-based (RESTful) application integration patterns using read/write Linked Data. This work will benefit both small-scale in-browser applications (WebApps) and large-scale Enterprise Application Integration (EAI) efforts. It will complement SPARQL and will be compatible with standards for publishing Linked Data, bringing the data integration features of RDF to RESTful, data-oriented software development. This presentation introduces developers to the Linked Data Platform, explains its origins in the Open Services Lifecycle Collaboration (OSLC) initiative, describes how it fits with other existing Semantic Web technologies and the problems developers will be able to address using LDP, based on use cases such as the integration challenge the industry faces in the Application Lifecycle Management (ALM) space. By attending this presentation developers will get an understanding of this upcoming W3C Recommendation which is posed to become a major stepping stone in enabling broader adoption of Linked Data in the industry, not only for publishing data but also for integrating applications. | |||
| Quill: a collaborative design assistant for cross platform web application user interfaces | | BIBA | Full-Text | 3-6 | |
| Vivian Genaro Motti; Dave Raggett | |||
| Web application development teams face an increasing burden when they need to come up with a consistent user interface across different platforms with different characteristics, for example, desktop, smart phone and tablet devices. This is going to get even worse with the adoption of HTML5 on TVs and cars. This short paper describes a browser-based collaborative design assistant that does the drudge work of ensuring that the user interfaces are kept in sync across all of the target platforms and with changes to the domain data and task models. This is based upon an expert system that dynamically updates the user interface design to reflect the developer's decisions. This is implemented in terms of constraint propagation and search through the design space. An additional benefit is the ease of providing accessible user interfaces in conjunction with assistive technologies. | |||
| Linked services infrastructure: a single entry point for online media related to any linked data concept | | BIBA | Full-Text | 7-10 | |
| Lyndon Nixon | |||
| In this submission, we describe the Linked Services Infrastructure (LSI). It uses Semantic Web Service technology to map individual concepts (identified by Linked Data URIs) to sets of online media content aggegrated from heterogeneous Web APIs. It exposes this mapping service in a RESTful API and returns RDF based responses for further processing if desired. The LSI can be used as a general purpose tool for user agents to retrieve different online media resources to illustrate a concept to a user. | |||
| ResourceSync: leveraging sitemaps for resource synchronization | | BIBA | Full-Text | 11-14 | |
| Bernhard Haslhofer; Simeon Warner; Carl Lagoze; Martin Klein; Robert Sanderson; Michael L. Nelson; Herbert Van de Sompel | |||
| Many applications need up-to-date copies of collections of changing Web resources. Such synchronization is currently achieved using ad-hoc or proprietary solutions. We propose ResourceSync, a general Web resource synchronization protocol that leverages XML Sitemaps. It provides a set of capabilities that can be combined in a modular manner to meet local or community requirements. We report on work to implement this protocol for arXiv.org and also provide an experimental prototype for the English Wikipedia as well as a client API. | |||
| Static typing & JavaScript libraries: towards a more considerate relationship | | BIBA | Full-Text | 15-18 | |
| Benjamin Canou; Emmanuel Chailloux; Vincent Botbol | |||
| In this paper, after relating a short history of the mostly unhappy relationship between static typing and JavaScript (JS), we explain a new attempt at conciliating them which is more respectful of both worlds than other approaches. As an example, we present Onyo, an advanced binding of the Enyo JS library for the OCaml language. Onyo exploits the expressiveness of OCaml's type system to properly encode the structure of the library, preserving its design while statically checking that it is used correctly, and without introducing runtime overhead. | |||
| Client-server web applications widgets | | BIBA | Full-Text | 19-22 | |
| Vincent Balat | |||
| The evolution of the Web from a content platform into an application platform has raised many new issues for developers. One of the most significant is that we are now developing distributed applications, in the specific context of the underlying Web technologies. In particular, one should be able to compute some parts of the page either on server or client sides, depending on the needs of developers, and preferably in the same language, with the same functions. This paper deals with the particular problem of user interface generation in this client-server setting. Many widget libraries for browsers are fully written in JavaScript and do not allow to generate the interface on server side, making more difficult the indexing of pages by search engines. We propose a solution that makes possible to generate widgets either on client side or on server side in a very flexible way. It is implemented in the Ocsigen framework. | |||
| Effective web scraping with OXPath | | BIBA | Full-Text | 23-26 | |
| Giovanni Grasso; Tim Furche; Christian Schallhart | |||
| Even in the third decade of the Web, scraping web sites remains a
challenging task: Most scraping programs are still developed as ad-hoc
solutions using a complex stack of languages and tools. Where comprehensive
extraction solutions exist, they are expensive, heavyweight, and proprietary.
OXPath is a minimalistic wrapping language that is nevertheless expressive and versatile enough for a wide range of scraping tasks. In this presentation, we want to introduce you to a new paradigm of scraping: declarative navigation -- instead of complex scripting or heavyweight, limited visual tools, OXPath turns scraping into a simple two step process: pick the relevant nodes through an XPath expression and then specify which action to apply to those nodes. OXPath takes care of browser synchronisation, page and state management, making scraping as easy as node selection with XPath. To achieve this, OXPath does not require a complex or heavyweight infrastructure. OXPath is an open source project and has seen first adoption in a wide variety of scraping tasks. | |||
| CSS browser selector plus: a JavaScript library to support cross-browser responsive design | | BIBA | Full-Text | 27-30 | |
| Richard Duchatsch Johansen; Talita Cristina Pagani Britto; Cesar Augusto Cusin | |||
| Developing websites for multiples devices have been a rough task for the past ten years. Devices features -- such as screen size, resolution, internet access, operating system, etc. -- change frequently and new devices emerge every day. Since W3C introduced media queries in CSS3, it's possible to developed tailored interfaces for multiple devices using a single HTML document. The approach of Responsive Web Design has been used media queries as support for developing adaptive and flexible layouts, however, it's not supported in legacy browsers. In this paper, we present CSS Browser Selector Plus, a cross-browser alternative method using JavaScript to support CSS3 media queries for developing responsive web considering older browsers. | |||
| A meteoroid on steroids: ranking media items stemming from multiple social networks | | BIBA | Full-Text | 31-34 | |
| Thomas Steiner | |||
| We have developed an application called Social Media Illustrator that allows for finding media items on multiple social networks, clustering them by visual similarity, ranking them by different criteria, and finally arranging them in media galleries that were evaluated to be perceived as aesthetically pleasing. In this paper, we focus on the ranking aspect and show how, for a given set of media items, the most adequate ranking criterion combination can be found by interactively applying different criteria and seeing their effect on-the-fly. This leads us to an empirically optimized media item ranking formula, which takes social network interactions into account. While the ranking formula is not universally applicable, it can serve as a good starting point for an individually adapted formula, all within the context of Social Media Illustrator. A demo of the application is available publicly online at the URL http://social-media-illustrator.herokuapp.com/. | |||
| Creating 3rd generation web APIs with hydra | | BIBA | Full-Text | 35-38 | |
| Markus Lanthaler | |||
| In this paper we describe a novel approach to build hypermedia-driven Web APIs based on Linked Data technologies such as JSON-LD. We also present the result of implementing a first prototype featuring both a RESTful Web API and a generic API client. To the best of our knowledge, no comparable integrated system to develop Linked Data-based APIs exists. | |||
| Scaling matrix factorization for recommendation with randomness | | BIBA | Full-Text | 39-40 | |
| Lei Tang; Patrick Harrington | |||
| Recommendation is one of the core problems in eCommerce. In our application, different from conventional collaborative filtering, one user can engage in various types of activities in a sequence. Meanwhile, the number of users and items involved are quite huge, entailing scalable approaches. In this paper, we propose one simple approach to integrate multiple types of user actions for recommendation. A two-stage randomized matrix factorization is presented to handle large-scale collaborative filtering where alternating least squares or stochastic gradient descent is not viable. Empirical results show that the method is quite scalable, and is able to effectively capture correlations between different actions, thus making more relevant recommendations. | |||
| Link prediction in social networks based on hypergraph | | BIBA | Full-Text | 41-42 | |
| Dong Li; Zhiming Xu; Sheng Li; Xin Sun | |||
| In recent years, online social networks have undergone a significant growth and attracted much attention. In these online social networks, link prediction is a critical task that not only offers insights into the factors behind creation of individual social relationship but also plays an essential role in the whole network growth. In this paper, we propose a novel link prediction method based on hypergraph. In contrast with conventional methods that using ordinary graph, we model the social network as a hypergraph, which can fully capture all types of objects and either the pair wise or high-order relations among these objects in the network. Then the link prediction task is formulated as a ranking problem on this hypergraph. Experimental results on Sina-Weibo dataset have demonstrated the effectiveness of our methods. | |||
| Inferring audience partisanship for YouTube videos | | BIBA | Full-Text | 43-44 | |
| Ingmar Weber; Venkata Rama Kiran Garimella; Erik Borra | |||
| Political campaigning and the corresponding advertisement money are increasingly moving online. Some analysts claim that the U.S. elections were partly won through a smart use of (i) targeted advertising and (ii) social media. But what type of information do politicized users consume online? And, the other way around, for a given content, e.g. a YouTube video, is it possible to predict its political audience? To address this latter question, we present a large scale study of anonymous YouTube video consumption of politicized users, where political orientation is derived from visits to "beacon pages", namely, political partisan blogs. Though our techniques are relevant for targeted political advertising, we believe that our findings are also of a wider interest. | |||
| Cross-region collaborative filtering for new point-of-interest recommendation | | BIBA | Full-Text | 45-46 | |
| Ning Zheng; Xiaoming Jin; Lianghao Li | |||
| With the rapid growth of location-based social networks (LBSNs), Point-of-Interest (POI) recommendation is in increasingly higher demand these years. In this paper, our aim is to recommend new POIs to a user in regions where he has rarely been before. Different from the classical memory-based recommendation algorithms using user rating data to compute similarity between users or items to make recommendation, we propose a cross-region collaborative filtering method based on hidden topics mined from user check-in records to recommend new POIs. Experimental results on a real-world LBSNs dataset show that our method consistently outperforms naive CF method. | |||
| Incorporating author preference in sentiment rating prediction of reviews | | BIBA | Full-Text | 47-48 | |
| Subhabrata Mukherjee; Gaurab Basu; Sachindra Joshi | |||
| Traditional works in sentiment analysis do not incorporate author preferences during sentiment classification of reviews. In this work, we show that the inclusion of author preferences in sentiment rating prediction of reviews improves the correlation with ground ratings, over a generic author independent rating prediction model. The overall sentiment rating prediction for a review has been shown to improve by capturing facet level rating. We show that this can be further developed by considering author preferences in predicting the facet level ratings, and hence the overall review rating. To the best of our knowledge, this is the first work to incorporate author preferences in rating prediction. | |||
| Board coherence in Pinterest: non-visual aspects of a visual site | | BIBA | Full-Text | 49-50 | |
| Krishna Y. Kamath; Ana-Maria Popescu; James Caverlee | |||
| Pinterest is a fast-growing interest network with significant user engagement and monetization potential. This paper explores quality signals for Pinterest boards, in particular the notion of board coherence. We find that coherence can be assessed with promising results and we explore its relation to quality signals based on social interaction. | |||
| Fragmented social media: a look into selective exposure to political news | | BIBA | Full-Text | 51-52 | |
| Jisun An; Daniele Quercia; Jon Crowcroft | |||
| The hypothesis of selective exposure assumes that people crave like-minded information and eschew information that conflicts with their beliefs, and that has negative consequences on political life. Yet, despite decades of research, this hypothesis remains theoretically promising but empirically difficult to test. We look into news articles shared on Facebook and examine whether selective exposure exists or not in social media. We find a concrete evidence for a tendency that users predominantly share like-minded news articles and avoid conflicting ones, and partisans are more likely to do that. Building tools to counter partisanship on social media would require the ability to identify partisan users first. We will show that those users cannot be distinguished from the average user as the two subgroups do not show any demographic difference. | |||
| Utility discounting explains informational website traffic patterns before a hurricane | | BIBA | Full-Text | 53-54 | |
| Ben Priest; Kevin Gold | |||
| We demonstrate that psychological models of utility discounting can explain the pattern of increased hits to weather websites in the days preceding a predicted weather disaster. We parsed the HTTP request lines issued by the web proxy for a mid-sized enterprise leading up to a hurricane, filtering for visits to weather-oriented websites. We fit four discounting models to the observed activity and found that our data matched hyperboloid models extending hyperbolic discounting. | |||
| Political hashtag hijacking in the U.S | | BIBA | Full-Text | 55-56 | |
| Asmelash Teka Hadgu; Kiran Garimella; Ingmar Weber | |||
| We study the change in polarization of hashtags on Twitter over time and show that certain jumps in polarity are caused by "hijackers" engaged in a particular type of hashtag war. | |||
| Learning to annotate tweets with crowd wisdom | | BIBA | Full-Text | 57-58 | |
| Wei Feng; Jianyong Wang | |||
| In Twitter, users can annotate tweets with hashtags to indicate the ongoing topics. Hashtags provide users a convenient way to categorize tweets. However, two problems remain unsolved during an annotation: (1) Users have no way to know whether some related hashtags have already been created. (2) Users have their own way to categorize tweets. Thus personalization is needed. To address the above problems, we develop a statistical model for Personalized Hashtag Recommendation. With millions of "tweet, hashtag" pairs being generated everyday, we are able to learn the complex mappings from tweets to hashtags with the wisdom of the crowd. Our model considers rich auxiliary information like URLs, locations, social relation, temporal characteristics of hashtag adoption, etc. We show our model successfully outperforms existing methods on real datasets crawled from Twitter. | |||
| To follow or not to follow: a feature evaluation | | BIBA | Full-Text | 59-60 | |
| Yanan Zhu; Nazli Goharian | |||
| The features available in Twitter provide meaningful information that can be harvested to provide a ranked list of followees to each user. We hypothesize that retweet and mention features can be further enriched by incorporating both temporal and additional/indirect links from within user's community. Our empirical results provide insights into the effectiveness of each feature, and evaluate our proposed similarity measures in ranking the followees. Utilizing temporal information and indirect links improves the effectiveness of retweet and mention features in terms of nDCG. | |||
| Topical organization of user comments and application to content recommendation | | BIBA | Full-Text | 61-62 | |
| Vidit Jain; Esther Galbrun | |||
| On a news website, an article may receive thousands of comments from its readers on a variety of topics. The usual display of these comments in a ranked list, e.g. by popularity, does not allow the user to follow discussions on a particular topic. Organizing them by semantic topics enables the user not only to selectively browse comments on a topic, but also to discover other significant topics of discussion in comments. This topical organization further allows to explicitly capture the immediate interests of the user even when she is not logged in. Here we use this information to recommend content that is relevant in the context of the comments being read by the user. We present an algorithm for building such a topical organization in a practical setting and study different recommendation schemes. In a pilot study, we observe these comments-to-article recommendations to be preferred over the standard article-to-article recommendations. | |||
| History-aware critiquing-based conversational recommendation | | BIBA | Full-Text | 63-64 | |
| Yasser Salem; Jun Hong | |||
| In this paper we present a new approach to critiquing-based conversational
recommendation, which we call History-Aware Critiquing (HAC). It takes a
case-based reasoning approach by reusing relevant recommendation sessions of
past users to short-cut the recommendation session of the current user. It
selects relevant recommendation sessions from a case base that contains the
successful recommendation sessions of past users. A past recommendation session
can be selected if it contains similar recommended items to the ones in the
current session and its critiques sufficiently overlap with the critiques so
far in the current session. HAC extends experience-based critiquing (EBC).
Our experimental results show that, in terms of recommendation efficiency, while EBC performs better than standard critiquing (STD), it does not perform as well as more recent techniques such as incremental critiquing (IC), whereas HAC achieves better recommendation efficiency over both STD and IC. | |||
| An effective general framework for localized content optimization | | BIBA | Full-Text | 65-66 | |
| Yoshiyuki Inagaki; Jiang Bian; Yi Chang | |||
| Local search services have been gaining interests from Web users who seek the information near certain geographical locations. Particularly, those users usually want to find interesting information about what is happening nearby. In this poster, we introduce the localized content optimization problem to provide Web users with authoritative, attractive and fresh information that are really interesting to people around the certain location. To address this problem, we propose a general learning framework and develop a variety of features. Our evaluations based on the data set from a commercial localized Web service demonstrate that our framework is highly effective at providing contents that are more relevant to users' localized information need. | |||
| Unfolding dynamics in a social network: co-evolution of link formation and user interaction | | BIBA | Full-Text | 67-68 | |
| Zhi Yang; Ji long Xue; Han Xiao Zhao; Xiao Wang; Ben Y. Zhao; Yafei Dai | |||
| Measurement studies of online social networks show that all social links are not equal, and the strength of each link is best characterized by the frequency of interactions between the linked users.To date, few studies have been able to examine detailed interaction data over time, and none have studied the problem of modeling user interactions. This paper proposes a generative model of social interactions that captures the inherently heterogeneous strengths of social links, thus having broad implications on the design of social network algorithms such as friend recommendation, information diffusion and viral marketing. | |||
| Mining emotions in short films: user comments or crowdsourcing? | | BIBA | Full-Text | 69-70 | |
| Claudia Orellana-Rodriguez; Ernesto Diaz-Aviles; Wolfgang Nejdl | |||
| Short films are regarded as an alternative form of artistic creation, and they express, in a few minutes, a whole gamma of different emotions oriented to impact the audience and communicate a story. In this paper, we exploit a multi-modal sentiment analysis approach to extract emotions in short films, based on the film criticism expressed through social comments from the video-sharing platform YouTube. We go beyond the traditional polarity detection (i.e., positive/negative), and extract, for each analyzed film, four opposing pairs of primary emotions: joy-sadness, anger-fear, trust-disgust, and anticipation-surprise. We found that YouTube comments are a valuable source of information for automatic emotion detection when compared to human analysis elicited via crowdsourcing. | |||
| Offering language based services on social media by identifying user's preferred language(s) from romanized text | | BIBA | Full-Text | 71-72 | |
| Mitesh M. Khapra; Salil Joshi; Ananthakrishnan Ramanathan; Karthik Visweswariah | |||
| With the increase of multilingual content and multilingual users on the web, it is prudent to offer personalized services and ads to users based on their language profile (i.e., the list of languages that a user is conversant with). Identifying the language profile of a user is often non-trivial because (i) users often do not specify all the languages known to them while signing up for an online service (ii) users of many languages (especially Indian languages) largely use Latin/Roman script to write content in their native language. This makes it non-trivial for a machine to distinguish the language of one comment from another. This situation presents an opportunity for offering following language based services for romanized content (i) hide romanized comments which belong to a language which is not known to the user (ii) translate romanized comments which belong to a language which is not known to the user (iii) transliterate romanized comments which belong to a language which is known to the user (iv) show language based ads by identifying languages known to a user based on the romanized comments that he wrote/read/liked. We first use a simple bootstrapping based semi-supervised algorithm for identify the language of a romanized comment. We then apply this algorithm to all the comments written/read/liked by a user to build a language profile of the user and propose that this profile can be used to offer the services mentioned above. | |||
| Zero-cost labelling with web feeds for weblog data extraction | | BIBA | Full-Text | 73-74 | |
| George Gkotsis; Karen Stepanyan; Alexandra I. Cristea; Mike S. Joy | |||
| Data extraction from web pages often involves either human intervention for training a wrapper or a reduced level of granularity in the information acquired. Even though the study of social media has drawn the attention of researchers, weblogs remain a part of the web that cannot be harvested efficiently. In this paper, we propose a fully automated approach in generating a wrapper for weblogs, which exploits web feeds for cheap labelling of weblog properties. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. Our evaluation shows that our approach is robust, accurate and efficient in handling different types of weblogs. | |||
| On using inter-document relations in microblog retrieval | | BIBA | Full-Text | 75-76 | |
| Jesus A. Rodriguez Perez; Yashar Moshfeghi; Joemon M. Jose | |||
| Microblog Ad-hoc retrieval has received much attention in recent years. As a result of the high vocabulary diversity of the publishing users, a mismatch is formed between the queries being formulated and the tweets representing the actual topics. In this work, we present a re-ranking approach relying on inter-document relations, which attempts to bridge this gap. Experiments with TREC's Microblog 2012 collection show that including such information in the retrieval process, statistically significantly improves retrieval effectiveness in terms of Precision and MAP, when the baseline performs well as a starting point. | |||
| Towards focused knowledge extraction: query-based extraction of structured summaries | | BIB | Full-Text | 77-78 | |
| Besnik Fetahu; Bernardo Pereira Nunes; Stefan Dietze | |||
| Complexity and algorithms for composite retrieval | | BIB | Full-Text | 79-80 | |
| Sihem Amer-Yahia; Francesco Bonchi; Carlos Castillo; Esteban Feuerstein; Isabel Méndez-Díaz; Paula Zabala | |||
| RESLVE: leveraging user interest to improve entity disambiguation on short text | | BIBA | Full-Text | 81-82 | |
| Elizabeth L. Murnane; Bernhard Haslhofer; Carl Lagoze | |||
| We address the Named Entity Disambiguation (NED) problem for short, user-generated texts on the social Web. In such settings, the lack of linguistic features and sparse lexical context result in a high degree of ambiguity and sharp performance drops of nearly 50% in the accuracy of conventional NED systems. We handle these challenges by developing a general model of user-interest with respect to a personal knowledge context and instantiate it using Wikipedia. We conduct systematic evaluations using individuals' posts from Twitter, YouTube, and Flickr and demonstrate that our novel technique is able to achieve performance gains beyond state-of-the-art NED methods. | |||
| A hybrid approach for spotting, disambiguating and annotating places in user-generated text | | BIBA | Full-Text | 83-84 | |
| Karen Stepanyan; George Gkotsis; Vangelis Banos; Alexandra I. Cristea; Mike Joy | |||
| We introduce a geolocation-aware semantic annotation model that extends the existing solutions for spotting and disambiguation of places within user-generated texts. The implemented prototype processes the text of weblog posts and annotates the places and toponyms. It outperforms existing solutions by taking into consideration the embedded geolocation data. The evaluation of the model is based on a set of randomly selected 3,165 geolocation embedded weblog posts, obtained from 1,775 web feeds. The results demonstrate a high degree of accuracy in annotation (87.7%) and a considerable gain (27.8%) in identifying additional entities, and therefore support the adoption of the model for supplementing the existing solutions. | |||
| HIGGINS: knowledge acquisition meets the crowds | | BIBA | Full-Text | 85-86 | |
| Sarath Kumar Kondreddi; Peter Triantafillou; Gerhard Weikum | |||
| We present HIGGINS, a system for Knowledge Acquisition (KA), placing emphasis on its architecture. The distinguishing characteristic and novelty of HIGGINS lies in its blending of two engines: an automated Information Extraction (IE) engine, aided by semantic resources and statistics, and a game-based Human Computing (HC) engine. We focus on KA from web pages and text sources and, in particular, on deriving relationships between entities. As a running application we utilize movie narratives, from which we wish to derive relationships among movie characters. | |||
| AELA: an adaptive entity linking approach | | BIBA | Full-Text | 87-88 | |
| Bianca Pereira; Nitish Aggarwal; Paul Buitelaar | |||
| The number of available Linked Data datasets has been increasing over time. Despite this, their use to recognise entities in unstructured plain text (Entity Linking task) is still limited to a small number of datasets. In this paper we propose a framework adaptable to the structure of generic Linked Data datasets. This adaptability allows a broader use of Linked Data datasets for the Entity Linking task. | |||
| Content extraction using diverse feature sets | | BIBA | Full-Text | 89-90 | |
| Matthew E. Peters; Dan Lecocq | |||
| The goal of content extraction or boilerplate detection is to separate the main content from navigation chrome, advertising blocks, copyright notices and the like in web pages. In this paper we explore a machine learning approach to content extraction that combines diverse feature sets and methods. Our main contributions are: a) preliminary results that show combining feature sets generally improves performance; and b) a method for including semantic information via id and class attributes applicable to HTML5. We also show that performance decreases on a new benchmark data set that better represents modern chrome. | |||
| Predicting relevant news events for timeline summaries | | BIBA | Full-Text | 91-92 | |
| Giang Binh Tran; Mohammad Alrifai; Dat Quoc Nguyen | |||
| This paper presents a framework for automatically constructing timeline summaries from collections of web news articles. We also evaluate our solution against manually created timelines and in comparison with related work. | |||
| Collective matrix factorization for co-clustering | | BIBA | Full-Text | 93-94 | |
| Mrinmaya Sachan; Shashank Srivastava | |||
| We outline some matrix factorization approaches for co-clustering polyadic data (like publication data) using non-negative factorization (NMF). NMF approximates the data as a product of non-negative low-rank matrices, and can induce desirable clustering properties in the matrix factors through a flexible range of constraints. We show that simultaneous factorization of one or more matrices provides potent approaches for co-clustering. | |||
| Walk and learn: a two-stage approach for opinion words and opinion targets co-extraction | | BIBA | Full-Text | 95-96 | |
| Liheng Xu; Kang Liu; Siwei Lai; Yubo Chen; Jun Zhao | |||
| This paper proposes a novel two-stage method for opinion words and opinion targets co-extraction. In the first stage, a Sentiment Graph Walking algorithm is proposed, which naturally incorporates syntactic patterns in a graph to extract opinion word/target candidates. In the second stage, we adopt a self-Learning strategy to refine the results from the first stage, especially for filtering out noises with high frequency and capturing long-tail terms. Preliminary experimental evaluation shows that considering pattern confidence in the graph is beneficial and our approach achieves promising improvement over three competitive baselines. | |||
| Discovery of technical expertise from open source code repositories | | BIBA | Full-Text | 97-98 | |
| Rahul Venkataramani; Atul Gupta; Allahbaksh Asadullah; Basavaraju Muddu; Vasudev Bhat | |||
| Online Question and Answer websites for developers have emerged as the main forums for interaction during the software development process. The veracity of an answer in such websites is typically verified by the number of 'upvotes' that the answer garners from peer programmers using the same forum. Although this mechanism has proved to be extremely successful in rating the usefulness of the answers, it does not lend itself very elegantly to model the expertise of a user in a particular domain. In this paper, we propose a model to rank the expertise of the developers in a target domain by mining their activity in different opensource projects. To demonstrate the validity of the model, we built a recommendation system for StackOverflow which uses the data mined from GitHub. | |||
| Power dynamics in spoken interactions: a case study on 2012 republican primary debates | | BIBA | Full-Text | 99-100 | |
| Vinodkumar Prabhakaran; Ajita John; Dorée D. Seligmann | |||
| In this paper, we explore how the power differential between participants of an interaction affects the way they interact in the context of political debates. We analyze the 2012 Republican presidential primary debates where we model the power index of each candidate in terms of their poll standings. We find that the candidates' power indices affected the way they interacted with others in the debates as well as how others interacted with them. | |||
| A non-learning approach to spelling correction in web queries | | BIBA | Full-Text | 101-102 | |
| Jason Soo | |||
| We describe an adverse environment spelling correction algorithm, known as Segments. Segments is language and domain independent and does not require any training data. We evaluate Segments' correction rate of transcription errors in web query logs with the state-of-the-art learning approach. We show that in environments where learning approaches are not applicable, such as multilingual documents, Segments has an F1-score within 0.005 of the learning approach. | |||
| Extracting implicit features in online customer reviews for opinion mining | | BIBA | Full-Text | 103-104 | |
| Yu Zhang; Weixiang Zhu | |||
| As the number of customer reviews grows very rapidly, it is essential to summarize useful opinions for buyers, sellers and producers. One key step of opinion mining is feature extraction. Most existing research focus on finding explicit features, only a few attempts have been made to extract implicit features. Nearly all existing research only concentrate on product features, few has paid attention to other features that relates to sellers, services and logistics. Therefore in this paper, we propose a novel co-occurrence association-based method, which aims to extract implicit features in customer reviews and provide more comprehensive and fine-grained mining results. | |||
| Co-training and visualizing sentiment evolvement for tweet events | | BIBA | Full-Text | 105-106 | |
| Shenghua Liu; Wenjun Zhu; Ning Xu; Fangtao Li; Xue-qi Cheng; Yue Liu; Yuanzhuo Wang | |||
| Sentiment classification on tweet events attracts more interest in recent years. The large tweet stream stops people reading the whole classified list to understand the insights. We employ the co-training framework in the proposed algorithm. Features are split into text view features and non-text view features. Two Random Forest (RF) classifiers are trained with the common labeled data on the two views of features separately. Then for each specific event, they collaboratively and periodically train together to boost the classification performance. At last, we propose a "river" graph to visualize the intensity and evolvement of sentiment on an event, which demonstrates the intensity by both color gradient and opinion labels, and the ups and downs of confronting opinions by the river flow. Comparing with the well-known sentiment classifiers, our algorithm achieves consistent increases in accuracy on the tweet events from TREC 2011 Microblogging and our database. The visualization helps people recognize turning and bursting patterns, and predict sentiment trend in an intuitive way. | |||
| Cost-effective node monitoring for online hot event detection in sina weibo microblogging | | BIBA | Full-Text | 107-108 | |
| Kai Chen; Yi Zhou; Hongyuan Zha; Jianhua He; Pei Shen; Xiaokang Yang | |||
| We propose a cost-effective hot event detection system over Sina Weibo platform, currently the dominant microblogging service provider in China. The problem of finding a proper subset of microbloggers under resource constraints is formulated as a mixed-integer problem for which heuristic algorithms are developed to compute approximate solution. Preliminary results show that by tracking about 500 out of 1.6 million candidate microbloggers and processing 15,000 microposts daily, 62% of the hot events can be detected five hours on average earlier than they are published by Weibo. | |||
| Solving electrical networks to incorporate supervision in random walks | | BIBA | Full-Text | 109-110 | |
| Mrinmkaya Saqchan; Dirk Hovy; Eduard Hovy | |||
| Random walks is one of the most popular ideas in computer science. A critical assumption in random walks is that the probability of the walk being at a given vertex at a time instance converges to a limit independent of the start state. While this makes it computationally efficient to solve, it limits their use to incorporate label information. In this paper, we exploit the connection between Random Walks and Electrical Networks to incorporate label information in classification, ranking, and seed expansion. | |||
| Information current in Twitter: which brings hot events to the world | | BIBA | Full-Text | 111-112 | |
| Peilei Liu; Jintao Tang; Ting Wang | |||
| In this paper we investigate information propagation in Twitter from the geographical view on the global scale. An information propagation phenomenon what we call "information current" has been discovered. According to this phenomenon, we propose a hypothesis that changes of information flows may be related to real-time events. Through analysis of retweets, we show that our hypothesis is supported by experiment results. Moreover, it is discovered that the retweet texts are more effective than common tweet texts for real-time event detection. This means that Twitter could be a good filter of texts for event detection. | |||
| Traffic quality based pricing in paid search using two-stage regression | | BIBA | Full-Text | 113-114 | |
| Rouben Amirbekian; Ye Chen; Alan Lu; Tak W. Yan; Liangzhong Yin | |||
| While the cost-per-click (CPC) pricing model is main stream in sponsored search, the quality of clicks with respect to conversion rates and hence their values to advertisers may vary considerably from publisher to publisher in a large syndication network. Traffic quality shall be used to establish price discounts for clicks from different publishers. These discounts are intended to maintain incentives for high-quality online traffic and to make it easier for advertisers to maintain long-term bid stability. Conversion signal is noisy as each advertiser defines conversion in their own way. It is also very sparse. Traditional way of overcoming signal sparseness is to allow for longer time in accumulating modeling data. However, due to fast-changing conversion trends, such longer time leads to deterioration of the precision in measuring quality. To allow models to adjust to fast-changing trends with sufficient speed, we had to limit time-window for conversion data collection and make it much shorter than the several weeks window commonly used. Such shorter time makes conversions in the training set extremely sparse. To overcome resulting obstacles, we used two-stage regression similar to hurdle regression. First we employed logistic regression to predict zero conversion outcomes. Next, conditioned on non-zero outcomes, we used random forest regression to predict the value of the quotient of two conversion rates. Two-stage model accounts for the zero inflation due to the sparseness of the conversion signal. The combined model maintains good precision and allows faster reaction to the temporal changes in traffic quality including changes due to certain actions by publishers that may lead to click-price inflation. | |||
| Dynamic evaluation of online display advertising with randomized experiments: an aggregated approach | | BIBA | Full-Text | 115-116 | |
| Joel Barajas; Ram Akella; Marius Holtan; Jaimie Kwon; Aaron Flores; Victor Andrei | |||
| We perform a randomized experiment to estimate the effects of a display advertising campaign on online user conversions. We present a time series approach using Dynamic Linear Models to decompose the daily aggregated conversions into seasonal and trend components. We attribute the difference between control and study trends to the campaign. We test the method using two real campaigns run for 28 and 21 days respectively from the Advertising.com ad network. | |||
| New features for query dependent sponsored search click prediction | | BIBA | Full-Text | 117-118 | |
| Ilya Trofimov | |||
| Click prediction for sponsored search is an important problem for commercial search engines. Good click prediction algorithm greatly affects on the revenue of the search engine, user experience and brings more clicks to landing pages of advertisers. This paper presents new query-dependent features for the click prediction algorithm based on treating query and advertisement as bags of words. New features can improve prediction accuracy both for ads having many and few views. | |||
| Modeling click and relevance relationship for sponsored search | | BIBA | Full-Text | 119-120 | |
| Wei Vivian Zhang; Ye Chen; Mitali Gupta; Swaraj Sett; Tak W. Yan | |||
| Click-through rate (CTR) prediction and relevance ranking are two
fundamental problems in web advertising. In this study, we address the problem
of modeling the relationship between CTR and relevance for sponsored search. We
used normalized relevance scores comparable across all queries to represent
relevance when modeling with CTR, instead of directly using human judgment
labels or relevance scores valid only within same query. We classified clicks
by identifying their relevance quality using dwell time and session
information, and compared all clicks versus selective clicks effects when
modeling relevance.
Our results showed that the cleaned click signal outperforms raw click signal and others we explored, in terms of relevance score fitting. The cleaned clicks include clicks with dwell time greater than 5 seconds and last clicks in session. Besides traditional thoughts that there is no linear relation between click and relevance, we showed that the cleaned click based CTR can be fitted well with the normalized relevance scores using a quadratic regression model. This relevance-click model could help to train ranking models using processed click feedback to complement expensive human editorial relevance labels, or better leverage relevance signals in CTR prediction. | |||
| Optimization of ads allocation in sponsored search | | BIBA | Full-Text | 121-122 | |
| Alexey Chervonenkis; Anna Sorokina; Valery A. Topinsky | |||
| We introduce the optimization problem of target-specific ads allocation. Technique for solving this problem for different target-constraints structures is presented. This technique allows us to find optimal ads allocation which maximize the target such as CTR, Revenue or other system performances subject to some linear constraints. We show that the optimal ads allocation depends on both the target and constraints variables. | |||
| A joint optimization of incrementality and revenue to satisfy both advertiser and publisher | | BIBA | Full-Text | 123-124 | |
| Dmitry Pechyony; Rosie Jones; Xiaojing Li | |||
| A long-standing goal in advertising is to reduce wasted costs due to advertising to people who are unlikely to buy, as well as to those who would make a purchase whether they saw an ad or not. The ideal audience for the advertiser are those incremental users who would buy if shown an ad, and would not buy, if not shown the ad. On the other hand, for publishers who are paid when the user clicks or buys, revenue may be maximized by showing ads to those users who are most likely to click or purchase. We show analytically and empirically that an optimization towards one metric might result in an inferior performance in the other one. We present a novel algorithm, called SLC, that performs a joint optimization towards both advertisers' and publishers' goals and provides superior results in both. | |||
| A case-based analysis of the effect of offline media on online conversion actions | | BIBA | Full-Text | 125-126 | |
| Damir Vandic; Didier Nibbering; Flavius Frasincar | |||
| In this paper, we investigate how offline advertising, by means of TV and radio, influences online search engine advertisement. Our research is based on the search engine-driven conversion actions of a 2012 marketing campaign of the potato chips manufacturer Lays. In our analysis we use several models, including linear regression (linear model) and Support Vector Regression (non-linear model). Our results confirm that offline commercials have a positive effect on the number of conversion actions from online marketing campaigns. This effect is especially visible in the first 50 minutes after the advertisement broadcasting. | |||
| An error driven approach to query segmentation | | BIBA | Full-Text | 127-128 | |
| Wei Zhang; Yunbo Cao; Chin-Yew Lin; Jian Su; Chew-Lim Tan | |||
| Query segmentation is the task of splitting a query into a sequence of non-overlapping segments that completely cover all tokens in the query. The majority of query segmentation methods are unsupervised. In this paper, we propose an error-driven approach to query segmentation (EDQS) with the help of search logs, which enables unsupervised training with guidance from the system-specific errors. In EDQS, we first detect the system's errors by examining the consistency among the segmentations of similar queries. Then, a model is trained by the detected errors to select the correct segmentation of a new query from the top-n outputs of the system. Our evaluation results show that EDQS can significantly boost the performance of state-of-the-art query segmentation methods on a publicly available data set. | |||
| Introducing search behavior into browsing based models of page's importance | | BIBA | Full-Text | 129-130 | |
| Maxim Zhukovskiy; Andrei Khropov; Gleb Gusev; Pavel Serdyukov | |||
| BrowseRank algorithm and its modifications are based on analyzing users' browsing trails. Our paper proposes a new method for computing page importance using a more realistic and effective search-aware model of user browsing behavior than the one used in BrowseRank. | |||
| Learning to shorten query sessions | | BIBA | Full-Text | 131-132 | |
| Cristina Ioana Muntean; Franco Maria Nardini; Fabrizio Silvestri; Marcin Sydow | |||
| We propose the use of learning to rank techniques to shorten query sessions by maximizing the probability that the query we predict is the "final" query of the current search session. We present a preliminary evaluation showing that this approach is a promising research direction. | |||
| The ACE theorem for querying the web of data | | BIBA | Full-Text | 133-134 | |
| Jürgen Umbrich; Claudio Gutierrez; Aidan Hogan; Marcel Karnstedt; Josiane Xavier Parreira | |||
| Inspired by the CAP theorem, we identify three desirable properties when querying the Web of Data: Alignment (results up-to-date with sources), Coverage (results covering available remote sources), and Efficiency (bounded resources). In this short paper, we show that no system querying the Web can meet all three ACE properties, but instead must make practical trade-offs that we outline. | |||
| Towards leveraging closed captions for news retrieval | | BIBA | Full-Text | 135-136 | |
| Roi Blanco; Gianmarco De Francisci Morales; Fabrizio Silvestri | |||
| IntoNow from Yahoo! is a second screen application that enhances the way of watching TV programs. The application uses audio from the TV set to recognize the program being watched, and provides several services for different use cases. For instance, while watching a football game on TV it can show statistics about the teams playing, or show the title of the song performed by a contestant in a talent show. The additional content provided by IntoNow is a mix of editorially curated and automatically selected one. From a research perspective, one of the most interesting and challenging use cases addressed by IntoNow is related to news programs (newscasts). When a user is watching a newscast, IntoNow detects it and starts showing online news articles from the Web. This work presents a preliminary study of this problem, i.e., to find an online news article that matches the piece of news discussed in the newscast currently airing on TV, and display it in real-time. | |||
| Searching the deep web using proactive phrase queries | | BIBA | Full-Text | 137-138 | |
| Wensheng Wu; Tingting Zhong | |||
| This paper proposes ipq, a novel search engine that proactively transforms query forms of Deep Web sources into phrase queries, constructs query evaluation plans, and caches results for popular queries offline. Then at query time, keyword queries are simply matched with phrase queries to retrieve results. ipq embodies a novel dual-ranking framework for query answering and novel solutions for discovering frequent attributes and queries. Preliminary experiments show the great potentials of ipq. | |||
| Graded relevance ranking for synonym discovery | | BIBA | Full-Text | 139-140 | |
| Andrew Yates; Nazli Goharian; Ophir Frieder | |||
| Interest in domain-specific search is steadfastly increasing, yielding a growing need for domain-specific synonym discovery. Existing synonym discovery methods perform poorly when faced with the realistic task of identifying a target term's synonyms from among many candidates. We approach domain-specific synonym discovery as a graded relevance ranking problem in which a target term's synonym candidates are ranked by their quality. In this scenario a human editor uses each ranked list of synonym candidates to build a domain-specific thesaurus. We evaluate our method for graded relevance ranking of synonym candidates and find that it outperforms existing methods. | |||
| Ranking method specialized for content descriptions of classical music | | BIBA | Full-Text | 141-142 | |
| Taku Kuribayashi; Yasuhito Asano; Masatoshi Yoshikawa | |||
| In this paper, we propose novel ranking methods of effectively finding content descriptions of classical music compositions. In addition to rather naive methods using technical term frequency and latent Dirichlet allocation (LDA), we proposed a novel classification of web pages about classical music and used the characteristics of the classification for our method of search by labeled LDA (L-LDA). The experimental results showed our method performed well at finding content descriptions of classical music compositions. | |||
| Towards a development process for geospatial information retrieval and search | | BIBA | Full-Text | 143-144 | |
| Dirk Ahlers | |||
| Geospatial search as a special type of vertical search has specific requirements and challenges. While the general principle of resource discovery, extraction, indexing, and search holds, geospatial search systems are tailored to the specific use case at hand with many individual adaptations. In this short overview, we aim to collect and organize the main organizing principles for the multitude of challenges and adaptations to be considered within the development process to work towards a more formal description. | |||
| Searching for interestingness in Wikipedia and Yahoo!: answers | | BIBA | Full-Text | 145-146 | |
| Yelena Mejova; Ilaria Bordino; Mounia Lalmas; Aristides Gionis | |||
| In many cases, when browsing the Web, users are searching for specific information. Sometimes, though, users are also looking for something interesting, surprising, or entertaining. Serendipitous search puts interestingness on par with relevance. We investigate how interesting are the results one can obtain via serendipitous search, and what makes them so, by comparing entity networks extracted from two prominent social media sites, Wikipedia and Yahoo! Answers. | |||
| A click model for time-sensitive queries | | BIBA | Full-Text | 147-148 | |
| Seung Eun Lee; Dongug Kim | |||
| User behavior on search results pages provides a clue about the query intent and the relevance of documents. To incorporate this information into search rankings, a variety of click modeling techniques have been proposed so far and now they are widely used in commercial search engines. For time-sensitive queries, however, applying click models can degrade the search relevance because the best document in the past may not be the current best answer. To address this problem, it is required to detect a time point, a turning point, where the search intent for a given query changes and to reflect it in click models. In this work, we devised a method to detect the turning point of a query from its search volume history. The proposed click model is designed to take only user behavior observed after the turning points. We applied our model in a commercial search engine and evaluated its relevance. | |||
| Intent classification of voice queries on mobile devices | | BIBA | Full-Text | 149-150 | |
| Subhabrata Mukherjee; Ashish Verma; Kenneth W. Church | |||
| Mobile query classification faces the usual challenges of encountering short and noisy queries as in web search. However, the task of mobile query classification is made difficult by the presence of more inter-active and personalized queries like map, command and control, dialogue, joke etc. Voice queries are made more difficult than typed queries due to the errors introduced by the automatic speech recognizer. This is the first paper, to the best of our knowledge, to bring the complexities of voice search and intent classification together. In this paper, we propose some novel features for intent classification, like the url's of the search engine results for the given query. We also show the effectiveness of other features derived from the part-of-speech information of the query and search engine results, in proposing a multi-stage classifier for intent classification. We evaluate the classifier using tagged data, collected from a voice search android application, where we achieve an average of 22% f-score improvement per category, over the commonly used bag-of-words baseline. | |||
| Leveraging geographical metadata to improve search over social media | | BIBA | Full-Text | 151-152 | |
| Alexander Kotov; Yu Wang; Eugene Agichtein | |||
| We propose the methods for document, query and relevance model expansion that leverage geographical metadata provided by social media. In particular, we propose a geographically-aware extension of the LDA topic model and utilize the resulting topics and language models in our expansion methods. The proposed approach has been experimentally evaluated over a large sample of Twitter, demonstrating significant improvements in search accuracy over traditional (geographically-unaware) retrieval models. | |||
| Place value: word position shifts vital to search dynamics | | BIBA | Full-Text | 153-154 | |
| Rishiraj Saha Roy; Anusha Suresh; Niloy Ganguly; Monojit Choudhury | |||
| With fast changing information needs in today's world, it is imperative that search engines precisely understand and exploit temporal changes in Web queries. In this work, we look at shifts in preferred positions of segments in queries over an interval of four years. We find that such shifts can predict key changes in usage patterns, and explain the observed increase in query lengths. Our findings indicate that recording positional statistics can be vital for understanding user intent in Web search queries. | |||
| Synthetic review spamming and defense | | BIBA | Full-Text | 155-156 | |
| Alex Morales; Huan Sun; Xifeng Yan | |||
| Online reviews are widely adopted in many websites such as Amazon, Yelp, and TripAdvisor. Positive reviews can bring significant financial gains, while negative ones often cause sales loss. This fact, unfortunately, results in strong incentives for opinion spam to mislead readers. Instead of hiring humans to write deceptive reviews, in this work, we bring into attention an automated, low-cost process for generating fake reviews, variations of which could be easily employed by evil attackers in reality. To the best of our knowledge, we are the first to expose the potential risk of machine-generated deceptive reviews. Our simple review synthesis model uses one truthful review as a template, and replaces its sentences with those from other reviews in a repository. The fake reviews generated by this mechanism are extremely hard to detect: Both the state-of-the-art machine detectors and human readers have an error rate of 35%-48%. A novel defense method that leverages the difference of semantic flows between fake and truthful reviews is developed, reducing the detection error rate to approximately 22%. Nevertheless, it is still a challenging research task to further decrease the error rate. | |||
| REDACT: a framework for sanitizing RDF data | | BIBA | Full-Text | 157-158 | |
| Jyothsna Rachapalli; Vaibhav Khadilkar; Murat Kantarcioglu; Bhavani Thuraisingham | |||
| Resource Description Framework (RDF) is the foundational data model of the Semantic Web, and is essentially designed for integration of heterogeneous data from varying sources. However, lack of security features for managing sensitive RDF data while sharing may result in privacy breaches, which in turn, result in loss of user trust. Therefore, it is imperative to provide an infrastructure to secure RDF data. We present a set of graph sanitization operations that are built as an extension to SPARQL. These operations allow one to sanitize sensitive parts of an RDF graph and further enable one to build more sophisticated security and privacy features, thus allowing RDF data to be shared securely. | |||
| Framework for evaluation of text captchas | | BIBA | Full-Text | 159-160 | |
| Achint Thomas; Kunal Punera; Lyndon Kennedy; Belle Tseng; Yi Chang | |||
| Interactive websites use text-based Captchas to prevent unauthorized automated interactions. These Captchas must be easy for humans to decipher while being difficult to crack by automated means. In this work we present a framework for the systematic study of Captchas along these two competing objectives. We begin by abstracting a set of distortions that characterize current and past commercial text-based Captchas. By means of user studies, we quantify the way human Captcha solving performance varies with changes in these distortion parameters. To quantify the effect of these distortions on the accuracy of automated solvers (bots), we propose a learning-based algorithm that performs automated Captcha segmentation driven by character recognition. Results show that our proposed algorithm is generic enough to solve text-based Captchas with widely varying distortions without requiring the use of hand-coded image processing or heuristic rules. | |||
| A probability-based trust prediction model using trust-message passing | | BIBA | Full-Text | 161-162 | |
| Hyun-Kyo Oh; Jin-Woo Kim; Sang-Wook Kim; Kichun Lee | |||
| We propose a probability-based trust prediction model based on trust-message passing which takes advantage of the two kinds of information: an explicit information and an implicit information. | |||
| RepRank: reputation in a peer-to-peer online system | | BIBA | Full-Text | 163-164 | |
| Zeqian Shen; Neel Sundaresan | |||
| Peer-to-peer e-commerce networks exemplify online lemon markets. Trust is key to sustaining these networks. We present a reputation system named RepRank that approaches trust with an intuition that in the peer-to-peer e-commerce world consisting of buyers and sellers, good buyers are those who buy from good sellers, and good sellers are those from whom good buyers buy. We propagate trust and distrust in a network using this mutually recursive definition. We discuss the algorithms and present the evaluation results. | |||
| The STAC (security toolbox: attacks & countermeasures) ontology | | BIBA | Full-Text | 165-166 | |
| Amelie Gyrard; Christian Bonnet; Karima Boudaoud | |||
| We present a security ontology to help non-security expert software designers or developers to: (1) design secure software and, (2) to understand and be aware of main security concepts and issues. Our security ontology defines the main security concepts such as attacks, countermeasures, security properties and their relationships. Countermeasures can be cryptographic concepts (encryption algorithm, key management, digital signature, hash function), security tools or security protocols. The purpose of this ontology is to be reused in numerous domains such as security of web applications, network management or communication networks (sensor, cellular and wireless). The ontology and a user interface (to use the ontology) are available online. | |||
| Modeling uncertain provenance and provenance of uncertainty in W3C PROV | | BIBA | Full-Text | 167-168 | |
| Tom De Nies; Sam Coppens; Erik Mannens; Rik Van de Walle | |||
| This paper describes how to model uncertain provenance and provenance of uncertain things in a flexible and unintrusive manner using PROV, W3C's new standard for provenance. Three new attributes with clearly defined values and semantics are proposed. Modeling this information is an important step towards the modeling and derivation of trust from resources whose provenance is described using PROV. | |||
| Scalable processing of flexible graph pattern queries on the cloud | | BIBA | Full-Text | 169-170 | |
| Padmashree Ravindra; Kemafor Anyanwu | |||
| Flexible exploration of large RDF datasets with unknown relationships can be enabled using 'unbound-property' graph pattern queries. Relational-style processing of such queries using normalized relations results in redundant information in intermediate results due to the repetition of adjoining bound (fixed) properties. Such redundancy negatively impacts the disk I/O, network transfer costs, and the required disk space while processing RDF query workloads on MapReduce-based systems. This work proposes packing and lazy unpacking strategies to minimize the redundancy in intermediate results while processing unbound-property queries. In addition to keeping the results compact, this work evaluates RDF queries using the Nested TripleGroup Data Model and Algebra (NTGA) that enables shorter MapReduce execution workflows. Experimental results demonstrate the benefit of this work over RDF query processing using relational-style systems such as Apache Pig and Hive. | |||
| Computing semantic relatedness from human navigational paths on Wikipedia | | BIBA | Full-Text | 171-172 | |
| Philipp Singer; Thomas Niebler; Markus Strohmaier; Andreas Hotho | |||
| This paper presents a novel approach for computing semantic relatedness between concepts on Wikipedia by using human navigational paths for this task. Our results suggest that human navigational paths provide a viable source for calculating semantic relatedness between concepts on Wikipedia. We also show that we can improve accuracy by intelligent selection of path corpora based on path characteristics indicating that not all paths are equally useful. Our work makes an argument for expanding the existing arsenal of data sources for calculating semantic relatedness and to consider the utility of human navigational paths for this task. | |||
| Discovering multilingual concepts from unaligned web documents by exploring associated images | | BIBA | Full-Text | 173-174 | |
| Xiaochen Zhang; Xiaoming Jin; Lianghao Li; Dou Shen | |||
| The Internet is experiencing an explosion of information presented in different languages. Though written in different languages, some articles implicitly share common concepts. In this paper, we propose a novel framework to mine cross-language common concepts from unaligned web documents. Specifically, visual words of images are used to bridge articles in different languages and then common concepts of multiple languages are learned by using an existing topic modeling algorithm. We conduct cross-lingual text classification in a real-world data set using the mined multilingual concepts from our method. The experiment results show that our approach is effective to mine cross-lingual common concepts. | |||
| Fria: fast and robust instance alignment | | BIBA | Full-Text | 175-176 | |
| Sanghoon Lee; Jongwuk Lee; Seung-won Hwang | |||
| This paper proposes Fria, a fast and robust instance alignment framework across two independently built knowledge bases (KBs). Our objective is two-fold: (1) to design an effective instance similarity measure and (2) to build a fast and robust alignment framework. Specifically, Fria consists of two-phases. Fria first achieves high-precision alignment for seed matches which have strong evidence for aligning. To obtain high-recall alignment, Fria then divides non-matched instances according to the types identified from seeds, and gives additional chances to the same-typed instances to be matched. Experimental results show that Fria is fast and robust, by achieving comparable accuracy to state-of-the-arts and a 10-times speed up. | |||
| Popularity prediction in microblogging network: a case study on sina weibo | | BIBA | Full-Text | 177-178 | |
| Peng Bao; Hua-Wei Shen; Junming Huang; Xue-Qi Cheng | |||
| Predicting the popularity of content is important for both the host and users of social media sites. The challenge of this problem comes from the inequality of the popularity of content. Existing methods for popularity prediction are mainly based on the quality of content, the interface of social media site to highlight contents, and the collective behavior of users. However, little attention is paid to the structural characteristics of the networks spanned by early adopters, i.e., the users who view or forward the content in the early stage of content dissemination. In this paper, taking the Sina Weibo as a case, we empirically study whether structural characteristics can provide clues for the popularity of short messages. We find that the popularity of content is well reflected by the structural diversity of the early adopters. Experimental results demonstrate that the prediction accuracy is significantly improved by incorporating the factor of structural diversity into existing methods. | |||
| The power of local information in PageRank | | BIBA | Full-Text | 179-180 | |
| Marco Bressan; Enoch Peserico; Luca Pretto | |||
| Can one assess, by visiting only a small portion of a graph, if a given node has a significantly higher PageRank score than another? We show that the answer strongly depends on the interplay between the required correctness guarantees (is one willing to accept a small probability of error?) and the graph exploration model (can one only visit parents and children of already visited nodes?). | |||
| Semantically sampling in heterogeneous social networks | | BIBA | Full-Text | 181-182 | |
| Cheng-Lun Yang; Perng-Hwa Kung; Chun-An Chen; Shou-De Lin | |||
| Online social networks sampling identifies a representative subnetwork that preserves certain graph property given heterogeneous semantics, with the full network not observed during sampling. This study presents a property, Relational Profile, to account for conditional dependency of node and relation type semantics in a network, and a sampling method to preserve the property. We show the proposed sampling method better preserves Relational Profile. Next, Relational Profile can design features to boost network prediction. Finally, our sampled network trains more accurate prediction models than other sampling baselines. | |||
| Sampling bias in user attribute estimation of OSNs | | BIBA | Full-Text | 183-184 | |
| Hosung Park; Sue Moon | |||
| Recent work on unbiased sampling of OSNs has focused on estimation of the network characteristics such as degree distributions and clustering coefficients. In this work we shift the focus to node attributes. We show that existing sampling methods produce biased outputs and need modifications to alleviate the bias. | |||
| Link recommendation for promoting information diffusion in social networks | | BIBA | Full-Text | 185-186 | |
| Dong Li; Zhiming Xu; Sheng Li; Xin Sun; Anika Gupta; Katia Sycara | |||
| Online social networks mainly have two functions: social interaction and information diffusion. Most of current link recommendation researches only focus on strengthening the social interaction function, but ignore the problem of how to enhance the information diffusion function. For solving this problem, this paper introduces the concept of user diffusion degree and proposes the algorithm for calculating it, then combines it with traditional recommendation methods for reranking recommended links. Experimental results on Email dataset and Amazon dataset under Independent Cascade Model and Linear Threshold Model show that our method noticeably outperforms the traditional methods in terms of promoting information diffusion. | |||
| Domain-sensitive opinion leader mining from online review communities | | BIBA | Full-Text | 187-188 | |
| Qingliang Miao; Shu Zhang; Yao Meng; Hao Yu | |||
| In this paper, we investigate how to identify domain-sensitive opinion leaders in online review communities, and present a model to rank domain-sensitive opinion leaders. To evaluate the effectiveness of the proposed model, we conduct preliminary experiments on a real-world dataset from Amazon.com. Experimental results indicate that the proposed model is effective in identifying domain-sensitive opinion leaders. | |||
| Understanding election candidate approval ratings using social media data | | BIBA | Full-Text | 189-190 | |
| Danish Contractor; Tanveer Afzal Faruquie | |||
| The last few years has seen an exponential increase in the amount of social media data generated daily. Thus, researchers have started exploring the use of social media data in building recommendation systems, prediction models, improving disaster management, discovery trending topics etc. An interesting application of social media is for the prediction of election results. The recently conducted 2012 US Presidential election was the "most tweeted" election in history and provides a rich source of social media posts. Previous work on predicting election outcomes from social media has been largely been based on sentiment about candidates, total volumes of tweets expressing electoral polarity and the like. In this paper we use a collection of tweets to predict the daily approval ratings of the two US presidential candidates and also identify topics that were causal to the approval ratings. | |||
| Extracting the multilevel communities based on network structural and nonstructural information | | BIBA | Full-Text | 191-192 | |
| Xin Liu; Tsuyoshi Murata; Ken Wakita | |||
| Many real-world networks contain nonstructural information on nodes, such as the spatial coordinate of a location, profile of a person, or contents of a web page. In this paper, we propose Dist-Modularity, a unified modularity measure, which is useful in extracting the multilevel communities based on network structural and nonstructural information. | |||
| Structural-interaction link prediction in microblogs | | BIBA | Full-Text | 193-194 | |
| Jia Yantao; Wang Yuanzhuo; Li Jingyuan; Feng Kai; Cheng Xueqi; Li Jianchen | |||
| Link prediction in Microblogs by using unsupervised methods aims to find an appropriate similarity measure between users in the network. However, the measures used by existing work lack a simple way to incorporate the structure of the network and the interactions between users. In this work, we define the retweet similarity to measure the interactions between users in Twitter, and propose a structural-interaction based matrix factorization model for following-link prediction. Experiments on the real world Twitter data show our model outperforms state-of-the-art methods. | |||
| Fast anomaly detection despite the duplicates | | BIBA | Full-Text | 195-196 | |
| Jay Yoon Lee; U. Kang; Danai Koutra; Christos Faloutsos | |||
| Given a large cloud of multi-dimensional points, and an off-the shelf outlier detection method, why does it take a week to finish? After careful analysis, we discovered that duplicate points create subtle issues, that the literature has ignored: if dmax is the multiplicity of the most over-plotted point, typical algorithms are quadratic on dmax. We propose several ways to eliminate the problem; we report wall-clock times and our time savings; and we show that our methods give either exact results, or highly accurate approximate ones. | |||
| Recommendation for online social feeds by exploiting user response behavior | | BIBA | Full-Text | 197-198 | |
| Ping-Han Soh; Yu-Chieh Lin; Ming-Syan Chen | |||
| In recent years, online social networks have been dramatically expanded. Active users spend hours communicating with each other via these networks such that an enormous amount of data is created every second. The tremendous amount of newly created information costs users much time to discover interesting messages from their online social feeds. The problem is even exacerbated if users access these networks via mobile devices. To assist users in discovering interesting messages efficiently, in this paper, we propose a new approach to recommend interesting messages for each user by exploiting the user's response behavior. We extract data from the most popular social network, and the experimental results show that the proposed approach is effective and efficient. | |||
| Lists as coping strategy for information overload on Twitter | | BIBA | Full-Text | 199-200 | |
| Simon de la Rouviere; Kobus Ehlers | |||
| When following too many users on microblogging services, information overload occurs due to increased and varied communication activity. Users then either leave, or employ coping strategies to continue benefiting from the service. Through a crawl of 31 684 random users from Twitter and a qualitative survey with 115 respondents, it has been determined that by using lists as an information management coping strategy (filtering and compartmentalising varied communication activity), users are capable of following more users and experience fewer symptoms of information overload. | |||
| To crop, or not to crop: compiling online media galleries | | BIBA | Full-Text | 201-202 | |
| Thomas Steiner; Christopher Chedeau | |||
| We have developed an application for the automatic generation of media galleries that visually and audibly summarize events based on media items like videos and photos from multiple social networks. Further, we have evaluated different media gallery styles with online surveys and examined their pros and cons. Besides the survey results, our contribution is also the application itself, where media galleries of different styles can be created on-the-fly. A demo is available at http://social-media-illustrator.herokuapp.com/. | |||
| Unsupervised approach to generate informative structured snippets for job search engines | | BIBA | Full-Text | 203-204 | |
| Nikita Spirin; Karrie Karahalios | |||
| Aiming to improve user experience for a job search engine, in this paper we propose an idea to switch from query-biased snippets used by most web search engines to rich structured snippets associated with the main sections of a job posting page, which are more appropriate for job search due to specific user needs and the structure of job pages. We present a very simple yet actionable approach to generate such snippets in an unsupervised way. The advantages of the proposed approach are two-fold: it doesn't require manual annotation and therefore can be easily deployed to many languages, which is a desirable property for a job search engine operating internationally; it fuses naturally with the trend towards Mobile Web where the content needs to be optimized for small screen devices and informativeness. | |||
| Learning to recommend with multi-faceted trust in social networks | | BIBA | Full-Text | 205-206 | |
| Lei Guo; Jun Ma; Zhumin Chen | |||
| Traditionally, trust-aware recommendation methods that utilize trust relations for recommender systems assume a single type of trust between users. However, this assumption ignores the fact that trust as a social concept inherently has many aspects. A user may place trust differently to different people. Motivated by this observation, we propose a novel probabilistic factor analysis method, which learns the multi-faceted trust relations and user profiles through a shared user latent feature space. Experimental results on the real product rating data set show that our approach outperforms state-of-the-art methods on the RMSE measure. | |||
| Hidden view game: designing human computation games to update maps and street views | | BIBA | Full-Text | 207-208 | |
| Jongin Lee; John Kim; KwanHong Lee | |||
| Although the Web has abundant information, it does not necessarily contain the latest, most recently updated information. In particular, interactive map websites and the accompanying street view applications often contain information that is a few years old and are somewhat outdated because street views can change quickly. In this work, we propose Hidden View -- a human computation mobile game that enables the updating of maps and street views with the latest information. The preliminary implementation of the game is described and some results collected from a sample user study are presented. This work is the first step towards leveraging human computation and an individual's familiarity with different points-of-interest to keep maps and street views up to date. | |||
| ASQ: interactive web presentations for hybrid MOOCs | | BIBA | Full-Text | 209-210 | |
| Vasileios Triglianos; Cesare Pautasso | |||
| ASQ is a Web application for creating and delivering interactive HTML5 presentations. It is designed to support teachers that need to gather real-time feedback from the students while delivering their lectures. Presentation slides are delivered to viewers that can answer the questions embedded in the slides. The objective is to maximize the efficiency of bi-directional communication between the lecturer and a large audience. More specifically, in the context of a hybrid MOOC classroom, a teacher can use ASQ to get feedback in real time about the level of comprehension of the presented material while reducing the time for gathering survey data, monitoring attendance and assessing solutions. | |||
| QMapper: a tool for SQL optimization on hive using query rewriting | | BIBA | Full-Text | 211-212 | |
| Yingzhong Xu; Songlin Hu | |||
| Although HiveQL offers similar features with SQL, it is still difficult to map complex SQL queries into HiveQL and manual translation often leads to poor performance. A tool named QMapper is developed to address this problem by utilizing query rewriting rules and cost-based MapReduce flow evaluation on the basis of column statistics. Evaluation demonstrates that while assuring the correctness, QMapper improves the performance up to 42% in terms of execution time. | |||
| Partitioning RDF exploiting workload information | | BIBA | Full-Text | 213-214 | |
| Rebeca Schroeder; Raqueline Penteado; Carmem Satie Hara | |||
| One approach to leverage scalable systems for RDF management is partitioning large datasets across distributed servers. In this paper we consider workload data, given in the form of query patterns and their frequencies, for determining how to partition RDF datasets. Our experimental study shows that our workload-aware method is an effective way to cluster related data and provides better query response times compared to an elementary fragmentation method. | |||
| Correlation discovery in web of things | | BIBA | Full-Text | 215-216 | |
| Lina Yao; Quan Z. Sheng | |||
| With recent advances in radio-frequency identification (RFID), wireless sensor networks, and Web services, Web of Things (WoT) is gaining a considerable momentum as an emerging paradigm where billions of physical objects will be interconnected and present over the World Wide Web. One inevitable challenge in the new era of WoT lies in how to efficiently and effectively manage things, which is critical for a number of important applications such as object search, recommendation, and composition. In this paper, we propose a novel approach to discover the correlations of things by constructing a relational network of things (RNT) where similar things are linked via virtual edges according to their latent correlations by mining three dimensional information in the things usage events in terms of user, temporality and spatiality. With RNT, many problems centered around things management such as objects classification, discovery and recommendation can be solved by exploiting graph-based algorithms. We conducted experiments using real-world data collected over a period of four months to verify and evaluate our model and the results demonstrate the feasibility of our approach. | |||
| The atomic web browser | | BIBA | Full-Text | 217-218 | |
| Cesare Pautasso; Masiar Babazadeh | |||
| The Atomic Web Browser achieves atomicity for distributed transactions across multiple RESTful APIs. Assuming that the participant APIs feature support for the Try-Confirm/Cancel pattern, the user may navigate with the Atomic Web Browser among multiple Web sites to perform local resource state transitions (e.g., reservations or bookings). Once the user indicates that the navigation has successfully completed, the Atomic Web browser takes care of confirming the local transitions to achieve the atomicity of the global transaction. | |||
| XML validation: looking backward -- strongly typed and flexible XML processing are not incompatible | | BIBA | Full-Text | 219-220 | |
| Pierre Geneves; Nabil Layaida | |||
| One major concept in web development using XML is validation: checking
whether some document instance fulfills structural constraints described by
some schema. Over the last few years, there has been a growing debate about XML
validation, and two main schools of thought emerged about the way it should be
done. On the one hand, some advocate the use of validation with respect to
complete grammar-based descriptions such as DTDs and XML Schemas. On the other
hand, motivated by a need for greater flexibility, others argue for no
validation at all, or prefer the use of lightweight constraint languages such
as Schematron with the aim of validating only required constraints, while
making schema descriptions more compositional and more reusable.
Owing to a logical compilation, we show that validators used in each of these approaches share the same theoretical foundations, meaning that the two approaches are far from being incompatible. Our findings include that the logic in [2] can be seen as a unifying formal ground for the construction of robust and efficient validators and static analyzers using any of these schema description techniques. This reconciles the two approaches from both a theoretical and a practical perspective, therefore facilitating any combination of them. | |||
| Co-operative content adaptation framework: satisfying consumer and content creator in resource constrained browsing | | BIBA | Full-Text | 221-222 | |
| Ayush Dubey; Pradipta De; Kuntal Dey; Sumit Mittal; Vikas Agarwal; Malolan Chetlur; Sougata Mukherjea | |||
| Mobile Web is characterized by two salient features, ubiquitous access to content and limited resources, like bandwidth and battery. Since most web pages are designed for the wired Internet, it is challenging to adapt the pages seamlessly to ensure a satisfactory mobile web experience. Content heavy web pages lead to longer load time on mobile browsers. Pre-defined load order of items in a page does not adapt to mobile browsing habits, where user looks for different snippets of a page to load under different contexts. Web content adaptation for mobile web has mainly focused on the user to define her preferences for content. We propose a framework where content creator is additionally included in guiding the adaptation. Allowing content creator to specify importance of items in a page also helps in factoring her incentives by pushing revenue generating content. We present mechanisms to enable cooperative content adaptation. Preliminary results show the efficacy of cooperative content adaptation in resource constrained mobile browsing scenario. | |||
| An effective class-centroid-based dimension reduction method for text classification | | BIBA | Full-Text | 223-224 | |
| Guansong Pang; Huidong Jin; Shengyi Jiang | |||
| Motivated by the effectiveness of centroid-based text classification techniques, we propose a classification-oriented class-centroid-based dimension reduction (DR) method, called CentroidDR. Basically, CentroidDR projects high-dimensional documents into a low-dimensional space spanned by class centroids. On this class-centroid-based space, the centroid-based classifier essentially becomes CentroidDR plus a simple linear classifier. Other classification techniques, such as K-Nearest Neighbor (KNN) classifiers, can be used to replace the simple linear classifier to form much more effective text classification algorithms. Though CentroidDR is simple, non-parametric and runs in linear time, preliminary experimental results show that it can improve the accuracy of the classifiers and perform better than general DR methods such as Latent Semantic Indexing (LSI). | |||
| Harnessing web page directories for large-scale classification of tweets | | BIBA | Full-Text | 225-226 | |
| Arkaitz Zubiaga; Heng Ji | |||
| Classification is paramount for an optimal processing of tweets, albeit performance of classifiers is hindered by the need of large sets of training data to encompass the diversity of contents one can find on Twitter. In this paper, we introduce an inexpensive way of labeling large sets of tweets, which can be easily regenerated or updated when needed. We use human-edited web page directories to infer categories from URLs contained in tweets. By experimenting with a large set of more than 5 million tweets categorized accordingly, we show that our proposed model for tweet classification can achieve 82% in accuracy, performing only 12.2% worse than for web page classification. | |||
| Scalable k-nearest neighbor graph construction based on greedy filtering | | BIBA | Full-Text | 227-228 | |
| Youngki Park; Sungchan Park; Sang-goo Lee; Woosung Jung | |||
| K-Nearest Neighbor Graph (K-NNG) construction is a primitive operation in the field of Information Retrieval and Recommender Systems. However, existing approaches to K-NNG construction do not perform well as the number of nodes or dimensions scales up. In this paper, we present greedy filtering, an efficient and scalable algorithm for selecting the candidates for nearest neighbors by matching only the dimensions of large values. The experimental results show that our K-NNG construction scheme, based on greedy filtering, guarantees a high recall while also being 5 to 6 times faster than state-of-the-art algorithms for large, high-dimensional data. | |||
| Numeric query ranking approach | | BIBA | Full-Text | 229-230 | |
| Jie Wu; Yi Liu; Ji-Rong Wen | |||
| We handle a special category of Web queries, queries containing numeric terms. We call them numeric queries. Motivated by some issues in ranking of numeric queries, we detect numeric sensitive queries by mining from retrieved documents using phrase operator. We also propose features based on numeric terms by extracting reliable numeric terms for each document. Finally, a ranking model is trained for numeric sensitive queries, combining proposed numeric-related features and traditional features. Experiments show that our model can significantly improve relevance for numeric sensitive queries. | |||
| Collaborative filtering meets next check-in location prediction | | BIBA | Full-Text | 231-232 | |
| Defu Lian; Vincent W. Zheng; Xing Xie | |||
| With the increasing popularity of Location-based Social Networks, a vast amount of location check-ins have been accumulated. Though location prediction in terms of check-ins has been recently studied, the phenomena that users often check in novel locations has not been addressed. To this end, in this paper, we leveraged collaborative filtering techniques for check-in location prediction and proposed a short- and long-term preference model. We extensively evaluated it on two large-scale check-in datasets from Gowalla and Dianping with 6M and 1M check-ins, respectively, and showed that the proposed model can outperform the competing baselines. | |||
| TCRec: product recommendation via exploiting social-trust network and product category information | | BIBA | Full-Text | 233-234 | |
| Yu Jiang; Jing Liu; Xi Zhang; Zechao Li; Hanqing Lu | |||
| In this paper, we develop a novel product recommendation method called TCRec, which takes advantage of consumer rating history record, social-trust network and product category information simultaneously. Compared experiments are conducted on two real-world datasets and outstanding performance is achieved, which demonstrates the effectiveness of TCRec. | |||
| Regional analysis of user interactions on social media in times of disaster | | BIBA | Full-Text | 235-236 | |
| Takeshi Sakaki; Fujio Toriumi; Kosuke Shinoda; Kazuhiro Kazama; Satoshi Kurihara; Itsuki Noda; Yutaka Matsuo | |||
| Social media attract attention for sharing information, especially Twitter, which is now being used in times of disasters. In this paper, we perform regional analysis of user interactions on Twittter during the Great East Japan Earthquake and arrived at the following two conclusions:People diffused much more information after the earthquake, especially in the heavily-damaged areas; People communicated with nearby users but diffused information posted by distant users. We conclude that social media users changed their behavior to widely diffuse information. | |||
| Improving consensus clustering of texts using interactive feature selection | | BIBA | Full-Text | 237-238 | |
| Ricardo M. Marcacini; Marcos A. Domingues; Solange O. Rezende | |||
| Consensus clustering and interactive feature selection are very useful methods to extract and manage knowledge from texts. While consensus clustering allows the aggregation of different clustering solutions into a single robust clustering solution, the interactive feature selection facilitates the incorporation of the users experience in text clustering tasks by selecting a set of high-level features. In this paper, we propose an approach to improve the robustness of consensus clustering using interactive feature selection. We have reported some experimental results on real-world datasets that show the effectiveness of our approach. | |||
| Live migration of JavaScript web apps | | BIBA | Full-Text | 241-244 | |
| James Lo; Eric Wohlstadter; Ali Mesbah | |||
| Due to the increasing complexity of web applications and emerging HTML5 standards, a large amount of runtime state is created and managed in the user's browser. While such complexity is desirable for user experience, it makes it hard for developers to implement mechanisms that provide users ubiquitous access to the data they create during application use. This work showcases Imagen, our implemented platform for browser session migration of JavaScript-based web applications. Session migration is the act of transferring a session between browsers at runtime. Without burden to developers, Imagen allows users to create a snapshot image that captures the runtime state needed to resume the session elsewhere. Our approach works completely in the JavaScript layer and we demonstrate that snapshots can be transferred between different browser vendors and hardware devices. The demo will illustrate our system's performance and interoperability using two HTML5 apps, four different browsers and three different devices. | |||
| Automated exploration and analysis of Ajax web applications with WebMole | | BIBA | Full-Text | 245-248 | |
| Gabriel Le Breton; Fabien Maronnaud; Sylvain Hallé | |||
| WebMole is a browser-based tool that automatically and exhaustively explores all pages inside a web application. Contrarily to classical web crawlers, which only explore pages accessible through regular anchors, WebMole can find its way through Ajax applications that use JavaScript-triggered links, and handles state changes that do not involve a page reload. User-defined functions called oracles can be used to bound the range of pages explored by WebMole to specific parts of an application, as well as to evaluate Boolean test conditions on all visited pages. Overall, WebMole can prove a more flexible alternative to automated testing suites such as Selenium WebDriver. | |||
| Analyzing the suitability of web applications for a single-user to multi-user transformation | | BIBA | Full-Text | 249-252 | |
| Matthias Heinrich; Franz Lehmann; Franz Josef Grüneberger; Thomas Springer; Martin Gaedke | |||
| Multi-user web applications like Google Docs or Etherpad are crucial to efficiently support collaborative work (e.g. jointly create texts, graphics, or presentations). Nevertheless, enhancing single-user web applications with multi-user capabilities (i.e. document synchronization and conflict resolution) is a time-consuming and intricate task since traditional approaches adopting concurrency control libraries (e.g. Apache Wave) require numerous scattered source code changes. Therefore, we devised the Generic Collaboration Infrastructure (GCI) [8] that is capable of converting single-user web applications non-invasively into collaborative ones, i.e. no source code changes are required. In this paper, we present a catalog of vital application properties that allows determining if a web application is suitable for a GCI transformation. On the basis of the introduced catalog, we analyze 12 single-user web applications and show that 6 are eligible for a GCI transformation. Moreover, we demonstrate (1) the transformation of one qualified application, namely, the prominent text editor TinyMCE, and (2) showcase the resulting multi-user capabilities. Both demo parts are illustrated in a dedicated screencast that is available at http://vsr.informatik.tu-chemnitz.de/demo/TinyMCE/. | |||
| Crowdsourcing MapReduce: JSMapReduce | | BIBA | Full-Text | 253-256 | |
| Philipp Langhans; Christoph Wieser; François Bry | |||
| JSMapReduce is an implementation of MapReduce which exploits the computing power available in the computers of the users of a web platform by giving tasks to the JavaScript engines of their web browsers. This article describes the implementation of JSMapReduce exploiting HTML 5 features, the heuristics it uses for distributing tasks to workers, and reports on an experimental evaluation of JSMapReduce. | |||
| Large-scale social-media analytics on stratosphere | | BIBA | Full-Text | 257-260 | |
| Christoph Boden; Marcel Karnstedt; Miriam Fernandez; Volker Markl | |||
| The importance of social-media platforms and online communities -- in business as well as public context -- is more and more acknowledged and appreciated by industry and researchers alike. Consequently, a wide range of analytics has been proposed to understand, steer, and exploit the mechanics and laws driving their functionality and creating the resulting benefits. However, analysts usually face significant problems in scaling existing and novel approaches to match the data volume and size of modern online communities. In this work, we propose and demonstrate the usage of the massively parallel data processing system Stratosphere, based on second order functions as an extended notion of the MapReduce paradigm, to provide a new level of scalability to such social-media analytics. Based on the popular example of role analysis, we present and illustrate how this massively parallel approach can be leveraged to scale out complex data-mining tasks, while providing a programming approach that eases the formulation of complete analytical workflows. | |||
| Optimizing RDF(S) queries on cloud platforms | | BIBA | Full-Text | 261-264 | |
| HyeongSik Kim; Padmashree Ravindra; Kemafor Anyanwu | |||
| Scalable processing of Semantic Web queries has become a critical need given
the rapid upward trend in availability of Semantic Web data. The MapReduce
paradigm is emerging as a platform of choice for large scale data processing
and analytics due to its ease of use, cost effectiveness, and potential for
unlimited scaling. Processing queries on Semantic Web triple models is a
challenge on the mainstream MapReduce platform called Apache Hadoop, and its
extensions such as Pig and Hive. This is because such queries require numerous
joins which leads to lengthy and expensive MapReduce workflows. Further, in
this paradigm, cloud resources are acquired on demand and the traditional join
optimization machinery such as statistics and indexes are often absent or not
easily supported.
In this demonstration, we will present RAPID+, an extended Apache Pig system that uses an algebraic approach for optimizing queries on RDF data models including queries involving inferencing. The basic idea is that by using logical and physical operators that are more natural to MapReduce processing, we can reinterpret such queries in a way that leads to more concise execution workflows and small intermediate data footprints that minimize disk I/Os and network transfer overhead. RAPID+ evaluates queries using the Nested TripleGroup Data Model and Algebra (NTGA). The demo will show comparative performance of NTGA query plans vs. relational algebra-like query plans used by Apache Pig and Hive. | |||
| TagVisor: extending web pages with interaction events to support presentation in digital signage | | BIBA | Full-Text | 265-268 | |
| Marcio dos Santos Galli; Eduardo Pezutti Beletato Santos | |||
| New interaction experiences are fundamentally changing the way we interact with the web. Emerging touch-based devices and a variety of web-connected appliances represents challenges that prevents the seamless reach of web resources originally tailored for the standard browser experience. This paper explores how web pages can be re-purposed and become interactive presentations that effectively supports communication in scenarios such as digital signage and other presentation use cases. We will cover the TagVisor project which is a JavaScript run-time that uses modern animation effects and provides an HTML5 extension approach to support the authoring of visual narratives using plain web pages. | |||
| Complementary assistance mechanisms for end user mashup composition | | BIBA | Full-Text | 269-272 | |
| Soudip Roy Chowdhury; Olexiy Chudnovskyy; Matthias Niederhausen; Stefan Pietschmann; Paul Sharples; Florian Daniel; Martin Gaedke | |||
| Despite several efforts for simplifying the composition process, learning efforts required for using existing mashup editors to develop mashups remain still high. In this paper, we describe how this barrier can be lowered by means of an assisted development approach that seamlessly integrates automatic composition and interactive pattern recommendation techniques into existing mashup platforms for supporting easy mashup development by end users. We showcase the use of such an assisted development environment in the context of an open-source mashup platform Apache Rave. Results of our user studies demonstrate the benefits of our approach for end user mashup development. | |||
| uTrack: track yourself! monitoring information on online social media | | BIBA | Full-Text | 273-276 | |
| Tiago Rodrigues; Prateek Dewan; Ponnurangam Kumaraguru; Raquel Melo Minardi; Virgílio Almeida | |||
| The past one decade has witnessed an astounding outburst in the number of online social media (OSM) services, and a lot of these services have enthralled millions of users across the globe. With such tremendous number of users, the amount of content being generated and shared on OSM services is also enormous. As a result, trying to visualize all this overwhelming amount of content, and gain useful insights from it has become a challenge. In this work, we present uTrack, a personalized web service to analyze and visualize the diffusion of content shared by users across multiple OSM platforms. To the best of our knowledge, there exists no work which concentrates on monitoring information diffusion for personal accounts. Currently, uTrack monitors and supports logging in from Facebook, Twitter, and Google+. Once granted permissions by the user, uTrack monitors all URLs (like videos, photos, news articles) the user has shared in all OSM services supported, and generates useful visualizations and statistics from the collected data. | |||
| DFT-extractor: a system to extract domain-specific faceted taxonomies from wikipedia | | BIBA | Full-Text | 277-280 | |
| Bifan Wei; Jun Liu; Jian Ma; Qinghua Zheng; Wei Zhang; Boqin Feng | |||
| Extracting faceted taxonomies from the Web has received increasing attention in recent years from the web mining community. We demonstrate in this study a novel system called DFT-Extractor, which automatically constructs domain-specific faceted taxonomies from Wikipedia in three steps: 1) It crawls domain terms from Wikipedia by using a modified topical crawler. 2) Then it exploits a classification model to extract hyponym relations with the use of motif-based features. 3) Finally, it constructs a faceted taxonomy by applying a community detection algorithm and a group of heuristic rules. DFT-Extractor also provides a graphical user interface to visualize the learned hyponym relations and the tree structure of taxonomies. | |||
| Temporal summarization of event-related updates in wikipedia | | BIBA | Full-Text | 281-284 | |
| Mihai Georgescu; Dang Duc Pham; Nattiya Kanhabua; Sergej Zerr; Stefan Siersdorfer; Wolfgang Nejdl | |||
| Wikipedia is a free multilingual online encyclopedia covering a wide range of general and specific knowledge. Its content is continuously maintained up-to-date and extended by a supporting community. In many cases, real-world events influence the collaborative editing of Wikipedia articles of the involved or affected entities. In this paper, we present Wikipedia Event Reporter, a web-based system that supports the entity-centric, temporal analytics of event-related information in Wikipedia by analyzing the whole history of article updates. For a given entity, the system first identifies peaks of update activities for the entity using burst detection and automatically extracts event-related updates using a machine-learning approach. Further, the system determines distinct events through the clustering of updates by exploiting different types of information such as update time, textual similarity, and the position of the updates within an article. Finally, the system generates the meaningful temporal summarization of event-related updates and automatically annotates the identified events in a timeline. | |||
| Live topic generation from event streams | | BIBA | Full-Text | 285-288 | |
| Vuk Milicic; Giuseppe Rizzo; José Luis Redondo Garcia; Raphaël Troncy; Thomas Steiner | |||
| Social platforms constantly record streams of heterogeneous data about human's activities, feelings, emotions and conversations opening a window to the world in real-time. Trends can be computed but making sense out of them is an extremely challenging task due to the heterogeneity of the data and its dynamics making often short-lived phenomena. We develop a framework which collects microposts shared on social platforms that contain media items as a result of a query, for example a trending event. It automatically creates different visual storyboards that reflect what users have shared about this particular event. More precisely it leverages on: (i) visual features from media items for near-deduplication, and (ii) textual features from status updates to interpret, cluster, and visualize media items. A screencast showing an example of these functionalities is published at: http://youtu.be/8iRiwz7cDYY while the prototype is publicly available at http://mediafinder.eurecom.fr. | |||
| Serefind: a social networking website for classifieds | | BIBA | Full-Text | 289-292 | |
| Pramod Verma | |||
| This paper presents the design and implementation of a social networking website for classifieds, called Serefind. We designed search interfaces with focus on security, privacy, usability, design, ranking, and communications. We deployed this site at the Johns Hopkins University, and the results show it can be used as a self-sustaining classifieds site for public or private communities. | |||
| MASFA: mass-collaborative faceted search for online communities | | BIBA | Full-Text | 293-296 | |
| Seth B. Cleveland; Byron J. Gao | |||
| Faceted search combines faceted navigation with direct keyword search, providing exploratory search capacities allowing progressive query refinement. It has become the de facto standard for e-commerce and product-related websites such as amazon.com and ebay.com. However, faceted search has not been effectively incorporated into non-commercial online community portals such as craigslist.org. This is mainly because unlike keyword search, faceted search systems require metadata that constantly evolve, making them very costly to build and maintain. In this paper, we propose a framework MASFA that utilizes a set of non-domain-specific techniques to build and maintain effective, portable, and cost-free faceted search systems in a mass-collaborative manner. We have implemented and deployed the framework on selected categories of Craigslist to demonstrate its utility. | |||
| ALFRED: crowd assisted data extraction | | BIBA | Full-Text | 297-300 | |
| Valter Crescenzi; Paolo Merialdo; Disheng Qiu | |||
| The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches, but the costs of training data, i.e., annotations over a set of sample pages, limit their scalability. Crowdsourcing platforms are making the manual annotation process more affordable. However, the tasks demanded to these platforms should be extremely simple, to be performed by non-expert people, and their number should be minimized, to contain the costs. We demonstrate ALFRED, a wrapper inference system supervised by the workers of a crowdsourcing platform. Training data are labeled values generated by means of membership queries, the simplest form of queries, posed to the crowd. ALFRED includes several original features: it automatically selects a representative sample set from the input collection of pages; in order to minimize the wrapper inference costs, it dynamically sets the expressiveness of the wrapper formalism and it adopts an active learning algorithm to select the queries posed to the crowd; it is able to manage inaccurate answers that can be provided by the workers engaged by crowdsourcing platforms. | |||
| SHERLOCK: a system for location-based services in wireless environments using semantics | | BIBA | Full-Text | 301-304 | |
| Roberto Yus; Eduardo Mena; Sergio Ilarri; Arantza Illarramendi | |||
| Nowadays people are exposed to huge amounts of information that are generated continuously. However, current mobile applications, Web pages, and Location-Based Services (LBSs) are designed for specific scenarios and goals. In this demo we show the system SHERLOCK, which searches and shares up-to-date knowledge from nearby devices to relieve the user from knowing and managing such knowledge directly. Besides, the system guides the user in the process of selecting the service that best fits his/her needs in the given context. | |||
| Tailored news in the palm of your hand: a multi-perspective transparent approach to news recommendation | | BIBA | Full-Text | 305-308 | |
| Mozhgan Tavakolifard; Jon Atle Gulla; Kevin C. Almeroth; Jon Espen Ingvaldesn; Gaute Nygreen; Erik Berg | |||
| Mobile news recommender systems help users retrieve news that is relevant in their particular context and can be presented in ways that require minimal user interaction. In spite of the availability of contextual information about mobile users, though, current mobile news applications employ rather simple strategies for news recommendation. Our multi-perspective approach unifies temporal, locational, and preferential information to provide a more fine-grained recommendation strategy. This demo paper presents the implementation of our solution to efficiently recommend specific news articles from a large corpus of newly-published press releases in a way that closely matches a reader's reading preferences. | |||
| Connected media experiences: web based interactive video using linked data | | BIBA | Full-Text | 309-312 | |
| Lyndon Nixon; Matthias Bauer; Cristian Bara | |||
| This demo submission presents a set of tools and an extended framework with API for enabling the semantically empowered enrichment of online video with Web content. As audiovisual media is increasingly transmitted online, new services deriving added value from such material can be imagined. For example, combining it with other material elsewhere on the Web which is related to it or enhances it in a meaningful way, to the benefit of the owner of the original content, the providers of the content enhancing it and the end consumer who can access and interact with these new services. Since the services are built around providing new experiences through connecting different related media together, we consider such services to be Connected Media Experiences (ConnectME). This paper presents a toolset for ConnectME -- an online annotation tool for video and a HTML5-based enriched video player -- as well as the ConnectME framework which enables these media experiences to be generated on the server side with semantic technology. | |||
| Radialize: a tool for social listening experience on the web based on radio station programs | | BIBA | Full-Text | 313-316 | |
| Álvaro R., Jr. Pereira; Diego Dutra; Milton, Jr. Stiilpen; Alex Amorim Dutra; Felipe Martins Melo; Paulo H. C. Mendonça; Ângelo Magno de Jesus; Kledilson Ferreira | |||
| Radialize represents a service for listening to music and radio programs through the web. The service allows the discovery of the content being played by radio stations on the web, either by managing explicit information made available by those stations or by means of our technology for automatic recognition of audio content in a stream. Radialize then offers a service in which the user can search, be recommended, and provide feedback on artists and songs being played in traditional radio stations, either explicitly or implicitly, in order to compose an individual profile. The recommender system utilizes every user interaction as a data source, as well as the similarity abstraction extracted out of the radios' musical programs, making use of the wisdom of crowds implicitly present in the radio programs. | |||
| FANS: face annotation by searching large-scale web facial images | | BIBA | Full-Text | 317-320 | |
| Steven C. H. Hoi; Dayong Wang; I. Yeu Cheng; Elmer Weijie Lin; Jianke Zhu; Ying He; Chunyan Miao | |||
| Auto face annotation is an important technique for many real-world applications, such as online photo album management, new video summarization, and so on. It aims to automatically detect human faces from a photo image and further name the faces with the corresponding human names. Recently, mining web facial images on the internet has emerged as a promising paradigm towards auto face annotation. In this paper, we present a demonstration system of search-based face annotation: FANS -- Face ANnotation by Searching large-scale web facial images. Given a query facial image for annotation, we first retrieve a short list of the most similar facial images from a web facial image database, and then annotate the query facial image by mining the top-ranking facial images and their corresponding labels with sparse representation techniques. Our demo system was built upon a large-scale real-world web facial image database with a total of 6,025 persons and about 1 million facial images. This paper demonstrates the potential of searching and mining web-scale weakly labeled facial images on the internet to tackle the challenging face annotation problem, and addresses some open problems for future exploration by researchers in web community. The live demo of FANS is available online at http://msm.cais.ntu.edu.sg/FANS/. | |||
| Search the past with the Portuguese web archive | | BIBA | Full-Text | 321-324 | |
| Daniel Gomes; David Cruz; João Miranda; Miguel Costa; Simão Fontes | |||
| The web was invented to quickly exchange data between scientists, but it
became a crucial communication tool to connect the world. However, the web is
extremely ephemeral. Most of the information published online becomes quickly
unavailable and is lost forever. There are several initiatives worldwide that
struggle to archive information from the web before it vanishes. However,
search mechanisms to access this information are still limited and do not
satisfy their users who demand performance similar to live-web search engines.
This demo presents the Portuguese Web Archive, which enables search over 1.2 billion files archived from 1996 to 2012. It is the largest full-text searchable web archive publicly available [17]. The software developed to support this service is also publicly available as a free open source project at Google Code, so that it can be reused and enhanced by other web archivists. A short video about the Portuguese Web Archive is available at vimeo.com/59507267. The service can be tried live at archive.pt. | |||
| Inside YOGO2s: a transparent information extraction architecture | | BIBA | Full-Text | 325-328 | |
| Joanna Biega; Erdal Kuzey; Fabian M. Suchanek | |||
| YAGO [9, 6] is one of the largest public ontologies constructed by information extraction. In a recent refactoring called YAGO2s, the system has been given a modular and completely transparent architecture. In this demo, users can see how more than 30 individual modules of YAGO work in parallel to extract facts, to check facts for their correctness, to deduce facts, and to merge facts from different sources. A GUI allows users to play with different input files, to trace the provenance of individual facts to their sources, to change deduction rules, and to run individual extractors. Users can see step by step how the extractors work together to combine the individual facts to the coherent whole of the YAGO ontology. | |||
| SPARQL2NL: verbalizing sparql queries | | BIBA | Full-Text | 329-332 | |
| Axel-Cyrille Ngonga Ngomo; Lorenz Bühmann; Christina Unger; Jens Lehmann; Daniel Gerber | |||
| Linked Data technologies are now being employed by a large number of
applications. While experts can query the backend of these applications using
the standard query language SPARQL, most lay users lack the expertise necessary
to proficiently interact with these applications. Consequently, non-expert
users usually have to rely on forms, query builders, question answering or
keyword search tools to access RDF data. Yet, these tools are usually unable to
make the meaning of the queries they generate plain to lay users, making it
difficult for these users to i) assess the correctness of the query generated
out of their input, and ii) to adapt their queries or iii) to choose in an
informed manner between possible interpretations of their input.
We present SPARQL2NL, a generic approach that allows verbalizing SPARQL queries, i.e., converting them into natural language. In addition to generating verbalizations, our approach can also explain the output of queries by providing a natural-language description of the reasons that led to each element of the result set being selected. Our evaluation of SPARQL2NL within a large-scale user survey shows that SPARQL2NL generates complete and easily understandable natural language descriptions. In addition, our results suggest that even SPARQL experts can process the natural language representation of SPARQL queries computed by our approach more efficiently than the corresponding SPARQL queries. Moreover, non-experts are enabled to reliably understand the content of SPARQL queries. Within the demo, we present the results generated by our approach on arbitrary questions to the DBpedia and MusicBrainz datasets. Moreover, we present how our framework can be used to explain results of SPARQL queries in natural language. | |||
| G-path: flexible path pattern query on large graphs | | BIBA | Full-Text | 333-336 | |
| Yiyuan Bai; Chaokun Wang; Yuanchi Ning; Hanzhao Wu; Hao Wang | |||
| With the socialization trend of web sites and applications, the techniques of effective management of graph-structured data have become one of the most important modern web technologies. In this paper, we present a system of path query on large graphs, known as G-Path. Based on Hadoop distributed framework and bulk synchronized parallel model, the system can process generic queries without preprocessing or building indices. To demonstrate the system, we developed a web-based application which allows searching entities and relationships on a large social network, e.g., DBLP publication network or Twitter dataset. With the flexibility of G-Path, the application is able to handle different kinds of queries. For example, a user may want to search for a publication graph of an author while another user may want to search for all publications of the author's co-authors. All these queries can be done by an interactive user interface and the results will be shown in a visual graph. | |||
| Mockup driven web development | | BIBA | Full-Text | 337-342 | |
| Edward Benson | |||
| Dynamic web development still borrows heavily from its origins in CGI
scripts: modern web applications are largely designed and developed as programs
that happen to output HTML. This thesis proposes to investigate the idea taking
a mockup-centric approach instead, in which self-contained, full page web
mockups are the central artifact driving the application development process.
In some cases, these mockups are sufficient to infer the dynamic application
structure completely.
This approach to mockup driven development is made possible by the development of a language the thesis develops, called Cascading Tree Sheets (CTS), that enables a mockup to be annotated with enough information so that many common web development tasks and workflows can be eliminated or vastly simplified. CTS describes and encapsulates a web page's design structure the same way CSS describes its styles. This enables mockups to serve as the input of a web application rather than simply a design artifact. Using this capability, I will study the feasibility and usability of mockup driven development for a range of novice and expert authorship tasks. The thesis aims to finish by demonstrating that the functionality of a domain-specific content management system can be inferred automatically from site mockups. | |||
| Structured summarization for news events | | BIBA | Full-Text | 343-348 | |
| Giang Binh Tran | |||
| Helping users to understand the news is an acute problem nowadays as the users are struggling to keep up with tremendous amount of information published every day in the Internet. In this research, we focus on modelling the content of news events by their semantic relations with other events, and generating structured summarization. | |||
| Multimedia information retrieval on the social web | | BIBA | Full-Text | 349-354 | |
| Teresa Bracamonte | |||
| Efforts have been made to obtain more accurate results for multimedia searches on the Web. Nevertheless, not all multimedia objects have related text descriptions available. This makes bridging the semantic gap more difficult. Approaches that combine context and content information of multimedia objects are the most popular for indexing and later retrieving these objects. However, scaling these techniques to Web environments is still an open problem. In this thesis, we propose the use of user-generated content (UGC) from the Web and social platforms as well as multimedia content information to describe the context of multimedia objects. We aim to design tag-oriented algorithms to automatically tag multimedia objects, filter irrelevant tags, and cluster tags in semantically-related groups. The novelty of our proposal is centered on the design of Web-scalable algorithms that enrich multimedia context using the social information provided by users as a result of their interaction with multimedia objects. We validate the results of our proposal with a large-scale evaluation in crowdsourcing platforms. | |||
| Effective analysis, characterization, and detection of malicious web pages | | BIBA | Full-Text | 355-360 | |
| Birhanu Eshete | |||
| The steady evolution of the Web has paved the way for miscreants to take advantage of vulnerabilities to embed malicious content into web pages. Up on a visit, malicious web pages steal sensitive data, redirect victims to other malicious targets, or cease control of victim's system to mount future attacks. Approaches to detect malicious web pages have been reactively effective at special classes of attacks like drive-by-downloads. However, the prevalence and complexity of attacks by malicious web pages is still worrisome. The main challenges in this problem domain are (1) fine-grained capturing and characterization of attack payloads (2) evolution of web page artifacts and (3) exibility and scalability of detection techniques with a fast-changing threat landscape. To this end, we proposed a holistic approach that leverages static analysis, dynamic analysis, machine learning, and evolutionary searching and optimization to effectively analyze and detect malicious web pages. We do so by: introducing novel features to capture fine-grained snapshot of malicious web pages, holistic characterization of malicious web pages, and application of evolutionary techniques to fine-tune learning-based detection models pertinent to evolution of attack payloads. In this paper, we present key intuition and details of our approach, results obtained so far, and future work. | |||
| Identifying, understanding and detecting recurring, harmful behavior patterns in collaborative Wikipedia editing: doctoral proposal | | BIBA | Full-Text | 361-366 | |
| Fabian Flöck | |||
| In this doctoral proposal, we describe an approach to identify recurring, collective behavioral mechanisms in the collaborative interactions of Wikipedia editors that have the potential to undermine the ideals of quality, neutrality and completeness of article content. We outline how we plan to parametrize these patterns in order to understand their emergence and evolution and measure their effective impact on content production in Wikipedia. On top of these results we intend to build end-user tools to increase the transparency of the evolution of articles and equip editors with more elaborated quality monitors. We also sketch out our evaluation plans and report on already accomplished tasks. | |||
| Ontology based feature level opinion mining for Portuguese reviews | | BIBA | Full-Text | 367-370 | |
| Larissa A. Freitas; Renata Vieira | |||
| This paper presents a thesis whose goal is to propose and evaluate methods to identify polarity in Portuguese user generated reviews according to features described in domain ontologies (experiments will consider movie and hotel ontologies Movie Ontology1 and Hontology2). | |||
| A machine-to-machine architecture to merge semantic sensor measurements | | BIBA | Full-Text | 371-376 | |
| Amelie Gyrard | |||
| The emerging field Machine-to-Machine (M2M) enables machines to communicate with each other without human intervention. Existing semantic sensor networks are domain-specific and add semantics to the context. We design a Machine-to-Machine (M2M) architecture to merge heterogeneous sensor networks and we propose to add semantics to the measured data rather than to the context. This architecture enables to: (1) get sensor measurements, (2) enrich sensor measurements with semantic web technologies, domain ontologies and the Link Open Data, and (3) reason on these semantic measurements with semantic tools, machine learning algorithms and recommender systems to provide promising applications. | |||
| Deep web entity monitoring | | BIB | Full-Text | 377-382 | |
| Mohamamdreza Khelghati; Djoerd Hiemstra; Maurice Van Keulen | |||
| Context mining and integration into predictive web analytics | | BIBA | Full-Text | 383-388 | |
| Julia Kiseleva | |||
| Predictive Web Analytics is aimed at understanding behavioural patterns of users of various web-based applications: e-commerce, ubiquitous and mobile computing, and computational advertising. Within these applications business decisions often rely on two types of predictions: an overall or particular user segment demand predictions and individualised recommendations for visitors. Visitor behaviour is inherently sensitive to the context, which can be defined as a collection of external factors. Context-awareness allows integrating external explanatory information into the learning process and adapting user behaviour accordingly. The importance of context-awareness has been recognised by researchers and practitioners in many disciplines, including recommendation systems, information retrieval, personalisation, data mining, and marketing. We focus on studying ways of context discovery and its integration into predictive analytics. | |||
| A proximity-based fallback model for hybrid web recommender systems | | BIBA | Full-Text | 389-394 | |
| Jaeseok Myung | |||
| Although there are numerous websites that provide recommendation services
for various items such as movies, music, and books, most of studies on
recommender systems only focus on one specific item type. As recommender sites
expand to cover several types of items, though, it is important to build a
hybrid web recommender system that can handle multiple types of items.
The switch hybrid recommender model provides a solution to this problem by choosing an appropriate recommender system according to given selection criteria, thereby facilitating cross-domain recommendations supported by individual recommender systems. This paper seeks to answer the question of how to deal with situations where no appropriate recommender system exists to deal with a required type of item. In such cases, the switch model cannot generate recommendation results, leading to the need for a fallback model that can satisfy most users most of the time. Our fallback model exploits a graph-based proximity search, ranking every entity on the graph according to a given proximity measure. We study how to incorporate the fallback model into the switch model, and propose a general architecture and simple algorithms for implementing these ideas. Finally, we present the results of our research result and discuss remaining challenges and possibilities for future research. | |||
| Analyzing linguistic structure of web search queries | | BIBA | Full-Text | 395-400 | |
| Rishiraj Saha Roy | |||
| It is believed that Web search queries are becoming more structurally complex over time. However, there has been no systematic study that quantifies such characteristics. In this thesis, we propose that queries are evolving into a unique linguistic system. We demonstrate proof of this hypothesis by examining the structure of Web queries by applying well-established techniques from natural language understanding. Preliminary results of these experiments show quantitative and qualitative proof that queries are not just some form of text between random sequences of words and natural language -- they have distinct properties of their own. | |||
| Understanding and analysing microblogs | | BIBA | Full-Text | 401-406 | |
| Pinar Yanardag Delul | |||
| Microblogging is a form of blogging where posts typically consist of short
content such as quick comments, phrases, URLs, or media, like images and
videos. Because of the fast and compact nature of microblogs, users have
adopted them for novel purposes, including sharing personal updates, spreading
breaking news, promoting political views, marketing and tracking real time
events. Thus, finding relevant information sources out of the rapidly growing
content is an essential task.
In this paper, we study the problem of understanding and analysing microblogs. We present a novel 2-stage framework to find potentially relevant content by extracting topics from the tweets and by taking advantage of submodularity.LILE2013 Welcome and organization | |||
| Linking data in and outside a scientific publishing house | | BIBA | Full-Text | 411-412 | |
| Sweitze Roffel | |||
| Publishing has undergone many changes since the 1960's, often driven by
rapid technological development. Technology impacts the creation and
dissemination of knowledge only to a certain extent, and in this talk I'll try
to give a publisher's perspective of some technological drivers impacting
Academic publishing today, and how the many actors involved are learning to
cooperate as well as compete in an increasingly distributed environment to
better turn information into knowledge. Technically, organizationally, and with
regard to shared standards and infrastructure.
Publishing has been called many different things by many different people. A simple definition could be that publishing is 'organizing content', so the focus of this talk will be on Elsevier's current use of Linked Data & Semantic technology in organizing scientific content, including some early lessons learned. This view from a publisher aims to help the discussion on how we can all contribute to better disseminate and promote the enormous creativity made through core research contributions. | |||
| Exploring student predictive model that relies on institutional databases and open data instead of traditional questionnaires | | BIBA | Full-Text | 413-418 | |
| Farhana Sarker; Thanassis Tiropanis; Hugh C. Davis | |||
| Research in student retention and progression to completion is traditionally survey-based, where researchers collect data through questionnaires and interviewing students. The major issues with survey-based study are the potentially low response rates and cost. Nevertheless, a large number of datasets that could inform the questions that students are explicitly asked in surveys is commonly available in the external open datasets. This paper describes a new student predictive model for student progression that relies on the data available in institutional internal databases and external open data, without the need for surveys. The results of empirical study for undergraduate students in their first year of study shows that this model can perform as well as or even out-perform traditional survey-based ones. | |||
| Towards integration of web data into a coherent educational data graph | | BIBA | Full-Text | 419-424 | |
| Davide Taibi; Besnik Fetahu; Stefan Dietze | |||
| Personalisation, adaptation and recommendation are central aims of Technology Enhanced Learning (TEL) environments. In this context, information retrieval and clustering techniques are more and more often applied to filter and deliver learning resources according to user preferences and requirements. However, the suitability and scope of possible recommendations is fundamentally dependent on the available data, such as metadata about learning resources as well as users. However, quantity and quality of both is still limited. On the other hand, throughout the last years, the Linked Data (LD) movement has succeeded to provide a vast body of well-interlinked and publicly accessible Web data. This in particular includes Linked Data of explicit or implicit educational nature. In this paper, we propose a large-scale educational dataset which has been generated by exploiting Linked Data methods together with clustering and interlinking techniques to extract import and interlink a wide range of educationally relevant data. We also introduce a set of reusable techniques which were developed to realise scalable integration and alignment of Web data in educational settings. | |||
| Finding relevant missing references in learning courses | | BIBA | Full-Text | 425-430 | |
| Patrick Siehndel; Ricardo Kawase; Asmelash Teka Hadgu; Eelco Herder | |||
| Reference sites play an increasingly important role in learning processes. Teachers use these sites in order to identify topics that should be covered by a course or a lecture. Learners visit online encyclopedias and dictionaries to find alternative explanations of concepts, to learn more about a topic, or to better understand the context of a concept. Ideally, a course or lecture should cover all key concepts of the topic that it encompasses, but often time constraints prevent complete coverage. In this paper, we propose an approach to identify missing references and key concepts in a corpus of educational lectures. For this purpose, we link concepts in educational material to the organizational and linking structure of Wikipedia. Identifying missing resources enables learners to improve their understanding of a topic, and allows teachers to investigate whether their learning material covers all necessary concepts. | |||
| Interactive learning resources and linked data for online scientific experimentation | | BIBA | Full-Text | 431-434 | |
| Alexander Mikroyannidis; John Domingue | |||
| There is currently a huge potential for eLearning in several new online learning initiatives like Massive Open Online Courses (MOOCs) and Open Educational Resources (OERs). These initiatives enable learners to self-regulate their learning by providing them with an abundant amount of free learning materials of high quality. This paper presents FORGE, a new European initiative for online learning using Future Internet Research and Experimentation (FIRE) facilities. FORGE is a step towards turning FIRE into a pan-European educational platform for Future Internet through Linked Data. This will benefit learners and educators by giving them both access to world class facilities in order to carry out experiments on e.g. new internet protocols. In turn, this supports constructivist and self-regulated learning approaches, through the use of interactive learning resources, such as eBooks. | |||
| Learning from quizzes using intelligent learning companions | | BIBA | Full-Text | 435-438 | |
| Danica Damljanovic; David Miller; Daniel O'Sullivan | |||
| It is widely recognised that engaging games can have a profound impact on learning. Integrating a conversational Artificial Intelligence (AI) into the mix makes the experience of learning even more engaging and enriching. In this paper we describe a conversational agent which is built with the purpose of acting as a personal tutor. The tutor can prompt, question, stimulate and guide a learner and then adapt exercises and challenges to specific needs. We illustrate how automatic generation of quizzes can be used to build learning exercises and activities. | |||
| Linked data selectors | | BIBA | Full-Text | 439-444 | |
| Kai Michael Höver; Max Mühlhäuser | |||
| In the world of Linked Data, HTTP URIs are names. A URI is dereferenced to obtain a copy or description of the referred resource. If only a fragment of a resource should be referred, pointing to the whole resource is not sufficient. Therefore, it is necessary to be able to refer to fragments of resources, and to name them with URIs to interlink them in the Web of Data. This is especially helpful in the educational context where learning processes including discussion and social interaction demand for exact references and granular selections of media. This paper presents the specification of Linked Data Selectors, an OWL ontology for describing dereferenceable fragments of Web resources. | |||
| OpenScout: harvesting business and management learning objects from the web of data | | BIBA | Full-Text | 445-450 | |
| Ricardo Kawase; Marco Fisichella; Katja Niemann; Vassilis Pitsilis; Aristides Vidalis; Philipp Holtkamp; Bernardo Nunes | |||
| Already existing open educational resources in the field of Business and Management have a high potential for enterprises to address the increasing training needs of their employees. However, it is difficult to act on OERs as some data is hidden. In the meanwhile, numerous repositories provide Linked Open Data on this field. Though, users have to search a number of repositories with heterogeneous interfaces in order to retrieve the desired content. In this paper, we present the strategies to gather heterogeneous learning objects from the Web of Data, and we provide an overview of the benefits of the OpenScout platform. Despite the fact that not all data repositories strictly follow Linked Data principles, OpenScout addressed individual variations in order to harvest, align, and provide a single end-point. In the end, OpenScout provides a full-fledged environment that leverages on the Linked Open Data available on the Web and additionally exposes it in an homogeneous format.LiME'13 Welcome and organization | |||
| The importance of linked media to the future web: lime 2013 keynote talk -- a proposal for the linked media research agenda | | BIBA | Full-Text | 455-456 | |
| Lyndon Nixon | |||
| If the future Web will be able to fully leverage the scale and quality of online media, a Web scale layer of structured, interlinked media annotations is needed, which we will call Linked Media, inspired by the Linked Data movement for making structured, interlinked descriptions of resources better available online. Mobile and tablet devices, as well as connected TVs, introduce novel application domains that will benefit from broad understanding and acceptance of Linked Media standards. In the keynote, I will provide an overview of current practices and specification efforts in the domain of video and Web content integration, drawing from the LinkedTV1 and MediaMixer2 projects. From this, I will present a vision for a Linked Media layer on the future Web will can empower new media-centric applications in a world of ubiquitous online multimedia. | |||
| Linking inside a video collection: what and how to measure? | | BIBA | Full-Text | 457-460 | |
| Robin Aly; Roeland J. F. Ordelman; Maria Eskevich; Gareth J. F. Jones; Shu Chen | |||
| Although linking video to additional information sources seems to be a sensible approach to satisfy information needs of user, the perspective of users is not yet analyzed on a fundamental level in real-life scenarios. However, a better understanding of the motivation of users to follow links in video, which anchors users prefer to link from within a video, and what type of link targets users are typically interested in, is important to be able to model automatic linking of audiovisual content appropriately. In this paper we report on our methodology towards eliciting user requirements with respect to video linking in the course of a broader study on user requirements in searching and a series of benchmark evaluations on searching and linking. | |||
| Using explicit discourse rules to guide video enrichment | | BIBA | Full-Text | 461-464 | |
| Michiel Hildebrand; Lynda Hardman | |||
| Video content analysis and named entity extraction are increasingly used to automatically generate content annotations for TV programs. A potential use of these annotations is to provide an entry point to background information that users can consume on a second screen. Automatic enrichments are, however, meaningless when it is unclear to the user what they can do with them and why they would. We propose to contextualize the annotations by an explicit representation of discourse in the form of scene templates. Through content rules these templates are populated with the relevant annotations. We illustrate this idea with an example video and annotations generated in the LinkedTV1 project. | |||
| Second screen interaction: an approach to infer tv watcher's interest using 3d head pose estimation | | BIBA | Full-Text | 465-468 | |
| Julien Leroy; François Rocca; Matei Mancas; Bernard Gosselin | |||
| In this paper, we present our "work-in-progress" approach to implicitly track user interaction and infer the interest a user can have for TV media. The aim is to identify moments of attentive focus, noninvasively and continuously, to dynamically improve the user profile by detecting which annotated media have drawn the user attention. Our method is based on the detection and estimation of face pose in 3D using a consumer depth camera. This allows us to determine when a user is or not looking at his television. This study is realized in the scenario of second screen interaction (tablet, smartphone), a behavior that has become common for spectators. We present our progress on the system and its integration in the LinkedTV project. | |||
| Enriching media fragments with named entities for video classification | | BIBA | Full-Text | 469-476 | |
| Yunjia Li; Giuseppe Rizzo; José Luis Redondo García; Raphaël Troncy; Mike Wald; Gary Wills | |||
| With the steady increase of videos published on media sharing platforms such as Dailymotion and YouTube, more and more efforts are spent to automatically annotate and organize these videos. In this paper, we propose a framework for classifying video items using both textual features such as named entities extracted from subtitles, and temporal features such as the duration of the media fragments where particular entities are spotted. We implement four automatic machine learning algorithms for multiclass classification problems, namely Logistic Regression (LG), K-Nearest Neighbour (KNN), Naive Bayes (NB) and Support Vector Machine (SVM). We study the temporal distribution patterns of named entities extracted from 805 Dailymotion videos. The results show that the best performance using the entity distribution is obtained with KNN (overall accuracy of 46.58%) while the best performance using the temporal distribution of named entities for each type is obtained with SVM (overall accuracy of 43.60%). We conclude that this approach is promising for automatically classifying online videos. | |||
| DataConf: enriching conference publications with a mobile mashup application | | BIBA | Full-Text | 477-478 | |
| Lionel Médini; Florian Bâcle; Hoang Duy Tan Nguyen | |||
| This paper describes a mobile Web application that allows browsing conference publications, their authors, authors' organizations, and even authors' other publications or publications related to the same keywords. It queries a main SPARQL endpoint that serves the conference metadata set, as well as other endpoints to enrich and explore data. It provides extra functions, such as flashing a publication QR code from the Web browser, accessing external resources about the publications, and it can be linked to external Web services. This application exploits the Linked Data paradigm and performs client-side reasoning. It follows recent W3C technical advances and as a mashup, requires few server resources. It can easily be deployed for any conference with available metadata on the Web. | |||
| The chrooma+ approach to enrich video content using HTML5 | | BIBA | Full-Text | 479-480 | |
| Philipp Oehme; Michael Krug; Fabian Wiedemann; Martin Gaedke | |||
| The Internet has become an important source for media content. Content types are not limited to text and pictures but also include video and audio. Currently audiovisual media is presented as it is. However, these media do not integrate the huge amount of related information, which is available on the Web. In this paper we present the Chrooma+ approach to improve the user experience of media consumption by enriching media content with additional information from various sources in the Web. Our approach focuses on the aggregation and combination of this related information with audiovisual media. This approach involves using new HTML5 technologies and with WebVTT a new annotation format to display relevant information at definite times. Some of the advantages of this approach are the usage of a rich annotation format and extensibility to include heterogeneous information sources. | |||
| Linking and visualizing television heritage: the EUscreen virtual exhibitions and the linked open data pilot | | BIBA | Full-Text | 481-484 | |
| Johan Oomen; Vassilis Tzouvaras; Kati Hyyppaä | |||
| The EUscreen initiative represents the European television archives and acts as a domain aggregator for Europeana, Europe's digital library, which provides access to over 20 million digitized cultural objects. The main motivation for the initiative is to provide unified access to a representative collection of television programs, secondary sources and articles, and in this way to allow students, scholars and the general public to study the history of television in its wider context. This paper explores the EUscreen activities related to novel ways to present curated content and publishing EUscreen metadata as Linked Open Data.LSNA'13 Welcome and organization | |||
| Online social networks: beyond popularity | | BIBA | Full-Text | 489-490 | |
| Ricardo Baeza-Yates; Diego Saez-Trumper | |||
| One of the main differences between traditional Web analysis and online Social Networks (OSNs) studies, is that in the first case the information is organized around content, while in the second case it is organized around people. While search engines have done a good job finding relevant content across billions of pages, nowadays we do not have an equivalent tool to find relevant people in OSNs. Even though an impressive amount of research has been done in this direction, there are still a lot of gaps to cover. Although the first intuition could be (and was!) search for popular people, previous research have shown that users' in-degree (e.g. number of friends or followers) is important but not enough to represent the importance and reputation of a person. Another approach is to study the content of the messages exchanged between users, trying to identify topical experts. However the computational cost of such approach -- including language diversity -- is a big limitation. In our work we take a content-agnostic approach, focusing in frequency, type, and time properties of user actions rather than content, mixing their static characteristics (social graph) and their activities (dynamic graphs). Our goal is to understand the role of popular users in OSNs, and also find "hidden important users": do popular users create new trends and cascades? Do they add value to the network? And, if they don't, who does it? Our research provides preliminary answers for these questions. | |||
| Aggregating information from the crowd and the network | | BIBA | Full-Text | 491-492 | |
| Anirban Dasgupta | |||
| In social systems, information often exists in a dispersed manner, as
individual opinions, local insights and preferences. In order to make a global
decision however, we need to be able to aggregate such local pieces of
information into a global description of the system. Such information
aggregation problems are key in setting up crowdsourcing or human computation
systems. How do we formally build and analyze such information aggregation
systems? In this talk we will discuss three different vignettes based on the
particular information aggregation problem and the "social system" that we are
extracting the information from.
In our first result, we will analyze a crowdsourcing system consisting of a set of users and binary choice questions. Each user has a specific reliability that determines the user's error rate in answering the questions. We show how to give an unsupervised algorithm for aggregating the user answers in order to simultaneously derive the user expertise as well as the truth values of the questions. Our second result will deal with the case when there is an interacting user community on a question answer forum. User preferences of quality are now expressed in terms of ("best answer" and "thumbs up/down") votes cast on each other's content. We will analyze a set of possible factors that indicate bias in user voting behavior -- these factors encompass different gaming behavior, as well as other eccentricities. We address the problem of aggregating user preferences (votes) using a supervised machine learning framework to calibrate such votes. We will see that this supervised learning method of content-agnostic vote calibration can significantly improve the performance of answer ranking and expert ranking. The last part of the talk will describe how it is possible to exploit local insights that users have about their friends in order to improve the efficiency of surveying in a (networked) population. We will describe the notion of "social sampling", where participants in a poll respond with a summary of their friends' putative responses to the poll. The analysis of social sampling leads to novel trade-off questions: the savings in the number of samples (roughly the average size of neighborhood of participants) vs. the systematic bias in the poll due to the network structure. We show bounds on the variances of few such estimators -- experiments on real world networks show this to be a useful paradigm in obtaining accurate information with small number of samples. | |||
| The social meanings of social networks: integrating SNA and ethnography of social networking | | BIBA | Full-Text | 493-494 | |
| Rogério de Paula | |||
| In this talk, I examine the manifest, emic meanings of social networking in the context of social network analysis and it uses this to discuss how the confluence of social science and computational sociology can contribute to a richer understanding of how emerging social technologies shape and are shaped by people's everyday practices. | |||
| Detecting malware with graph-based methods: traffic classification, botnets, and Facebook scams | | BIBA | Full-Text | 495-496 | |
| Michalis Faloutsos | |||
| In this talk, we highlight two topics on security from our lab. First, we
address the problem of Internet traffic classification (e.g. web, filesharing,
or botnet?). We present a fundamentally different approach to classifying
traffic that studies the network wide behavior by modeling the interactions of
users as a graph. By contrast, most previous approaches use statistics such as
packet sizes and inter-packet delays. We show how our approach gives rise to
novel and powerful ways to: (a) visualize the traffic, (b) model the behavior
of applications, and (c) detect abnormalities and attacks. Extending this
approach, we develop ENTELECHEIA, a botnet-detection method. Tests with real
data suggests that our graph-based approach is very promising.
Second, we present, MyPageKeeper, a security Facebook app, with 13K downloads, which we deployed to: (a) quantify the presence of malware on Facebook, and (b) protect end-users. We designed MyPageKeeper in a way that strikes the balance between accuracy and scalability. Our initial results are scary and interesting: (a) malware is widespread, with 49% of our users are exposed to at least one malicious post from a friend, and (b) roughly 74% of all malicious posts contain links that point back to Facebook, and thus would evade any of the current web-based filtering approaches. | |||
| Mining and analyzing the enterprise knowledge graph | | BIBA | Full-Text | 497-498 | |
| Ido Guy | |||
| Today's enterprises hold ever-growing amounts of public data, stemming from different organizational systems, such as development environments, CRM systems, business intelligence systems, and enterprise social media. This data unlocks rich and diverse information about entities, people, terms, and the relationships among them. A lot of insight can be gained through analyzing this knowledge graph, both by individual employees and by the organization as a whole. In this talk, I will review recent work done by the Social Technologies & Analytics group at IBM Research-Haifa to mine these relationships, represent them in a generalized model, and use the model for different aims within the enterprise, including social search [5], expertise location [1], social recommendation [2, 3], and network analysis [4]. | |||
| Scaling graph computations at Facebook | | BIBA | Full-Text | 499-500 | |
| Johan Ugander | |||
| With over a billion nodes and hundreds of billions of edges, scalability is at the forefront of concerns when dealing with the Facebook social graph. This talk will focus on two recent advances in graph computations at Facebook. The first focus concerns the development of a novel graph sharding algorithm -- Balanced Label Propagation -- for load-balancing distributed graph computations. Using Balanced Label Propagation, we were able to reduce by 50% the query time of Facebook's 'People You May Know' service, the realtime distributed system responsible for the feature extraction and ranking of the friends-of-friends of all active Facebook users. The second focus concerns the 2011 computation of the average distance distribution between all active Facebook users. This computation, which produced an average distance of 4.74, was made possible by two recent computational advances: Hyper-ANF, a modern probabilistic algorithm for computing distance distributions, and Layered Label Propagation, a modern compression scheme suited for social graphs. The details of how this computation was coordinated will be described. The talk describes joint work with Lars Backstrom, Paolo Boldi, Marco Rosa, and Sebastiano Vigna. | |||
| Towards highly scalable pregel-based graph processing platform with x10 | | BIBA | Full-Text | 501-508 | |
| Nguyen Thien Bao; Toyotaro Suzumura | |||
| Many practical computing problems concern large graph. Standard problems include web graph analysis and social networks analysis like Facebook, Twitter. The scale of these graph poses challenge to their efficient processing. To efficiently process large-scale graph, we create X-Pregel, a graph processing system based on Google's Computing Pregel model [1], by using the state-of-the-art PGAS programming language X10. We do not purely implement Google Pregel by using X10 language, but we also introduce two new features that do not exists in the original model to optimize the performance: (1) an optimization to reduce the number of messages which is exchanged among workers, (2) a dynamic re-partitioning scheme that effectively reassign vertices to different workers during the computation. Our performance evaluation demonstrates that our optimization method of sending messages achieves up to 200% speed up on Pagerank by reducing the network I/O to 10 times in comparison with the default method of sending messages when processing SCALE20 Kronecker graph [2](vertices = 1,048,576, edges = 33,554,432). It also demonstrates that our system processes large graph faster than prior implementation of Pregel such as GPS [3](stands for graph processing system) and Giraph [4]. | |||
| A first view of exedra: a domain-specific language for large graph analytics workflows | | BIBA | Full-Text | 509-516 | |
| Miyuru Dayarathna; Toyotaro Suzumura | |||
| In recent years, many programming models, software libraries, and middleware have appeared for processing large graphs of various forms. However, there exists a significant usability gap between the graph analysis scientists, and High Performance Computing (HPC) application programmers due to the complexity of HPC graph analysis software. In this paper we provide a basic view of Exedra, a domain-specific language (DSL) for large graph analysis in which we aim to eliminate the aforementioned complexities. Exedra consists of high level language constructs for specifying different graph analysis tasks on distributed environments. We implemented Exedra DSL on a scalable graph analysis platform called Dipper. Dipper uses Igraph/R interface for creating graph analysis workflows which in turn gets translated to Exedra statements. Exedra statements are interpreted by Dipper interpreter, and gets mapped to user specified libraries/middleware. Exedra DSL allows for synthesize of graph algorithms that are more efficient compared to bare use of graph libraries while maintaining a standard interface that could use even future graph analysis software. We evaluated Exedra's feasibility for expressing graph analysis tasks by running Dipper on a cluster of four nodes. We observed that Dipper has the ability of reducing the time taken for graph analysis when the workflow was distributed on all four nodes despite the communication, and data format conversion overhead of the Dipper framework. | |||
| Analysis of large scale climate data: how well climate change models and data from real sensor networks agree? | | BIBA | Full-Text | 517-526 | |
| Santiago A. Nunes; Luciana A. S. Romani; Ana M. H. Avila; Priscila P. Coltri; Caetano, Jr. Traina; Robson L. F. Cordeiro; Elaine P. M. de Sousa; Agma J. M. Traina | |||
| Research on global warming and climate changes has attracted a huge attention of the scientific community and of the media in general, mainly due to the social and economic impacts they pose over the entire planet. Climate change simulation models have been developed and improved to provide reliable data, which are employed to forecast effects of increasing emissions of greenhouse gases on a future global climate. The data generated by each model simulation amount to Terabytes of data, and demand fast and scalable methods to process them. In this context, we propose a new process of analysis aimed at discriminating between the temporal behavior of the data generated by climate models and the real climate observations gathered from ground-based meteorological station networks. Our approach combines fractal data analysis and the monitoring of real and model-generated data streams to detect deviations on the intrinsic correlation among the time series defined by different climate variables. Our measurements were made using series from a regional climate model and the corresponding real data from a network of sensors from meteorological stations existing in the analyzed region. The results show that our approach can correctly discriminate the data either as real or as simulated, even when statistical tests fail. Those results suggest that there is still room for improvement of the state-of-the-art climate change models, and that the fractal-based concepts may contribute for their improvement, besides being a fast, parallelizable, and scalable approach. | |||
| Model of complex networks based on citation dynamics | | BIBA | Full-Text | 527-530 | |
| Lovro Šubelj; Marko Bajec | |||
| Complex networks of real-world systems are believed to be controlled by common phenomena, producing structures far from regular or random. These include scale-free degree distributions, small-world structure and assortative mixing by degree, which are also the properties captured by different random graph models proposed in the literature. However, many (non-social) real-world networks are in fact disassortative by degree. Thus, we here propose a simple evolving model that generates networks with most common properties of real-world networks including degree disassortativity. Furthermore, the model has a natural interpretation for citation networks with different practical applications. | |||
| How social network is evolving?: a preliminary study on billion-scale Twitter network | | BIBA | Full-Text | 531-534 | |
| Masaru Watanabe; Toyotaro Suzumura | |||
| Recently, social network services such as Twitter, Facebook, MySpace, LinkedIn have been remarkably growing. There are various studies about social networks analysis. Haewoon Kwak performed the analysis of the Twitter network on 2009 and shows the degree of separation. However, the number of users on 2009 is about 41.7 million, the graph scale is not very large compared with the current graph. In this paper, we conduct a Twitter network analysis in terms growth by region, scale-free, reciprocity, degree of separation and diameter using Twitter user data with 469.9 million users and 28.7 billion relationships. We report that the value of degree of separation is 4.59 in current Twitter network through our experiments.MABSDA'13 Welcome and organization | |||
| The web as a laboratory | | BIBA | Full-Text | 539-540 | |
| Bebo White | |||
| Insights from Web Science and Big Data Analysis have led many researchers to the conclusion that the Web not only represents an almost unlimited data store but also a remarkable multi-disciplinary laboratory environment. A new challenge is how to best leverage the potential of this experimental space. What are the procedures for defining, implementing and evaluating "Web-scale" experiments? What are acceptable measures of robustness and repeatability? What are the opportunities for experimental collaboration? What disciplines are likely to benefit from this new research model? The Web Laboratory model provides an exciting new and fertile model for future research. | |||
| Like prediction: modeling like counts by bridging Facebook pages with linked data | | BIBA | Full-Text | 541-548 | |
| Shohei Ohsawa; Yutaka Matsuo | |||
| Recent growth of social media has produced a new market for branding of
people and businesses. Facebook provides Facebook Pages (Pages in short) for
public figures and businesses (we call entities) to communicate with their fans
through a Like button. Because Like counts sometimes reflect the popularity of
entities, techniques to increase the Like count can be a matter of interest,
and might be known as social media marketing. From an academic perspective,
Like counts of Pages depend not only on the popularity of the entity, but also
on the popularity of semantically related entities. For example, Lady Gaga's
Page has many Likes; her song "Poker Face" does too. We can infer that her next
song will acquire many Likes immediately. Important questions are these: How
does the Like count of Lady Gaga affect the Like count of her song?
Alternatively, how does the Like count of her song constitute some fraction of
the Like count of Lady Gaga herself?
As described in this paper, we strive to reveal the mutual influences of Like counts among semantically related entities. To measure the influence of related entities, we propose a problem called the Like prediction problem (LPP). It models Like counts of a given entity using information of related entities. The semantic relations among entities, expressed as RDF predicates, are obtained by linking each Page with the most similar DBpedia entity. Using the model learned by support vector regression (SVR) on LPP, we can estimate the Like count of a new entity e.g., Lady Gaga's new song. More importantly, we can analyze which RDF predicates are important to infer Like counts, providing a mutual influence network among entities. Our study comprises three parts: (1) crawling the Pages and their Like counts, (2) linking Pages to DBpedia, and (3) constructing features to solve the LPP. Our study, based on 20 million Pages with 30 billion Likes, is the largest-scale study of Facebook Likes ever reported. This research constitutes a new attempt to integrate unstructured emotional data such as Likes, with Linked data, and to provide new insights for branding with social media. | |||
| Tower of babel: a crowdsourcing game building sentiment lexicons for resource-scarce languages | | BIBA | Full-Text | 549-556 | |
| Yoonsung Hong; Haewoon Kwak; Youngmin Baek; Sue Moon | |||
| With the growing amount of textual data produced by online social media today, the demands for sentiment analysis are also rapidly increasing; and, this is true for worldwide. However, non-English languages often lack sentiment lexicons, a core resource in performing sentiment analysis. Our solution, Tower of Babel (ToB), is a language-independent sentiment-lexicon-generating crowdsourcing game. We conducted an experiment with 135 participants to explore the difference between our solution and a conventional manual annotation method. We evaluated ToB in terms of effectiveness, efficiency, and satisfactions. Based on the result of the evaluation, we conclude that sentiment classification via ToB is accurate, productive and enjoyable. | |||
| Rule-based opinion target and aspect extraction to acquire affective knowledge | | BIBA | Full-Text | 557-564 | |
| Stefan Gindl; Albert Weichselbraun; Arno Scharl | |||
| Opinion holder and opinion target extraction are among the most popular and challenging problems tackled by opinion mining researchers, recognizing the significant business value of such components and their importance for applications such as media monitoring and Web intelligence. This paper describes an approach that combines opinion target extraction with aspect extraction using syntactic patterns. It expands previous work limited by sentence boundaries and includes a heuristic for anaphora resolution to identify targets across sentences. Furthermore, it demonstrates the application of concepts known from research on open information extraction to the identification of relevant opinion aspects. Qualitative analyses performed on a corpus of 100,000 Amazon product reviews show that the approach is promising. The extracted opinion targets and aspects are useful for enriching common knowledge resources and opinion mining ontologies, and support practitioners and researchers to identify opinions in document collections. | |||
| A graph-based approach to commonsense concept extraction and semantic similarity detection | | BIBA | Full-Text | 565-570 | |
| Dheeraj Rajagopal; Erik Cambria; Daniel Olsher; Kenneth Kwok | |||
| Commonsense knowledge representation and reasoning support a wide variety of potential applications in fields such as document auto-categorization, Web search enhancement, topic gisting, social process modeling, and concept-level opinion and sentiment analysis. Solutions to these problems, however, demand robust knowledge bases capable of supporting flexible, nuanced reasoning. Populating such knowledge bases is highly time-consuming, making it necessary to develop techniques for deconstructing natural language texts into commonsense concepts. In this work, we propose an approach for effective multi-word commonsense expression extraction from unrestricted English text, in addition to a semantic similarity detection technique allowing additional matches to be found for specific concepts not already present in knowledge bases. | |||
| Spanish knowledge base generation for polarity classification from masses | | BIBA | Full-Text | 571-578 | |
| Arturo Montejo-Ráez; Manuel Carlos Díaz-Galiano; José Manuel Perea-Ortega; Luis Alfonso Ureña-López | |||
| This work presents a novel method for the generation of a knowledge base oriented to Sentiment Analysis from the continuous stream of published micro-blogs in social media services like Twitter. The method is simple in its approach and has shown to be effective compared to other knowledge based methods for Polarity Classification. Due to independence from language, the method has been tested on different Spanish corpora, with a minimal effort in the lexical resources involved. Although for two of the three studied corpora the obtained results did not improve those officially obtained on the same corpora, it should be noted that this is an unsupervised approach and the accuracy levels achieved were close to those levels obtained with well-known supervised algorithms. | |||
| Revised mutual information approach for German text sentiment classification | | BIBA | Full-Text | 579-586 | |
| Farag Saad; Brigitte Mathiak | |||
| The significant increase in content of online social media such as product reviews, blogs, forums etc., have led to an increasing attention to sentiment analysis tools and approaches that make use of mining this substantially growing content. The aim of this paper is to develop a robust classification approach of customer reviews based on a self-annotated domain-specific corpus by applying a statistical approach i.e., mutual information. First, subjective words in each test sentence are identified. Second, ambiguous adjectives such as high, low, large, many etc., are disambiguated based on their accompanying noun using a conditional mutual information approach. Third, a mutual information approach is applied to find the sentiment orientation (polarity) of the identified subjective words based on analyzing their statistical relationship with the manually annotated sentiment labels within a sizeable sentiment training data. Fourth, since negation plays a significant role in flipping the sentiment polarity of an identified sentiment word, we estimate the role of negation in affecting the classification accuracy. Finally, the identified polarity for each test sentence is evaluated against experts' annotation.MSM'13 Welcome and organization | |||
| Urban: crowdsourcing for the good of London | | BIBA | Full-Text | 591-592 | |
| Daniele Quercia | |||
| For the last few years, we have been studying existing social media sites
and created new ones in the context of London. By combining what Twitter users
in a variety of London neighborhoods talk about with census data, we showed
that neighborhood deprivation was associated (positively and negatively) with
use of emotion words (sentiment) [2] and with specific topics [5]. Users in
more deprived neighborhoods tweeted about wedding parties, matters expressed in
Spanish/Portuguese, and celebrity gossips. By contrast, those in less deprived
neighborhoods tweeted about vacations, professional use of social media,
environmental issues, sports, and health issues. Also, upon data about 76
million London underground and overground rail journeys, we found that people
from deprived areas visited both other deprived areas and prosperous areas,
while residents of better-off communities tended to only visit other privileged
neighborhoods -- suggesting a geographic segregation effect [1, 6]. More
recently, we created and launched two crowdsourcing websites. First, we
launched urbanopticon.org, which extracts Londoners' mental images of the city.
By testing which places are remarkable and unmistakable and which places
represent faceless sprawl, we were able to draw the recognizability map of
London. We found that areas with low recognizability did not fare any worse on
the economic indicators of income, education, and employment, but they did
significantly suffer from social problems of housing deprivation, poor living
conditions, and crime [4]. Second, we launched urbangems.org. This crowdsources
visual perceptions of quiet, beauty and happiness across the city using Google
Street View pictures.
The aim is to identify the visual cues that are generally associated with concepts difficult to define such beauty, happiness, quietness, or even deprivation. By using state-of-the-art image processing techniques, we determined the visual cues that make a place appear beautiful, quiet, and happy [3]: the amount of greenery was the most positively associated visual cue with each of three qualities; by contrast, broad streets, fortress-like buildings, and council houses tended to be negatively associated. These two sites offer the ability to conduct specific urban sociological experiments at scale. More generally, this line of work is at the crossroad of two emerging themes in computing research -- a crossroad where "web science" meets the "smart city" agenda. | |||
| Using topic models for Twitter hashtag recommendation | | BIBA | Full-Text | 593-596 | |
| Fréderic Godin; Viktor Slavkovikj; Wesley De Neve; Benjamin Schrauwen; Rik Van de Walle | |||
| Since the introduction of microblogging services, there has been a continuous growth of short-text social networking on the Internet. With the generation of large amounts of microposts, there is a need for effective categorization and search of the data. Twitter, one of the largest microblogging sites, allows users to make use of hashtags to categorize their posts. However, the majority of tweets do not contain tags, which hinders the quality of the search results. In this paper, we propose a novel method for unsupervised and content-based hashtag recommendation for tweets. Our approach relies on Latent Dirichlet Allocation (LDA) to model the underlying topic assignment of language classified tweets. The advantage of our approach is the use of a topic distribution to recommend general hashtags. | |||
| FS-NER: a lightweight filter-stream approach to named entity recognition on Twitter data | | BIBA | Full-Text | 597-604 | |
| Diego Marinho de Oliveira; Alberto H. F. Laender; Adriano Veloso; Altigran S. da Silva | |||
| Microblog platforms such as Twitter are being increasingly adopted by Web users, yielding an important source of data for web search and mining applications. Tasks such as Named Entity Recognition are at the core of many of these applications, but the effectiveness of existing tools is seriously compromised when applied to Twitter data, since messages are terse, poorly worded and posted in many different languages. Also, Twitter follows a streaming paradigm, imposing that entities must be recognized in real-time. In view of these challenges and the inappropriateness of existing tools, we propose a novel approach for Named Entity Recognition on Twitter data called FS-NER (Filter-Stream Named Entity Recognition). FS-NER is characterized by the use of filters that process unlabeled Twitter messages, being much more practical than existing supervised CRF-based approaches. Such filters can be combined either in sequence or in parallel in a flexible way. Moreover, because these filters are not language dependent, FS-NER can be applied to different languages without requiring a laborious adaptation. Through a systematic evaluation using three Twitter collections and considering seven types of entity, we show that FS-NER performs 3% better than a CRF-based baseline, besides being orders of magnitude faster and much more practical. | |||
| Nerding out on Twitter: fun, patriotism and #curiosity | | BIBA | Full-Text | 605-612 | |
| Victoria Uren; Aba-Sah Dadzie | |||
| This paper presents an analysis of tweets collected over six days before, during and after the landing of the Mars Science Laboratory, known as Curiosity, in the Gale Crater on the 6th of August 2012. A sociological application of web science is demonstrated by use of parallel coordinate visualization as part of a mixed methods study. The results show strong, predominantly positive, international interest in the event. Scientific details dominated the stream, but, following the successful landing, other themes emerged such as fun, and national pride. | |||
| ET: events from tweets | | BIBA | Full-Text | 613-620 | |
| Ruchi Parikh; Kamalakar Karlapalem | |||
| Social media sites such as Twitter and Facebook have emerged as popular tools for people to express their opinions on various topics. The large amount of data provided by these media is extremely valuable for mining trending topics and events. In this paper, we build an efficient, scalable system to detect events from tweets (ET). Our approach detects events by exploring their textual and temporal components. ET does not require any target entity or domain knowledge to be specified; it automatically detects events from a set of tweets. The key components of ET are (1) an extraction scheme for event representative keywords, (2) an efficient storage mechanism to store their appearance patterns, and (3) a hierarchical clustering technique based on the common co-occurring features of keywords. The events are determined through the hierarchical clustering process. We evaluate our system on two data-sets; one is provided by VAST challenge 2011, and the other published by US based users in January 2013. Our results show that we are able to detect events of relevance efficiently. | |||
| Meaning as collective use: predicting semantic hashtag categories on Twitter | | BIBA | Full-Text | 621-628 | |
| Lisa Posch; Claudia Wagner; Philipp Singer; Markus Strohmaier | |||
| This paper sets out to explore whether data about the usage of hashtags on Twitter contains information about their semantics. Towards that end, we perform initial statistical hypothesis tests to quantify the association between usage patterns and semantics of hashtags. To assess the utility of pragmatic features -- which describe how a hashtag is used over time -- for semantic analysis of hashtags, we conduct various hashtag stream classification experiments and compare their utility with the utility of lexical features. Our results indicate that pragmatic features indeed contain valuable information for classifying hashtags into semantic categories. Although pragmatic features do not outperform lexical features in our experiments, we argue that pragmatic features are important and relevant for settings in which textual information might be sparse or absent (e.g., in social video streams). | |||
| Towards linking buyers and sellers: detecting commercial Intent on Twitter | | BIBA | Full-Text | 629-632 | |
| Bernd Hollerit; Mark Kröll; Markus Strohmaier | |||
| Since more and more people use the micro-blogging platform Twitter to convey their needs and desires, it has become a particularly interesting medium for the task of identifying commercial activities. Potential buyers and sellers can be contacted directly thereby opening up novel perspectives and economic possibilities. By detecting commercial intent in tweets, this work is considered a first step to bring together buyers and sellers. In this work, we present an automatic method for detecting commercial intent in tweets where we achieve reasonable precision 57% and recall 77% scores. In addition, we provide insights into the nature and characteristics of tweets exhibiting commercial intent thereby contributing to our understanding of how people express commercial activities on Twitter. | |||
| MicroFilter: real time filtering of microblogging content | | BIBA | Full-Text | 633-634 | |
| Ryadh Dahimene; Cédric du Mouza | |||
| Microblogging systems have become a major trend over the Web. After only 7 years of existence, Twitter for instance claims more than 500 million users with more than 350 billion delivered update each day. As a consequence the user must today manage possibly extremely large feeds, resulting in poor data readability and loss of valuable information and the system must face a huge network load. In this demonstration, we present and illustrate the features of MicroFilter (MF in the following), an inverted list-based filtering engine that nicely extends existing centralized microblogging systems by adding a real-time filtering feature. The demonstration proposed illustrates how the user experience is improved, the impact on the traffic for the overall system, and how the characteristics of microblogs drove the design of the indexing structures. | |||
| Some clues on irony detection in tweets | | BIBA | Full-Text | 635-636 | |
| Aline A. Vanin; Larissa A. Freitas; Renata Vieira; Marco Bochernitsan | |||
| MSND'13 Welcome | |||
| Detection of spam tipping behaviour on foursquare | | BIBA | Full-Text | 641-648 | |
| Anupama Aggarwal; Jussara Almeida; Ponnurangam Kumaraguru | |||
| In Foursquare, one of the currently most popular online location based
social networking sites (LBSNs), users may not only check-in at specific venues
but also post comments (or tips), sharing their opinions and previous
experiences at the corresponding physical places. Foursquare tips, which are
visible to everyone, provide venue owners with valuable user feedback besides
helping other users to make an opinion about the specific venue. However, they
have been the target of spamming activity by users who exploit this feature to
spread tips with unrelated content.
In this paper, we present what, to our knowledge, is the first effort to identify and analyze different patterns of tip spamming activity in Foursquare, with the goal of developing automatic tools to detect users who post spam tips -- tip spammers. A manual investigation of a real dataset collected from Foursquare led us to identify four categories of spamming behavior, viz. Advertising/Spam, Self-promotion, Abusive and Malicious. We then applied machine learning techniques, jointly with a selected set of user, social and tip's content features associated with each user, to develop automatic detection tools. Our experimental results indicate that we are able to not only correctly distinguish legitimate users from tip spammers with high accuracy (89.76%) but also correctly identify a large fraction (at least 78.88%) of spammers in each identified category. | |||
| The role of research leaders on the evolution of scientific communities | | BIBA | Full-Text | 649-656 | |
| Bruno Leite Alves; Fabrício Benevenuto; Alberto H. F. Laender | |||
| There have been considerable efforts in the literature towards understanding and modeling dynamic aspects of scientific communities. Despite the great interest, little is known about the role that different members play in the formation of the underlying network structure of such communities. In this paper, we provide a wide investigation of the roles that members of the core of scientific communities play in the collaboration network structure formation and evolution. To do that, we define a community core based on an individual metric, core score, which is an h-index derived metric that captures both, the prolificness and the involvement of researchers in a community. Our results provide a number of key observations related to community formation and evolving patterns. Particularly, we show that members of the community core work as bridges that connect smaller clustered research groups. Furthermore, these members are responsible for an increase in the average degree of the whole community underlying network and a decrease on the overall network assortativeness. More important, we note that variations on the members of the community core tend to be strongly correlated with variations on these metrics. We argue that our observations are important for shedding a light on the role of key members on community formation and structure. | |||
| Analyzing and predicting viral tweets | | BIBA | Full-Text | 657-664 | |
| Maximilian Jenders; Gjergji Kasneci; Felix Naumann | |||
| Twitter and other microblogging services have become indispensable sources
of information in today's web. Understanding the main factors that make certain
pieces of information spread quickly in these platforms can be decisive for the
analysis of opinion formation and many other opinion mining tasks.
This paper addresses important questions concerning the spread of information on Twitter. What makes Twitter users retweet a tweet? Is it possible to predict whether a tweet will become "viral", i.e., will be frequently retweeted? To answer these questions we provide an extensive analysis of a wide range of tweet and user features regarding their influence on the spread of tweets. The most impactful features are chosen to build a learning model that predicts viral tweets with high accuracy. All experiments are performed on a real-world dataset, extracted through a public Twitter API based on user IDs from the TREC 2011 microblog corpus. | |||
| Resolving homonymy with correlation clustering in scholarly digital libraries | | BIBA | Full-Text | 665-672 | |
| Jeongin Ju; Hosung Park; Sue Moon | |||
| As scholarly data increases rapidly, scholarly digital libraries, supplying publication data through convenient online interfaces, become popular and important tools for researchers. Researchers use SDLs for various purposes, including searching the publications of an author, assessing one's impact by the citations, and identifying one's research topics. However, common names among authors cause difficulties in correctly identifying one's works among a large number of scholarly publications. Abbreviated first and middle names make it even harder to identify and distinguish authors with the same representation (i.e. spelling) of names. Several disambiguation methods have solved the problem under their own assumptions. The assumptions are usually that inputs such as the number of same-named authors, training sets, or rich and clear information about papers are given. Considering the size of scholarship records today and their inconsistent formats, we expect their assumptions be very hard to be met. We use common assumption that coauthors are likely to write more than one paper together and propose an unsupervised approach to group papers from the same author only using the most common information, author lists. We represent each paper as a point in an author name space, take dimension reduction to find author names shown frequently together in papers, and cluster papers with vector similarity measure well fitted for name disambiguation task. The main advantage of our approach is to use only coauthor information as input. We evaluate our method using publication records collected from DBLP, and show that our approach results in better disambiguation compared to other five clustering methods in terms of cluster purity and fragmentation. | |||
| Examining lists on Twitter to uncover relationships between following, membership and subscription | | BIBA | Full-Text | 673-676 | |
| Srikar Velichety; Sudha Ram | |||
| We report on an exploratory analysis of pairwise relationships between three different forms of information consumption on Twitter viz., following, listing and subscribing. We develop a systematic framework to examine the relationships between these three forms. Using our framework, we conducted an empirical analysis of a dataset from Twitter. Our results show that people not only consume information by explicitly following others, but also by listing and subscribing to lists and that the people they list or subscribe to are not the same as the ones they follow. Our work has implications for understanding information propagation and diffusion via Twitter and for generating recommendations for adding users to lists, subscribing and merging or splitting them.PHDA'13 Welcome and organization | |||
| A proposal for automatic diagnosis of malaria: extended abstract | | BIBA | Full-Text | 681-682 | |
| Allisson D. Oliveira; Giordano Cabral; D. López; Caetano Firmo; F. Zarzuela Serrat; J. Albuquerque | |||
| This paper presents a methodology for automatic diagnosis of malaria using computer vision techniques combined with artificial intelligence. We had obtained an accuracy rate of 74% in the detection system. | |||
| Vaccine attitude surveillance using semantic analysis: constructing a semantically annotated corpus | | BIBA | Full-Text | 683-686 | |
| Stephanie Brien; Nona Naderi; Arash Shaban-Nejad; Luke Mondor; Doerthe Kroemker; David L. Buckeridge | |||
| This paper reports work in progress to semantically annotate blog posts about vaccines to use in the Vaccine Attitude Surveillance using Semantic Analysis (VASSA) framework. The VASSA framework combines semantic web and natural language processing (NLP) tools and techniques to provide a coherent semantic layer across online social media for assessment and analysis of vaccination attitudes and beliefs. We describe how the blog posts were sampled and selected, our schema to semantically annotate concepts defined in our ontology, details of the annotation process, and inter-annotator agreement on a sample of blog posts. | |||
| A roadmap to integrated digital public health surveillance: the vision and the challenges | | BIBA | Full-Text | 687-694 | |
| Patty Kostkova | |||
| The exponentially increasing stream of real time big data produced by Web
2.0 Internet and mobile networks created radically new interdisciplinary
challenges for public health and computer science. Traditional public health
disease surveillance systems have to utilize the potential created by new
situation-aware realtime signals from social media, mobile/sensor networks and
citizens? participatory surveillance systems providing invaluable free realtime
event-based signals for epidemic intelligence. However, rather than improving
existing isolated systems, an integrated solution bringing together existing
epidemic intelligence systems scanning news media (e.g., GPHIN, MedISys) with
real-time social media intelligence (e.g., Twitter, participatory systems) is
required to substantially improve and automate early warning, outbreak
detection and preparedness operations. However, automatic monitoring and novel
verification methods for these multichannel event-based real time signals has
to be integrated with traditional case-based surveillance systems from
microbiological laboratories and clinical reporting. Finally, the system needs
effectively support coordination of epidemiological teams, risk communication
with citizens and implementation of prevention measures.
However, from computational perspective, signal detection, analysis and verification of very high noise realtime big data provide a number of interdisciplinary challenges for computer science. Novel approaches integrating current systems into a digital public health dashboard can enhance signal verification methods and automate the processes assisting public health experts in providing better informed and more timely response. In this paper, we describe the roadmap to such a system, components of an integrated public health surveillance services and computing challenges to be resolved to create an integrated real world solution. | |||
| Participatory disease surveillance in Latin America | | BIBA | Full-Text | 695-696 | |
| Michael Johansson; Oktawia Wojcik; Rumi Chunara; Mark Smolinski; John Brownstein | |||
| Participatory disease surveillance systems are dynamic, sensitive, and accurate. They also offer an opportunity to directly connect the public to public health. Implementing them in Latin America requires targeting multiple acute febrile illnesses, designing a system that is appropriate and scalable, and developing local strategies for encouraging participation. | |||
| Crowdsourced risk factors of influenza-like-illness in Mexico | | BIBA | Full-Text | 697-698 | |
| Natalia Barbara Mantilla-Beniers; Rocio Rodriguez-Ramirez; Christopher Rhodes Stephens | |||
| Monitoring of influenza like illnesses (ILI) using the Internet has become
more common since its beginnings nearly a decade ago. The initial project of
Der Grote Griep Meting was launched in 2003 in the Netherlands and Belgium. It
was designed as a means of engaging people in matters of scientific and public
health importance, and indeed attracted participation from over 30,000 people
in its first year. Its success thus gathered a wealth of potentially valuable
epidemiological data complementary to those obtained through the established
disease surveillance networks, and linked to rich background information on
each participant. Since then, there has been an accelerated increase in the
number of countries hosting similar websites, and many of these have generated
rather promising results
In this talk, an analysis of the data from the Mexican monitoring website, "Reporta" is presented, and the risk factors that are linked to reporting of ILI symptoms among its participants are determined and analyzed. The data base gathered from the launching of Reporta in May 2009 to September 2011 is used for this purpose. The definition of suspect ILI case employed by the Mexican Health Ministry is applied to distinguish a class C of participants; the traits gathered in the background questionnaire are labeled Xi. Risk associated to any given trait Xi is evaluated by considering the difference between the frequency with which C occurs among participants with trait Xi and in the general population. This difference is then normalized to assess its statistical significance Interestingly, while some of the results confirm the suspected importance of certain traits indicative of enhanced susceptibility or a large contact network, others are unexpected and must be interpreted within an adequate framework. Thus, a taxonomy of background traits is proposed to aid interpretation, and tested through a new assessment of the associated risks. This work illustrates a way in which Internet-based monitoring can contribute to our understanding of disease spread. | |||
| Validating models for disease detection using Twitter | | BIBA | Full-Text | 699-702 | |
| Todd Bodnar; Marcel Salathé | |||
| Data mining social media has become a valuable resource for infectious disease surveillance. However, there are considerable risks associated with incorrectly predicting an epidemic. The large amount of social media data combined with the small amount of ground truth data and the general dynamics of infectious diseases present unique challenges when evaluating model performance. In this paper, we look at several methods that have been used to assess influenza prevalence using Twitter. We then validate them with tests that are designed to avoid and illustrate issues with the standard k-fold cross validation method. We also find that small modifications to the way that data are partitioned can have major effects on a model's reported performance. | |||
| Combining Twitter and media reports on public health events in MedISys | | BIBA | Full-Text | 703-718 | |
| Erik van der Goot; Hristo Tanev; Jens P. Linge | |||
| We describe the harvesting and subsequent analysis of tweets that are linked to media reports on public health events in order to identify which Internet resources are being referred to in these tweets. The aim was to automatically detect resources that are traditionally not considered mainstream media, but play a role in the discussion of public health events on the Internet. Interestingly, our initial evaluation of the results showed that most references related to public health events lead to traditional news media sites, even though URLs to non-traditional media receive a higher rank. We will briefly describe the Medical Information System (MedISys) and the methodology used to obtain and analyse tweets.PSOM'13 Welcome and organization | |||
| Preserving user privacy from third-party applications in online social networks | | BIBA | Full-Text | 723-728 | |
| Yuan Cheng; Jaehong Park; Ravi Sandhu | |||
| Online social networks (OSNs) facilitate many third-party applications (TPAs) that offer users additional functionality and services. However, they also pose serious user privacy risk as current OSNs provide little control over disclosure of user data to TPAs. Addressing the privacy and security issues related to TPAs (and the underlying social networking platforms) requires solutions beyond a simple all-or-nothing strategy. In this paper, we outline an access control framework that provides users flexible controls over how TPAs can access user data and activities in OSNs while still retaining the functionality of TPAs. The proposed framework specifically allows TPAs to utilize some private data without actually transmitting this data to TPAs. Our approach determines access from TPAs based on user-specified policies in terms of relationships between the user and the application. | |||
| Faking Sandy: characterizing and identifying fake images on Twitter during Hurricane Sandy | | BIBA | Full-Text | 729-736 | |
| Aditi Gupta; Hemank Lamba; Ponnurangam Kumaraguru; Anupam Joshi | |||
| In today's world, online social media plays a vital role during real world events, especially crisis events. There are both positive and negative effects of social media coverage of events, it can be used by authorities for effective disaster management or by malicious entities to spread rumors and fake news. The aim of this paper, is to highlight the role of Twitter, during Hurricane Sandy (2012) to spread fake images about the disaster. We identified 10,350 unique tweets containing fake images that were circulated on Twitter, during Hurricane Sandy. We performed a characterization analysis, to understand the temporal, social reputation and influence patterns for the spread of fake images. Eighty six percent of tweets spreading the fake images were retweets, hence very few were original tweets. Our results showed that top thirty users out of 10,215 users (0.3%) resulted in 90% of the retweets of fake images; also network links such as follower relationships of Twitter, contributed very less (only 11%) to the spread of these fake photos URLs. Next, we used classification models, to distinguish fake images from real images of Hurricane Sandy. Best results were obtained from Decision Tree classifier, we got 97% accuracy in predicting fake images from real. Also, tweet based features were very effective in distinguishing fake images tweets from real, while the performance of user based features was very poor. Our results, showed that, automated techniques can be used in identifying real images from fake images posted on Twitter. | |||
| A pilot study of cyber security and privacy related behavior and personality traits | | BIBA | Full-Text | 737-744 | |
| Tzipora Halevi; James Lewis; Nasir Memon | |||
| Recent research has begun to focus on the factors that cause people to
respond to phishing attacks as well as affect user behavior on social networks.
This study examines the correlation between the Big Five personality traits and
email phishing response. Another aspect examined is how these factors relate to
users' tendency to share information and protect their privacy on Facebook
(which is one of the most popular social networking sites).
This research shows that when using a prize phishing email, neuroticism is the factor most correlated to responding to this email, in addition to a gender-based difference in the response. This study also found that people who score high on the openness factor tend to both post more information on Facebook as well as have less strict privacy settings, which may cause them to be susceptible to privacy attacks. In addition, this work detected no correlation between the participants estimate of being vulnerable to phishing attacks and actually being phished, which suggests susceptibility to phishing is not due to lack of awareness of the phishing risks and that real-time response to phishing is hard to predict in advance by online users. The goal of this study is to better understand the traits that contribute to online vulnerability, for the purpose of developing customized user interfaces and secure awareness education, designed to increase users' privacy and security in the future. | |||
| Twitter (R)evolution: privacy, free speech and disclosure | | BIBA | Full-Text | 745-750 | |
| Lilian Edwards; Andrea M. Matwyshyn | |||
| Using Twitter as a case study, this paper sets forth the legal tensions faced by social networks that seek to defend privacy interests of users. Recent EC and UN initiatives have begun to suggest an increased role for corporations as protectors of human rights. But, as yet, binding rather than voluntary obligations of this kind under international human rights law seem either non-existent or highly conflicted, and structural limitations to such a shift may currently exist under both U.S. and UK law. Companies do not face decisions regarding disclosure in a vacuum, rather they face them constrained by existing obligations under (sometimes conflicting) legal demands. Yet, companies such as Twitter are well-positioned to be advocates for consumers' interests in these legal debates. Using several recent corporate disclosure decisions regarding user identity as illustration, this paper places questions of privacy, free speech and disclosure in broader legal context. More scholarship is needed on the mechanics of how online intermediaries, especially social media, manage their position as crucial speech platforms in democratic as well as less democratic regimes. | |||
| How to hack into Facebook without being a hacker | | BIBA | Full-Text | 751-754 | |
| Tarun Parwani; Ramin Kholoussi; Panagiotis Karras | |||
| The proliferation of online social networking services has aroused privacy
concerns among the general public. The focus of such concerns has typically
revolved around providing explicit privacy guarantees to users and letting
users take control of the privacy-threatening aspects of their online behavior,
so as to ensure that private personal information and materials are not made
available to other parties and not used for unintended purposes without the
user's consent. As such protective features are usually opt-in, users have to
explicitly opt-in for them in order to avoid compromising their privacy.
Besides, third-party applications may acquire a user's personal information,
but only after they have been granted consent by the user. If we also consider
potential network security attacks that intercept or misdirect a user's online
communication, it would appear that the discussion of user vulnerability has
accurately delimited the ways in which a user may be exposed to privacy
threats.
In this paper, we expose and discuss a previously unconsidered avenue by which a user's privacy can be gravely exposed. Using this exploit, we were able to gain complete access to some popular online social network accounts without using any conventional method like phishing, brute force, or trojans. Our attack merely involves a legitimate exploitation of the vulnerability created by the existence of obsolete web-based email addresses. We present the results of an experimental study on the spread that such an attack can reach, and the ethical dilemmas we faced in the process. Last, we outline our suggestions for defense mechanisms that can be employed to enhance online security and thwart the kind of attacks that we expose. | |||
| A cross-cultural framework for protecting user privacy in online social media | | BIBA | Full-Text | 755-762 | |
| Blase Ur; Yang Wang | |||
| Social media has become truly global in recent years. We argue that support for users' privacy, however, has not been extended equally to all users from around the world. In this paper, we survey existing literature on cross-cultural privacy issues, giving particular weight to work specific to online social networking sites. We then propose a framework for evaluating the extent to which social networking sites' privacy options are offered and communicated in a manner that supports diverse users from around the world. One aspect of our framework focuses on cultural issues, such as norms regarding the use of pseudonyms or posting of photographs. A second aspect of our framework discusses legal issues in cross-cultural privacy, including data-protection requirements and questions of jurisdiction. The final part of our framework delves into user expectations regarding the data-sharing practices and the communication of privacy information. The framework can enable service providers to identify potential gaps in support for user privacy. It can also help researchers, regulators, or consumer advocates reason systematically about cultural differences related to privacy in social media. | |||
| Privacy nudges for social media: an exploratory Facebook study | | BIBA | Full-Text | 763-770 | |
| Yang Wang; Pedro Giovanni Leon; Kevin Scott; Xiaoxuan Chen; Alessandro Acquisti; Lorrie Faith Cranor | |||
| Anecdotal evidence and scholarly research have shown that a significant portion of Internet users experience regrets over their online disclosures. To help individuals avoid regrettable online disclosures, we employed lessons from behavioral decision research and research on soft paternalism to design mechanisms that "nudge" users to consider the content and context of their online disclosures before posting them. We developed three such privacy nudges on Facebook. The first nudge provides visual cues about the audience for a post. The second nudge introduces time delays before a post is published. The third nudge gives users feedback about their posts. We tested the nudges in a three-week exploratory field trial with 21 Facebook users, and conducted 13 follow-up interviews. Our system logs, results from exit surveys, and interviews suggest that privacy nudges could be a promising way to prevent unintended disclosure. We discuss limitations of the current nudge designs and future directions for improvement.RAMSS'13 Welcome and organization | |||
| Real-time user modeling and prediction: examples from YouTube | | BIBA | Full-Text | 775-776 | |
| Ramesh R. Sarukkai | |||
| Real-time analysis and modeling of users for improving engagement, and interaction is a burgeoning area of interest with applications to web sites, social networks and mobile applications. Apart from scalability issues, this domain poses a number of modeling and algorithmic challenges. In this talk, as an illustrative example, we present DAL, a system that leverages real-time user activity/signals for dynamic ad loads, and designed to improve the overall user experience on YouTube. This system uses machine learning to optimize for user activity during a visit and helps decide on real-time advertising policies dynamically for the user. We conclude the talk with challenges and opportunities in this important area of real-time user analysis and social modeling. | |||
| SAMOA: a platform for mining big data streams | | BIBA | Full-Text | 777-778 | |
| Gianmarco De Francisci Morales | |||
| Social media and user generated content are causing an ever growing data
deluge. The rate at which we produce data is growing steadily, thus creating
larger and larger streams of continuously evolving data. Online news,
micro-blogs, search queries are just a few examples of these continuous streams
of user activities. The value of these streams relies in their freshness and
relatedness to ongoing events. However, current (de-facto standard) solutions
for big data analysis are not designed to deal with evolving streams.
In this talk, we offer a sneak preview of SAMOA, an upcoming platform for mining dig data streams. SAMOA is a platform for online mining in a cluster/cloud environment. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as S4 and Storm. SAMOA includes algorithms for the most common machine learning tasks such as classification and clustering. Finally, SAMOA will soon be open sourced in order to foster collaboration and research on big data stream mining. | |||
| Towards real-time collaborative filtering for big fast data | | BIBA | Full-Text | 779-780 | |
| Ernesto Diaz-Aviles; Wolfgang Nejdl; Lucas Drumond; Lars Schmidt-Thieme | |||
| The Web of people is highly dynamic and the life experiences between our on-line and "real-world" interactions are increasingly interconnected. For example, users engaged in the Social Web more and more rely upon continuous social streams for real-time access to information and fresh knowledge about current affairs. However, given the deluge of data items, it is a challenge for individuals to find relevant and appropriately ranked information at the right time. Having Twitter as test bed, we tackle this information overload problem by following an online collaborative approach. That is, we go beyond the general perspective of information finding in Twitter, that asks: "What is happening right now?", towards an individual user perspective, and ask: "What is interesting to me right now within the social media stream?". In this paper, we review our recently proposed online collaborative filtering algorithms and outline potential research directions. | |||
| Detecting real-time burst topics in microblog streams: how sentiment can help | | BIBA | Full-Text | 781-782 | |
| Lumin Zhang; Yan Jia; Bin Zhou; Yi Han | |||
| Microblog has become an increasing valuable resource of up-to-date topics about what is happening in the world. In this paper, we propose a novel approach of detecting real-time events in microblog streams based on bursty sentiments detection. Instead of traditional sentiment orientation like positive, negative and neutral, we use sentiment vector as our sentiment model to abstract subjective messages which are then used to detect bursts and clustered into new events. Experimental evaluations show that our approach could perform effectively for online event detection. Although we worked with Chinese in our research, the technique can be used with any other language. | |||
| Sub-event detection during natural hazards using features of social media data | | BIBA | Full-Text | 783-788 | |
| Dhekar Abhik; Durga Toshniwal | |||
| Social networking sites such as Flickr, YouTube, Facebook, etc. contain a huge amount of user-contributed data for a variety of real-world events. These events can be some natural calamities such as earthquakes, floods, forest fires, etc. or some man-made hazards like riots. This work focuses on getting better knowledge about a natural hazard event using the data available from social networking sites. Rescue and relief activities in emergency situations can be enhanced by identifying sub-events of a particular event. Traditional topic discovery techniques used for event identification in news data cannot be used for social media data because social network data may be unstructured. To address this problem the features or metadata associated with social media data can be exploited. These features can be user-provided annotations (e.g., title, description) and automatically generated information (e.g., content creation time). Considerable improvement in performance is observed by using multiple features of social media data for sub-event detection rather than using individual feature. Proposed here is a two-step process. In the first step, clusters are formed from social network data using relevant features individually. Based on the significance of features weights are assigned to them. And in the second step all the clustering solutions formed in the first step are combined in a principal weighted manner to give the final clustering solution. Each cluster represents a sub-event for a particular natural hazard. | |||
| MediaFinder: collect, enrich and visualize media memes shared by the crowd | | BIBA | Full-Text | 789-790 | |
| Raphaël Troncy; Vuk Milicic; Giuseppe Rizzo; José Luis Redondo García | |||
| Social networks play an increasingly important role for sharing media items related to human's activities, feelings, emotions and conversations opening a window to the world in real-time. However, these images and videos are spread over multiple social networks. In this paper, we first describe a so-called media server that collect recent images and videos which can be potentially attached to an event. These media items can then be used for the automatic generation of visual summaries. However, making sense out of the resulting media galleries is an extremely challenging task. We present a framework that leverages on: (i) visual features from media items for near-deduplication and (ii) textual features from status updates to enrich, cluster and generate storyboards. A prototype is publicly available at http://mediafinder.eurecom.fr. | |||
| MJ no more: using concurrent wikipedia edit spikes with social network plausibility checks for breaking news detection | | BIBA | Full-Text | 791-794 | |
| Thomas Steiner; Seth van Hooland; Ed Summers | |||
| We have developed an application called Wikipedia Live Monitor that monitors article edits on different language versions of Wikipedia -- as they happen in realtime. Wikipedia articles in different languages are highly interlinked. For example, the English article "en:2013_Russian_meteor_event" on the topic of the February 15 meteoroid that exploded over the region of Chelyabinsk Oblast, Russia, is interlinked with "ru:ПaДehne_meteopnta_ha_Ypajie_B_2013_roДy?, the Russian article on the same topic. As we monitor multiple language versions of Wikipedia in parallel, we can exploit this fact to detect concurrent edit spikes of Wikipedia articles covering the same topics, both in only one, and in different languages. We treat such concurrent edit spikes as signals for potential breaking news events, whose plausibility we then check with full-text cross-language searches on multiple social networks. Unlike the reverse approach of monitoring social networks first, and potentially checking plausibility on Wikipedia second, the approach proposed in this paper has the advantage of being less prone to false-positive alerts, while being equally sensitive to true-positive events, however, at only a fraction of the processing cost. A live demo of our application is available online at the URL http://wikipedia-irc.herokuapp.com/, the source code is available under the terms of the Apache 2.0 license at https://github.com/tomayac/wikipedia-irc. | |||
| Real time discussion retrieval from Twitter | | BIBA | Full-Text | 795-800 | |
| Dmitrijs Milajevs; Gosse Bouma | |||
| While social media receive a lot of attention from the scientific community in general, there is little work on high recall retrieval of messages relevant to a discussion. Hash tag based search is widely used for data retrieval from social media. This work shows limitations of this approach, because the majority of the relevant messages do not even contain any hash tag, and unpredictable hash tags are used as the conversation evolves in time. To overcome these limitations, we propose an alternative retrieval method. Given an input stream of messages as an example of the discussion, our method extracts the most relevant words from it and queries the social network for more messages with these words. Our method filters messages that do not belong to the discussion using an LDA topic model. We demonstrate this concept on manually built collections of tweets about major sport and music events.SIMPLEX'13 Welcome and organization | |||
| Characterizing branching processes from sampled data | | BIBA | Full-Text | 805-812 | |
| Fabricio Murai; Bruno Ribeiro; Donald Towsley; Krista Gile | |||
| Branching processes model the evolution of populations of agents that randomly generate offspring (children). These processes, more patently Galton-Watson processes, are widely used to model biological, social, cognitive, and technological phenomena, such as the diffusion of ideas, knowledge, chain letters, viruses, and the evolution of humans through their Y-chromosome DNA or mitochondrial RNA. A practical challenge of modeling real phenomena using a Galton-Watson process is the choice of the offspring distribution, which must be measured from the population. In most cases, however, directly measuring the offspring distribution is unrealistic due to lack of resources or the death of agents. So far, researchers have relied on informed guesses to guide their choice of offspring distribution. In this work we propose two methods to estimate the offspring distribution from real sampled data. Using a small sampled fraction of the agents and instrumented with the identity of the ancestors of the sampled agents, we show that accurate offspring distribution estimates can be obtained by sampling as little as 14% of the population. | |||
| Resilience of dynamic overlays through local interactions | | BIBA | Full-Text | 813-820 | |
| Stefano Ferretti | |||
| This paper presents a self-organizing protocol for dynamic (unstructured P2P) overlay networks, which allows to react to the variability of node arrivals and departures. Through local interactions, the protocol avoids that the departure of nodes causes a partitioning of the overlay. We show that it is sufficient to have knowledge about 1st and 2nd neighbours, plus a simple interaction P2P protocol, to make unstructured networks resilient to node faults. A simulation assessment over different kinds of overlay networks demonstrates the viability of the proposal. | |||
| Fast centrality-driven diffusion in dynamic networks | | BIBA | Full-Text | 821-828 | |
| Abraão Guimarães; Alex B. Vieira; Ana Paula Couto Silva; Artur Ziviani | |||
| Diffusion processes in complex dynamic networks can arise, for instance, on data search, data routing, and information spreading. Therefore, understanding how to speed up the diffusion process is an important topic in the study of complex dynamic networks. In this paper, we shed light on how centrality measures and node dynamics coupled with simple diffusion models can help on accelerating the cover time in dynamic networks. Using data from systems with different characteristics, we show that if dynamics is disregarded, network cover time is highly underestimated. Moreover, using centrality accelerates the diffusion process over a different set of complex dynamic networks when compared with the random walk approach. For the best case, in order to cover 80% of nodes, fast centrality-driven diffusion reaches an improvement of 60%, i.e. when next-hop nodes are selected by using centrality measures. Additionally, we also propose and present the first results on how link prediction can help on speeding up the diffusion process in dynamic networks. | |||
| Unveiling Zeus: automated classification of malware samples | | BIBA | Full-Text | 829-832 | |
| Abedelaziz Mohaisen; Omar Alrawi | |||
| Malware family classification is an age old problem that many Anti-Virus (AV) companies have tackled. There are two common techniques used for classification, signature based and behavior based. Signature based classification uses a common sequence of bytes that appears in the binary code to identify and detect a family of malware. Behavior based classification uses artifacts created by malware during execution for identification. In this paper we report on a unique dataset we obtained from our operations and classified using several machine learning techniques using the behavior-based approach. Our main class of malware we are interested in classifying is the popular Zeus malware. For its classification we identify 65 features that are unique and robust for identifying malware families. We show that artifacts like file system, registry, and network features can be used to identify distinct malware families with high accuracy -- in some cases as high as 95 percent. | |||
| Using link semantics to recommend collaborations in academic social networks | | BIBA | Full-Text | 833-840 | |
| Michele A. Brandão; Mirella M. Moro; Giseli Rabello Lopes; José P. M. Oliveira | |||
| Social network analysis (SNA) has been explored in many contexts with different goals. Here, we use concepts from SNA for recommending collaborations in academic networks. Recent work shows that research groups with well connected academic networks tend to be more prolific. Hence, recommending collaborations is useful for increasing a group's connections, then boosting the group research as a collateral advantage. In this work, we propose two new metrics for recommending new collaborations or intensification of existing ones. Each metric considers a social principle (homophily and proximity) that is relevant within the academic context. The focus is to verify how these metrics influence in the resulting recommendations. We also propose new metrics for evaluating the recommendations based on social concepts (novelty, diversity and coverage) that have never been used for such a goal. Our experimental evaluation shows that considering our new metrics improves the quality of the recommendations when compared to the state-of-the-art. | |||
| Addressing the privacy management crisis in online social networks | | BIBA | Full-Text | 841-842 | |
| Krishna P. Gummadi | |||
| The sharing of personal data has emerged as a popular activity over online social networking sites like Facebook. As a result, the issue of online social network privacy has received significant attention in both the research literature and the mainstream media. Our overarching goal is to improve defaults and provide better tools for managing privacy, but we are limited by the fact that the full extent of the privacy problem remains unknown; there is little quantification of the incidence of incorrect privacy settings or the difficulty users face when managing their privacy. In this talk, I will first focus on measuring the disparity between the desired and actual privacy settings, quantifying the magnitude of the problem of managing privacy. Later, I will discuss how social network analysis techniques can be leveraged towards addressing the privacy management crisis.SNOW'13 Welcome and organization | |||
| Social media, journalism and the public | | BIBA | Full-Text | 847-848 | |
| Steve Schifferes | |||
| This paper draws on the parallels between the current period and other periods of historic change in journalism to examine what is new in today's world of social media and what continuities there are with the past. It examines the changing relationship between the public and the press and how it is being continuously reinterpreted. It addresses the questions of whether we are the beginning or end of a process of revolutionary media change. | |||
| Weaving a safe web of news | | BIBA | Full-Text | 849-852 | |
| Kanak Kiscuitwala; Willem Bult; Mathias Lécuyer; T. J. Purtell; Madeline K. B. Ross; Augustin Chaintreau; Chris Haseman; Monica S. Lam; Susan E. McGregor | |||
| The rise of social media and data-capable mobile devices in recent years has
transformed the face of global journalism, supplanting the broadcast news
anchor with a new source for breaking news: the citizen reporter. Social
media's decentralized networks and instant re-broadcasting mechanisms mean that
the reach of a single tweet can easily trump that of the most powerful
broadcast satellite. Brief, text-based and easy to translate, social messages
allow news audiences to skip the middleman and get news "straight from the
source."
Whether used by "citizen" or professional reporters, however, social media technologies can also pose risks that endanger these individuals and, by extension, the press as a whole. First, social media platforms are usually proprietary, leaving users' data and activities on the system open to scrutiny by collaborating companies and/or governments. Second, the networks upon which social media reporting relies are inherently fragile, consisting of easily targeted devices and relatively centralized message-routing systems that authorities may block or simply shut down. Finally, this same privileged access can be used to flood the network with inaccurate or discrediting messages, drowning the signal of real events in misleading noise. A citizen journalist can be anyone who is simply in the right place at the right time. Typically untrained and unevenly tech-savvy, citizen reporters are unaccustomed to thinking of their social media activities as high-risk, and may not consider the need to defend themselves against potential threats. Though often part of a crowd, they may have no formal affiliations; if targeted for retaliation, they may have nowhere to turn for help. The dangers citizen journalists face are personal and physical. They may be targeted in the act of reporting, and/or online through the tracking of their digital communications. Addressing their needs for protection, resilience, and recognition requires a move away from the major assumptions of in vitro communication security. For citizen journalists using social networks, the adversary is already inside, as the network itself may be controlled or influenced by the threatening party, while "outside" nodes, such as public figures, protest organizers, and other journalists can be trusted to handle content appropriately. In these circumstances there can be no seamless, guaranteed solution. Yet the need remains for technologies that improve the security of these journalists who in many cases may constitute a region's only independent press. In this paper, we argue that a comprehensive and collaborative effort is required to make publishing and interacting with news websites more secure. Journalists typically enjoy stronger legal protection at least in some countries, such as the United States. However, this protection may prove ineffective, as many online tools compromise source protection. In the remaining sections, we identify a set of discussion topics and challenges to encourage a broader research agenda aiming to address jointly the need for social features and security for citizens journalists and readers alike. We believe communication technologies should embrace the methods and possibilities of social news rather than treating this as a pure security problem. We briefly touch upon a related initiative, Dispatch, that focuses on providing security to citizen journalists for publisihing content. | |||
| Traffic prediction and discovery of news via news crowds | | BIB | Full-Text | 853-854 | |
| Carlos Castillo | |||
| Who broke the news?: an analysis on first reports of news events | | BIBA | Full-Text | 855-862 | |
| Matthias Gallé; Jean-Michel Renders; Eric Karstens | |||
| We present a data-driven study on which sources were the first to report on
news events. For this, we implemented a news-aggregator that included a large
number of established news sources and covered one year of data. We present a
novel framework that is able to retrieve a large number of events and not only
the most salient ones, while at the same time making sure that they are not
exclusively of local impact.
Our analysis then focuses on different aspects of the news cycle. In particular we analyze which are the sources to break most of the news. By looking when certain events become bursty, we are able to perform a finer analysis on those events and the associated sources that dominate the global news-attention. Finally we study the time it takes news outlet to report on these events and how this reects different strategies of which news to report. A general finding of our study is that big news agencies remain an important threshold to cross to bring global attention to particular news, but it also shows the importance of focused (by region or topic) outlets. | |||
| Finding news curators in Twitter | | BIBA | Full-Text | 863-870 | |
| Janette Lehmann; Carlos Castillo; Mounia Lalmas; Ethan Zuckerman | |||
| Users interact with online news in many ways, one of them being sharing
content through online social networking sites such as Twitter. There is a
small but important group of users that devote a substantial amount of effort
and care to this activity. These users monitor a large variety of sources on a
topic or around a story, carefully select interesting material on this topic,
and disseminate it to an interested audience ranging from thousands to
millions. These users are news curators, and are the main subject of study of
this paper. We adopt the perspective of a journalist or news editor who wants
to discover news curators among the audience engaged with a news site.
We look at the users who shared a news story on Twitter and attempt to identify news curators who may provide more information related to that story. In this paper we describe how to find this specific class of curators, which we refer to as news story curators. Hence, we proceed to compute a set of features for each user, and demonstrate that they can be used to automatically find relevant curators among the audience of two large news organizations. | |||
| Towards automatic assessment of the social media impact of news content | | BIBA | Full-Text | 871-874 | |
| Tom De Nies; Gerald Haesendonck; Fréderic Godin; Wesley De Neve; Erik Mannens; Rik Van de Walle | |||
| In this paper, we investigate the possibilities to estimate the impact the content of a news article has on social media, and in particular on Twitter. We propose an approach that makes use of captured and temporarily stored microposts found in social media, and compares their relevance to an arbitrary news article. These results are used to derive key indicators of the social media impact of the specified content. We describe each step of our approach, provide a first implementation, and discuss the most imminent challenges and discussion points. | |||
| Verifying news on the social web: challenges and prospects | | BIBA | Full-Text | 875-878 | |
| Steve Schifferes; Nic Newman | |||
| The problem of verification is the key issue for journalists who use social media. This paper argues for the importance of a user-centered approach in finding solutions to this problem. Because journalists have different needs for different types of stories, there is no one magic bullet that can verify social media. Any tool will need to have a multi-faceted approach to the problem, and will have to be adjustable to suit the particular needs of individual journalists and news organizations. | |||
| Newspaper editors vs the crowd: on the appropriateness of front page news selection | | BIBA | Full-Text | 879-880 | |
| Arkaitz Zubiaga | |||
| The front page is the showcase that might condition whether one buys a newspaper, and so editors carefully select the news of the day that they believe will attract as many readers as possible. Little is known about the extent to which editors' criteria for front page news selection are appropriate so as to matching the actual interests of the crowd. In this paper, we compare the news stories in The New York Times over the period of a year to their popularity on Twitter and Facebook. Our study questions the current news selection criteria, revealing that while editors focus on picking hard news such as politics for the front page, social media users are rather into soft news such as science and fashion.SOCM'13 welcome and organization | |||
| Social machines: a unified paradigm to describe social web-oriented systems | | BIBA | Full-Text | 885-890 | |
| Vanilson Buregio; Silvio Meira; Nelson Rosa | |||
| Blending computational and social elements into software has gained significant attention in key conferences and journals. In this context, "Social Machines" appears as a promising model for unifying both computational and social processes. However, it is a fresh topic, with concepts and definitions coming from different research fields, making a unified understanding of the concept a somewhat challenging endeavor. This paper aims to investigate efforts related to this topic and build a preliminary classification scheme to structure the science of Social Machines. We provide a preliminary overview of this research area through the identification of the main visions, concepts, and approaches; we additionally examine the result of the convergence of existing contributions. With the field still in its early stage, we believe that this work can collaborate to the process of providing a more common and coherent conceptual basis for understanding Social Machines as a paradigm. Furthermore, this study helps detect important research issues and gaps in the area. | |||
| Crime applications and social machines: crowdsourcing sensitive data | | BIBA | Full-Text | 891-896 | |
| Maire Byrne Evans; Kieron O'Hara; Thanassis Tiropanis; Craig Webber | |||
| The authors explore some issues with the United Kingdom (U.K.) crime reporting and recording systems which currently produce Open Crime Data. The availability of Open Crime Data seems to create a potential data ecosystem which would encourage crowdsourcing, or the creation of social machines, in order to counter some of these issues. While such solutions are enticing, we suggest that in fact the theoretical solution brings to light fairly compelling problems, which highlight some limitations of crowdsourcing as a means of addressing Berners-Lee's "social constraint." The authors present a thought experiment -- a Gendankenexperiment -- in order to explore the implications, both good and bad, of a social machine in such a sensitive space and suggest a Web Science perspective to pick apart the ramifications of this thought experiment as a theoretical approach to the characterisation of social machines. | |||
| Pseudonymity in social machines | | BIBA | Full-Text | 897-900 | |
| Ben Dalton | |||
| This paper describes the potential of systems in which many people collectively control a single constructed identity mediated by socio-technical networks. By looking to examples of identities that have spontaneously emerged from anonymous communities online, a model for pseudonym design in social machines is proposed. A framework of identity dimensions is presented as a means of exploring the functional types of identity encountered in social machines, and design guidelines are outlined that suggest possible approaches to this task. | |||
| Observing social machines part 1: what to observe? | | BIBA | Full-Text | 901-904 | |
| David De Roure; Clare Hooper; Megan Meredith-Lobay; Kevin Page; Ségolène Tarte; Don Cruickshank; Catherine De Roure | |||
| As a scoping exercise in the design of our Social Machines Observatory we consider the observation of Social Machines "in the wild", as illustrated through two scenarios. More than identifying and classifying individual machines, we argue that we need to study interactions between machines and observe them throughout their lifecycle. We suggest that purpose may be a key notion to help identify individual Social Machines in composed systems, and that mixed observation methods will be required. This exercise provides a basis for later work on how we instrument and observe the ecosystem. | |||
| Towards a classification framework for social machines | | BIBA | Full-Text | 905-912 | |
| Nigel R. Shadbolt; Daniel A. Smith; Elena Simperl; Max Van Kleek; Yang Yang; Wendy Hall | |||
| The state of the art in human interaction with computational systems blurs the line between computations performed by machine logic and algorithms, and those that result from input by humans, arising from their own psychological processes and life experience. Current socio-technical systems, known as "social machines" exploit the large-scale interaction of humans with machines. Interactions that are motivated by numerous goals and purposes including financial gain, charitable aid, and simply for fun. In this paper we explore the landscape of social machines, both past and present, with the aim of defining an initial classificatory framework. Through a number of knowledge elicitation and refinement exercises we have identified the polyarchical relationship between infrastructure, social machines, and large-scale social initiatives. Our initial framework describes classification constructs in the areas of contributions, participants, and motivation. We present an initial characterisation of some of the most popular social machines, as demonstration of the use of the identified constructs. We believe that it is important to undertake an analysis of the behaviour and phenomenology of social machines, and of their growth and evolution over time. Our future work will seek to elicit additional opinions, classifications and validation from a wider audience, to produce a comprehensive framework for the description, analysis and comparison of social machines. | |||
| Linked data in crowdsourcing purposive social network | | BIBA | Full-Text | 913-918 | |
| Priyanka Singh; Nigel Shadbolt | |||
| Internet is an easy medium for people to collaborate and crowdsourcing is an efficient feature of social web where people with common interest and expertise come together to solve specific problems by collective thinking and create a community. It can also be used to filter out important information from large data, remove spams, and gamification techniques are used to reward the users for their contribution and keep a sustainable environment for the growth of the community. Semantic web technologies can be used to structure the community data so it can be combined, decentralized and be used across platform. Using such tools knowledge can be enhanced and easily discovered and merged together. This paper discusses the concept of a purposive social network where people with similar interest and varied expertise come together, use crowdsourcing technique to solve a common problem and build tools for common purpose. The StackOverflow website is chosen to study the purposive network, different network ties and roles of user is studied. Linked Data is used for name disambiguation of keywords and topics for easier search and discovery of experts in a field and provide useful information that is otherwise unavailable in the website. | |||
| A few thoughts on engineering social machines: extended abstract | | BIBA | Full-Text | 919-920 | |
| Markus Strohmaier | |||
| Social machines are integrated systems of people and computers. What
distinguishes social machines from other types of software systems -- such as
software for cars or air planes -- is the unprecedented involvement of data
about user behavior, -goals and -motivations into the software system's
structure. In social machines, the interaction between a user and the system is
mediated by the aggregation of explicit or implicit data from other users. This
is the case with systems where, for example, user data is used to suggest
search terms (e.g. Google Autosuggest), to recommend products (e.g. Amazon
recommendations), to aid navigation (e.g. tag-based navigation) or to filter
content (e.g. Digg.com). This makes social machines a novel class of software
systems (as opposed to for example safety-related software that is being used
in cars) and unique in a sense that potentially essential system properties and
functions -- such as navigability -- are dynamically influenced by aggregate
user behavior. Such properties can not be satisfied through the implementation
of requirements alone, what is needed is regulation, i.e. a dynamic integration
of users' goals and behavior into the continuous process of engineering.
Functional and non-functional properties of software systems have been the subject of software engineering research for decades [1]. The notion of non-functional requirements (softgoals) captures a recognition by the software engineering community that software requirements can be subjective and interdependent, they can lack a clear-cut success criteria, exhibit different priorities and can require decomposition or operationalization. Resulting approaches to analyzing and designing software systems emphasize the role of users (or more general: agents) in this process (such as [1]). i* for example has been used to capture and represent user goals during system design and run time. With the emergence of social machines, such as the WWW, and social-focussed applications running on top of the web, such as facebook.com, delicious.com and others, social machines and their emergent properties have become a crucial infrastructure for many aspects of our daily lives. To give an example: the navigability of the web depends on the behavior of web editors who are interlinking documents, or the usefulness of tags for classification depends on the tagging behavior of users [2]. The rise of social machines can be expected to fundamentally change the way in which such properties and functions of software systems are designed and maintained. Rather than planning for certain system properties (such as navigability, usefulness for certain tasks) and functions at design time, the task of engineers is to build a platform which allows to influence and regulate emergent user behavior in such a way that desired system attributes are achieved at run time. It is through the process of social computation, i.e. the combination of social behavior and algorithmic computation, that desired system properties and functions emerge. For a science of social machines, specifically understanding the relationship between individual and social behavior on one hand, and desired system properties and functions on the other is crucial. In order to maintain control, research must focus on understanding a wide variety of social machine properties such as semantic, intentional and navigational properties across different systems and applications including -- but not limited to -- social media. Summarizing, the full implications of the genesis of social machines for related domains including software engineering, knowledge acquisition or peer production systems are far from being well understood, and warrant future work. For example, the interactions between the pragmatics of such systems (how they are used) and the semantics emerging in those systems (what the words, symbols, etc mean) is a fundamental issue that deserves greater attention. Equipping engineers of social machines with the right tools to achieve and maintain desirable system properties is a problem of practical relevance that needs to be addressed by future research. | |||
| The HTP model: understanding the development of social machines | | BIBA | Full-Text | 921-926 | |
| Ramine Tinati; Leslie Carr; Susan Halford; Catherine J. Pope | |||
| The Web represents a collection of socio-technical activities inter-operating using a set of common protocols and standards. Online banking, web TV, internet shopping, e-government and social networking are all different kinds of human interaction that have recently leveraged the capabilities of the Web architecture. Activities that have human and computer components are referred to as social machines. This paper introduces HTP, a socio-technical model to understand, describe and analyze the formation and development of social machines and other web activities. HTP comprises three components: heterogeneous networks of actors involved in a social machine; the iterative process of translation of the actors' activities into a temporarily stable and sustainable social machine; and the different phases of this machine's adaptation from one stable state to another as the surrounding networks restructure and global agendas ebb and flow. The HTP components are drawn from an interdisciplinary range of theoretical positions and concepts. HTP provides an analytical framework to explain why different Web activities remain stable and functional, whilst others fail. We illustrate the use of HTP by examining the formation of a classic social machine (Wikipedia), and the stabilization points corresponding to its different phases of development. | |||
| "the crowd keeps me in shape": social psychology and the present and future of health social machines | | BIBA | Full-Text | 927-932 | |
| Max Van Kleek; Daniel A. Smith; Wendy Hall; Nigel Shadbolt | |||
| Can the Web help people live healthier lives? This paper seeks to answer this question through an examination of sites, apps and online communities designed to help people improve their fitness, better manage their disease(s) and conditions, and to solve the often elusive connections between the symptoms they experience, diseases and treatments. These health social machines employ a combination of both simple and complex social and computational processes to provide such support. We first provide a descriptive classification of the kinds of machines currently available, and the support each class offers. We then describe the limitations exhibited by these systems and potential ways around them, towards the design of more effective machines in the future.SRS'13 welcome and organization | |||
| How status and reputation shape human evaluations: consequences for recommender systems | | BIBA | Full-Text | 937-938 | |
| Jure Leskovec | |||
| Recommender systems are inherently driven by evaluations and reviews provided by the users of these systems. Understanding ways in which users form judgments and produce evaluations can provide insights for modern recommendation systems. Many online social applications include mechanisms for users to express evaluations of one another, or of the content they create. In a variety of domains, mechanisms for evaluation allow one user to say whether he or she trusts another user, or likes the content they produced, or wants to confer special levels of authority or responsibility on them. We investigate a number of fundamental ways in which user and item characteristics affect the evaluations in online settings. For example, evaluations are not unidimensional but include multiple aspects that all together contribute to user's overall rating. We investigate methods for modeling attitudes and attributes from online reviews that help us better understand user's individual preferences. We also examine how to create a composite description of evaluations that accurately reflects some type of cumulative opinion of a community. Natural applications of these investigations include predicting the evaluation outcomes based on user characteristics and to estimate the chance of a favorable overall evaluation from a group knowing only the attributes of the group's members, but not their expressed opinions. | |||
| Large-scale social recommender systems: challenges and opportunities | | BIBA | Full-Text | 939-940 | |
| Mitul Tiwari | |||
| Online social networks have become very important for networking,
communication, sharing, and content discovery. Recommender systems play a
significant role on any online social network for engaging members, recruiting
new members, and recommending other members to connect with. This talk presents
challenges in recommender systems, graph analysis, social stream relevance and
virality on a large-scale social networks such as LinkedIn, the largest
professional network with more than 200M members.
First, social recommender systems for recommending jobs, groups, companies to follow, other members to connect with, are very important part of a professional network like LinkedIn [1, 6, 7, 9]. Each one of these entity recommender systems present novel challenges to use social and member generated data. Second, various problems, such as, link prediction, visualizing connection network, finding the strength of each connection, and the best path among members, require large-scale social graph analysis, and present unique research opportunities [2, 5]. Third, social stream relevance and capturing virality in social products are crucial for engaging users on any online social network [4]. Final, systems challenges must be addressed in scaling recommender systems on a large-scale social networks [3, 8, 10]. This talk presents challenges and interesting problems in large-scale social recommender systems, and describes some of the solutions. | |||
| Signal-based user recommendation on Twitter | | BIBA | Full-Text | 941-944 | |
| Giuliano Arru; Davide Feltoni Gurini; Fabio Gasparetti; Alessandro Micarelli; Giuseppe Sansonetti | |||
| In recent years, social networks have become one of the best ways to access information. The ease with which users connect to each other and the opportunity provided by Twitter and other social tools in order to follow person activities are increasing the use of such platforms for gathering information. The amount of available digital data is the core of the new challenges we now face. Social recommender systems can suggest both relevant content and users with common social interests. Our approach relies on a signal-based model, which explicitly includes a time dimension in the representation of the user interests. Specifically, this model takes advantage of a signal processing technique, namely, the wavelet transform, for defining an efficient pattern-based similarity function among users. Experimental comparisons with other approaches show the benefits of the proposed approach. | |||
| Generation of coalition structures to provide proper groups' formation in group recommender systems | | BIBA | Full-Text | 945-950 | |
| Lucas Augusto M. C. Carvalho; Hendrik T. Macedo | |||
| Group recommender systems usually provide recommendations to a fixed and predetermined set of members. In some situations, however, there is a set of people (N) that should be organized into smaller and cohesive groups, so it is possible to provide more effective recommendations to each of them. This is not a trivial task. In this paper we propose an innovative approach for grouping people within the recommendation problem context. The problem is modeled as a coalitional game from Game Theory. The goal is to group people into exhaustive and disjoint coalitions so as to maximize the social welfare function of the group. The optimal coalition structure is that with highest summation over all social welfare values. Similarities between recommendation system users are used to define the social welfare function. We compare our approach with K-Means clustering for a dataset from Movielens. Results have shown that the proposed approach performs better than K-Means for both average group satisfaction and Davies-Bouldin index metrics when the number of coalitions found is not greater than 4 (K N = 12). | |||
| Users' satisfaction in recommendation systems for groups: an approach based on noncooperative games | | BIBA | Full-Text | 951-958 | |
| Lucas Augusto Montalvão Costa Carvalho; Hendrik Teixeira Macedo | |||
| A major difficulty in a recommendation system for groups is to use a group aggregation strategy to ensure, among other things, the maximization of the average satisfaction of group members. This paper presents an approach based on the theory of noncooperative games to solve this problem. While group members can be seen as game players, the items for potential recommendation for the group comprise the set of possible actions. Achieving group satisfaction as a whole becomes, then, a problem of finding the Nash equilibrium. Experiments with a MovieLens dataset and a function of arithmetic mean to compute the prediction of group satisfaction for the generated recommendation have shown statistically significant results when compared to state-of-the-art aggregation strategies, in particular, when evaluation among group members are more heterogeneous. The feasibility of this unique approach is shown by the development of an application for Facebook, which recommends movies to groups of friends. | |||
| Recommending collaborators using keywords | | BIBA | Full-Text | 959-962 | |
| Sara Cohen; Lior Ebel | |||
| This paper studies the problem of recommending collaborators in a social network, given a set of keywords. Formally, given a query q, consisting of a researcher s (who is a member of a social network) and a set of keywords k (e.g., an article name or topic of future work), the collaborator recommendation problem is to return a high-quality ranked list of possible collaborators for s on the topic k. Extensive effort was expended to define ranking functions that take into consideration a variety of properties, including structural proximity to s, textual relevance to k, and importance. The effectiveness of our methods have been experimentally proven over two large subsets of the social network determined by DBLP co-authorship data. The results show that the ranking methods developed in this paper work well in practice. | |||
| A recommender system for job seeking and recruiting website | | BIBA | Full-Text | 963-966 | |
| Yao Lu; Sandy El Helou; Denis Gillet | |||
| In this paper, a hybrid recommender system for job seeking and recruiting websites is presented. The various interaction features designed on the website help the users organize the resources they need as well as express their interest. The hybrid recommender system exploits the job and user profiles and the actions undertaken by users in order to generate personalized recommendations of candidates and jobs. The data collected from the website is modeled using a directed, weighted, and multi-relational graph, and the 3A ranking algorithm is exploited to rank items according to their relevance to the target user. A preliminary evaluation is conducted based on simulated data and production data from a job hunting website in Switzerland. | |||
| Weighted slope one predictors revisited | | BIBA | Full-Text | 967-972 | |
| Danilo Menezes; Anisio Lacerda; Leila Silva; Adriano Veloso; Nivio Ziviani | |||
| Recommender systems are used to help people in specific life choices, like what items to buy, what news to read or what movies to watch. A relevant work in this context is the Slope One algorithm, which is based on the concept of differential popularity between items (i.e., how much better one item is liked than another). This paper proposes new approaches to extend Slope One based predictors for collaborative filtering, in which the predictions are weighted based on the number of users that co-rated items. We propose to improve collaborative filtering by exploiting the web of trust concept, as well as an item utility measure based on the error of predictions based on specific items to specific users. We performed experiments using three application scenarios, namely Movielens, Epinions, and Flixter. Our results demonstrate that, in most cases, exploiting the web of trust is benefitial to prediction performance, and improvements are reported when comparing the proposed approaches against the original Weighted Slope One algorithm. | |||
| Profile diversity in search and recommendation | | BIBA | Full-Text | 973-980 | |
| Maximilien Servajean; Esther Pacitti; Sihem Amer-Yahia; Pascal Neveu | |||
| We investigate profile diversity, a novel idea in searching scientific documents. Combining keyword relevance with popularity in a scoring function has been the subject of different forms of social relevance [2, 6, 9]. Content diversity has been thoroughly studied in search and advertising [4, 11], database queries [16, 5, 8], and recommendations [17, 10, 18]. We believe our work is the first to investigate profile diversity to address the problem of returning highly popular but too-focused documents. We show how to adapt Fagin's threshold-based algorithms to return the most relevant and most popular documents that satisfy content and profile diversities and run preliminary experiments on two benchmarks to validate our scoring function. | |||
| Does social contact matter?: modelling the hidden web of trust underlying Twitter | | BIBA | Full-Text | 981-988 | |
| Mozhgan Tavakolifard; Kevin C. Almeroth; Jon Atle Gulla | |||
| Social recommender systems aim to alleviate the information overload problem on social network sites. The social network structure is often an important input to these recommender systems. Typically, this structure cannot be inferred directly from declared relationships among users. The goal of our work is to extract an underlying hidden and sparse network which more strongly represents the actual interactions among users. We study how to leverage Twitter activities like micro-blogging and the network structure to find a simple, efficient, but accurate method to infer and expand this hidden network. We measure and compare the performance of several different modeling strategies using a crawled data set from Twitter. Our results reveal that the structural similarity in the network generated by users' retweeting behavior outweighs the other discussed methods. | |||
| Understanding user spatial behaviors for location-based recommendations | | BIBA | Full-Text | 989-992 | |
| Jun Zhang; Chun-yuen Teng; Yan Qu | |||
| In this paper, we introduce a network-based method to study user spatial behaviors based on check-in histories. The results of this study have direct implications for location-based recommendation systems.SWDM'13 welcome and organization | |||
| Disasters response using social life networks | | BIBA | Full-Text | 997-998 | |
| Ramesh C. Jain | |||
| Connecting people to required resources efficiently, effectively and promptly is one of the most important challenges for our society. Disasters make it the challenge for life and death. During disasters many normal sources of information to assess situations as well as distributing vital information to individuals break down. Unfortunately, during disastrous situations, most current practices are forced to follow bureaucratic processes and procedures that may delay help in critical life and death moments. Social media brings together different media as well as modes of distribution -- focused, narrowcast, and broadcast -- and has revolutionized communication among people. Mobile phones, equipped with myriads of sensors are bringing the next generation of social networks not only to connect people with other people, but also to connect people with other people and essential life resources based on the disaster situation and personal context. We believe that such Social Life Networks (SLN) may play very important role for solving some essential human problems, including providing vital help to people during disasters. We will present early design of such systems and use a few examples of such systems explored in our group during disasters. Focused Micro Blogs (FMBs) will be discussed as an alternative to less noisy and more direct versions of current microblogs, such as Tweets and Status Updates. An important part of our discussion will be to list challenges and opportunities in this area. | |||
| A sensitive Twitter earthquake detector | | BIBA | Full-Text | 999-1002 | |
| Bella Robinson; Robert Power; Mark Cameron | |||
| This paper describes early work at developing an earthquake detector for
Australia and New Zealand using Twitter. The system is based on the Emergency
Situation Awareness (ESA) platform which provides all-hazard information
captured, filtered and analysed from Twitter. The detector sends email
notifications of evidence of earthquakes from Tweets to the Joint Australian
Tsunami Warning Centre.
The earthquake detector uses the ESA platform to monitor Tweets and checks for specific earthquake related alerts. The Tweets that contribute to an alert are then examined to determine their locations: when the Tweets are identified as being geographically close and the retweet percentage is low an email notification is generated. The earthquake detector has been in operation since December 2012 with 31 notifications generated where 17 corresponded with real, although minor, earthquake events. The remaining 14 were a result of discussions about earthquakes but not prompted by an event. A simple modification to our algorithm results in 20 notifications identifying the same 17 real events and reducing the false positives to 3. Our detector is sensitive in that it can generate alerts from only a few Tweets when they are determined to be geographically close. | |||
| Text vs. images: on the viability of social media to assess earthquake damage | | BIBA | Full-Text | 1003-1006 | |
| Yuan Liang; James Caverlee; John Mander | |||
| In this paper, we investigate the potential of social media to provide rapid insights into the location and extent of damage associated with two recent earthquakes -- the 2011 Tohoku earthquake in Japan and the 2011 Christchurch earthquake in New Zealand. Concretely, we (i) assess and model the spatial coverage of social media; and (ii) study the density and dynamics of social media in the aftermath of these two earthquakes. We examine the difference between text tweets and media tweets (containing links to images and videos), and investigate tweet density, re-tweet density, and user tweeting count to estimate the epicenter and to model the intensity attenuation of each earthquake. We find that media tweets provide more valuable location information, and that the relationship between social media activity vs. loss/damage attenuation suggests that social media following a catastrophic event can provide rapid insight into the extent of damage. | |||
| Comparing web feeds and tweets for emergency management | | BIBA | Full-Text | 1007-1010 | |
| Robert Power; Bella Robinson; Catherine Wise | |||
| This paper describes ongoing work with the Australian Government to assemble
information from a collection of web feeds describing emergency incidents of
interest for emergency managers. The developed system, the Emergency Response
Intelligence Capability (ERIC) tool, has been used to gather information about
emergency events during the Australian summer of 2012/13. The web feeds are an
authoritative source of structured information summarising incidents that
includes links to emergency services web sites containing further details about
the events underway.
The intelligence obtained using ERIC for a specific fire event has been compared with information that was available in Twitter using the Emergency Situation Awareness (ESA) platform. This information would have been useful as a new source of intelligence: it was reported faster than via the web feed, contained more specific event information, included details of impact to the community, was updated more frequently, included information from the public and remains available as a source of information long after the web feed contents have been removed. | |||
| Leveraging on social media to support the global building resilient cities campaign | | BIBA | Full-Text | 1011-1012 | |
| David Stevens | |||
| This paper presents a summary of the main points put forward during the presentation delivered at the 2nd International Workshop on Social Web for Disaster Management which was held in conjunction with WWW 2013 on May 14th 2013 in Rio de Janeiro, Brazil. | |||
| Location-based insights from the social web | | BIBA | Full-Text | 1013-1016 | |
| Yohei Ikawa; Maja Vukovic; Jakob Rogstadius; Akiko Murakami | |||
| Citizens, news reporters, relief organizations, and governments are increasingly relying on the Social Web to report on and respond to disasters as they occur. The capability to rapidly react to important events, which can be identified from high-volume streams even when the sources are unknown, still requires precise localization of the events and verification of the reports. In this paper, we propose a framework for classifying location elements and a method for their extraction from Social Web data. We describe the framework in the context of existing Social Web systems used for disaster management. We present a new location-inferencing architecture and evaluate its performance with a data set from a real-world disaster. | |||
| Location extraction from disaster-related microblogs | | BIBA | Full-Text | 1017-1020 | |
| John Lingad; Sarvnaz Karimi; Jie Yin | |||
| Location information is critical to understanding the impact of a disaster, including where the damage is, where people need assistance and where help is available. We investigate the feasibility of applying Named Entity Recognizers to extract locations from microblogs, at the level of both geo-location and point-of-interest. Our experimental results show that such tools once retrained on microblog data have great potential to detect the where information, even at the granularity of point-of-interest. | |||
| Practical extraction of disaster-relevant information from social media | | BIBA | Full-Text | 1021-1024 | |
| Muhammad Imran; Shady Elbassuoni; Carlos Castillo; Fernando Diaz; Patrick Meier | |||
| During times of disasters online users generate a significant amount of data, some of which are extremely valuable for relief efforts. In this paper, we study the nature of social-media content generated during two different natural disasters. We also train a model based on conditional random fields to extract valuable information from such content. We evaluate our techniques over our two datasets through a set of carefully designed experiments. We also test our methods over a non-disaster dataset to show that our extraction model is useful for extracting information from socially-generated content in general. | |||
| Information sharing on Twitter during the 2011 catastrophic earthquake | | BIBA | Full-Text | 1025-1028 | |
| Fujio Toriumi; Takeshi Sakaki; Kosuke Shinoda; Kazuhiro Kazama; Satoshi Kurihara; Itsuki Noda | |||
| Such large disasters as earthquakes and hurricanes are very unpredictable.
During a disaster, we must collect information to save lives. However, in time
disaster, it is difficult to collect information which is useful for ourselves
from such traditional mass media as TV and newspapers that contain information
for the general public. Social media attract attention for sharing information,
especially Twitter, which is a hugely popular social medium that is now being
used during disasters. In this paper, we focus on the information sharing
behaviors on Twitter during disasters. We collected data before and during the
Great East Japan Earthquake and arrived at the following conclusions: Many
users with little experience with such specific functions as reply and retweet
did not continuously use them after the disaster. Retweets were well used to
share information on Twitter. Retweets were used not only for sharing the
information provided by general users but used for relaying the information
from the mass media.
We conclude that social media users changed their behavior to widely diffuse important information and decreased non-emergency tweets to avoid interrupting critical information. | |||
| Information verification during natural disasters | | BIBA | Full-Text | 1029-1032 | |
| Abdulfatai Popoola; Dmytro Krasnoshtan; Attila-Peter Toth; Victor Naroditskiy; Carlos Castillo; Patrick Meier; Iyad Rahwan | |||
| Large amounts of unverified and at times contradictory information often appear on social media following natural disasters. Timely verification of this information can be crucial to saving lives and for coordinating relief efforts. Our goal is to enable this verification by developing an online platform that involves ordinary citizens in the evidence gathering and evaluation process. The output of this platform will provide reliable information to humanitarian organizations, journalists, and decision makers involved in relief efforts.TempWeb'13 welcome and organization | |||
| Timelines as summaries of popular scheduled events | | BIBA | Full-Text | 1037-1044 | |
| Omar Alonso; Kyle Shiells | |||
| Known events that are scheduled in advance, such as popular sports games, usually get a lot of attention from the public. Communications media like TV, radio, and newspapers will report the salient aspects of such events live or post-hoc for general consumption. However, certain actions, facts, and opinions would likely be omitted from those objective summaries. Our approach is to construct a particular game's timeline in such a way that it can be used as a quick summary of the main events that happened along with popular subjective and opinionated items that the public inject. Peaks in the volume of posts discussing the event reflect both objectively recognizable events in the game -- in the sports example, a change in score -- and subjective events such as a referee making a call fans disagree with. In this work, we introduce a novel timeline design that captures a more complete story of the event by placing the volume of Twitter posts alongside keywords that are driving the additional traffic. We demonstrate our approach using events of major international social impact from the World Cup 2010 and evaluate against professional liveblog coverage of the same events. | |||
| A survey of web archive search architectures | | BIBA | Full-Text | 1045-1050 | |
| Miguel Costa; Daniel Gomes; Francisco Couto; Mário Silva | |||
| Web archives already hold more than 282 billion documents and users demand full-text search to explore this historical information. This survey provides an overview of web archive search architectures designed for time-travel search, i.e. full-text search on the web within a user-specified time interval. Performance, scalability and ease of management are important aspects to take in consideration when choosing a system architecture. We compare these aspects and initialize the discussion of which search architecture is more suitable for a large-scale web archive. | |||
| Archival HTTP redirection retrieval policies | | BIBA | Full-Text | 1051-1058 | |
| Ahmed AlSum; Michael L. Nelson; Robert Sanderson; Herbert Van de Sompel | |||
| When retrieving archived copies of web resources (mementos) from web archives, the original resource's URI-R is typically used as the lookup key in the web archive. This is straightforward until the resource on the live web issues a redirect: R ->R. Then it is not clear if R or R should be used as the lookup key to the web archive. In this paper, we report on a quantitative study to evaluate a set of policies to help the client discover the correct memento when faced with redirection. We studied the stability of 10,000 resources and found that 48% of the sample URIs tested were not stable, with respect to their status and redirection location. 27% of the resources were not perfectly reliable in terms of the number of mementos of successful responses over the total number of mementos, and 2% had a reliability score of less than 0.5. We tested two retrieval policies. The first policy covered the resources which currently issue redirects and successfully resolved 17 out of 77 URIs that did not have mementos of the original URI, but did of the resource that was being redirected to. The second policy covered archived copies with HTTP redirection and helped the client in 58% of the cases tested to discover the nearest memento to the requested datetime. | |||
| Creating a billion-scale searchable web archive | | BIBA | Full-Text | 1059-1066 | |
| Daniel Gomes; Miguel Costa; David Cruz; João Miranda; Simão Fontes | |||
| Web information is ephemeral. Several organizations around the world are struggling to archive information from the web before it vanishes. However, users demand efficient and effective search mechanisms to access the already vast collections of historical information held by web archives. The Portuguese Web Archive is the largest full-text searchable web archive publicly available. It supports search over 1.2 billion files archived from the web since 1996. This study contributes with an overview of the lessons learned while developing the Portuguese Web Archive, focusing on web data acquisition, ranking search results and user interface design. The developed software is freely available as an open source project. We believe that sharing our experience obtained while developing and operating a running service will enable other organizations to start or improve their web archives. | |||
| Discovering temporal hidden contexts in web sessions for user trail prediction | | BIBA | Full-Text | 1067-1074 | |
| Julia Kiseleva; Hoang Thanh Lam; Mykola Pechenizkiy; Toon Calders | |||
| In many web information systems such as e-shops and information portals, predictive modeling is used to understand user's intentions based on their browsing behaviour. User behavior is inherently sensitive to various hidden contexts. It has been shown in different experimental studies that exploitation of contextual information can help in improving prediction performance significantly. It is reasonable to assume that users may change their intents during one web session and that changes are influenced by some external factors such as switch in temporal context e.g. 'users want to find information about a specific product' and after a while 'they want to buy this product'. A web session can be represented as a sequence of user's actions where actions are ordered by time. The generation of a web session might be influenced by several hidden temporal contexts. Each session can be represented as a concatenation of independent segments, each of which is influenced by one corresponding context. We show how to learn how to apply different predictive models for each segment in this work. We define the problem of discovering temporal hidden contexts in such way that we optimize directly the accuracy of predictive models (e.g. users' trails prediction) during the process of context acquisition. Our empirical study on a real dataset demonstrates the effectiveness of our method. | |||
| Carbon dating the web: estimating the age of web resources | | BIBA | Full-Text | 1075-1082 | |
| Hany M. SalahEldeen; Michael L. Nelson | |||
| In the course of web research it is often necessary to estimate the creation datetime for web resources (in the general case, this value can only be estimated). While it is feasible to manually establish likely datetime values for small numbers of resources, this becomes infeasible if the collection is large. We present "carbon date", a simple web application that estimates the creation date for a URI by polling a number of sources of evidence and returning a machine-readable structure with their respective values. To establish a likely datetime, we poll bitly for the first time someone shortened the URI, topsy for the first time someone tweeted the URI, a Memento aggregator for the first time it appeared in a public web archive, Google's time of last crawl, and the Last-Modified HTTP response header of the resource itself. We also examine the backlinks of the URI as reported by Google and apply the same techniques for the resources that link to the URI. We evaluated our tool on a gold standard data set of 1200 URIs in which the creation date was manually verified. We were able to estimate a creation date for 75.90% of the resources, with 32.78% having the correct value. Given the different nature of the URIs, the union of the various methods produces the best results. While the Google last crawl date and topsy account for nearly 66% of the closest answers, eliminating the web archives or Last-Modified from the results produces the largest overall negative impact on th | |||