HCI Bibliography Home | HCI Conferences | MM Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
MM Tables of Contents: 131415

# Proceedings of the 2015 ACM International Conference on Multimedia

Fullname: Proceedings of the 23rd ACM International Conference on Multimedia Xiaofang Zhou; Alan F. Smeaton; Qi Tian; Dick C.A. Bulterman; Heng Tao Shen; Ketan Mayer-Patel; Shuicheng Yan Brisbane, Australia 2015-Oct-26 to 2015-Oct-30 ACM ISBN: 978-1-4503-3459-4; ACM DL: Table of Contents; hcibib: MM15 272 1349 Conference Website

### Keynote 1

 Harnessing Big Personal Data, with Scrutable User Modelling for Privacy and Control | BIBA | Full-Text 1-2 Judy Kay My work aims to enable people to harness, control and manage their big personal data. This is challenging because people are generating vast, and growing, collections of personal data. That data is captured by a rich personal digital ecosystems of devices, some worn or carried, and others are fixed or embedded in the environment. Users explicitly store some data but systems also capture the user's digital footprints, ranging from simple clicks and touches, to images, audio and video. This personal data resides in a quite bewildering range of places, from personal devices to cloud stores, in multitudes of silos.    Big personal data differs from the scientific big data in important ways. Because it is personal, it should be handled in ways that enable people to ensure it is managed and used as they wish. It may be of modest size compared with scientific big data, but people consider their data stores as big, because they are complex and hard to manage. A driving goal for my research has been to tackle the challenges of big personal data by creating infrastructures, representations and interfaces that enable a user to scrutinize and control their personal data in a scrutable user model.    One important role for users models is personalisation, where the user model is a dynamic set of evidence-based beliefs about the user. This is the foundation for personalization, ranging from recommenders to teaching systems. User models may represent anything from the user's attributes to their knowledge, beliefs, goals, plans and preferences.

### Keynote 2

 Vision-enhanced Immersive Interaction and Remote Collaboration with Large Touch Displays | BIBA | Full-Text 3-4 Zhengyou Zhang Large displays are becoming commodity, and more and more, they are touch-enabled. In this keynote, we describe a system called ViiBoard (Vision-enhanced Immersive Interaction with touch Board) that enables natural interaction and immersive remote collaboration with large touch displays by adding a commodity color plus depth sensor. It consists of two parts. The first part is called VTouch that augments touch input with visual understanding of the user to improve interaction with a large touch-sensitive display such as Microsoft Surface Hub. An RGBD sensor such as Microsoft Kinect adds the visual modality and enables new interactions beyond touch. Through visual analysis, the system understands where the user is, who the user is, and what the user is doing even before the user touches the display. Such information is used to enhance interaction in multiple ways. For example, a user can use simple gestures to bring up menu items such as color palette and soft keyboard; menu items can be shown where the user is and can follow the user; hovering can show information to the user before the user commits to touch; the user can perform different functions (for example writing and erasing) with different hands; and the user's preference profile can be maintained, distinct from other users. User studies are conducted and the users very much appreciate the value of these and other enhanced interactions.    The second part is called ImmerseBoard. ImmerseBoard is a system for remote collaboration through a digital whiteboard that gives participants a 3D immersive experience, enabled by an RGBD sensor mounted on the side of a large touch display (the same setup as in VTouch). Using 3D processing of the depth images, life-sized rendering, and novel visualizations, ImmerseBoard emulates writing side-by-side on a physical whiteboard, or alternatively on a mirror. User studies involving three tasks show that compared to standard video conferencing with a digital whiteboard, ImmerseBoard provides participants with a quantitatively better ability to estimate their remote partners' eye gaze direction, gesture direction, intention, and level of agreement. Moreover, these quantitative capabilities translate qualitatively into a heightened sense of being together and a more enjoyable experience. ImmerseBoard's form factor is suitable for practical and easy installation in homes and offices.

### Best Paper Session

 Analyzing Free-standing Conversational Groups: A Multimodal Approach | BIBA | Full-Text 5-14 Xavier Alameda-Pineda; Yan Yan; Elisa Ricci; Oswald Lanz; Nicu Sebe During natural social gatherings, humans tend to organize themselves in the so-called free-standing conversational groups. In this context, robust head and body pose estimates can facilitate the higher-level description of the ongoing interplay. Importantly, visual information typically obtained with a distributed camera network might not suffice to achieve the robustness sought. In this line of thought, recent advances in wearable sensing technology open the door to multimodal and richer information flows. In this paper we propose to cast the head and body pose estimation problem into a matrix completion task. We introduce a framework able to fuse multimodal data emanating from a combination of distributed and wearable sensors, taking into account the temporal consistency, the head/body coupling and the noise inherent to the scenario. We report results on the novel and challenging SALSA dataset, containing visual, auditory and infrared recordings of 18 people interacting in a regular indoor environment. We demonstrate the soundness of the proposed method and the usability for higher-level tasks such as the detection of F-formations and the discovery of social attention attractors.
 An Affordable Solution for Binocular Eye Tracking and Calibration in Head-mounted Displays | BIBA | Full-Text 15-24 Michael Stengel; Steve Grogorick; Martin Eisemann; Elmar Eisemann; Marcus A. Magnor Immersion is the ultimate goal of head-mounted displays (HMD) for Virtual Reality (VR) in order to produce a convincing user experience. Two important aspects in this context are motion sickness, often due to imprecise calibration, and the integration of a reliable eye tracking. We propose an affordable hard- and software solution for drift-free eye-tracking and user-friendly lens calibration within an HMD. The use of dichroic mirrors leads to a lean design that provides the full field-of-view (FOV) while using commodity cameras for eye tracking. Our prototype supports personalizable lens positioning to accommodate for different interocular distances. On the software side, a model-based calibration procedure adjusts the eye tracking system and gaze estimation to varying lens positions. Challenges such as partial occlusions due to the lens holders and eye lids are handled by a novel robust monocular pupil-tracking approach. We present four applications of our work: Gaze map estimation, foveated rendering for depth of field, gaze-contingent level-of-detail, and gaze control of virtual avatars.
 SINGA: Putting Deep Learning in the Hands of Multimedia Users | BIBA | Full-Text 25-34 Wei Wang; Gang Chen; Anh Tien Tuan Dinh; Jinyang Gao; Beng Chin Ooi; Kian-Lee Tan; Sheng Wang Recently, deep learning techniques have enjoyed success in various multimedia applications, such as image classification and multi-modal data analysis. Two key factors behind deep learning's remarkable achievement are the immense computing power and the availability of massive training datasets, which enable us to train large models to capture complex regularities of the data. There are two challenges to overcome before deep learning can be widely adopted in multimedia and other applications. One is usability, namely the implementation of different models and training algorithms must be done by non-experts without much effort. The other is scalability, that is the deep learning system must be able to provision for a huge demand of computing resources for training large models with massive datasets. To address these two challenges, in this paper, we design a distributed deep learning platform called SINGA which has an intuitive programming model and good scalability. Our experience with developing and training deep learning models for real-life multimedia applications in SINGA shows that the platform is both usable and scalable.
 Weakly-Shared Deep Transfer Networks for Heterogeneous-Domain Knowledge Propagation | BIBA | Full-Text 35-44 Xiangbo Shu; Guo-Jun Qi; Jinhui Tang; Jingdong Wang In recent years, deep networks have been successfully applied to model image concepts and achieved competitive performance on many data sets. In spite of impressive performance, the conventional deep networks can be subjected to the decayed performance if we have insufficient training examples. This problem becomes extremely severe for deep networks with powerful representation structure, making them prone to over fitting by capturing nonessential or noisy information in a small data set. In this paper, to address this challenge, we will develop a novel deep network structure, capable of transferring labeling information across heterogeneous domains, especially from text domain to image domain. This weakly-shared Deep Transfer Networks (DTNs) can adequately mitigate the problem of insufficient image training data by bringing in rich labels from the text domain.    Specifically, we present a novel architecture of DTNs to translate cross-domain information from text to image. To share the labels between two domains, we will build multiple weakly shared layers of features. It allows to represent both shared inter-domain features and domain-specific features, making this structure more flexible and powerful in capturing complex data of different domains jointly than the strongly shared layers. Experiments on real world dataset will show its competitive performance as compared with the other state-of-the-art methods.

### Panel 1

 Opportunities and Challenges of Industry-Academic Collaborations in Multimedia Research | BIBA | Full-Text 45 Shih-Fu Chang; Matt Cooper; Denver Dash; Funda Kivran-Swaine; Jia Li; David A. Shamma This ACM MM panel aims to redefine the state of research between Academia and Industry.

### Panel 2

 Opportunities and Challenges of Global Network Cameras | BIBA | Full-Text 47-48 Joanna Batstone; Touradj Ebrahimi; Tiejun Huang; Yung-Hsiang Lu; Yonggang Wen Since the introduction of consumer digital cameras, user-created multimedia content has become increasingly popular. Digital cameras, together with inexpensive editing tools, and free hosting sites have made multimedia an integral part of everyday life. Today, hundreds of hours video are uploaded to hosting sites every minute. Video-on-demand through wireless networks and smartphones have profoundly changed how people consume multimedia content. Meanwhile, the widely deployed network cameras can provide live views of many parts of the world. These cameras can provide rich sources creating multimedia content. This panel will explore the opportunities and discuss the challenges using global network cameras for creating multimedia contents and understanding the world. Every year, millions of network cameras are deployed. The data from some of these network cameras are publicly available, continuously streaming live views of national parks, city halls, streets, highways, and shopping malls. A person may see multiple tourist attractions through these cameras, without leaving home. Researchers may observe the weather in different cities. Using the data from the cameras, it is possible to observe natural disasters, such as volcano eruption or tsunami, at a safe distance. News reporters may obtain instant views of an unfolding riot without risking their lives. A spectator may watch a celebration parade from multiple locations using the street cameras. Despite the many promising applications, the opportunities of using global network cameras for creating multimedia content have not been fully exploited.

### Session 1: Multimedia Indexing and Search

 Fast and Accurate Content-based Semantic Search in 100M Internet Videos | BIBA | Full-Text 49-58 Lu Jiang; Shoou-I Yu; Deyu Meng; Yi Yang; Teruko Mitamura; Alexander G. Hauptmann Large-scale content-based semantic search in video is an interesting and fundamental problem in multimedia analysis and retrieval. Existing methods index a video by the raw concept detection score that is dense and inconsistent, and thus cannot scale to "big data" that are readily available on the Internet. This paper proposes a scalable solution. The key is a novel step called concept adjustment that represents a video by a few salient and consistent concepts that can be efficiently indexed by the modified inverted index. The proposed adjustment model relies on a concise optimization framework with interpretations. The proposed index leverages the text-based inverted index for video retrieval. Experimental results validate the efficacy and the efficiency of the proposed method. The results show that our method can scale up the semantic search while maintaining state-of-the-art search performance. Specifically, the proposed method (with reranking) achieves the best result on the challenging TRECVID Multimedia Event Detection (MED) zero-example task. It only takes 0.2 second on a single CPU core to search a collection of 100 million Internet videos.
 Visual Coding in a Semantic Hierarchy | BIBA | Full-Text 59-68 Yang Yang; Hanwang Zhang; Mingxing Zhang; Fumin Shen; Xuelong Li In recent years, tremendous research endeavours have been dedicated to seeking effective visual representations for facilitating various multimedia applications, such as visual annotation and retrieval. Nonetheless, existing approaches can hardly achieve satisfactory performance due to the scarcity of fully exploring semantic properties of visual codes. In this paper, we present a novel visual coding approach, termed as hierarchical semantic visual coding (HSVC), to effectively encode visual objects (e.g., image and video) in a semantic hierarchy. Specifically, we first construct a semantic-enriched dictionary hierarchy, which is comprised of dictionaries corresponding to all concepts in a semantic hierarchy as well as their hierarchical semantic correlation. Moreover, we devise an on-line semantic coding model, which simultaneously 1) exploits the rich hierarchical semantic prior knowledge in the learned dictionary, 2) reflects semantic sparse property of visual codes, and 3) explores semantic relationships among concepts in the semantic hierarchy. To this end, we propose to integrate concept-level group sparsity constraint and semantic correlation matrix into a unified regularization term. We design an effective algorithm to optimize the proposed model, and a rigorous mathematical analysis has been provided to guarantee that the algorithm converges to a global optima. Extensive experiments on various multimedia datasets have been conducted to illustrate the superiority of our proposed approach as compared to state-of-the-art methods.
 Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment | BIBA | Full-Text 69-78 Xinyang Jiang; Fei Wu; Xi Li; Zhou Zhao; Weiming Lu; Siliang Tang; Yueting Zhuang Cross-modal retrieval is a very hot research topic that is imperative to many applications involving multi-modal data. Discovering an appropriate representation for multi-modal data and learning a ranking function are essential to boost the cross-media retrieval. Motivated by the assumption that a compositional cross-modal semantic representation (pairs of images and text) is more attractive for cross-modal ranking, this paper exploits the existing image-text databases to optimize a ranking function for cross-modal retrieval, called deep compositional cross-modal learning to rank (C2MLR). In this paper, C2MLR considers learning a multi-modal embedding from the perspective of optimizing a pairwise ranking problem while enhancing both local alignment and global alignment. In particular, the local alignment (i.e., the alignment of visual objects and textual words) and the global alignment (i.e., the image-level and sentence-level alignment) are collaboratively utilized to learn the multi-modal embedding common space in a max-margin learning to rank manner. The experiments demonstrate the superiority of our proposed C2MLR due to its nature of multi-modal compositional embedding.
 Effective Multi-Query Expansions: Robust Landmark Retrieval | BIBA | Full-Text 79-88 Yang Wang; Xuemin Lin; Lin Wu; Wenjie Zhang Given a query photo issued by a user (q-user), the landmark retrieval is to return a set of photos with their landmarks similar to those of the query, while the existing studies on the landmark retrieval focus on exploiting geometries of landmarks for similarity matches between candidate photos and a query photo. We observe that the same landmarks provided by different users may convey different geometry information depending on the viewpoints and/or angles, and may subsequently yield very different results. In fact, dealing with the landmarks with shapes caused by the photography of q-users is often nontrivial and has never been studied.    Motivated by this, in this paper we propose a novel framework, namely multi-query expansions, to retrieve semantically robust landmarks by two steps. Firstly, we identify the top-k photos regarding the latent topics of a query landmark to construct multi-query set so as to remedy its possible shape. For this purpose, we significantly extend the techniques of Latent Dirichlet Allocation. Secondly, we propose a novel technique to generate the robust yet compact pattern set from the multi-query photos. To ensure redundancy-free and enhance the efficiency, we adopt the existing minimum-description-length-principle based pattern mining techniques to remove similar query photos from the (k+1) selected query photos. Then, a landmark retrieval rule is developed to calculate the ranking scores between mined pattern set and each photo in the database, which are ranked to serve as the final ranking list of landmark retrieval. Extensive experiments are conducted on real-world landmark datasets, validating the significantly higher accuracy of our approach.

### Session 2: Social Multimedia

 What are Popular: Exploring Twitter Features for Event Detection, Tracking and Visualization | BIBA | Full-Text 89-98 Hongyun Cai; Yang Yang; Xuefei Li; Zi Huang As one of the most representative social media platforms, Twitter provides various real-life information on social events in real time. Despite that social event detection has been actively studied, tweet images, which appear in around 36 percent of the total tweets, have not been well utilized for this research problem. Most existing event detection methods tend to represent an image as a bag-of-visual-words and then process these visual words in the same way as textual words. This may not fully exploit the visual properties of images. State-of-the-art visual features like convolutional neural network (CNN) features have shown significant performance gains over the traditional bag-of-visual-words in unveiling the image's semantics. Unfortunately, they have not been employed in detecting events from social websites. Hence, how to make the most of tweet images to improve the performance of social event detection and visualization remains open. In this paper, we thoroughly study the impact of tweet images on social event detection for different event categories using various visual features. A novel topic model which jointly models five Twitter features (text, image, location, timestamp and hashtag) is designed to discover events from the sheer amount of tweets. Moreover, the evolutions of events are tracked by linking the events detected on adjacent days and each event is visualized by representative images selected on three predefined criteria. Extensive experiments have been conducted on a real-life tweet dataset to verify the effectiveness of our method.
 Cross-Domain Collaborative Learning in Social Multimedia | BIBA | Full-Text 99-108 Shengsheng Qian; Tianzhu Zhang; Richang Hong; Changsheng Xu Cross-domain data analysis is one of the most important tasks in social multimedia. It has a wide range of real-world applications, including cross-platform event analysis, cross-domain multi-event tracking, cross-domain video recommendation, etc. It is also very challenging because the data have multi-modal and multi-domain properties, and there are no explicit correlations to link different domains. To deal with these issues, we propose a generic Cross-Domain Collaborative Learning (CDCL) framework based on non-parametric Bayesian dictionary learning model for cross-domain data analysis. In the proposed CDCL model, it can make use of the shared domain priors and modality priors to collaboratively learn the data's representations by considering the domain discrepancy and the multi-modal property. As a result, our CDCL model can effectively explore the virtues of different information sources to complement and enhance each other for cross-domain data analysis. To evaluate the proposed model, we apply it for two different applications: cross-platform event recognition and cross-network video recommendation. The extensive experimental evaluations well demonstrate the effectiveness of the proposed algorithm for cross-domain data analysis.
 Learning Socially Embedded Visual Representation from Scratch | BIBA | Full-Text 109-118 Shaowei Liu; Peng Cui; Wenwu Zhu; Shiqiang Yang Learning image representation by deep model has recently made remarkable achievements for semantic-oriented applications, such as image classification. However, for user-centric tasks, such as image search and recommendation, simply employing the representation learnt from semantic-oriented tasks may fail to capture user intentions. In this paper, we propose a novel Socially Embedded VIsual Representation Learning (SEVIR) approach, where an Asymmetric Multi-task CNN (amtCNN) model is proposed to embed user intention learning task into semantic learning task. Specifically, to address the sparsity and unreliability problems in social behavioral data, we propose to use user clustering, reliability evaluation, random dropout in output layer in our amtCNN. With its the partially shared network architecture, the learnt representation can capture both semantics and user intentions. Comprehensive experiments are conducted to investigate the effectiveness of our approach in applications of user favoring prediction, personalized image recommendation, and image reranking. Compared to the state-of-the-art image representation techniques, our approach achieves significant improvement in performance.
 Spatial-aware Multimodal Location Estimation for Social Images | BIBA | Full-Text 119-128 Jiewei Cao; Zi Huang; Yang Yang Nowadays the locations of social images play an important role in geographic knowledge discovery. However, most social images still lack the location information, driving location estimation for social images to have recently become an active research topic. With the rapid growth of social images, new challenges have been posed: 1) data quality of social images is an issue because they are often associated with noises and error-prone user-generated content, such as junk comments and misspelled words; and 2) data sparsity exists in social images despite the large volume, since most of them are unevenly distributed around the world and their contextual information is often missing or incomplete. In this paper, we propose a spatial-aware multimodal location estimation (SMLE) framework to tackle the above challenges. Specifically, a spatial-aware language model (SLM) is proposed to detect the high quality location-indicative tags from large datasets. We also design a spatial-aware topic model, namely spatial-aware regularized latent semantic indexing (SRLSI), to discover geographic topics and alleviate the data sparseness problem existing in language modeling. Taking multi-modalities of social images into consideration, we employ the learning to rank approach to fuse multiple evidences derived from textual features represented by SLM and SRLSI, and visual features represented by bag-of-visual-words (BoVW). Importantly, an ad hoc method is introduced to construct the training dataset with spatial-aware relevance labels for learning to rank training. Finally, given a query image, its location is estimated as the location of its most relevant image returned from the learning to rank model. The proposed framework is evaluated on a public benchmark provided by MediaEval 2013 Placing Task, which contains more than 8.5 million images crawled from Flickr. Extensive experiments on this dataset demonstrate the superior performance of the proposed methods over the state-of-the-art approaches.

### Session 3: Emotional and Social Signals in Multimedia

 Collaborative Fashion Recommendation: A Functional Tensor Factorization Approach | BIBA | Full-Text 129-138 Yang Hu; Xi Yi; Larry S. Davis With the rapid expansion of online shopping for fashion products, effective fashion recommendation has become an increasingly important problem. In this work, we study the problem of personalized outfit recommendation, i.e. automatically suggesting outfits to users that fit their personal fashion preferences. Unlike existing recommendation systems that usually recommend individual items, we suggest sets of items, which interact with each other, to users. We propose a functional tensor factorization method to model the interactions between user and fashion items. To effectively utilize the multi-modal features of the fashion items, we use a gradient boosting based method to learn nonlinear functions to map the feature vectors from the feature space into some low dimensional latent space. The effectiveness of the proposed algorithm is validated through extensive experiments on real world user data from a popular fashion-focused social network.
 Predicting and Understanding Urban Perception with Convolutional Neural Networks | BIBA | Full-Text 139-148 Lorenzo Porzi; Samuel Rota Bulò; Bruno Lepri; Elisa Ricci Cities' visual appearance plays a central role in shaping human perception and response to the surrounding urban environment. For example, the visual qualities of urban spaces affect the psychological states of their inhabitants and can induce negative social outcomes. Hence, it becomes critically important to understand people's perceptions and evaluations of urban spaces. Previous works have demonstrated that algorithms can be used to predict high level attributes of urban scenes (e.g. safety, attractiveness, uniqueness), accurately emulating human perception. In this paper we propose a novel approach for predicting the perceived safety of a scene from Google Street View Images. Opposite to previous works, we formulate the problem of learning to predict high level judgments as a ranking task and we employ a Convolutional Neural Network (CNN), significantly improving the accuracy of predictions over previous methods. Interestingly, the proposed CNN architecture relies on a novel pooling layer, which permits to automatically discover the most important areas of the images for predicting the concept of perceived safety. An extensive experimental evaluation, conducted on the publicly available Place Pulse dataset, demonstrates the advantages of the proposed approach over state-of-the-art methods.
 A Multimodal Predictive Model of Successful Debaters or How I Learned to Sway Votes | BIBA | Full-Text 149-158 Maarten Brilman; Stefan Scherer Interpersonal skills such as public speaking are essential assets for a large variety of professions and in everyday life. The ability to communicate in social environments often greatly influences a person's career development, can help resolve conflict, gain the upper hand in negotiations, or sway the public opinion. We focus our investigations on a special form of public speaking, namely public debates of socioeconomic issues that affect us all. In particular, we analyze performances of expert debaters recorded through the Intelligence Squared U.S. (IQ2US) organization. IQ2US collects high-quality audiovisual recordings of these debates and publishes them online free of charge. We extract audiovisual nonverbal behavior descriptors, including facial expressions, voice quality characteristics, and surface level linguistic characteristics. Within our experiments we investigate if it is possible to automatically predict if a debater or his/her team are going to sway the most votes after the debate using multimodal machine learning and fusion approaches. We identify unimodal nonverbal behaviors that characterize successful debaters and our investigations reveal that multimodal machine learning approaches can reliably predict which individual (~75% accuracy) or team (85% accuracy) is going to win the most votes in the debate. We created a database consisting of over 30 debates with four speakers per debate suitable for public speaking skill analysis and plan to make this database publicly available for the research community.
 Visual Affect Around the World: A Large-scale Multilingual Visual Sentiment Ontology | BIBA | Full-Text 159-168 Brendan Jou; Tao Chen; Nikolaos Pappas; Miriam Redi; Mercan Topkara; Shih-Fu Chang Every culture and language is unique. Our work expressly focuses on the uniqueness of culture and language in relation to human affect, specifically sentiment and emotion semantics, and how they manifest in social multimedia. We develop sets of sentiment- and emotion-polarized visual concepts by adapting semantic structures called adjective-noun pairs, originally introduced by Borth et al. (2013), but in a multilingual context. We propose a new language-dependent method for automatic discovery of these adjective-noun constructs. We show how this pipeline can be applied on a social multimedia platform for the creation of a large-scale multilingual visual sentiment concept ontology (MVSO). Unlike the flat structure in Borth et al. (2013), our unified ontology is organized hierarchically by multilingual clusters of visually detectable nouns and subclusters of emotionally biased versions of these nouns. In addition, we present an image-based prediction task to show how generalizable language-specific models are in a multilingual context. A new, publicly available dataset of >15.6K sentiment-biased visual concepts across 12 languages with language-specific detector banks, >7.36M images and their metadata is also released.

### Session: Multimedia Grand Challenge

 Learning Deep Features For MSR-bing Information Retrieval Challenge | BIBA | Full-Text 169-172 Qiang Song; Sixie Yu; Cong Leng; JiaXiang Wu; Qinghao Hu; Jian Cheng Two tasks have been put forward in the MSR-bing Grand Challenge 2015. To address the information retrieval task, we raise and integrate a series of methods with visual features obtained by convolution neural network (CNN) models. In our experiments, we discover that the ranking strategies of Hierarchical clustering and PageRank methods are mutually complementary. Another task is fine-grained classification. In contrast to basic-level recognition, fine-grained classification aims to distinguish between different breeds or species or product models, and often requires distinctions that must be conditioned on the object pose for reliable identification. Current state-of-the-art techniques rely heavily upon the use of part annotations, while the bing datasets suffer both abundance of part annotations and dirty background. In this paper, we propose a CNN-based feature representation for visual recognition only using image-level information. Our CNN model is pre-trained on a collection of clean datasets and fine-tuned on the bing datasets. Furthermore, a multi-scale training strategy is adopted by simply resizing the input images into different scales and then merging the soft-max posteriors. We then implement our method into a unified visual recognition system on Microsoft cloud service. Finally, our solution achieved top performance in both tasks of the contest.
 Image Retrieval by Cross-Media Relevance Fusion | BIBA | Full-Text 173-176 Jianfeng Dong; Xirong Li; Shuai Liao; Jieping Xu; Duanqing Xu; Xiaoyong Du How to estimate cross-media relevance between a given query and an unlabeled image is a key question in the MSR-Bing Image Retrieval Challenge. We answer the question by proposing cross-media relevance fusion, a conceptually simple framework that exploits the power of individual methods for cross-media relevance estimation. Four base cross-media relevance functions are investigated, and later combined by weights optimized on the development set. With DCG25 of 0.5200 on the test dataset, the proposed image retrieval system secures the first place in the evaluation.
 Who are the Devils Wearing Prada in New York City? | BIBA | Full-Text 177-180 KuanTing Chen; Kezhen Chen; Peizhong Cong; Winston H. Hsu; Jiebo Luo Fashion is a perpetual topic in human social life, and the mass has the penchant to emulate what large city residents and celebrities wear. Undeniably, New York City is such a bellwether large city with all kinds of fashion leadership. Consequently, to study what the fashion trends are during this year, it is very helpful to learn the fashion trends of New York City. Discovering fashion trends in New York City could boost many applications such as clothing recommendation and advertising. Does the fashion trend in the New York Fashion Show actually influence the clothing styles on the public? To answer this question, we design a novel system that consists of three major components: (1) constructing a large dataset from the New York Fashion Shows and New York street chic in order to understand the likely clothing fashion trends in New York, (2) utilizing a learning-based approach to discover fashion attributes as the representative characteristics of fashion trends, and (3) comparing the analysis results from the New York Fashion Shows and street-chic images to verify whether the fashion shows have actual influence on the people in New York City. Through the preliminary experiments over a large clothing dataset, we demonstrate the effectiveness of our proposed system, and obtain useful insights on fashion trends and fashion influence.
 What Makes New York So Noisy?: Reasoning Noise Pollution by Mining Multimodal Geo-Social Big Data | BIBA | Full-Text 181-184 Hsun-Ping Hsieh; Tzu-Chi Yen; Cheng-Te Li Noise pollution in modern cities is getting worse and sound sensors are sparse and costly, but it is highly demanded to have a system that can help reason and present the noise pollution at any region in urban areas. In this work, we leverage multimodal geo-social media data on Foursquare, Twitter, Flickr, and Gowalla in New York City, to infer and visualize the volume and the composition of noise pollution for every region in NYC. Using NYC 311 noise complaint records as the approximation of noise pollution for validation, we develop a joint inference and visualization system that integrates multimodal features, including geographical, mobility, visual, and social, with a graph-based learning model to infer the noise compositions of regions. Experimental results show that our model can achieve promising results with substantially few training data, compared to state-of-the-art methods. A NYC Urban Noise Diagnotor system is developed and allowed users to understand the noise composition of any region of NYC in an interactive manner.
 EventBuilder: Real-time Multimedia Event Summarization by Visualizing Social Media | BIBA | Full-Text 185-188 Rajiv Ratn Shah; Anwar Dilawar Shaikh; Yi Yu; Wenjing Geng; Roger Zimmermann; Gangshan Wu Due to the ubiquitous availability of smartphones and digital cameras, the number of photos/videos online has increased rapidly. Therefore, it is challenging to efficiently browse multimedia content and obtain a summary of an event from a large collection of photos/videos aggregated in social media sharing platforms such as Flickr and Instagram. To this end, this paper presents the EventBuilder system that enables people to automatically generate a summary for a given event in real-time by visualizing different social media such as Wikipedia and Flickr. EventBuilder has two novel characteristics: (i) leveraging Wikipedia as event background knowledge to obtain more contextual information about an input event, and (ii) visualizing an interesting event in real-time with a diverse set of social media activities. According to our initial experiments on the YFCC100M dataset from Flickr, the proposed algorithm efficiently summarizes knowledge structures based on the metadata of photos/videos and Wikipedia articles.
 Multimodal Graph-based Event Detection and Summarization in Social Media Streams | BIBA | Full-Text 189-192 Manos Schinas; Symeon Papadopoulos; Georgios Petkos; Yiannis Kompatsiaris; Pericles A. Mitkas The paper describes a multimodal graph-based system for addressing the Yahoo-Flickr Event Summarization Challenge of ACM Multimedia 2015. The objective is to automatically uncover structure within a collection of 100 million photos/videos in the form of detecting and identifying events, and summarizing them succinctly for consumer consumption. The presented system uses a sliding window over the stream of multimedia items to build and maintain a multimodal same-event image graph and applies a graph clustering algorithm to detect events. In addition, it makes use of a graph-based diversity-oriented ranking approach and a versatile event retrieval mechanism to access summarized instances of the events of interest. A demo of the system is online at http://mklab.iti.gr/acmmm2015-gc/.
 Evento 360: Social Event Discovery from Web-scale Multimedia Collection | BIBA | Full-Text 193-196 Jaeyoung Choi; Eungchan Kim; Martha Larson; Gerald Friedland; Alan Hanjalic We present Evento 360 (URL: http://evento360.info), an online interactive social event browser, which allows the user to explore events detected within a web-scale multimedia corpus. The system addresses five key aspects of social multimedia event detection and summarization: multimodality, scale, diversity of representations, noise of multimedia items, and missing metadata. The detection algorithm uses unsupervised clustering approach that exploits temporal, spatial and textual metadata. For each detected event cluster, to choose the best subset of photos that meet both relevance and diversity criteria, the system uses hierarchical clustering that exploits both visual and audio information. Evento 360's user interface provides a search feature that is not limited to a certain set of events, but rather can handle an arbitrary event query. It allows the user to retrieve and explore relevant events. The system scales well and is effective in producing high-quality summaries of the detected events.
 Unsupervised Latent Aspect Discovery for Diverse Event Summarization | BIBA | Full-Text 197-200 Wen-Yu Lee; Yin-Hsi Kuo; Peng-Ju Hsieh; Wen-Feng Cheng; Ting-Hsuan Chao; Hui-Lan Hsieh; Chieh-En Tsai; Hsiao-Ching Chang; Jia-Shin Lan; Winston Hsu Recently, the fast growth of social media communities and mobile devices encourages more people to share their media data online than ever before. Analyzing data and summarizing data into useful information have become increasingly popular and important for modern society. Given a set of event keywords and a dataset, this paper performs event summarization, aiming to discover and summarize what people may concern for each event from the given dataset. More specifically, this paper extracts latent sub-events with diverse and representative attributes for each given event. This paper proposes effective methods on detecting events with (1) human attribute discovery, such as human pose and clothes, (2) scene analysis, (3) image aspect discovery, and (4) temporal and semantic analysis, to provide people different perspectives for the events they are interested in. For practical implementation, this paper studied and conducted experiments on YFCC100M, which is a dataset with 100 million of photos and videos, provided by Yahoo!. Finally, a comprehensive and complete system is created accordingly to support diverse event summarization.

### Session: Brave New Ideas

 How Was It?: Exploiting Smartphone Sensing to Measure Implicit Audience Responses to Live Performances | BIBA | Full-Text 201-210 Claudio Martella; Ekin Gedik; Laura Cabrera-Quiros; Gwenn Englebienne; Hayley Hung In this paper, we present an approach to understand the response of an audience to a live dance performance by the processing of mobile sensor data. We argue that exploiting sensing capabilities already available in smart phones enables a potentially large scale measurement of an audience's implicit response to a performance. In this work, we leverage both tri-axial accelerometers, worn by ordinary members of the public during a dance performance, to predict responses to a number of survey answers, comprising enjoyment, immersion, willingness to recommend the event to others, and change in mood. We also analyse how behaviour as a result of seeing a dance performance might be reflected in a people's subsequent social behaviour using proximity and acceleration sensing. To our knowledge, this is the first work where pervasive mobile sensing has been used to investigate spontaneous responses to predict the affective evaluation of a live performance. Using a single body worn accelerometer to monitor a set of audience members, we were able to predict whether they enjoyed the event with a balanced classification accuracy of 90\%. The collective coordination of the audience's bodily movements also highlighted memorable moments that were reported later by the audience. The effective use of body movements to measure affective responses in such a setting is particularly surprising given that traditionally, physiological signals such as skin conductance or brain-based signals are the more commonly accepted methods to measure implicit affective response. Our experiments open interesting new directions for research on both automated techniques and applications for the implicit tagging of real world events via spontaneous and implicit audience responses during as well as after a performance.
 Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indoor Urban Places | BIBA | Full-Text 211-220 Darshan Santani; Daniel Gatica-Perez New research cutting across architecture, urban studies, and psychology is contextualizing the understanding of urban spaces according to the perceptions of their inhabitants. One fundamental construct that relates place and experience is ambiance, which is defined as "the mood or feeling associated with a particular place". We posit that the systematic study of ambiance dimensions in cities is a new domain for which multimedia research can make pivotal contributions. We present a study to examine how images collected from social media can be used for the crowdsourced characterization of indoor ambiance impressions in popular urban places. We design a crowdsourcing framework to understand suitability of social images as data source to convey place ambiance, to examine what type of images are most suitable to describe ambiance, and to assess how people perceive places socially from the perspective of ambiance along 13 dimensions. Our study is based on 50,000 Foursquare images collected from 300 popular places across six cities worldwide. The results show that reliable estimates of ambiance can be obtained for several of the dimensions. Furthermore, we found that most aggregate impressions of ambiance are similar across popular places in all studied cities. We conclude by presenting a multidisciplinary research agenda for future research in this domain.
 Bringing Deep Causality to Multimedia Data Streams | BIBA | Full-Text 221-230 Laleh Jalali; Ramesh Jain We live in a data abundance era. Availability of large volume of diverse multimedia data streams (ranging from video, to tweets, to activity, and to PM2.5) can now be used to solve many critical societal problems. Causal modeling across multimedia data streams is essential to reap the potential of this data. However, effective frameworks combining formal abstract approaches with practical computational algorithms for causal inference from such data are needed to utilize available data from diverse sensors. We propose a causal modeling framework that builds on data-driven techniques while emphasizing and including the appropriate human knowledge in causal inference. We show that this formal framework can help in designing a causal model with a systematic approach that facilitates framing sharper scientific questions, incorporating expert's knowledge as causal assumptions, and evaluating the plausibility of these assumptions. We show the applicability of the framework in a an important Asthma management application using meteorological and pollution data streams.
 Analytic Quality: Evaluation of Performance and Insight in Multimedia Collection Analysis | BIBA | Full-Text 231-240 Jan Zahálka; Stevan Rudinac; Marcel Worring In this paper, we present analytic quality (AQ), a novel paradigm for the design and evaluation of multimedia analysis methods. AQ complements the existing evaluation methods based on either machine-driven benchmarks or user studies. AQ includes the notion of user insight gain and the time needed to acquire it, both critical aspects of large-scale multimedia collections analysis. To incorporate insight, AQ introduces a novel user model. In this model, each simulated user, or artificial actor, builds its insight over time, at any time operating with multiple categories of relevance. The methods are evaluated in timed sessions. The artificial actors interact with each method and steer the course by indicating relevant items throughout the session. AQ measures not only precision and recall, but also throughput, diversity of the results, and the accuracy of estimating the percentage of relevant items in the collection. AQ is shown to provide a wide picture of analytic capabilities of the evaluated methods and enumerate how their strengths differ for different purposes. The AQ time plots provide design suggestions for improving the evaluated methods. AQ is demonstrated to be more insightful than the classic benchmark evaluation paradigm both in terms of method comparison and suggestions for further design.

### Session 4: Multimedia and Vision

 Dancing with Turks | BIBA | Full-Text 241-250 I-Kao Chiang; Ian Spiro; Seungkyu Lee; Alyssa Lees; Jingchen Liu; Chris Bregler; Yanxi Liu Dance is a dynamic art form that reflects a wide range of cultural diversity and individuality. With the advancement of motion-capture technology combined with crowd-sourcing and machine learning algorithms, we explore the complex relationship between perceived dance quality/dancer's gender and dance movements/music respectively. As a feasibility study, we construct a computational framework for an analysis-synthesis-feedback loop using a novel multimedia dance-music texture representation. Furthermore, we integrate crowd-sourcing, music and motion-capture data, and machine learning-based methods for dance segmentation, analysis and synthesis of new dancers. A quantitative validation of this framework on a motion-capture dataset of 172 dancers evaluated by more than 400 independent on-line raters demonstrates significant correlation between human perception and the algorithmically intended dance quality or gender of synthesized dancers. The technology illustrated in this work has a high potential to advance the multimedia entertainment industry via dancing with Turks.
 Single Image Spectral Reconstruction for Multimedia Applications | BIBA | Full-Text 251-260 Antonio Robles-Kelly In this paper, we present a method which can perform spectral reconstruction and illuminant recovery from a single colour image making use of an unlabelled training set of hyperspectral images. Our method employs colour and appearance information to drive the reconstruction process subject to the material properties of the objects in the scene. The idea is to reconstruct the image spectral irradiance making use of a set of prototypes extracted from the training set. These spectra, together with a set of convolutional features are hence obtained using sparse coding so as to reconstruct the image irradiance. With the reconstructed spectra in hand, we proceed to compute the illuminant power spectrum using a quadratic optimisation approach. We provide a quantitative analysis for our method and compare to a number of alternatives. We also show sample results on illuminant substitution and transfer, film simulation and image recolouring using mood board colour schemes.
 SkyStitch: A Cooperative Multi-UAV-based Real-time Video Surveillance System with Stitching | BIBA | Full-Text 261-270 Xiangyun Meng; Wei Wang; Ben Leong Recent advances in unmanned aerial vehicle (UAV) technologies have made it possible to deploy an aerial video surveillance system to provide an unprecedented aerial perspective for ground monitoring in real time. Multiple UAVs would be required to cover a large target area, and it is difficult for users to visualize the overall situation if they were to receive multiple disjoint video streams. To address this problem, we designed and implemented SkyStitch, a multiple-UAV video surveillance system that provides a single and panoramic video stream to its users by stitching together multiple aerial video streams. SkyStitch addresses two key design challenges: (i) the high computational cost of stitching and (ii) the difficulty of ensuring good stitching quality under dynamic conditions. To improve the speed and quality of video stitching, we incorporate several practical techniques like distributed feature extraction to reduce workload at the ground station, the use of hints from the flight controller to improve stitching efficiency and a Kalman filter-based state estimation model to mitigate jerkiness. Our results show that SkyStitch can achieve a stitching rate that is 4 times faster than existing state-of-the-art methods and also improve perceptual stitching quality. We also show that SkyStitch can be easily implemented using commercial off-the-shelf hardware.
 Eye of the Dragon: Exploring Discriminatively Minimalist Sketch-based Abstractions for Object Categories | BIBA | Full-Text 271-280 Ravi Kiran Sarvadevabhatla; Venkatesh Babu R As a form of visual representation, freehand line sketches are typically studied as an end product of the sketching process. However, from a recognition point of view, one can also study various orderings and properties of the primitive strokes that compose the sketch. Studying sketches in this manner has enabled us to create novel sparse yet discriminative sketch-based representations for object categories which we term category-epitomes. Concurrently, the epitome construction provides a natural measure for quantifying the sparseness underlying the original sketch, which we term epitome-score. We analyze category-epitomes and epitome-scores for hand-drawn sketches from a sketch dataset of 160 object categories commonly encountered in daily life. Our analysis provides a novel viewpoint for examining the complexity of representation for visual object categories.

### Session 5: Multimedia Art, Entertainment and Culture

 A Distributed Theatre Experiment with Shakespeare | BIBA | Full-Text 281-290 Douglas L. Williams; Ian C. Kegel; Marian Ursu; Pablo Cesar; Jack Jansen; Erik Geelhoed; Andras Horti; Michael Frantzis; Bill Scott This paper reports on an experimental production of The Tempest that was developed in collaboration with Miracle Theatre Company realised as a distributed performance from two separate stages through a dynamically configured telepresence system. The production allowed an exploration of the way a range of technologies, including consumer grade broadband, cameras and projection technologies could affect the development and delivery of live theatre by regional touring company. The architecture of the communication platform used to deliver the performance is introduced as are two novel software tools that are used to describe and control the way the play should be captured and represented.    The experimental production was thoroughly evaluated and the feedback from audience and theatre professionals is presented in some detail.    A considered observation of the process and the way it differs from film, TV and theatre suggest that distributed theatre can be treated as a new genre of storytelling.
 Image Profiling for History Events on the Fly | BIBA | Full-Text 291-300 Jia Chen; Qin Jin; Yong Yu; Alexander G. Hauptmann History event related knowledge is precious and imagery is a powerful medium that records diverse information about the event. In this paper, we propose to automatically construct an image profile given a one sentence description of the historic event which contains where, when, who and what elements. Such a simple input requirement makes our solution easy to scale up and support a wide range of culture preservation and curation related applications ranging from wikipedia enrichment to history education. However, history relevant information on the web is available as "wild and dirty" data, which is quite different from clean, manually curated and structured information sources. There are two major challenges to build our proposed image profiles: 1) unconstrained image genre diversity. We categorize images into genres of documents/maps, paintings or photos. Image genre classification involves a full-spectrum of features from low-level color to high-level semantic concepts. 2) image content diversity. It can include faces, objects and scenes. Furthermore, even within the same event, the views and subjects of images are diverse and correspond to different facets of the event. To solve this challenge, we group images at two levels of granularity: iconic image grouping and facet image grouping. These require different types of features and analysis from near exact matching to soft semantic similarity. We develop a full-range feature analysis module which is composed of several levels, each suitable for different types of image analysis tasks. The wide range of features are based on both classical hand-crafted features and different layers of a convolutional neural network. We compare and study the performance of the different levels in the full-range features and show their effectiveness on handling such a wild, unconstrained dataset.
 Modeling Perspective Effects in Photographic Composition | BIBA | Full-Text 301-310 Zihan Zhou; Siqiong He; Jia Li; James Z. Wang Automatic understanding of photo composition is a valuable technology in multiple areas including digital photography, multimedia advertising, entertainment, and image retrieval. In this paper, we propose a method to model geometrically the compositional effects of linear perspective. Comparing with existing methods which have focused on basic rules of design such as simplicity, visual balance, golden ratio, and the rule of thirds, our new quantitative model is more comprehensive whenever perspective is relevant. We first develop a new hierarchical segmentation algorithm that integrates classic photometric cues with a new geometric cue inspired by perspective geometry. We then show how these cues can be used directly to detect the dominant vanishing point in an image without extracting any line segments, a technique with implications for multimedia applications beyond this work. Finally, we demonstrate an interesting application of the proposed method for providing on-site composition feedback through an image retrieval system.
 Who's Afraid of Itten: Using the Art Theory of Color Combination to Analyze Emotions in Abstract Paintings | BIBA | Full-Text 311-320 Andreza Sartori; Dubravko Culibrk; Yan Yan; Nicu Sebe Color plays an essential role in everyday life and is one of the most important visual cues in human perception. In abstract art, color is one of the essential means to convey the artist's intention and to affect the viewer emotionally. However, colors are rarely experienced in isolation, rather, they are usually presented together with other colors. In fact, the expressive properties of two-color combinations have been extensively studied by artists. It is intriguing to try to understand how color combinations in abstract paintings might affect the viewer emotionally, and to investigate if a computer algorithm can learn this mechanism.    In this work, we propose a novel computational approach able to analyze the color combinations in abstract paintings and use this information to infer whether a painting will evoke positive or negative emotions in an observer. We exploit art theory concepts to design our features and the learning algorithm. To make use of the color-group information, we propose inferring the emotions elicited by paintings based on the sparse group lasso approach. Our results show that a relative improvement of between 6% and 8% can be achieved in this way. Finally, as an application, we employ our method to generate Mondrian-like paintings and do a prospective user study to evaluate the ability of our method as an automatic tool for generating abstract paintings able to elicit positive and negative emotional responses in people.

### Session 6: Telepresence, Virtual, and Augmented Reality

 Image2Scene: Transforming Style of 3D Room | BIBA | Full-Text 321-330 Xiaowu Chen; Jianwei Li; Qing Li; Bo Gao; Dongqing Zou; Qinping Zhao We propose a style transformation system to transform a 3D room into one that resembles the style of a photograph. We focus on two major components of interior scene style: layout and color. Using an interior image database, we learn the related style guidelines. Given a reference image and a 3D room of two different interior rooms, we first establish semantic correspondence between the two scenes. The styles of the reference image are then extracted in the form of layout constraints and color schemes. Finally, our framework performs layout rearrangement followed by recoloring of the scene to match the learned style of the reference image. We show style transformation results on numerous examples to demonstrate the effectiveness and efficiency of our system.
 Gradient-based 2D-to-3D Conversion for Soccer Videos | BIBA | Full-Text 331-340 Kiana Calagari; Mohamed Elgharib; Piotr Didyk; Alexandre Kaspar; Wojciech Matusik; Mohamed Hefeeda A wide spread adoption of 3D videos and technologies is hindered by the lack of high-quality 3D content. One promising solution to address this problem is to use automated 2D-to-3D conversion. However, current conversion methods, while general, produce low-quality results with artifacts that are not acceptable to many viewers. We address this problem by showing how to construct a high-quality, domain-specific conversion method for soccer videos. We propose a novel, data-driven method that generates stereoscopic frames by transferring depth information from similar frames in a database of 3D stereoscopic videos. Creating a database of 3D stereoscopic videos with accurate depth is, however, very difficult. One of the key findings in this paper is showing that computer generated content in current sports computer games can be used to generate high-quality 3D video reference database for 2D-to-3D conversion methods. Once we retrieve similar 3D video frames, our technique transfers depth gradients to the target frame while respecting object boundaries. It then computes depth maps from the gradients, and generates the output stereoscopic video. We implement our method and validate it by conducting user-studies that evaluate depth perception and visual comfort of the converted 3D videos. We show that our method produces high-quality 3D videos that are almost indistinguishable from videos shot by stereo cameras. In addition, our method significantly outperforms the current state-of-the-art method. For example, up to 20% improvement in the perceived depth is achieved by our method, which translates to improving the mean opinion score from Good to Excellent.
 Ubii: Towards Seamless Interaction between Digital and Physical Worlds | BIBA | Full-Text 341-350 Zhanpeng Huang; Weikai Li; Pan Hui We present Ubii (Ubiquitous interface and interaction), an interface system that aims to expand people's perception and interaction from the digital space to the physical world. The centralized user interface is broken into pieces woven in the domain environment. Augmented user interface is paired to the physical objects, where physical and digital presentations are displayed in the same context. The augmented interface and physical affordance respond as one control to provide seamless interaction. By connecting digital interface with physical objects, the system presents a nearby embodiment to afford users sense of awareness to interact with domain objects. Integrated on wearable devices as Google Glass, a less intrusive and more convenient interaction is afforded. Our research illustrates the great potential of direct mapping of interaction between digital interfaces and physical affordance by converging wearable devices and augmented reality (AR) technology.
 Smart Beholder: An Open-Source Smart Lens for Mobile Photography | BIBA | Full-Text 351-360 Chun-Ying Huang; Chih-Fan Hsu; Tsung-Han Tsai; Ching-Ling Fan; Cheng-Hsin Hsu; Kuan-Ta Chen Smart lenses are detachable lenses connected to mobile devices via wireless networks, which are not constrained by the small form factor of mobile devices, and have potential to deliver better photo (video) quality. However, the viewfinder previews of smart lenses on mobile devices are difficult to optimize, due to the strict resource constraints on smart lenses and fluctuating wireless network conditions. In this paper, we design, implement, and evaluate an open-source smart lens, called Smart Beholder. It achieves three design goals: (i) cost effectiveness, (ii) low interaction latency, and (iii) high preview quality by: (i) selecting an embedded system board that is just powerful enough, (ii) minimizing per-component latency, and (iii) dynamically adapting the video coding parameters to maximizing Quality of Experience (QoE), respectively. Several optimization techniques, such as anti-drifting mechanism for video frames and QoE-driven resolution/frame rate adaptation algorithm, are proposed in this paper. Our measurement study shows that Smart Beholder outperforms Altek Cubic and Sony QX100 in terms of lower bitrate, lower latency, slightly higher frame rate, and better preview quality. We also demonstrate that sys adapts to network dynamics. Smart Beholder has been made public at http://www.smartbeholder.org as an experimental platform for researchers and developers to optimize smart lenses and other embedded real-time video streaming systems.

### Session 7: Actions and Events

 Coherent Motion Detection with Collective Density Clustering | BIBA | Full-Text 361-370 Yunpeng Wu; Yangdong Ye; Chenyang Zhao Detecting coherent motion is significant for analysing the crowd motion in video applications. In this study, we propose the Collective Density Clustering (CDC) approach to recognize both local and global coherent motion having arbitrary shapes and varying densities. Firstly, the collective density is defined to reveal the underlying patterns with varying levels of density. Based on collective density, the collective clustering algorithm is further presented to recognize the local consistency, where density-based clustering is more adaptive to recognize clusters with arbitrary shapes. This algorithm has salient properties including single step of clustering process, automatical decision of clustering number and accurate identification of outliers. Finally, the collective merging algorithm is introduced to fully characterize the global consistency. Experiments on diverse crowd scenes, including pedestrians, traffic and bacterial colony, demonstrate the effectiveness for coherent motion detection. The comparisons show that our approach outperforms state-of-the-art coherent detection techniques.
 Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images | BIBA | Full-Text 371-380 Chen Sun; Sanketh Shetty; Rahul Sukthankar; Ram Nevatia We address the problem of fine-grained action localization from temporally untrimmed web videos. We assume that only weak video-level annotations are available for training. The goal is to use these weak labels to identify temporal segments corresponding to the actions, and learn models that generalize to unconstrained web videos. We find that web images queried by action names serve as well-localized highlights for many actions, but are noisily labeled. To solve this problem, we propose a simple yet effective method that takes weak video labels and noisy image labels as input, and generates localized action frames as output. This is achieved by cross-domain transfer between video frames and web images, using pre-trained deep convolutional neural networks. We then use the localized action frames to train action recognition models with long short-term memory networks. We collect a fine-grained sports action data set FGA-240 of more than 130,000 YouTube videos. It has 240 fine-grained actions under 85 sports activities. Convincing results are shown on the FGA-240 data set, as well as the THUMOS 2014 localization data set with untrimmed training videos.
 Temporal Matching Kernel with Explicit Feature Maps | BIBA | Full-Text 381-390 Sébastien Poullot; Shunsuke Tsukatani; Anh Phuong Nguyen; Hervé Jégou; Shin'Ichi Satoh This paper proposes a framework for content-based video retrieval that addresses various tasks as particular event retrieval, copy detection or video synchronization. Given a video query, the method is able to efficiently retrieve, from a large collection, similar video events or near-duplicates with temporarily consistent excerpts. As a byproduct of the representation, it provides a precise temporal alignment of the query and the detected video excerpts.    Our method converts a series of frame descriptors into a single visual-temporal descriptor, called a temporal invariant match kernel. This representation takes into account the relative positions of the visual frames: the frame descriptors are jointly encoded with their timestamps. When matching two videos, the method produces a score function for all possible relative timestamps, which is maximized to obtain both the similarity score and the relative time offset.    Then, we propose two complementary contributions to further improve the detection and localization performance.    The first is a novel query expansion method that takes advantage of the joint descriptor/timestamp representation to automatically align the first result set and produce an enriched temporal query. In contrast to other query expansion methods proposed for videos, it preserves the localization capability. Second, we improve the localization trade-off between quality and representation size by using several complementary temporal match kernels.    We evaluate our approach on benchmarks for particular event retrieval, copy detection and video synchronization. Our experiments show that our approach achieve excellent detection and localization results.
 Efficient Activity Retrieval through Semantic Graph Queries | BIBA | Full-Text 391-400 Gregory Castanon; Yuting Chen; Ziming Zhang; Venkatesh Saligrama We present an efficient retrieval approach for activity detection in large surveillance video datasets based on semantic graph queries. Unlike conventional approaches, our zero-shot retrieval method does not require knowledge of the activity classes contained in the video. We propose a novel user-centric approach that models queries through the creation of sparse semantic graphs based on attributes and discriminative relationships. We then pose search as a ranked subgraph matching problem and leverage the fact that the attributes and relationships in the query have different levels of discriminability to filter out bad matches. Rather than solving the NP-hard exact subgraph matching problem, we develop a novel maximally discriminative spanning tree (MDST) as the relaxation of a given query graph, and then describe a matching algorithm that recovers matches to this tree in linear time using maximally discriminative subgraphmatching (MDSM).    We utilize the MDST to minimize the number of possible matches to the original query while guaranteeing that the best matches are within this set. We test this algorithm on two large video datasets: the 35-GB Virat Ground dataset and a 1-TB aerial data collection from Yuma. These datasets yield graphs with 200,000 nodes and 1 million nodes, respectively, with an average degree of 5. Our approach finds complex, large-scale queries in seconds while maintaining comparable precision and recall to slower current approaches.

### Session 8: Video Systems

 Video Killed The Data Store: Extending the n-Dimensional Display Interface for Full Screen Video | BIBA | Full-Text 401-410 Charles D. Estes; Ketan Mayer-Patel Prior research introduced the n-Dimensional Display Interface (NDDI) as a new "narrow waist" for the display pipeline. In this paper, we extend the NDDI architecture to provide a blending feature. We then utilize that new feature for full screen video playback, leveraging application-level framing to realize a significant data transmission reduction. We then explore new NDDI configurations for effective rate control under highly constrained transmission budgets.
 Dependency-Aware Unequal Error Protection for Layered Video Coding | BIBA | Full-Text 411-420 Mohammad Reza Zakerinasab; Mea Wang Layered video coding standards encode a high-quality video into multiple layers of unequal importance. Dependent layers that provide higher quality rely on their respective reference layers for successful reconstruction of transmitted video frames. Hence, if a video packet in a reference layer is corrupted or lost during transmission, all its dependent layers cannot be reconstructed successfully, and the resources consumed to transmit them are wasted. To address this problem, unequal error protection (UEP) techniques have been proposed to provide protection to each layer according to their importance. Nonetheless, the importance of a piece of video content is determined by not only the layering structure, but also visual features and encoding decisions. In this paper, we look deeper into the coding and prediction structure of layered encoded videos and model the dependency among macroblocks and submacroblocks (the finest processing units of H.264 video coding standard) as a weighted graph. Based on this graph, we propose a dependency-aware UEP model that protects macroblocks according to their importance. Our simulation results show that the proposed UEP model outperforms the conventional UEP models for layered SVC videos by 3.76 dB of peak signal-to-noise ratio (PSNR) when the channel packet loss rate is as high as 28%.
 HiFi: A Hierarchical Filtering Algorithm for Caching of Online Video | BIBA | Full-Text 421-430 Shahid Akhtar; Andre Beck; Ivica Rimac Online video presents new challenges to traditional caching with over a thousand fold increase in number of assets, rapidly changing popularity of assets and much higher throughput requirements.    We propose a new hierarchical filtering algorithm for caching online video-HiFi. Our algorithm is designed to optimize hit-rate, replacement rate and cache throughput. It has an associated implementation complexity comparable to that of LRU.    Our results show that under typical operator conditions, HiFi can increase edge cache byte hit-rate by 5-24% over an LRU policy, but more importantly can increase RAM or memory byte hit-rate by 80% to 200% and reduce replacement rate by 90%! These two factors combined can dramatically increase throughput for most caches. If SSDs are used for storage, the much lower replacement rate may also allow substitution of lower cost MLC based SSDs instead of SLC based SSDs.    We extend previous multi-tier analytical models for LRU caches to caches with filtering. We develop a realistic simulation environment for online video using statistics from operator traces. We show that HiFi performs within a few percentage points from the optimal solution which was simulated by Belady's MIN algorithm under typical operator conditions.
 Exploring QoE for Power Efficiency: A Field Study on Mobile Videos with LCD Displays | BIBA | Full-Text 431-440 Zhisheng Yan; Qian Liu; Tong Zhang; Chang Wen Chen Display power consumption has become a major concern for both mobile users and design engineers, especially considering the prevalence of today's video-rich mobile services. The power consumption of liquid crystal display (LCD), a dominant mobile display technology, can be reduced by dynamic backlight scaling (DBS). However, such dynamic changes of screen brightness may degrade users' quality of experience (QoE) in viewing videos. How would QoE be impacted by different DBS strategies has not yet been understood clearly and thus obscures the way to achieve systematic power saving. In this paper, we take a first step to explore the QoE of DBS on smartphones and aim at maximally enhancing the display power performance without negatively impacting users' QoE. In particular, we conduct three motivational studies to uncover the inherent relationship between QoE and backlight scaling frequency, magnitude, and temporal consistency, respectively. Motivated by the findings of these studies, we design a suite of techniques to implement a comprehensive DBS strategy. We demonstrate an example application of the proposed DBS designs in a mobile video streaming system. Measurements and user evaluations show that more than 40% system power reduction, or equivalently, 20% more power savings than the non-QoE approaches, can be achieved without QoE impairment.

### Session 9: Deep Learning and Multimedia

 Automatic Image Dataset Construction from Click-through Logs Using Deep Neural Network | BIBA | Full-Text 441-450 Yalong Bai; Kuiyuan Yang; Wei Yu; Chang Xu; Wei-Ying Ma; Tiejun Zhao Labelled image datasets are the backbone for high-level image understanding tasks with wide application scenarios, and continuously drive and evaluate the progress of feature designing and supervised learning models. Recently, the million scale labelled image dataset further contributes to the rebirth of deep convolutional neural network and bypass manual designing handcraft features. However, the construction process of image dataset is mainly manual-based and quite labor intensive, which often take years' efforts to construct a million scale dataset with high quality. In this paper, we propose a deep learning based method to construct large scale image dataset in an automatic way. Specifically, word representation and image representation are learned in a deep neural network from large amount of click-through logs, and further used to define word-word similarity and image-word similarity. These two similarities are used to automatize the two labor intensive steps in manual-based image dataset construction: query formation and noisy image removal. With a new proposed cross convolutional filter regularizer, we can construct a million scale image dataset in one week. Finally, two image datasets are constructed to verify the effectiveness of the method. In addition to scale, the automatically constructed dataset has comparable accuracy, diversity and cross-dataset generalization with manually labelled image datasets.
 DeepFont: Identify Your Font from An Image | BIBA | Full-Text 451-459 Zhangyang Wang; Jianchao Yang; Hailin Jin; Eli Shechtman; Aseem Agarwala; Jonathan Brandt; Thomas S. Huang As font is one of the core design concepts, automatic font identification and similar font suggestion from an image or photo has been on the wish list of many designers. We study the Visual Font Recognition (VFR) problem [4] LFE, and advance the state-of-the-art remarkably by developing the DeepFont system. First of all, we build up the first available large-scale VFR dataset, named AdobeVFR, consisting of both labeled synthetic data and partially labeled real-world data. Next, to combat the domain mismatch between available training and testing data, we introduce a Convolutional Neural Network (CNN) decomposition approach, using a domain adaptation technique based on a Stacked Convolutional Auto-Encoder (SCAE) that exploits a large corpus of unlabeled real-world text images combined with synthetic data preprocessed in a specific way. Moreover, we study a novel learning-based model compression approach, in order to reduce the DeepFont model size without sacrificing its performance. The DeepFont system achieves an accuracy of higher than 80% (top-5) on our collected dataset, and also produces a good font similarity measure for font selection and suggestion. We also achieve around 6 times compression of the model without any visible loss of recognition accuracy.
 Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification | BIBA | Full-Text 461-470 Zuxuan Wu; Xi Wang; Yu-Gang Jiang; Hao Ye; Xiangyang Xue Classifying videos according to content semantics is an important problem with a wide range of applications. In this paper, we propose a hybrid deep learning framework for video classification, which is able to model static spatial information, short-term motion, as well as long-term temporal clues in the videos. Specifically, the spatial and the short-term motion features are extracted separately by two Convolutional Neural Networks (CNN). These two types of CNN-based features are then combined in a regularized feature fusion network for classification, which is able to learn and utilize feature relationships for improved performance. In addition, Long Short Term Memory (LSTM) networks are applied on top of the two features to further model longer-term temporal clues. The main contribution of this work is the hybrid learning framework that can model several important aspects of the video data. We also show that (1) combining the spatial and the short-term motion features in the regularized fusion network is better than direct classification and fusion using the CNN with a softmax layer, and (2) the sequence-based LSTM is highly complementary to the traditional classification strategy without considering the temporal frame orders. Extensive experiments are conducted on two popular and challenging benchmarks, the UCF-101 Human Actions and the Columbia Consumer Videos (CCV). On both benchmarks, our framework achieves very competitive performance: 91.3% on the UCF-101 and 83.5% on the CCV.
 EventNet: A Large Scale Structured Concept Library for Complex Event Detection in Video | BIBA | Full-Text 471-480 Guangnan Ye; Yitong Li; Hongliang Xu; Dong Liu; Shih-Fu Chang Event-specific concepts are the semantic concepts specifically designed for the events of interest, which can be used as a mid-level representation of complex events in videos. Existing methods only focus on defining event-specific concepts for a small number of pre-defined events, but cannot handle novel unseen events. This motivates us to build a large scale event-specific concept library that covers as many real-world events and their concepts as possible. Specifically, we choose WikiHow, an online forum containing a large number of how-to articles on human daily life events. We perform a coarse-to-fine event discovery process and discover 500 events from WikiHow articles. Then we use each event name as query to search YouTube and discover event-specific concepts from the tags of returned videos. After an automatic filter process, we end up with 95,321 videos and 4,490 concepts. We train a Convolutional Neural Network (CNN) model on the 95,321 videos over the 500 events, and use the model to extract deep learning feature from video content. With the learned deep learning feature, we train 4,490 binary SVM classifiers as the event-specific concept library. The concepts and events are further organized in a hierarchical structure defined by WikiHow, and the resultant concept library is called EventNet. Finally, the EventNet concept library is used to generate concept based representation of event videos. To the best of our knowledge, EventNet represents the first video event ontology that organizes events and their concepts into a semantic structure. It offers great potential for event retrieval and browsing. Extensive experiments over the zero-shot event retrieval task when no training samples are available show that the proposed EventNet concept library consistently and significantly outperforms the state-of-the-art (such as the 20K ImageNet concepts trained with CNN) by a large margin up to 207%. We will also show that EventNet structure can help users find relevant concepts for novel event queries that cannot be well handled by conventional text based semantic analysis alone. The unique two-step approach of first applying event detection models followed by detection of event-specific concepts also provides great potential to improve the efficiency and accuracy of Event Recounting since only a very small number of event-specific concept classifiers need to be fired after event detection.

### Session 10: Multimedia Quality Perception

 Modelling Human Factors in Perceptual Multimedia Quality: On The Role of Personality and Culture | BIBA | Full-Text 481-490 Michael James Scott; Sharath Chandra Guntuku; Yang Huan; Weisi Lin; Gheorghita Ghinea Perception of multimedia quality is shaped by a rich interplay between system, context and human factors. While system and context factors are widely researched, few studies consider human factors as sources of systematic variance. This paper presents an analysis on the influence of personality and cultural traits on the perception of multimedia quality. A set of 144 video sequences (from 12 short movie excerpts) were rated by 114 participants from a cross-cultural population, producing 1232 ratings. On this data, three models are compared: a baseline model that only considers system factors; an extended model that includes personality and culture as human factors; and an optimistic model in which each participant is modelled as a random effect. An analysis shows that personality and cultural traits represent 9.3\% of the variance attributable to human factors while human factors overall predict an equal or higher proportion of variance compared to system factors. In addition, the quality-enjoyment correlation varied across the excerpts. This suggests that human factors play an important role in perceptual multimedia quality, but further research to explore moderation effects and a broader range of human factors is warranted.
 Biologically Inspired Media Quality Modeling | BIBA | Full-Text 491-500 Luming Zhang; Meng Wang; Liqiang Nie; Richang Hong; Yingjie Xia; Roger Zimmermann A successful quality model is indispensable in a rich variety of multimedia applications, e.g., image classification and video summarization. Conventional approaches have developed many features to assess media quality at both low-level and high-level. However, they cannot reflect the process of human visual cortex in media perception. It is generally accepted that an ideal quality model should be biologically plausible, i.e., capable of mimicking human gaze shifting as well as the complicated visual cognition. In this paper, we propose a biologically inspired quality model, focusing on interpreting how humans perceive visually and semantically important regions in an image (or a video clip). Particularly, we first extract local descriptors (graphlets in this work) from an image/frame. They are projected onto the perceptual space, which is built upon a set of low-level and high-level visual features. Then, an active learning algorithm is utilized to select graphlets that are both visually and semantically salient. The algorithm is based on the observation that each graphlet can be linearly reconstructed by its surrounding ones, and spatially nearer ones make a greater contribution. In this way, both the local and global geometric properties of an image/frame can be encoded in the selection process. These selected graphlets are linked into a so-called biological viewing path (BVP) to simulate human visual perception. Finally, the quality of an image or a video clip is predicted by a probabilistic model. Experiments shown that 1) the predicted BVPs are over 90% consistent with real human gaze shifting paths on average; and 2) our quality model outperforms many of its competitors remarkably.
 QoE Modelling for VP9 and H.265 Videos on Mobile Devices | BIBA | Full-Text 501-510 Wei Song; Yao Xiao; Dian Tjondronegoro; Antonio Liotta Current mobile devices and streaming video services support high definition (HD) video, increasing expectation for more contents. HD video streaming generally requires large bandwidth, exerting pressures on existing networks. New generation of video compression codecs, such as VP9 and H.265/HEVC, are expected to be more effective for reducing bandwidth. Existing studies to measure the impact of its compression on users" perceived quality have not been focused on mobile devices. Here we propose new Quality of Experience (QoE) models that consider both subjective and objective assessments of mobile video quality. We introduce novel predictors, such as the correlations between video resolution and size of coding unit, and achieve a high goodness-of-fit to the collected subjective assessment data (adjusted R-square >83%). The performance analysis shows that H.265 can potentially achieve 44% to 59% bit rate saving compared to H.264/AVC, slightly better than VP9 at 33% to 53%, depending on video content and resolution.
 Towards Solving the Bottleneck of Pitch-based Singing Voice Separation | BIBA | Full-Text 511-520 Bilei Zhu; Wei Li; Linwei Li Singing voice separation from accompaniment in monaural music recordings is a crucial technique in music information retrieval. A majority of existing algorithms are based on singing pitch detection, and take the detected pitch as the cue to identify and separate the harmonic structure of the singing voice. However, as a key yet undependable premise, vocal pitch detection makes the separation performance of these algorithms rather limited. To overcome the inherent weakness of pitch-based inference algorithms, two novel methods based on non-negative matrix factorization (NMF) are devised in this paper. The first one combines NMF with the distribution regularities of vocals under different time frequency resolutions, so that many vocal unrelated portions are eliminated and the singing voice is hence enhanced. In consequence, the accuracy of vocal pitch detection is significantly improved. The second method applies NMF to decompose the spectrogram into non-overlapping and indivisible segments, which can be used as another cue besides the pitch to help identify the vocal harmonic structure. The two proposed methods are integrated into the framework of pitch-based inference. Extensive testing on the MIR-1K public dataset shows that both of them are rather effective, and the overall performances outperform other state-of-the-art singing separation algorithms.

### Session 11: Multimedia Networking

 Enhancing the Quality of Interactive Multimedia Services by Proactive Monitoring and Failure Prediction | BIBA | Full-Text 521-530 Mohammed Shatnawi; Mohamed Hefeeda Online multimedia communication services, such as Skype and Google Hangout, are used by millions of users every day. Although these services provide acceptable quality on average, users occasionally suffer from reduced audio quality, dropped video streams, and even failed sessions. To mitigate some of these problems, service providers closely monitor the performance of different parts of the system. However, most current techniques for monitoring and managing the quality of service (QoS) of online multimedia communication services are reactive and lack the ability to adapt to dynamic changes in real time. We propose a novel proactive approach for continuously monitoring the health of large-scale multimedia communication services, and dynamically managing and improving the quality of the multimedia sessions. The proposed approach, called Proactive QoS Manager, has novel light-weight methods for estimating the capacity of different components of the system and for using this capacity estimation in allocating resources to multimedia sessions in real time. We implement the proposed approach in one of the largest online multimedia communication services in the world and evaluate its performance on more than 100 million audio, video, and conferencing sessions. Our empirical results show that substantial quality improvements can be achieved using our proactive approach, without changing the production code of the service or imposing significant overheads. For example, in our experiments, the Proactive QoS Manager reduced the number of failed sessions by up to 25% and improved the quality (in terms of the Mean Opinion Score (MOS)) of the succeeded sessions by up to 12%. These improvements are achieved for the well-engineered and highly-provisioned online service examined in this paper; we expect higher gains for other similar services.
 Distributed Optimal Datacenter Bandwidth Allocation for Dynamic Adaptive Video Streaming | BIBA | Full-Text 531-540 Fanxin Kong; Xingjian Lu; Mingyuan Xia; Xue Liu; Haibing Guan Video streaming systems such as YouTube and Netflix are usually supported by the content delivery networks and datacenters that can consume many megawatts of power. Most existing works independently study the issues of improving quality of experience (QoE) for viewers and reducing the cost and emissions associated with the enormous energy usage of datacenters. By contrast, this paper addresses them both, and jointly optimizes the QoE, the energy cost and emissions by intelligently allocating datacenter bandwidth among different client groups. Specially, we propose a distributed algorithm for achieving the optimal bandwidth allocation. The algorithm novelly decomposes the optimization process into separate ones, which are solved iteratively across datacenters and clients. We demonstrate its convergence by both theoretical proof and experimental validation. The experimental results show that the proposed algorithm converges very fast and achieves much better QoE-cost balance than existing approaches.
 HTTP/2-Based Methods to Improve the Live Experience of Adaptive Streaming | BIBA | Full-Text 541-550 Rafael Huysegems; Tom Bostoen; Patrice Rondao Alface; Jeroen van der Hooft; Stefano Petrangeli; Tim Wauters; Filip De Turck HTTP Adaptive Streaming (HAS) is today the number one video technology for over-the-top video distribution. In HAS, video content is temporally divided into multiple segments and encoded at different quality levels. A client selects and retrieves per segment the most suited quality version to create a seamless playout. Despite the ability of HAS to deal with changing network conditions, HAS-based live streaming often suffers from freezes in the playout due to buffer under-run, low average quality, large camera-to-display delay, and large initial/channel-change delay. Recently, IETF has standardized HTTP/2, a new version of the HTTP protocol that provides new features for reducing the page load time in Web browsing. In this paper, we present ten novel HTTP/2-based methods to improve the quality of experience of HAS. Our main contribution is the design and evaluation of a push-based approach for live streaming in which super-short segments are pushed from server to client as soon as they become available. We show that with an RTT of 300 ms, this approach can reduce the average server-to-display delay by 90.1% and the average start-up delay by 40.1%.
 Bandwidth-aware Prefetching for Proactive Multi-video Preloading and Improved HAS Performance | BIBA | Full-Text 551-560 Vengatanathan Krishnamoorthi; Niklas Carlsson; Derek Eager; Anirban Mahanti; Nahid Shahmehri This paper considers the problem of providing users playing one streaming video the option of instantaneous and seamless playback of alternative videos. Recommendation systems can easily provide a list of alternative videos, but there is little research on how to best eliminate the startup time for these alternative videos. The problem is motivated by services that want to retain increasingly impatient users, who frequently watch the beginning of multiple videos, before viewing a video to the end. We present the design, implementation, and evaluation of an HTTP-based Adaptive Streaming (HAS) solution that provides careful prefetching and buffer management. We also present the design and evaluation of three fundamental policy classes that provide different tradeoffs between how aggressively new alternative videos are prefetched versus the importance of ensuring high playback quality. We show that our solution allows us to reduce the startup times of alternative videos by an order of magnitude and effectively adapt the quality such as to ensure the highest possible playback quality of the video being viewed. By improving the channel utilization we also address the discrimination problem that HAS clients often suffer from, allowing us to in some cases simultaneously improve the playback quality of the video being viewed and provide the value-added service of allowing instantaneous playback of the prefetched alternative videos.

### Session 12: Data Imperfectness for Multimedia

 Multi-View Visual Recognition of Imperfect Testing Data | BIBA | Full-Text 561-570 Qilin Zhang; Gang Hua A practical yet under-explored problem often encountered by multimedia researchers is the recognition of imperfect testing data, where multiple sensing channels are deployed but interference or transmission distortion corrupts some of them. Typical cases of imperfect testing data include missing features and feature misalignments. To address these challenges, we choose the latent space model and introduce a new similarity learning canonical-correlation analysis (SLCCA) method to capture the semantic consensus between views. The consensus information is preserved by projection matrices learned with modified canonical-correlation analysis (CCA) optimization terms with new, explicit class-similarity constraints. To make it computationally tractable, we propose to combine a practical relaxation and an alternating scheme to solve the optimization problem. Experiments on four challenging multi-view visual recognition datasets demonstrate the efficacy of the proposed method.
 If You Can't Beat Them, Join Them: Learning with Noisy Data | BIBA | Full-Text 571-580 Pravin Kakar; Alex Yong-Sang Chia Vision capabilities have been significantly enhanced in recent years due to the availability of powerful computing hardware and sufficiently large and varied databases. However, the labelling of these image databases prior to training still involves considerable effort and is a roadblock for truly scalable learning. For instance, it has been shown that tag noise levels in Flickr images are as high as 80%. In an effort to exploit large images datasets therefore, extensive efforts have been invested to reduce the tag noise of the data by refining the image tags or by developing robust learning frameworks. In this work, we follow the latter approach, where we propose a multi-layer neural network-based noisy learning framework that incorporates noise probabilities of a training dataset. These are then utilized effectively to perform learning with sustained levels of accuracy, even in the presence of significant noise levels. We present results on several datasets of varying sizes and complexity and demonstrate that the proposed mechanism is able to outperform existing methods, despite often employing weaker constraints and assumptions.
 Searching Persuasively: Joint Event Detection and Evidence Recounting with Limited Supervision | BIBA | Full-Text 581-590 Xiaojun Chang; Yao-Liang Yu; Yi Yang; Alexander G. Hauptmann Multimedia event detection (MED) and multimedia event recounting (MER) are fundamental tasks in managing large amounts of unconstrained web videos, and have attracted a lot of attention in recent years. Most existing systems perform MER as a post-processing step on top of the MED results. In order to leverage the mutual benefits of the two tasks, we propose a joint framework that simultaneously detects high-level events and localizes the indicative concepts of the events. Our premise is that a good recounting algorithm should not only explain the detection result, but should also be able to assist detection in the first place. Coupled in a joint optimization framework, recounting improves detection by pruning irrelevant noisy concepts while detection directs recounting to the most discriminative evidences. To better utilize the powerful and interpretable semantic video representation, we segment each video into several shots and exploit the rich temporal structures at shot level. The consequent computational challenge is carefully addressed through a significant improvement of the current ADMM algorithm, which, after eliminating all inner loops and equipping novel closed-form solutions for all intermediate steps, enables us to efficiently process extremely large video corpora. We test the proposed method on the large scale TRECVID MEDTest 2014 and MEDTest 2013 datasets, and obtain very promising results for both MED and MER.
 Beyond Doctors: Future Health Prediction from Multimedia and Multimodal Observations | BIBA | Full-Text 591-600 Liqiang Nie; Luming Zhang; Yi Yang; Meng Wang; Richang Hong; Tat-Seng Chua Although chronic diseases cannot be cured, they can be effectively controlled as long as we understand their progressions based on the current observational health records, which is often in the form of multimedia data. A large and growing body of literature has investigated the disease progression problem. However, far too little attention to date has been paid to jointly consider the following three observations of the chronic disease progression: 1) the health statuses at different time points are chronologically similar; 2) the future health statuses of each patient can be comprehensively revealed from the current multimedia and multimodal observations, such as visual scans, digital measurements and textual medical histories; and 3) the discriminative capabilities of different modalities vary significantly in accordance to specific diseases. In the light of these, we propose an adaptive multimodal multi-task learning model to co-regularize the modality agreement, temporal progression and discriminative capabilities of different modalities. We theoretically show that our proposed model is a linear system. Before training our model, we address the data missing problem via the matrix factorization approach. Extensive evaluations on a real-world Alzheimer's disease dataset well verify our proposed model. It should be noted that our model is also applicable to other chronic diseases.

### Session 13: Multimedia Experiences and Expectations

 Multi-sensor Self-Quantification of Presentations | BIBA | Full-Text 601-610 Tian Gan; Yongkang Wong; Bappaditya Mandal; Vijay Chandrasekhar; Mohan S. Kankanhalli Presentations have been an effective means of delivering information to groups for ages. Over the past few decades, technological advancements have revolutionized the way humans deliver presentations. Despite that, the quality of presentations can be varied and affected by a variety of reasons. Conventional presentation evaluation usually requires painstaking manual analysis by experts. Although the expert feedback can definitely assist users in improving their presentation skills, manual evaluation suffers from high cost and is often not accessible to most people. In this work, we propose a novel multi-sensor self-quantification framework for presentations. Utilizing conventional ambient sensors (i.e., static cameras, Kinect sensor) and the emerging wearable egocentric sensors (i.e., Google Glass), we first analyze the efficacy of each type of sensor with various nonverbal assessment rubrics, which is followed by our proposed multi-sensor presentation analytics framework. The proposed framework is evaluated on a new presentation dataset, namely NUS Multi-Sensor Presentation (NUSMSP) dataset, which consists of 51 presentations covering a diverse set of topics. The dataset was recorded with ambient static cameras, Kinect sensor, and Google Glass. In addition to multi-sensor analytics, we have conducted a user study with the speakers to verify the effectiveness of our system generated analytics, which has received positive and promising feedback.
 HyperMeeting: Supporting Asynchronous Meetings with Hypervideo | BIBA | Full-Text 611-620 Andreas Girgensohn; Jennifer Marlow; Frank Shipman; Lynn Wilcox While synchronous meetings are an important part of collaboration, it is not always possible for all stakeholders to meet at the same time. We created the concept of hypermeetings for meetings with asynchronous attendance. Such hypermeetings consist of a chain of video-recorded meetings with hyperlinks for navigating through them. Our HyperMeeting system supports the viewing of prior meetings during a videoconference. Natural viewing behavior such as pausing video generates hyperlinks between previous and current meetings. During playback, automatic link-following guided by playback plans present the relevant content to users. Playback plans take into account the user's meeting attendance and viewing history and match them with features such as topic and speaker segmentation. A user study showed that participants found hyperlinks useful but did not always understand where the links would take them. Experiences from longer-term use and the study results provide a good basis for future system improvements.
 MMToC: A Multimodal Method for Table of Content Creation in Educational Videos | BIBA | Full-Text 621-630 Arijit Biswas; Ankit Gandhi; Om Deshmukh In this paper we propose a multimodal method called MMToC for automatically creating a table of content for educational videos. MMToC defines and quantifies word saliency for visual words extracted from the slides and spoken words obtained from the speech transcript. The saliency scores from these two modalities are combined to obtain a ranked list of salient words. These ranked words along with their saliency scores are used to formulate a topic segmentation cost function. The cost function is optimized using a dynamic program framework to obtain the topic segments of the video. These segments are labelled with their corresponding topic names for creating the table of content. We perform experiments on 24 hours of lectures spread across 23 videos ranging over 20-75 minutes duration each. We compare the proposed method with LDA-based video segmentation approaches and show that the proposed MMToC method is significantly better (F-score improvement of 0.19 and 0.24 on two datasets). We also perform a user study to demonstrate the effectiveness of MMToC for navigating educational videos.
 Interactive Scene Flow Editing for Improved Image-based Rendering and Virtual Spacetime Navigation | BIBA | Full-Text 631-640 Kai Ruhl; Martin Eisemann; Anna Hilsmann; Peter Eisert; Marcus Magnor High-quality stereo and optical flow maps are essential for a multitude of tasks in visual media production, e.g. virtual camera navigation, disparity adaptation or scene editing. Rather than estimating stereo and optical flow separately, scene flow is a valid alternative since it combines both spatial and temporal information and recently surpassed the former two in terms of accuracy. However, since automated scene flow estimation is non-accurate in a number of situations, resulting rendering artifacts have to be corrected manually in each output frame, an elaborate and time-consuming task. We propose a novel workflow to edit the scene flow itself, catching the problem at its source and yielding a more flexible instrument for further processing. By integrating user edits in early stages of the optimization, we allow the use of approximate scribbles instead of accurate editing, thereby reducing interaction times. Our results show that editing the scene flow improves the quality of visual results considerably while requiring vastly less editing effort.

### Doctoral Symposium

 Real-Time Assistance in Multimedia Capture Using Social Media | BIBA | Full-Text 641-644 Yogesh Singh Rawat In the last decade, we have seen significant improvement in the ease and cost of capturing multimedia content. However, the aesthetic quality of the content captured by an amateur user still needs substantial improvement. This doctoral research aims at providing real-time assistance to amateur users so that they can capture high quality photographs and home videos. Our approach is focused on learning the art of photography and videography from multimedia content shared on social media. We have proposed a context-based photography learning method which can assist a user in capturing high quality photographs. The photography learning is augmented with contextual information such as time, geo-location, environmental conditions and type of image, which have an impact on photography. The proposed method can provide real-time feedback to the user regarding scene composition, camera parameters, camera movement and viewpoint. We have presented some preliminary results and also described the planned future work.
 Intuitive Input Methods for Interactive Segmentation on Mobile Touch-Based Devices | BIBA | Full-Text 645-648 Christoph Korinke Existing interactive image segmentation approaches mainly focus on the algorithms rather than the input methods. The literature regarding the input methods is mainly applied to desktop PCs. However, there is a transition to mobile touch-based devices. In this paper we describe our approach to identify a set of intuitive input methods for interactive segmentation on mobile devices. Preliminary results of two user studies are presented. A description of our planned research is given, divided into initial and refinement input methods and the exploitation of the input for segmentation algorithms.
 Exploiting Contextual Information to Enable Efficient Content Delivery for 3D Tele-Immersion Applications | BIBA | Full-Text 649-652 Shannon Chen The tradeoff relationship between resource requirement, content complexity, and user satisfaction is magnified when more and more modern 3D Tele-immersive (3DTI) applications with higher quality demands and/or scalability requirements come into the picture. These demanding applications introduce challenges in different phases throughout the delivery chain of 3DTI systems. To tackle them, we propose to exploit contextual information of 3DTI systems, such as contextual information on resource, content, and satisfaction aspects. Understanding contextual information will improve the utilization of different computing environments to fulfill the objectives of targeted application.
 Socializing Multimodal Sensors for Information Fusion | BIBA | Full-Text 653-656 Yuhui Wang In the modern big data world, development of a huge number of physical or social sensors provides us a great opportunity to explore both cyber and physical situation awareness. However, social sensor fusion for situation awareness is still in its infancy and lacks a unified framework to aggregate and composite real-time media streams from diverse sensors and social network platforms. We propose a new paradigm where sensor and social information are fused together facilitating event detection or customized services. Our proposal consists of 1) a tweeting camera framework where cameras can tweet event related information; 2) a hybrid social sensor fusion algorithm utilizing spatio-temporal-semantic information from multimodal sensors and 3) a new social-cyber-physical paradigm where human and sensors are collaborating for event fusion. Our research progress and preliminary results are presented and future directions are discussed.
 Learn to Recognize Actions Through Neural Networks | BIBA | Full-Text 657-660 Zhenzhong Lan This research seeks to develop neural network techniques to effectively recognize actions in videos. The proposed study will lead to a deeper understanding of how neural network algorithms can help AI systems to understand motions. It will also realize an action recognition system that significantly outperforms current state-of-the-art. In addition, we will investigate the extent to which our work can be beneficial to understanding how brains perceive and analyze actions. Recent data-driven neural network approaches such as convolutional neural networks have been successful for object recognition. However, learning to recognize actions has proven to be quite a challenge due to the difficulty of getting enough labels, processing large-scale video data, and capturing motion information from videos. Therefore, we leverage effective techniques from local hand-crafted methods to help neural network algorithms learn motion features. These techniques include learning from video and optical flow volumes that follow motion trajectories, pooling features from videos played at multiple frame rates to achieve speed invariance, extending the local descriptors with normalized locations to incorporate spatial-temporal information, and a training-free re-ranking technique to exploit the relationship among classes. We also discuss a fundamental problem of whether should we learn time-aware models and what our models actually capture when we feed them temporal data. Finally, we discuss the connection of our research with how brains perceive and recognize motions.
 Challenge for Manga Processing: Sketch-based Manga Retrieval | BIBA | Full-Text 661-664 Yusuke Matsui We propose a sketch-based system for manga image retrieval. In the system, users simply draw sketches, and similar images are retrieved in real time from a manga database. The results are updated every time the user draws a stroke and therefore users can intuitively interact with the system. The proposed method consists of a simple and efficient sliding window-based feature description framework and interactive re-ranking schemes, which are introduced because the characteristics of manga images are different from those of naturalistic images, and thus, traditional image retrieval methods are not effective. Additionally, the future directions of improving the current retrieval system, including by using a combination of text features and image features, and the construction of a large database are discussed.
 Captioning Images Using Different Styles | BIBA | Full-Text 665-668 Alexander Patrick Mathews I develop techniques that can be used to incorporate stylistic objectives into existing image captioning systems. Style is generally a very tricky concept to define, thus I concentrate on two specific components of style. First I develop a technique for predicting how people will name visual objects. I demonstrate that this technique could be used to generate captions with human like naming conventions. Full details are available in a recent publication. Second I outline a system for generating sentences which express a strong positive or negative sentiment. Finally I present two possible future directions which are aimed at modelling style more generally. These are learning to imitate an individuals captioning style and generating a diverse set of captions for a single image.
 Weakly Supervised Learning of Part-based Models for Interaction Prediction via LDA | BIBA | Full-Text 669-671 Jia-Lin Chen In this paper, we focus on interaction prediction which infers to what interaction might happen in the near future. Each interaction is modeled by mixtures of deformable parts in order to provide higher tolerance to part configurations. In our weakly supervised learning setting, part detectors are learned from training data without bounding boxes around the true locations of the people in each frame. The discriminating features are obtained using a two-layer Linear Discriminant Analysis (LDA) classification to promise maximal separability for parts and interactions respectively. Experimental results demonstrate that the proposed system is effective in learning part-based models in less annotated information and achieves comparable performance to state-of-the-art fully supervised approaches.

### Open Source Software Competition

 The fertilized forests Decision Forest Library | BIBA | Full-Text 681-684 Christoph Lassner; Rainer Lienhart Since the introduction of Random Forests in the 80's they have been a frequently used statistical tool for a variety of machine learning tasks. Many different training algorithms and model adaptions demonstrate the versatility of the forests. This variety resulted in a fragmentation of research and code, since each adaption requires its own algorithms and representations. In 2011, Criminisi and Shotton developed a unifying Decision Forest model for many tasks. By identifying the reusable parts and specifying clear interfaces, we extend this approach to an object oriented representation and implementation. This has the great advantage that research on specific parts of the Decision Forest model can be done 'locally' by reusing well-tested and high-performance components.    Our fertilized forests library is open source and easy to extend. It provides components allowing for parallelization up to node optimization level to exploit modern many core architectures. Additionally, the library provides consistent and easy-to-maintain interfaces to C++, Python and Matlab and offers cross-platform and cross-interface persistence.
 SINGA: A Distributed Deep Learning Platform | BIBA | Full-Text 685-688 Beng Chin Ooi; Kian-Lee Tan; Sheng Wang; Wei Wang; Qingchao Cai; Gang Chen; Jinyang Gao; Zhaojing Luo; Anthony K. H. Tung; Yuan Wang; Zhongle Xie; Meihui Zhang; Kaiping Zheng Deep learning has shown outstanding performance in various machine learning tasks. However, the deep complex model structure and massive training data make it expensive to train. In this paper, we present a distributed deep learning system, called SINGA, for training big models over large datasets. An intuitive programming model based on the layer abstraction is provided, which supports a variety of popular deep learning models. SINGA architecture supports both synchronous and asynchronous training frameworks. Hybrid training frameworks can also be customized to achieve good scalability. SINGA provides different neural net partitioning schemes for training large models. SINGA is an Apache Incubator project released under Apache License 2.
 MatConvNet: Convolutional Neural Networks for MATLAB | BIBA | Full-Text 689-692 Andrea Vedaldi; Karel Lenc MatConvNet is an open source implementation of Convolutional Neural Networks (CNNs) with a deep integration in the MATLAB environment. The toolbox is designed with an emphasis on simplicity and flexibility. It exposes the building blocks of CNNs as easy-to-use MATLAB functions, providing routines for computing convolutions with filter banks, feature pooling, normalisation, and much more. MatConvNet can be easily extended, often using only MATLAB code, allowing fast prototyping of new CNN architectures. At the same time, it supports efficient computation on CPU and GPU, allowing to train complex models on large datasets such as ImageNet ILSVRC containing millions of training examples.
 Theia: A Fast and Scalable Structure-from-Motion Library | BIBA | Full-Text 693-696 Christopher Sweeney; Tobias Hollerer; Matthew Turk In this paper, we have presented a comprehensive multi-view geometry library, Theia, that focuses on large-scale SfM. In addition to state-of-the-art scalable SfM pipelines, the library provides numerous tools that are useful for students, researchers, and industry experts in the field of multi-view geometry. Theia contains clean code that is well documented (with code comments and the website) and easy to extend. The modular design allows for users to easily implement and experiment with new algorithms within our current pipeline without having to implement a full end-to-end SfM pipeline themselves. Theia has already gathered a large number of diverse users from universities, startups, and industry and we hope to continue to gather users and active contributors from the open-source community.
 eRS: A System to Facilitate Emotion Recognition in Movies | BIBA | Full-Text 697-700 Joël Dumoulin; Diana Affi; Elena Mugellini; Omar Abou Khaled We present eRS, an open-source system whose purpose is to facilitate the workflow of emotion recognition in movies, released under the MIT license. The system consists of a Django project and an AngularJS web application. It allows to easily create emotional video datasets, process the videos, extract the features and model the emotion. All data is exposed by a REST API, making it available not only to the eRS web application, but also to other applications. All visualizations are interactive and linked to the playing video, allowing researchers to easily analyze the results of their algorithms. The system currently runs on Linux and OS X. eRS can be extended, to integrate new features and algorithms needed in the different steps of emotion recognition in movies.
 WATTS: a Web Annotation Tool for Surveillance Scenarios | BIBA | Full-Text 701-704 Federico Bartoli; Lorenzo Seidenari; Giuseppe Lisanti; Svebor Karaman; Alberto Del Bimbo In this paper, we present a web based annotation tool we developed allowing creating collaboratively a detailed ground truth for datasets related to visual surveillance and behavior understanding. The system persistence is based on a relational database and the user interface is designed using HTML5, Javascript and CSS. Our tool can easily manage datasets with multiple cameras. It allows annotating a person location in the image, its identity, its body and head gaze, as well as a potential occlusion or group membership. We justify each annotation type with regards to current trends of research in the computer vision community. We further detail how our interface can be used to annotate each of these annotations type. We conclude the paper with an usability evaluation of our system.
 Aurio: Audio Processing, Analysis and Retrieval | BIBA | Full-Text 705-708 Mario Guggenberger Aurio is an open source software library written for audio-based processing, analysis, and retrieval of audio and video recordings. The novelty of this library is the implementation of a number of fingerprinting and time warping algorithms to retrieve, match and synchronize media streams, which no other library currently offers. It is designed with simplicity, performance and versatility in mind, can be easily integrated into .NET applications, and offers a collection of many basic signal processing methods. It can read many file formats, offers multiple export abilities for further processing, and contains various UI widgets for graphical applications. Built upon the Aurio library, AudioAlign is an additionally released open source application for the (semi-)automatic synchronization of media recordings.
 Amalia.js: An Open-Source Metadata Driven HTML5 Multimedia Player | BIBA | Full-Text 709-712 Nicolas Hervé; Pierre Letessier; Mathieu Derval; Hakim Nabi Amalia.js is a new extensible and versatile HTML5 multimedia player that allows you to view any type of metadata synchronized with your video or audio streams. It manages metadata that are localized both temporally and spatially. They can also be hierarchical. Several visualization plugins have already been developped, enabling amalia.js to be deployed in a huge variety of web applications. We believe it can be used in various research areas to quickly visualize analysis results and to share them with the community. Amalia.js is an open-source software under a GPL license. It is available for download at http://ina-foss.github.io/amalia.js.
 SIVA Suite: Framework for Hypervideo Creation, Playback and Management | BIBA | Full-Text 713-716 Britta Meixner; Stefan John; Christian Handschigl Due to their structure, hypervideos are well suited for different scenarios. Compared to traditional linear videos, they have advantages especially in e-learning and training, where the study matter can be fitted to the needs of the viewer. In this paper we present the SIVA Suite, an open source framework for the creation, playback, and administration of hypervideos. The SIVA Suite consists of an authoring tool, an HTML5 hypervideo player, and a Web server for user and video management. This framework has been successfully used for the creation of hypervideos in different use cases, e.g. a medical hypervideo training. It was evaluated in several usability tests and improved step-by-step since 2008.

### Art Exhibit

 Using Handmade Controllers for Interactive Projection Mapping | BIBA | Full-Text 717-719 Alinta K. Krauth In this short paper I will look at the use of hand-made gaming controllers for my own interactive artworks, in particular the technology behind the interactive projection mapping artwork Shadows blister those who try to touch, exhibited at ACM MM 2015 in Brisbane. In this installation, interactions between viewer and artwork give new life to older forms of interactive hardware and traditional art media. Its set up also creates previously unexplored connections between new and old software, and has caused me to postulate that the aesthetic of interactive controllers can, and perhaps should, be considered in the artistic process. In new media art, hardware can be a part of the artwork -- not just a peripheral.
 3D Printing and Camera Mapping: Dialectic of Virtual and Reality | BIBA | Full-Text 721-722 He-Lin Luo; I-Chun Chen; Yi-Ping Hung Projection Mapping, the superimposing of virtual images upon actual objects, is already extensively used in performance arts. Applications of it are already quite mature, therefore, here we wish to achieve the opposite, or specifically speaking, the superimposing of actual objects into virtual images. This method of reverse superimposition is called "camera mapping." Through cameras, camera mapping captures actual objects, and introduces them into a virtual world. Then using superimposition, this allows for actual objects to be rendered as virtual objects. However, the actual objects here must have refined shapes so that they may be superimposed back into the camera. Through the proliferation of 3D printing, virtual 3D models in computers can be created in reality, thereby providing a framework for the limits and demands of "camera mapping." The new media artwork Digital Buddha combines 3D Printing and camera mapping. This work was created by 3-D deformable modeling through a computer, then transforming the model into a sculpture using 3D printing, and then remapping the materially produced sculpture back into the camera. Finally, it uses the already known algorithm to convert the model back into that of the original non-deformed sculpture. From this creation project, in the real world, audiences will see a deformed, abstract sculpture; and in the virtual world, through camera mapping, they will see a concrete sculpture (Buddha). In its representation, this piece of work pays homage to the work TV Buddha produced by video art master Nam June Paik. Using the influence television possesses over people, this work extends into the most important concepts of the digital era, "coding" and "decoding," simultaneously addressing the shock and insecurity people in the digital era feel toward images.
 Drag A Star: The Social Media in Outer Space | BIBA | Full-Text 723-726 James She; Carmen Ng; Desmond Leung "Drag A Star" is an interactive installation artwork that gives audiences an immersive and stunning interactive experience to remember the myth of making wishes upon a shooting star. Through the interactions with the display, audiences can learn about meteorites from outer space based on scientific and artistic perspectives by catching a shooting star with their smartphones. Audiences can send their wishes to a shooting star through their smartphones, while being able to read and reply the wishes from others at the same time. The piece was created based on the latest technologies of digital display, screen smart-device interactions, mobile applications, and web-based messaging systems. Extensive scientific, artistic and design efforts were integrated to create these cyber-physical interactive experiences between shooting stars and audiences. The artistic statement of this installation, akin to many ancient myths about wishing upon shooting stars, is about the possibility of catching a shooting star physically through technologies, and realizing someone's wish after reading them. Hence the existence of shooting stars could likely be the social media in outer space -- a world where a connection is made between other beings in the Universe.
 Disturbed System: Recreating Sculptor's Experience of Their Medium With Haptics and Generated Sound | BIBA | Full-Text 727-730 Oksana Krzyhanivska; Simon Fay; Jeffrey E. Boyd This paper discusses a collaborative project that inquires about the possibilities of the tactile sensation in art and its ability to reestablish human sensory relationships with consumer technology. It introduces the visitor to the interaction style practiced by the artists with their medium. The described interactive sculpture predisposes the visitor to explore tactile interaction as an aesthetic experience within multimodal multisensory system. Disturbed System takes the visitor's experience with the artwork into unfamiliar sensory territory. Touching the soft silicone surface of the sculpture provides electronic feedback of embedded vibration and directional spatialized sound in an installation format. The artwork presents this sensory information in a form of unexpected assemblage of pulsing organic sculptural surfaces and emitting sound. It also places visitors in a shared interactive space, an aura of traveling sound warped by their touch of the sculpture. This shared interaction investigates relationships between nature, artifice, technology, human body and human social group behavior.
 The Real Time Rolling Shutter | BIBA | Full-Text 731-734 David S. Monaghan; Noel E. O'Connor; Anne Cleary; Denis Connolly From an early age children are often told either, you are creative you should do art but stay away from science and maths. Or that you are mathematical you should do science but you're not that creative. Compounding this there also exist some traditional barriers of artistic rhetoric that say, "don't touch, don't think and don't be creative, we've already done that for you, you can just look...". The Real Time Rolling Shutter is part of a collaborative Art/Science partnership whose core tenets are in complete contrast to this. The Art/Science exhibitions we have created have invited the public to become part of the exhibition by utilising augmented digital mirrors, Kinects, feed-back camera and projector systems and augmented reality perception helmets. The fundamental underlying principles we are trying to adhere to are to foster curiosity, intrigue, wonderment and amazement and we endeavour to draw the audience into the interactive nature of our exhibits and exclaim to everyone that you can be what ever you chose to be, and that everyone can be creative, everyone can be an artist, everyone can be a scientist... all it takes is an inquisitive mind, so come and explore the real-time rolling shutter and be creative.

### Videos/Demos 1:

 Query-by-Emoji Video Search | BIBA | Full-Text 735-736 Spencer Cappallo; Thomas Mensink; Cees G. M. Snoek This technical demo presents Emoji2Video, a query-by-emoji interface for exploring video collections. Ideogram-based video search and representation presents an opportunity for an intuitive, visual interface and concise non-textual summary of video contents, in a form factor that is ideal for small screens. The demo allows users to build search strings comprised of ideograms which are used to query a large dataset of YouTube videos. The system returns a list of the top-ranking videos for the user query along with an emoji summary of the video contents so that users may make an informed decision whether to view a video or refine their search terms. The ranking of the videos is done in a zero-shot, multi-modal manner that employs an embedding space to exploit semantic relationships between user-selected ideograms and the video's visual and textual content.
 Dive into Remote Events: Omnidirectional Video Streaming with Acoustic Immersion | BIBA | Full-Text 737-738 Daisuke Ochi; Kenta Niwa; Akio Kameda; Yutaka Kunita; Akira Kojima We propose a system that can provide the physical presence of remote events through a head mount display (HMD) and a headphone. It can stream omnidirectional video within a limited network bandwidth at a high bitrate without sending regions that users are not viewing. It can also reproduce binaural sounds by convoluting head related transfer functions and angular region-wise separated signals. Technical demos of the system using an Oculus Rift HMD with a headphone will be performed to enable users to experience the visual and acoustic immersion it provides.
 Movie's Affect Communication Using Multisensory Modalities | BIBA | Full-Text 739-740 Joël Dumoulin; Diana Affi; Elena Mugellini; Omar Abou Khaled; Marco Bertini; Alberto Del Bimbo The goal of the system presented in this demo is to make possible for the visually and hearing impaired audience to live empathetic viewing experiences using their home theatre. In this work we suggest the incorporation of new emotion communication modalities into the standard television, to provide the targeted audience with sensations that they do not have the opportunity to enjoy because of their disability.
 QOEYE: A Data Driven Platform for QoE Visualization and System Performance Monitoring | BIBA | Full-Text 741-742 Chao Zhou; Lifeng Sun; Wenming Shi; Shiqiang Yang The stunning increase of video streaming has been a major part of the network flow in the past few years. It is essential for content providers to manage more flow servers to satisfy the demands of users. Therefore, to effectively manage large-scale servers and to identify problems, service node become crucial to guarantee video user experience. The rule-based approach that the traditional service providers use is effective but could hardly be applied to large-scale server clusters, let alone in a joint service resources and the maximization of user experience. Unlike traditional rule-based QoS monitoring platform, we design a framework on a basis of user experience metrics to detect the quality of service. In line with the characteristics of IP addresses, we also design a method to quickly pinpoint the fault location.
 AR in Hand: Egocentric Palm Pose Tracking and Gesture Recognition for Augmented Reality Applications | BIBA | Full-Text 743-744 Hui Liang; Junsong Yuan; Daniel Thalmann; Nadia Magnenat Thalmann Wearable devices such as Microsoft Hololens and Google glass are highly popular in recent years. As traditional input hardware is difficult to use on such platforms, vision-based hand pose tracking and gesture control techniques are more suitable alternatives. This demo shows the possibility to interact with 3D contents with bare hands on wearable devices by two Augmented Reality applications, including virtual teapot manipulation and fountain animation in hand. Technically, we use a head-mounted depth camera to capture the RGB-D images from egocentric view, and adopt the random forest to regress for the palm pose and classify the hand gesture simultaneously via a spatial-voting framework. The predicted pose and gesture are used to render the 3D virtual objects, which are overlaid onto the hand region in input RGB images with camera calibration parameters for seamless virtual and real scene synthesis.
 PPTLens: Create Digital Objects with Sketch Images | BIBA | Full-Text 745-746 Changcheng Xiao; Changhu Wang; Liqing Zhang In this work, we introduce the PPTLens system to convert sketch images captured by smart phones to digital flowcharts in PowerPoint. Different from existing sketch recognition system, which is based on hand-drawn strokes, PPTLens enables users to use sketch images as inputs directly. It's more challenging since strokes extracted from sketch images might not only be very messy, but also without temporal information of the drawings. To implement the 'Image to Object' (I2O) scenario, we propose a novel sketch image recognition framework, including an effective stroke extraction strategy and a novel offline sketch parsing algorithm. By enabling sketch images as inputs, our system makes flowchart/diagram production much more convenient and easier.
 A Multi-Modal 3D Capturing Platform for Learning and Preservation of Traditional Sports and Games | BIBA | Full-Text 747-748 Francois Destelle; Amin Ahmadi; Kieran Moran; Noel E. O'Connor; Nikolaos Zioulis; Anargyros Chatzitofis; Dimitrios Zarpalas; Petros Daras; Luis Unzueta; Jon Goenetxea; Mikel Rodriguez; Maria T. Linaza; Yvain Tisserand; Nadia Magnenat Thalmann We present a demonstration of a multi-modal 3D capturing platform coupled to a motion comparison system. This work is focused on the preservation of Traditional Sports and Games, namely the Gaelic sports from Ireland and Basque sports from France and Spain. Users can learn, compare and compete in the performance of sporting gestures and compare themselves to real athletes. Our online gesture database provides a way to preserve and display a wide range of sporting gestures. The capturing devices utilised are Kinect 2 sensors and wearable inertial sensors, where the number required varies based on the requested scenario. The fusion of these two capture modalities, coupled to our inverse kinematic algorithm, allow us to synthesize a fluid and reliable 3D model of the user gestures over time. Our novel comparison algorithms provide the user with a performance score and a set of comparison curves (i.e. joint angles and angular velocities), providing a precise and valuable feedback for coaches and players.
 Analysing Audience Response to Performing Events: A Web Platform for Interactive Exploration of Physiological Sensor Data | BIBA | Full-Text 749-750 Thomas Röggla; Chen Wang; Pablo S. César This paper presents a web interface for the exploration of audience response to a performing arts event. The platform temporally synchronises data obtained from physiological sensors with video recordings. More concretely, it is geared towards people from the creative industry, e.g. theatre directors, who want to gain deeper insights into how audiences perceive their performances. The platform takes the raw data from sensors and corresponding video recordings and visualises both synchronised in a more digestible manner. This document presents the major features of the application and explains the reasoning behind its various visualisations and interactive capabilities and how it can benefit performing artists.
 MPEG-DASH for Low Latency and Hybrid Streaming Services | BIBA | Full-Text 751-752 Jean Le Feuvre; Cyril Concolato; Nassima Bouzakaria; Viet-Thanh-Trung Nguyen While over-the-top video distribution is now widely deployed, it still suffers from much higher latencies than traditional broadcast, typically from a few seconds up to half a minute. In this paper, we demonstrate a novel DASH system with latency close to broadcast channels, and show how such a system can be used to enable combined broadcast and broadband services while keeping the client buffering requirements on the broadcast link low.
 eMosic: Mobile Media Pushing through Social Emotion Sensing | BIB | Full-Text 753-754 Jheng-Wei Peng; Shih-Wei Sun; Wen-Huang Cheng; Yi-Hsuan Yang
 PITAGORA: Recommending Users and Local Experts in an Airport Social Network | BIBA | Full-Text 755-756 Andrea Ferracani; Daniele Pezzatini; Andrea Benericetti; Marco Guiducci; Alberto Del Bimbo In this demo we present PITAGORA\footnote{Demo video available at http://bit.ly/1GgtUrN}: a mobile web contextual social network designed for the check-in area of an airport. The app provides recommendation of potential friends, local experts and targeted services. Recommendation is hybrid and combines social media analysis and collaborative filtering techniques. Users' recommendation has been evaluated through a user study with good results.
 A System for Video Recommendation using Visual Saliency, Crowdsourced and Automatic Annotations | BIBA | Full-Text 757-758 Andrea Ferracani; Daniele Pezzatini; Marco Bertini; Saverio Meucci; Alberto Del Bimbo In this paper we present a system for content-based video recommendation that exploits visual saliency to better represent video features and content\footnote{Demo video available at http://bit.ly/1FYloeQ}. Visual saliency is used to select relevant frames to be presented in a web-based interface to tag and annotate video frames in a social network; it is also employed to summarize video content to create a more effective video representation used in the recommender system. The system exploits automatic annotations from CNN-based classifiers on salient frames and user generated annotations. We evaluate several baseline approaches and show how the proposed method improves over them.
 A Semantic Geo-Tagged Multimedia-Based Routing in a Crowdsourced Big Data Environment | BIBA | Full-Text 759-760 Faizan Ur Rehman; Ahmed Lbath; Abdullah Murad; Md. Abdur Rahman; Bilal Sadiq; Akhlaq Ahmad; Ahmad Qamar; Saleh Basalamah Traditional routing algorithms for calculating the fastest or shortest path become ineffective or difficult to use when both source and destination are dynamic or unknown. To solve the problem, we propose a novel semantic routing system that leverages geo-tagged rich crowdsourced multimedia information such as images, audio, video and text to add semantics to the conventional routing. Our proposed system includes a Semantic Multimedia Routing Algorithm (SMRA) that uses an indexed spatial big data environment to answer multimedia spatio-temporal queries in real-time. The results are customized to the users' smartphone bandwidth and resolution requirements. The system has been designed to be able to handle a very large number of multimedia spatio-temporal requests at any given moment. A proof of concept of the system will be demonstrated through two scenarios. These are 1) multimedia enhanced routing and 2) finding lost individuals in a large crowd using multimedia. We plan to test the system's performance and usability during Hajj 2015, where over four million pilgrims from all over the world gather to perform their rituals.
 Crowdsourced Multimedia Enhanced Spatio-temporal Constraint Based on-Demand Social Network for Group Mobility | BIBA | Full-Text 761-762 Bilal Sadiq; Md. Abdur Rahman; Abdullah Murad; Muhammad Shahid; Faizan Ur Rehman; Ahmed Lbath; Akhlaq Ahmad; Ahmad Qamar This paper presents a system that enables efficient and scalable real-time user and vehicle discovery using textual, audio and video mechanisms. The system allows users to group together for shared intra-city transportation with the aid of multimedia that helps individuals to 1) find community of common interest (CoCI), 2) locate individual users in a large crowd and 3) locate vehicles for mobility in an efficient and cost effective manner. The system is a pilot project and will be deployed during Hajj 2015 when over three million pilgrims from all over the world visit Makkah, Saudi Arabia.
 A Multi-sensory Gesture-Based Login Environment | BIBA | Full-Text 763-764 Ahmad Qamar; Abdullah Murad; Md. Mohamed Rahman; Faizan Ur Rehman; Akhlaq Ahmad; Bilal Sadiq; Saleh Basalamah Logging on to a system using a conventional keyboard may not be feasible in certain environments, such as, in a surgical operation theatre or in an industrial manufacturing facility. We have developed a multi-sensory gesture based login system that allows a user to access secure information using body gestures. The system can be configured to use different types of gestures according to the type of sensors available to the user. We have proposed a simple scheme to represent all alphanumeric characters required for password entry as gestures within the multi-sensory environment. Our scheme is scalable enough to support sensors that detect a large number of gestures to those that can only accept a few. This allows the system to be used in a variety of situations such as usage by disabled persons with limited ability to perform gestures. We are in the midst of deploying our developed system in a clinical environment.
 Hand-Object Sense: A Hand-held Object Recognition System Based on RGB-D Information | BIBA | Full-Text 765-766 Xiong Lv; Shuqiang Jiang; Luis Herranz; Shuang Wang Hand-held objects play an important role in human-human and human-machine interaction. It can be used as a reference for understanding user intentions or user requirements. In this technical demonstration, we introduce an object recognition system called Hand-Object Sense that can automatically recognize the object held by user. This system first detects and segments the hand-held object by exploiting skeleton information combined with depth information. Second, in the object recognition stage, this system exploits features computed in different ways and fuses them to improve the recognition accuracy. Our system can recognize objects in real-time and have a good tolerance to angle and scale transformation. Furthermore, it has a good generalization capability for unknown objects.
 A Cross-media Sentiment Analytics Platform For Microblog | BIBA | Full-Text 767-769 Chao Chen; Fuhai Chen; Donglin Cao; Rongrong Ji In this demo, a cross-media public sentiment analysis system is presented. The system presents and visualizes the sentiments of microblog data by organizing the results by region, topic, and content, respectively. Such sentiment is obtained by fusing of sentiment classification scores from both visual and textual channel. In such a way, social multimedia sentiment is shown in a multi-level and user-friendly form.
 A Unsupervised Person Re-identification Method Using Model Based Representation and Ranking | BIBA | Full-Text 771-774 Chao Liang; Binyue Huang; Ruimin Hu; Chunjie Zhang; Xiaoyuan Jing; Jing Xiao As a core technique supporting the multi-camera tracking task, person re-identification attracts increasing research interests in both academic and industrial communities. Its aim is to match individuals across a group of spatially non-overlapping surveillance cameras, which are usually interfered by various imaging conditions and object motions. Current methods mainly focus on robust feature representation and accurate distance measure, where intensive computations and expensive training samples prohibit their practical applications. To address the above problems, this paper proposes a new unsupervised person re-identification method featured by its competitive accuracy and high efficiency. Both merits stem from model based person image representation and ranking, with which, merely 4-dimension pixel-level features can achieve over 20% matching rate at Rank 1 on the challenging VIPeR dataset.
 Evolution of a Tabletop Telepresence System through Art and Technology | BIBA | Full-Text 775-776 Tony Dunnigan; John Doherty; Daniel Avrahami; Jacob Biehl; Patrick Chiu; Chelhwon Kim; Qiong Liu; Henry Tang; Lynn Wilcox New technologies arise in a number of ways. They may come from advances in scientific research, through new combinations of existing technologies, or by simply imagining what might be possible in the future. This video describes the evolution of Tabletop Telepresence, a system for remote collaboration through desktop videoconferencing combined with a digital desk. Tabletop Telepresence began as a collection of camera, projector, videoconferencing and user interaction technologies. Working together; artists and research scientists combined these technologies into a means of sharing paper documents between remote desktops, interacting with those documents, requesting services (such as translation), and communicating through a videoconference.
 LiveTraj: Real-Time Trajectory Tracking over Live Video Streams | BIBA | Full-Text 777-780 Tom Z. J. Fu; Jianbing Ding; Richard T. B. Ma; Marianne Winslett; Yin Yang; Zhenjie Zhang; Yong Pei; Bingbing Ni We present LiveTraj, a novel system for tracking trajectories in a live video stream in real time, backed by a cloud platform. Although trajectory tracking is a well-studied topic in computer vision, so far most attention has been devoted to improving the accuracy of trajectory tracking, rather than the efficiency. To our knowledge, LiveTraj is the first that achieves real-time efficiency in trajectory tracking, which can be a key enabler in many important applications such as video surveillance, action recognition and robotics. LiveTraj is based on a state-of-the-art approach to (offline) trajectory tracking; its main innovation is to adapt this base solution to run on an elastic cloud platform to achieve real-time tracking speed at an affordable cost. The video demo shows the offline base solution and LiveTraj side by side, both running on a video stream containing human actions. Besides demonstrating the real-time efficiency of LiveTraj, our video demo also exhibits important system parameters to the audience such as latency and cloud resource usage for different components of the system. Further, if the conference venue provides sufficiently fast Internet connection to our cloud platform, we also plan to demonstrate LiveTraj on-site, during which we will show LiveTraj identifying and tracking trajectories from a live video stream captured by a camera.
 Automatic Accident Detection and Alarm System | BIBA | Full-Text 781-784 Zhuo Wei; Swee-Won Lo; Yu Liang; Tieyan Li; Jialie Shen; Robert H. Deng Accident detection and alarm system is very important to detect possible accidents or dangers for the peoples using their mobile devices while walking, i.e., distracted walking. In this paper, we introduce an automatic accident detection and alarm system, called AutoADAS, which is fully implemented and tested on the real mobile devices. The proposed system can be activated either manually or automatically when user walks. Under the manual mode, user activates the system before distracted walking while under the automatic mode, a "user behaviour profiling" module is used to recognize (distracted) walking behaviours and an "object detection" module is activated. Using image processing and camera field of view (FOV), the distance and angle between the user and detected objects are estimated and then applied to identify whether any potential accidents can happen. The "accident analysis and prediction" module includes: temporal alarm that inputs the user's walking speed and distance with respect to the detected objects and outputs temporal accident prediction; spatial alarm that inputs the user's walking direction and angle with respect to the detected objects and outputs spatial accident prediction. Once the proposed system positively predicts a potential accident, the "alarm and suggestion" module alerts the user with text, sound or vibration.
 Visible Light Communication via Temporal Psycho-Visual Modulation | BIBA | Full-Text 785-788 Chunjia Hu; Guangtao Zhai; Zhongpai Gao In this paper we propose a new paradigm for visible light communication (VLC) using the emerging display technology of Temporal Psycho-Visual Modulation (TPVM) that exploits the interaction between human visual system and modern electro-optical display devices. Unlike traditional VLC, no specifically designed light emitter and receiver are required. In the proposed system, light projector is used as the information source and digital cameras act as information decoder. The emitted light is designed in a specific way such that it can carry meaningful information (or simply works as an illumination source) for human eyes while other message can be decoded by the digital camera due to the fundamental difference in the imaging mechanism of the human eye and digital devices. We further describe two applications of this new type of VLC in ubiquitous augmented reality and illegal camcorder-recording prevention with extensive experimental results.

### Demos 2:

 What Shall I Look Like after N Years? | BIBA | Full-Text 789-790 Xiangbo Shu; Jinhui Tang; Luoqi Liu; Zhiheng Niu; Shuicheng Yan "What shall I look like after N years?" In this paper, we present an Auto Age Progression system, which automatically renders a series of aging faces in the future age ranges and generates an aging sequence (aging video) covering the entire life for an individual input. In the offline stage, a set of age-range specific dictionaries are learned from the constructed database, where the dictionary bases corresponding to the same index yet from different dictionaries form a particular aging process pattern across different age groups, and a linear combination of these patterns expresses a particular personalized aging process. In the online stage, for an input face of an individual, our system renders the aging faces corresponding to different age ranges through the aging dictionaries, and then generates an age progression by the presented face morphing technology.
 Searching and Browsing Live, Web-based Meetings | BIBA | Full-Text 791-792 Scott Carter; Laurent Denoue; Matthew Cooper Establishing common ground is one of the key problems for any form of communication. The problem is particularly pronounced in remote meetings, in which participants can easily lose track of the details of dialogue for any number of reasons. In this demo we present a web-based tool, MixMeet, that allows teleconferencing participants to search the contents of live meetings so they can rapidly retrieve previously shared content to get on the same page, correct a misunderstanding, or discuss a new idea.
 Deep Face Beautification | BIBA | Full-Text 793-794 Jianshu Li; Chao Xiong; Luoqi Liu; Xiangbo Shu; Shuicheng Yan The beautification of human photos usually requires professional editing softwares, which are difficult for most users. In this technical demonstration, we propose a deep face beautification framework, which is able to automatically modify the geometrical structure of a face so as to boost the attractiveness. A learning based approach is adopted to capture the underlying relations between the facial shape and the attractiveness via training the Deep Beauty Predictor (DBP). Relying on the pre-trained DBP, we construct the BeAuty SHaper (BASH) to infer the "flows" of landmarks towards the maximal aesthetic level. BASH modifies the facial landmarks with the direct guidance of the beauty score estimated by DBP.
 Pan360: INS Assisted 360-Degree Panorama (Demo Description) | BIBA | Full-Text 795-796 Yu-Hsin Lin; Yu-Mei Chen; Lun-Cheng Chu; Andre Chen; Scott Chien-Hung Liao; Edward Y. Chang This article describes Pan360, a 360 X 180 panorama capturing and viewing product developed and launched by our team, and our demo plan.
 HeartHealth: New Adventures in Serious Gaming | BIBA | Full-Text 797-798 David S. Monaghan; Freddie Honohan; Edmond Mitchell; Noel E. O'Connor; Anargyros Chatzitofis; Dimitrios Zarpalas; Petros Daras We present a novel, low-cost, interactive, exercise-based rehabilitation system. Our research involves the investigation and development of patient-centric, sensor-based rehabilitation games and surrounding technologies. HeartHealth is designed to provide a safe, personalised and fun exercise environment that could be deployed in any exercise based rehabilitation program. HeartHealth utilises a cloud-based patient information management system built on FIWARE Generic Enablers, and motion tracking coupled with our sophisticated motion comparison algorithms. Users can record customised exercises through a doctors interface and then play the rehabilitation game where they must perform a sequence of their exercises in order to complete the game scenario. Their exercises are monitored, recorded and compared by our Motion Evaluation software and real-time feedback is than given based on the users performance.
 Challenged Content Delivery Network: Eliminating the Digital Divide | BIBA | Full-Text 799-800 Hua-Jun Hong; Shu-Ting Wang; Chih-Pin Tan; Tarek El-Ganainy; Khaled Harras; Cheng-Hsin Hsu; Mohamed Hefeeda We present a complete system, called Challenged Content Delivery Network (CCDN), to efficiently deliver multimedia content to mobile users who live in developing countries, rural areas, or over-populated cities with no or weak network infrastructure. These mobile users do not have always-on Internet access. We demo our CCDN, implemented on a Linux server, Raspberry Pi proxies, and Android phones from three aspects: multimedia, networking, and machine learning tools. We propose multiple optimization algorithm modules that compute personalized distribution plans, and maximize the overall user experience. CCDN allows people living in area with challenged networks access to multimedia content, like news reports, using mobile devices, such as smartphones. This in turn will help in eliminating the digital divide, which refers to information inequality to persons with different Internet accessing abilities.
 OmniViewer: Enabling Multi-modal 3D DASH | BIBA | Full-Text 801-802 Zhenhuan Gao; Shannon Chen; Klara Nahrstedt This paper presents OmniViewer, a multi-modal 3D video streaming system based on Dynamic Adaptive Streaming over HTTP (DASH) standard. OmniViewer allows users to view arbitrary side of a performer by choosing the view angle from 0° to 360°. Besides, according to the current available bandwidth, it can also adaptively change the bitrate of rendered 3D video for both smooth and high-quality view rendering. Finally, OmniViewer extends traditional DASH implementation to support multi-modal data streaming besides video and audio.
 Large Video Event Ontology Browsing, Search and Tagging (EventNet Demo) | BIBA | Full-Text 803-804 Hongliang Xu; Guangnan Ye; Yitong Li; Dong Liu; Shih-Fu Chang In this demo we present PITAGORA\footnote{Demo video available at http://bit.ly/1GgtUrN}: a mobile web contextual social network designed for the check-in area of an airport. The app provides recommendation of potential friends, local experts and targeted services. Recommendation is hybrid and combines social media analysis and collaborative filtering techniques. Users' recommendation has been evaluated through a user study with good results.
 MASTER: Multi-platform Application Streaming Toolkits for Elastic Resources | BIBA | Full-Text 805-806 Yusen Li; Yunhua Deng; Ronald Seet; Xueyan Tang; Wentong Cai In this demo we present PITAGORA\footnote{Demo video available at http://bit.ly/1GgtUrN}: a mobile web contextual social network designed for the check-in area of an airport. The app provides recommendation of potential friends, local experts and targeted services. Recommendation is hybrid and combines social media analysis and collaborative filtering techniques. Users' recommendation has been evaluated through a user study with good results.
 smArt: Open and Interactive Indoor Cultural Data | BIBA | Full-Text 807-808 Andrea Ferracani; Daniele Pezzatini; Alberto Del Bimbo; Riccardo Del Chiaro; Franco Yang; Maurizio Sanesi In this demo we present smArt, a low-cost framework to quickly set up indoor exhibits featuring a smart navigation system for museums. The framework is web-based and allows the design on a digital map of a sensorized museum environment and the dynamic and assisted definition of the multimedia materials and sensors associated to the artworks. The knowledge-base uses semantic technologies and it is exploited by museum visitors to get directions and to have multimedia insights in a natural way. Indoor localisation and routing is provided taking advantage of active and passive sensors advertisements and user interactions. In this way we overcome the Global Positioning System (GPS) unavailability issue in indoor environments.
 i-Diary: A Crowdsource-based Spatio-Temporal Multimedia Enhanced Points of Interest Authoring Tool | BIBA | Full-Text 809-810 Akhlaq Ahmad; Faizan Ur Rehman; Md. Abdur Rahman; Abdullah Murad; Ahmad Qamar; Bilal Sadiq; Salah Basalamah; Mohamed Ridza Wahiddin Traditional routing algorithms for calculating the fastest or shortest path become ineffective or difficult to use when both source and destination are dynamic or unknown. To solve the problem, we propose a novel semantic routing system that leverages geo-tagged rich crowdsourced multimedia information such as images, audio, video and text to add semantics to the conventional routing. Our proposed system includes a Semantic Multimedia Routing Algorithm (SMRA) that uses an indexed spatial big data environment to answer multimedia spatio-temporal queries in real-time. The results are customized to the users' smartphone bandwidth and resolution requirements. The system has been designed to be able to handle a very large number of multimedia spatio-temporal requests at any given moment. A proof of concept of the system will be demonstrated through two scenarios. These are 1) multimedia enhanced routing and 2) finding lost individuals in a large crowd using multimedia. We plan to test the system's performance and usability during Hajj 2015, where over four million pilgrims from all over the world gather to perform their rituals.
 B-box Mixer: An Interactive UI for Generating B-box Music | BIBA | Full-Text 811-812 Yi-Zhu Dai; Ting-Chia Lee; Xin-Yu Kuo; Tse-Yu Pan; Min-Chun Hu B-box is a form of vocal percussion that imitates rhythms in various types of sound, especially musical instruments. As b-box becoming popular, more and more people want to learn b-box and make their own b-box music. However, not everyone has the talent for generating harmonic b-box music. In this work, we develop an interactive system which helps the user easily compose b-box music given two inputs: an unaccompanied vocal song and a piece of b-box rhythm. The audio signals of the two inputs are analyzed and adaptively matched on the basis of their beats. The state of the art beat detection technique does not perform well on vocal songs. Hence, we propose to partition the song into short segments and estimate the average tempo for each segment so that the adjustment of tempo will not be affected too much by the wrongly detected beats. With the proposed system, people who love b-box or are not familiar with b-box can enjoy producing their own b-box music.
 DeepFont: A System for Font Recognition and Similarity | BIBA | Full-Text 813-814 Zhangyang Wang; Jianchao Yang; Hailin Jin; Jonathan Brandt; Eli Shechtman; Aseem Agarwala; Zhaowen Wang; Yuyan Song; Joseph Hsieh; Sarah Kong; Thomas Huang We develop the DeepFont system, a large-scale learning-based solution for automatic font identification, organization and selection. In this proposed technical demonstration, we will give our audience a tour to the DeepFont system, with the focus on its impacts on real consumer products, including but not limited to: 1) a cloud-based iOS App for font recognition; 2) a web-based tool for font similarity evaluation and discovery.
 ObjectMinutiae: Fingerprinting for Object Authentication | BIBA | Full-Text 815-816 Tzu-Yun Lin; Yu-Chiang Frank Wang; Sean Moss-Pultz In this work, we present \emph{ObjectMinutiae}, which is a framework for authenticating different objects or materials via extracting and matching their fingerprints. Unlike biometrics fingerprinting processes, which use patterns such as ridge ending and bifurcation points as the interest points, our work applies stereo photometric techniques for reconstructing objects' local image regions that contain the surface texture information. The interest points of the recovered image regions can be detected and described by state-of-the-art computer vision algorithms. Together with dimension reduction and hashing techniques, our proposed system is able to perform object verification using compact image features. With neutral and different torturing conditions, preliminary results on multiple types of papers support the use of our framework for practical object authentication tasks.
 Hyper Video Browser: Search and Hyperlinking in Broadcast Media | BIBA | Full-Text 817-818 Maria Eskevich; Huynh Nguyen; Mathilde Sahuguet; Benoit Huet Massive amounts of digital media is being produced and consumed daily on the Internet. Efficient access to relevant information is of key importance in contemporary society. The Hyper Video Browser provides multiple navigation means within the content of a media repository. Our system utilizes the state of the art multimodal content analysis and indexing techniques, at multiple temporal granularity, in order to satisfy the user need by suggesting relevant material. We integrate two intuitive interfaces: for search and browsing through the video archive, and for further hyperlinking to the related content while enjoying some video content. The novelty of this work includes a multi-faceted search and browsing interface for navigating in video collections and the dynamic suggestion of hyperlinks related to a media fragment content, rather than the entire video, being viewed. The approach was evaluated on the MediaEval Search and Hyperlinking task, demonstrating its effectiveness at locating accurately relevant content in a big media archive.

### Poster Session 1

 Joint Modeling of Users' Interests and Mobility Patterns for Point-of-Interest Recommendation | BIBA | Full-Text 819-822 Hongzhi Yin; Bin Cui; Zi Huang; Weiqing Wang; Xian Wu; Xiaofang Zhou Point-of-Interest (POI) recommendation has become an important means to help people discover interesting places, especially when users travel out of town. However, extreme sparsity of user-POI matrix creates a severe challenge. To cope with this challenge, we propose a unified probabilistic generative model, Topic-Region Model (TRM), to simultaneously discover the semantic, temporal and spatial patterns of users' check-in activities, and to model their joint effect on users' decision-making for POIs. We conduct extensive experiments to evaluate the performance of our TRM on two real large-scale datasets, and the experimental results clearly demonstrate that TRM outperforms the state-of-art methods.
 SHOE: Sibling Hashing with Output Embeddings | BIBA | Full-Text 823-826 Sravanthi Bondugula; Varun Manjunatha; Larry S. Davis; David Doermann We present a supervised binary encoding scheme for image retrieval that learns projections by taking into account similarity between classes obtained from output embeddings. Our motivation is that binary hash codes learned in this way improve the visual quality of retrieval results by ranking related (or "sibling") class images before unrelated class images. We employ a sequential greedy optimization that learns relationship aware projections by minimizing the difference between inner products of binary codes and output embedding vectors. We develop a joint optimization framework to learn projections which improve the accuracy of supervised hashing over the current state of the art with respect to standard and sibling evaluation metrics. We further obtain discriminative features learned from correlations of kernelized input CNN features and output embeddings, which significantly boosts performance. Experiments are performed on three datasets: CUB-2011, SUN-Attribute and ImageNet ILSVRC 2010, where we show significant improvement in sibling performance metrics over state-of-the-art supervised hashing techniques, while maintaining performance with respect to standard metrics.
 Supervised Hashing with Pseudo Labels for Scalable Multimedia Retrieval | BIBA | Full-Text 827-830 Jingkuan Song; Lianli Gao; Yan Yan; Dongxiang Zhang; Nicu Sebe There is an increasing interest in using hash codes for efficient multimedia retrieval and data storage. The hash functions are learned in such a way that the hash codes can preserve essential properties of the original space or the label information. Then the Hamming distance of the hash codes can approximate the data similarity. Existing works have demonstrated the success of many supervised hashing models. However, labeling data is time and labor consuming, especially for scalable datasets. In order to utilize the supervised hashing models to improve the discriminative power of hash codes, we propose a Supervised Hashing with Pseudo Labels (SHPL) which uses the cluster centers of the training data to generate pseudo labels, based on which the hash codes can be generated using the criteria of supervised hashing. More specifically, we utilize linear discriminant analysis (LDA) with trace ratio criterion as a showcase for hash functions learning and during the optimization, we prove that the pseudo labels and the hash codes can be jointly learned and iteratively updated in an unified framework. The learned hash functions can harness the discriminant power of trace ratio criterion, and thus can achieve better performance. Experimental results on three large-scale unlabeled datasets (i.e., SIFT1M, GIST1M, and SIFT1B) demonstrate the superior performance of our SHPL over existing hashing methods.
 Multi-view Latent Hashing for Efficient Multimedia Search | BIBA | Full-Text 831-834 Xiaobo Shen; Fumin Shen; Quan-Sen Sun; Yun-Hao Yuan Hashing techniques have attracted broad research interests in recent multimedia studies. However, most of existing hashing methods focus on learning binary codes from data with only one single view, and thus cannot fully utilize the rich information from multiple views of data. In this paper, we propose a novel unsupervised hashing approach, dubbed multi-view latent hashing (MVLH), to effectively incorporate multi-view data into hash code learning. Specifically, the binary codes are learned by the latent factors shared by multiple views from an unified kernel feature space, where the weights of different views are adaptively learned according to the reconstruction error with each view. We then propose to solve the associate optimization problem with an efficient alternating algorithm. To obtain high-quality binary codes, we provide a novel scheme to directly learn the codes without resorting to continuous relaxations, where each bit is efficiently computed in a closed form. We evaluate the proposed method on several large-scale datasets and the results demonstrate the superiority of our method over several other state-of-the-art methods.
 Jointly Estimating Interactions and Head, Body Pose of Interactors from Distant Social Scenes | BIBA | Full-Text 835-838 Ramanathan Subramanian; Jagannadan Varadarajan; Elisa Ricci; Oswald Lanz; Stefan Winkler We present joint estimation of F-formations and head, body pose of interactors in a social scene captured by surveillance cameras. Unlike prior works that have focused on (a) discovering F-formations based on head pose and position cues, or (b) jointly learned head and body pose of individuals based on anatomic constraints, we exploit positional and pose cues characterizing interactors and interactions to jointly infer both (a) and (b). We show how the joint inference framework benefits both F-formation and head, body pose estimation accuracy via experiments on two social datasets.
 Exploring Viewable Angle Information in Georeferenced Video Search | BIBA | Full-Text 839-842 Gang Hu; Jie Shao; Lianli Gao; Yang Yang As positioning data and other sensor information such as orientation measurement became powerful contextual features generated by mobile devices during video recording, a model capturing geographic field-of-view (FOV) has been developed for georeferenced video search. The accurate representation of an FOV is through the geometric shape of a circular sector. However, previous work simply employed a rectilinear vector model to represent the coverage area of a video scene. In this study, we propose to use a novel circular sector model with beginning-ending vectors for FOV representation which additionally explores viewable angle information. Its major advantage is that it leads to a more accurate georeferenced video search without false positives or false negatives (which occur in previous model using single vector). We demonstrate how our model can be applied to perform different types of overlap queries for spatial data selection in a unified framework, while providing competitive performance in terms of efficiency.
 Topic Hypergraph Hashing for Mobile Image Retrieval | BIBA | Full-Text 843-846 Lei Zhu; Jialie Shen; Liang Xie Hashing is one of the promising solutions to support efficient Mobile Image Retrieval (MIR). However, most of existing hashing strategies simply rely on low-level features, which inevitably makes the generated hashing codes less semantic. Moreover, many of them fail to exploit complex and high-order semantic correlations of images. Motivated by these observations, we propose a novel unsupervised hashing scheme, \emph{Topic Hypergraph Hashing} (THH), to address the limitations. A unified topic hypergraph, where images and topics are represented with independent vertices and hyperedges respectively, is first constructed to model latent semantics of images and their correlations. With topic hypergraph model, hashing codes and functions are then learned by simultaneously preserving similarity consistence and semantic correlation. Experiments on standard datasets demonstrate that THH can achieve superior performance compared with several state-of-the-art techniques, and it is more suitable for MIR.
 Semi-supervised Coupled Dictionary Learning for Cross-modal Retrieval in Internet Images and Texts | BIBA | Full-Text 847-850 Xing Xu; Yang Yang; Atsushi Shimada; Rin-ichiro Taniguchi; Li He Nowadays massive amount of images and texts has been emerging on the Internet, arousing the demand of effective cross-modal retrieval such as text-to-image search and image-to-text search. To eliminate the heterogeneity between the modalities of images and texts, the existing subspace learning methods try to learn a common latent subspace under which cross-modal matching can be performed. However, these methods usually require fully paired samples (images with corresponding texts) and also ignore the class label information along with the paired samples. This may inhibit these methods from learning an effective subspace since the correlations between two modalities are implicitly incorporated. Indeed, the class label information can reduce the semantic gap between different modalities and explicitly guide the subspace learning procedure. In addition, the large quantities of unpaired samples (images or texts) may provide useful side information to enrich the representations from learned subspace. Thus, in this paper we propose a novel model for cross-modal retrieval problem. It consists of 1) a semi-supervised coupled dictionary learning step to generate homogeneously sparse representations for different modalities based on both paired and unpaired samples; 2) a coupled feature mapping step to project the sparse representations of different modalities into a common subspace defined by class label information to perform cross-modal matching. Experiments on a large scale web image dataset MIRFlickr-1M with both fully paired and unpaired settings show the effectiveness of the proposed model on the cross-modal retrieval task.
 Vocabulary Expansion Using Word Vectors for Video Semantic Indexing | BIBA | Full-Text 851-854 Nakamasa Inoue; Koichi Shinoda We propose vocabulary expansion for video semantic indexing. From many semantic concept detectors obtained by using training data, we make detectors for concepts not included in training data. First, we introduce Mikolov's word vectors to represent a word by a low-dimensional vector. Second, we represent a new concept by a weighted sum of concepts in training data in the word vector space. Finally, we use the same weighting coefficients for combining detectors to make a new detector. In our experiments, we evaluate our methods on the TRECVID Video Semantic Indexing (SIN) Task. We train our models with Google News text documents and ImageNET images to generate new semantic detectors for SIN task. We show that our method performs as well as SVMs trained with 100 TRECVID example videos.
 Filter-Invariant Image Classification on Social Media Photos | BIBA | Full-Text 855-858 Yu-Hsiu Chen; Ting-Hsuan Chao; Sheng-Yi Bai; Yen-Liang Lin; Wen-Chin Chen; Winston H. Hsu With the popularity of social media nowadays, tons of photos are uploaded everyday. To understand the image content, image classification becomes a very essential technique for plenty of applications (e.g., object detection, image caption generation). Convolutional Neural Network (CNN) has been shown as the state-of-the-art approach for image classification. However, one of the characteristics in social media photos is that they are often applied with photo filters, especially on Instagram. We find that prior works do not aware of this trend in social media photos and fail on filtered images. Thus, we propose a novel CNN architecture that utilizes the power of pairwise constraint by combining Siamese network and the proposed adaptive margin contrastive loss with our discriminative pair sampling method to solve the problem of filter bias. To the best of our knowledge, this is the first work to tackle filter bias on CNN and achieve state-of-the-art performance on a filtered subset of ILSVRC2012.
 Learning Multi-view Deep Features for Small Object Retrieval in Surveillance Scenarios | BIBA | Full-Text 859-862 Haiyun Guo; Jinqiao Wang; Min Xu; Zheng-Jun Zha; Hanqing Lu With the explosive growth of surveillance videos, object retrieval has become a significant task for security monitoring. However, visual objects in surveillance videos are usually of small size with complex light conditions, view changes and partial occlusions, which increases the difficulty level of efficiently retrieving objects of interest in a large-scale dataset. Although deep features have achieved promising results on object classification and retrieval and have been verified to contain rich semantic structure property, they lack of adequate color information, which is as crucial as structure information for effective object representation. In this paper, we propose to leverage discriminative Convolutional Neural Network (CNN) to learn deep structure and color feature to form an efficient multi-view object representation. Specifically, we utilize CNN trained on ImageNet to abstract rich semantic structure information. Meanwhile, we propose a CNN model supervised by 11 color names to extract deep color features. Compared with traditional color descriptors, deep color features can capture the common color property across different illumination conditions. Then, the complementary multi-view deep features are encoded into short binary codes by Locality-Sensitive Hash (LSH) and fused to retrieve objects. Retrieval experiments are performed on a dataset of 100k objects extracted from multi-camera surveillance videos. Comparison results with several popular visual descriptors show the effectiveness of the proposed approach.
 Unsupervised Extraction of Human-Interpretable Nonverbal Behavioral Cues in a Public Speaking Scenario | BIBA | Full-Text 863-866 M. Iftekhar Tanveer; Ji Liu; M. Ehsan Hoque We present a framework for unsupervised detection of nonverbal behavioral cues -- hand gestures, pose, body movements, etc.--from a collection of motion capture (MoCap) sequences in a public speaking setting. We extract the cues by solving a sparse and shift-invariant dictionary learning problem, known as shift-invariant sparse coding. We find that the extracted behavioral cues are human-interpretable in the context of public speaking. Our technique can be applied to automatically identify the common patterns of body movements and the time-instances of their occurrences, minimizing time and efforts needed for manual detection and coding of nonverbal human behaviors.
 Exploiting Word and Visual Word Co-occurrence for Sketch-based Clipart Image Retrieval | BIBA | Full-Text 867-870 Ching-Hsuan Liu; Yen-Liang Lin; Wen-Feng Cheng; Winston H. Hsu As the increasing popularity of touch-screen devices, retrieving images by hand-drawn sketch has become a trend. Human sketch can easily express some complex user intention such as the object shape. However, sketches are sometimes ambiguous due to different drawing styles and inter-class object shape ambiguity. Although adding text queries as semantic information can help removing the ambiguity of sketch, it requires a huge amount of efforts to annotate text tags to all database clipart images. We propose a method directly model the relationship between text and clipart images by the co-occurrence relationship between words and visual words, which improves traditional sketch-based image retrieval (SBIR), provides a baseline performance and obtains more relevant results in the condition that all images in database do not have any text tag. Experimental results show that our method really can help SBIR to get better retrieval result since it indeed learned semantic meaning from the "word-visual word" (W-VW) co-occurrence relationship.
 Heterogeneous Graph-based Video Search Reranking using Web Knowledge via Social Media Network | BIBA | Full-Text 871-874 Soh Yoshida; Takahiro Ogawa; Miki Haseyama Graph-based reranking is effective for refining text-based video search results by making use of the social network structure. Unlike previous works which only focus on an individual video graph, the proposed method leverages the mutual reinforcement of heterogeneous graphs, such as videos and their associated tags obtained by social influence mining. Specifically, propagation of information relevancy across different modalities is performed by exchanging information of inter- and intra-relations among heterogeneous graphs. The proposed method then formulates the video search reranking as an optimization problem from the perspective of Bayesian framework. Furthermore, in order to model the consistency over the modified video graph topology, a local learning regularization with a social community detection scheme is introduced to the framework. Since videos within the same social community have strong semantic correlation, the consistency score estimation becomes feasible. Experimental results obtained by applying the proposed method to a real-world video collection show its effectiveness.
 Selective K-means Tree Search | BIBA | Full-Text 875-878 Tuan Anh Nguyen; Yusuke Matsui; Toshihiko Yamasaki; Kiyoharu Aizawa In object recognition and image retrieval, an inverted indexing method is used to solve the approximate nearest neighbor search problem. In these tasks, inverted indexing provides a nonexhaustive solution to large-scale search. However, a problem of previous inverted indexing methods is that a large-scale inverted index is required to achieve a high search recall rate. In this study, we address the problem of reducing the time required to build an inverted index without degrading the search accuracy and speed. Thus, we propose a selective k-means tree search method that combines the power of both hierarchical k-means tree and selective nonexhaustive search. Experiments based on approximate nearest neighbor search using a large dataset comprising one billion SIFT features showed that the hierarchical inverted file based on the selective k-means tree method could be built six times faster, while obtaining almost the same recall and search speed as the state-of-the-art inverted indexing methods.
 Predicting Continuous Probability Distribution of Image Emotions in Valence-Arousal Space | BIBA | Full-Text 879-882 Sicheng Zhao; Hongxun Yao; Xiaolei Jiang Previous works on image emotion analysis mainly focused on assigning a dominated emotion category or the average dimension values to an image for affective image classification and regression. However, this is often insufficient in many applications, as the emotions that are evoked in viewers by an image are highly subjective and different. In this paper, we propose to predict the continuous probability distribution of dimensional image emotions represented in valence-arousal space. By the statistical analysis on the constructed Image-Emotion-Social-Net dataset, we represent the emotion distribution as a Gaussian mixture model (GMM), which is estimated by the EM algorithm. Then we extract commonly used features of different levels for each image. Finally, we formulize the emotion distribution prediction as a multi-task shared sparse regression (MTSSR) problem, which is optimized by iteratively reweighted least squares. Besides, we introduce three baseline algorithms. Experiments conducted on the Image-Emotion-Social-Net dataset demonstrate the superiority of the proposed method, as compared to some state-of-the-art approaches.
 Towards Distributed Video Summarization | BIBA | Full-Text 883-886 Shayok Chakraborty; Omesh Tickoo; Ravishankar Iyer Video summarization is a fertile topic in multimedia research. While the advent of modern video cameras and several social networking and video sharing websites (like YouTube, Flickr, Facebook) has led to the generation of humongous amounts of redundant video data, video summarization has emerged as an effective methodology to automatically extract a succinct and condensed representation of a given video. The unprecedented increase in the volume of video data necessitates the usage of multiple, independent computers for its storage and processing. In order to understand the overall essence of a video, it is therefore necessary to develop an algorithm which can summarize a video distributed across multiple computers. In this paper, we propose a novel algorithm for distributed video summarization. Our algorithm requires minimal communication among the computers (over which the video is stored) and also enjoys nice theoretical properties. Our empirical results on several challenging, unconstrained videos corroborate the potential of the proposed framework for real-world distributed video summarization applications.
 Semantic Image Search From Multiple Query Images | BIBA | Full-Text 887-890 Gonzalo Vaca-Castano; Mubarak Shah This paper presents a novel search paradigm that uses multiple images as input to perform semantic search of images. While earlier focuses on using single or multiple query images to retrieve images with views of the same instance, the proposed paradigm uses each query image to discover text-based descriptors that are leveraged to find the common concepts that are implicitly shared by all of the query images and retrieves images considering the found concepts. Our implementation uses high level visual features extracted from a deep convolutional network to retrieve images similar to each query input. These images have associated text previously generated by implicit crowdsourcing. A Bag of Words (BoW) textual representation of each query image is built from the associated text of the retrieved similar images. A learned vector space representation of English words extracted from a corpus of 100 billion words allows computing the conceptual similarity of words. The words that represent the input images are used to find new words that share conceptual similarity across all the input images. These new words are combined with the representations of the input images to obtain a BoW textual representation of the search, which is used to perform image retrieval. The retrieved images are re-ranked to enhance visual similarity with respect to any of the input images. Our experiments show that the concepts found are meaningful and that they retrieve correctly 72.43% of the images from the top 25, along with user ratings performed in the cases of study.
 Geolocation with Subsampled Microblog Social Media | BIBA | Full-Text 891-894 Miriam Cha; Youngjune L. Gwon; H. T. Kung We propose a data-driven geolocation method on microblog text. Key idea underlying our approach is sparse coding, an unsupervised learning algorithm. Unlike conventional positioning algorithms, we geolocate a user by identifying features extracted from her social media text. We also present an enhancement robust to a random erasure of words in the text and report our experimental results with uniformly or randomly subsampled microblog text. Our solution features a novel two-step procedure consisting of upconversion and iterative refinement by joint sparse coding. As a result, we can reduce the computational cost of geolocation while preserving accuracy. In the light of information preservation and privacy, we remark potential applications of this paper.
 Social Tag Relevance Estimation via Ranking-Oriented Neighbour Voting | BIBA | Full-Text 895-898 Chaoran Cui; Jialie Shen; Jun Ma; Tao Lian User-generated tags associated with social images are frequently imprecise and incomplete. Therefore, a fundamental challenge in tag-based applications is the problem of tag relevance estimation, which concerns how to interpret and quantify the relevance of a tag with respect to the contents of an image. In this paper, we address the key problem from a new perspective of learning to rank, and develop a novel approach to facilitate tag relevance estimation to directly optimize the ranking performance of tag-based image search. A supervision step is introduced into the neighbour voting scheme, in which tag relevance is estimated by accumulating votes from visual neighbours. Through explicitly modelling the neighbour weights and tag correlations, the risk of making heuristic assumptions is effectively avoided for conventional methods. Extensive experiments on a benchmark dataset in comparison with the state-of-the-art methods demonstrate the promise of our approach.
 EMV-matchmaker: Emotional Temporal Course Modeling and Matching for Automatic Music Video Generation | BIBA | Full-Text 899-902 Jen-Chun Lin; Wen-Li Wei; Hsin-Min Wang This paper presents a novel content-based emotion-oriented music video (MV) generation system, called EMV-matchmaker, which utilizes the emotional temporal phase sequence of the multimedia content as a bridge to connect music and video. Specifically, we adopt an emotional temporal course model (ETCM) to respectively learn the relationship between music and its emotional temporal phase sequence and the relationship between video and its emotional temporal phase sequence from an emotion-annotated MV corpus. Then, given a video clip (or a music clip), the visual (or acoustic) ETCM is applied to predict its emotional temporal phase sequence in a valence-arousal (VA) emotional space from the corresponding low-level visual (or acoustic) features. For MV generation, string matching is applied to measure the similarity between the emotional temporal phase sequences of video and music. The results of objective and subjective experiments demonstrate that EMV-matchmaker performs well and can generate appealing music videos that can enhance the viewing and listening experience.
 Scalable Multimedia Retrieval by Deep Learning Hashing with Relative Similarity Learning | BIBA | Full-Text 903-906 Lianli Gao; Jingkuan Song; Fuhao Zou; Dongxiang Zhang; Jie Shao Learning-based hashing methods are becoming the mainstream for approximate scalable multimedia retrieval. They consist of two main components: hash codes learning for training data and hash functions learning for new data points. Tremendous efforts have been devoted to designing novel methods for these two components, i.e., supervised and unsupervised methods for learning hash codes, and different models for inferring hashing functions. However, there is little work integrating supervised and unsupervised hash codes learning into a single framework. Moreover, the hash function learning component is usually based on hand-crafted visual features extracted from the training images. The performance of a content-based image retrieval system crucially depends on the feature representation and such hand-crafted visual features may degrade the accuracy of the hash functions. In this paper, we propose a semi-supervised deep learning hashing (DLH) method for fast multimedia retrieval. More specifically, in the first component, we utilize both visual and label information to learn an relative similarity graph that can more precisely reflect the relationship among training data, and then generate the hash codes based on the graph. In the second stage, we apply a deep convolutional neural network (CNN) to simultaneously learn a good multimedia representation and hash functions. Extensive experiments on three popular datasets demonstrate the superiority of our DLH over both supervised and unsupervised hashing methods.
 Image Popularity Prediction in Social Media Using Sentiment and Context Features | BIBA | Full-Text 907-910 Francesco Gelli; Tiberio Uricchio; Marco Bertini; Alberto Del Bimbo; Shih-Fu Chang Images in social networks share different destinies: some are going to become popular while others are going to be completely unnoticed. In this paper we propose to use visual sentiment features together with three novel context features to predict a concise popularity score of social images. Experiments on large scale datasets show the benefits of proposed features on the performance of image popularity prediction. Exploiting state-of-the-art sentiment features, we report a qualitative analysis of which sentiments seem to be related to good or poor popularity. To the best of our knowledge, this is the first work understanding specific visual sentiments that positively or negatively influence the eventual popularity of images.
 Subtle Facial Expression Recognition Using Adaptive Magnification of Discriminative Facial Motion | BIBA | Full-Text 911-914 Sung Yeong Park; Seung Ho Lee; Yong Man Ro Recently, recognizing spontaneous facial expression has gained increasing attention in various emerging applications related to human affect. Spontaneous facial expression may generally have different temporal characteristics across subjects, emotion types, and so on. In this paper, we proposed a facial expression recognition (FER) method which adaptively magnifies a subtle facial motion based on its temporal characteristics. In training stage, we learn the relations between the temporal characteristics of facial motions and their discriminative temporal filtering. The learned model is used to automatically predict the most discriminative temporal filtering that magnifies the subtle facial motion in a test sequence. Experimental result shows that the proposed FER using the adaptive motion magnification performed clearly better than FER using non-adaptive motion magnification as well as FER without motion magnification.
 "Clustering of Dancelets": Towards Video Recommendation Based on Dance Styles | BIBA | Full-Text 915-918 Tingting Han; Hongxun Yao; Xiaoshuai Sun; Yanhao Zhang; Sicheng Zhao; Xiusheng Lu; Yinghao Huang; Wenlong Xie Dance is a special and important type of action, composed of abundant and various action elements. However, the recommendation of dance videos on the web are still not well studied. It is hard to realize it in the way of traditional methods using associated texts or static features of video content. In this paper, we study the problem focusing on extraction and representation of action information in dances. We propose to recommend dance videos based on the automatically discovered "Dance Styles", which play a significant role in characterizing different types of dances. To bridge the semantic gap of video content and mid-level concept, style, we take advantage of a mid-level action representation method, and extract representative patches as "Dancelets", a sort of intermediation between videos and the concepts. Furthermore, we propose to employ Motion Boundaries as saliency priors and sparsely extract patches containing more representative information to generate a set of dancelet candidates. Dancelets are then discovered by Normalized-cut method, which is superior in grouping visually similar patterns into the same clusters. For the fast and effective recommendation, a random forest-based index is built, and the ranking results are derived according to the matching results in all the leaf notes. Extensive experiments validated on the web dance videos demonstrate the effectiveness of the proposed methods for dance style discovery and video recommendation based on styles.
 The Quest for Visual Interest | BIBA | Full-Text 919-922 Mohammad Soleymani In this paper, we report on identifying the underlying factors that contribute to the visual interest in digital photos. A set of 1005 digital photos covering different topics and of different qualities was collected from Flickr. Images were annotated by a pool of diverse participants on a crowdsourcing platform. 12 bipolar ratings were collected for each photo on 7-point semantic differential scale, including dimensions related to interest, emotions and image quality. Every image received 20 annotations from unique participants. The most important appraisals and visual attributes for visual interest in photos was identified. We found that intrinsic pleasantness, arousal, visual quality and coping potential are the most important factors contributing to visual interest in digital photos. We developed a system that automatically detects the important visual attributes from low level visual features and demonstrated their significance in predicting interest at individual level.
 How to Take a Good Selfie? | BIBA | Full-Text 923-926 Mahdi M. Kalayeh; Misrak Seifu; Wesna LaLanne; Mubarak Shah Selfies are now a global phenomenon. This massive number of self-portrait images taken and shared on social media is revolutionizing the way people introduce themselves and the circle of their friends to the world. While taking photos of oneself can be seen simply as recording personal memories, the urge to share them with other people adds an exclusive sensation to the selfies. Due to the Big Data nature of selfies, it is nearly impossible to analyze them manually. In this paper, we provide, to the best of our knowledge, the first selfie dataset for research purposes with more than 46,000 images. We address interesting questions about selfies, including how appearance of certain objects, concepts and attributes influences the popularity of selfies. We also study the correlation between popularity and sentiment in selfie images. In a nutshell, from a large scale dataset, we automatically infer what makes a selfie a good selfie. We believe that this research creates new opportunities for social, psychological and behavioral scientists to study selfies from a large scale point of view, a perspective that best fits the nature of the selfie phenomenon.
 R2P: Recomposition and Retargeting of Photographic Images | BIBA | Full-Text 927-930 Hui-Tang Chang; Po-Cheng Pan; Yu-Chiang Frank Wang; Ming-Syan Chen In this paper, we propose a novel approach for performing joint recomposition and retargeting of photographic images (R2P). Given a reference image of interest, our method is able to automatically alter the composition of the input source image accordingly, while the recomposed output will be jointly retargeted to fit the reference. This is achieved by recomposing the visual components of the source image via graph matching, followed by solving a constrained mesh-warping based optimization problem for retargeting. As a result, the recomposed output image would fit the reference while suppressing possible distortion. Our experiments confirm that our proposed R2P method is able to achieve visually satisfactory results, without the need to use pre-collected labeled data or predetermined aesthetics rules.
 Egocentric Video Summarization of Cultural Tour based on User Preferences | BIBA | Full-Text 931-934 Patrizia Varini; Giuseppe Serra; Rita Cucchiara In this paper, we propose a new method to obtain customized video summarization according to specific user preferences. Our approach is tailored on Cultural Heritage scenario and is designed on identifying candidate shots, selecting from the original streams only the scenes with behavior patterns related to the presence of relevant experiences, and further filtering them in order to obtain a summary matching the requested user's preferences. Our preliminary results show that the proposed approach is able to leverage user's preferences in order to obtain a customized summary, so that different users may extract from the same stream different summaries.
 A Novel Statistical Approach for Image and Video Retrieval and Its Adaption for Active Learning | BIBA | Full-Text 935-938 Moitreya Chatterjee; Anton Leuski The ever expanding multimedia content (such as images and videos), especially on the web, necessitates effective text query-based search (or retrieval) systems. Popular approaches for addressing this issue, use the query-likelihood model which fails to capture the user's information needs. In this work therefore, we explore a new ranking approach in the context of image and video retrieval from text queries. Our approach assumes two separate underlying distributions for query and the document respectively. We then, determine the extent of similarity between these two statistical distributions for the task of ranking. Furthermore we extend our approach, using Active Learning techniques, to address the question of obtaining a good performance without requiring a fully labeled training dataset. This is done by taking Sample Uncertainty, Density and Diversity into account. Our experiments on the popular TRECVID corpus and the open, relatively small-sized USC SmartBody corpus show that we are almost at-par or sometimes better than multiple state-of-the-art baselines.
 Automatically Stereoscopic Camera Control for 3D Animation Production | BIBA | Full-Text 939-942 Dawei Lu; Huadong Ma; Zeyu Wang; Liang Liu; Huiyuan Fu This paper proposes a novel approach for automatically controlling stereoscopic camera parameters that specifically addresses challenges in stereo 3D animation production process.    Our proposed camera control method produces stereo contents with preferable depth perception and guarantees visual comfort by optimization of camera parameters. We introduce an attention tracking method to calculate convergence plane, avoiding window violation and minimizing visual conflict. Moreover, we derive an smoothing function on convergence plane that reduces depth jump over time. Then, we calculate the inter-axial separation using a perceived depth mapping. We describe how to implement our method on the Maya plug-in and test the stereo effect using professional stereo 3D animation scenes. The experimental results, including a user study, show that our method enhances the stereo effect. Our controller provides automatic camera control that can be helpful in creating comfortable and faster stereo 3D animations.
 Color Photo Makeover via Crowd Sourcing and Recoloring | BIBA | Full-Text 943-946 Wengang Cheng; Ruru Jiang; Chang Wen Chen It is not always easy for amateur photographers to capture photos with desired colors even on a classic hot spot as the appearance of color photo dependent on many factors. This paper proposes a novel approach to recolor given photos via a crowdsourcing based makeover scheme. When a user input a photo to be recolored, the proposed system will first conduct favorite exemplars suggestion from the images hosted by the social media sites, by jointly leveraging contextual and visual information associated with the images. The recommended exemplars shall reveal the scene and context dependent color compositions and provide users with diverse possible color styles. Then, a novel superpixel-based recoloring scheme, incorporating color statistics, texture characteristics and spatial constraints into soft matching, is applied to generate new photos of desired color. Experiments and a user study demonstrate that the proposed color photo makeover is able to achieve robust recoloring results for various outdoor photos.
 Multi-view Semi-supervised Learning for Web Image Annotation | BIBA | Full-Text 947-950 Mengqiu Hu; Yang Yang; Hanwang Zhang; Fumin Shen; Jie Shao; Fuhao Zou With the explosive increasing of web image data, image annotation has become a critical research issue for image semantic index and search. In this work, we propose a novel model, termed as multi-view semi-supervised learning (MVSSL), for robust image annotation task. Specifically, we exploit both labeled images and unlabeled images to uncover the intrinsic data structural information. Meanwhile, to comprehensively describe an individual datum, we take advantage of the correlated and complemental information derived from multiple facets of image data (i.e., multiple views or features). We devise a robust pair-wise constraint on outcomes of different views to achieve annotation consistency. Furthermore, we integrate a robust classifier learning component via l2,1 loss, which can provide effective noise identification power during the learning process. Finally, we devise an efficient iterative algorithm to solve the optimization problem in MVSSL. We conduct extensive experiments on the NUS-WIDE dataset, and the results illustrate that our proposed approach is promising for large scale web image annotation task.
 Tracking Cultural Differences in News Video Creation | BIBA | Full-Text 951-954 Chun-Yu Tsai; John R. Kender Many videos on the Web are created in different countries about the same international event. Their specialized video content, as well as their viewing and reposting rates, reflect different cultural interests. Effectively tracking cross-cultural visual memes of the same event, in online video repositories of different cultures, can provide users with a more comprehensive understanding of an international event. We propose a new way to use the PageRank algorithm to model cross-cultural visual meme influence, which more accurately captures the rates at which visual memes are re-posted in a specified time period in a specified culture.
 Click-through-based Deep Visual-Semantic Embedding for Image Search | BIBA | Full-Text 955-958 Yuan Liu; Zhongchao Shi; Xue Li; Gang Wang The problem of image search is mostly considered from the perspectives of feature-based vector model and image ranker learning. A fundamental issue that underlies the success of these approaches is the similarity learning between query and image. The need of image surrounding texts in feature-based vector model, however, makes the similarity sensitive to the quality of text descriptions. On the other, the image ranker learning can suffer from robustness problem, originating from the fact that human labeled query-image pairs do not always predict user search intention precisely. We demonstrate in this paper that the above two issues can be well mitigated by jointly exploring visual-semantic embedding and the use of click-through data. Specifically, we propose a novel click-through-based deep visual-semantic embedding (C-DVSE) model for learning query and image similarity. The proposed model consists of two components: a deep convolutional neural networks followed by an image embedding layer for learning visual embedding, and a deep neural networks for generating query semantic embedding. The objective of our model is to maximize the correlation between semantic (query) and visual (clicked image) embedding. When the visual-semantic embedding is learnt, query-image similarity can be directly computed by cosine similarity on this embedding space. On a large-scale click-based image dataset with 11.7 million queries and one million images, our model is shown to be powerful for keyword-based image search with superior performance over several state-of-the-art methods.
 Partially Common-Semantic Pursuit for RGB-D Object Recognition | BIBA | Full-Text 959-962 Lu Jin; Zechao Li; Xiangbo Shu; Shenghua Gao; Jinhui Tang For the RGB-D object recognition task, the robust and rich representations can boost the performance. Most works employ feature learning approaches to learn specific representation for the RGB and depth modalities independently, while some directly learn common property. Different from them, this paper proposes a novel supervised feature learning method for RGB-D object recognition, named Partially Common-Semantic Learning (PCSL), which jointly captures the complementary and consistency semantic information from RGB and depth modalities. The complementary information is revealed by the individual modality, while the consistency is exploited by both modalities simultaneously. In PCSL, Reconstruction Independent Component Analysis (RICA) is extended to integrate the supervised information and learn both of the complementary and partially shared common semantic information. The proposed approach is evaluated on two public RGB-D datasets and achieves better performance than several state-of-the-art methods.
 Pinterest Board Recommendation for Twitter Users | BIBA | Full-Text 963-966 Xitong Yang; Yuncheng Li; Jiebo Luo Pinboard on Pinterest is an emerging media to engage online social media users, on which users post online images for specific topics. Regardless of its significance, there is little previous work specifically to facilitate information discovery based on pinboards. This paper proposes a novel pinboard recommendation system for Twitter users. In order to associate contents from the two social media platforms, we propose to use MultiLabel classification to map Twitter user followees to pinboard topics and visual diversification to recommend pinboards given user interested topics. A preliminary experiment on a dataset with 2000 users validated our proposed system.
 A Video Timeline with Bookmarks and Prefetch State for Faster Video Browsing | BIBA | Full-Text 967-970 Axel Carlier; Vincent Charvillat; Wei Tsang Ooi Reducing seek latency by predicting what the users will access is important for user experience, particularly during video browsing, where users seek frequently to skim through a video. Much existing research strived to predict user access pattern more accurately to improve the prefetching hit rate. This paper proposed a different approach whereby the prefetch hit rate is improved by biasing the users to seek to prefetched content with higher probability, through changing the video player user interface. Through a user study, we demonstrated that our player interface can lead to up to 4$\times$ more seeks to bookmarked segments and reduce seek latency by 40\%, compared to a video player interface commonly used today. The user study also showed that the user experience and the understanding of the video content when browsing is not compromised by the changes in seek behavior.
 Giggler: An Intuitive, Real-Time Integrated Wireless In-Ear Monitoring and Personal Mixing System using Mobile Devices | BIBA | Full-Text 971-974 Andries Valstar; Min-Chieh Hsiu; Te-Yen Wu; Mike Y. Chen For live music performances, current Wireless In-Ear Monitoring and Personal Mixing setups require a lot of equipment and wiring. This paper introduces Giggler, a system that makes a Wireless In-Ear Monitoring and Personal Mixing experience, easier to setup, easier to use, faster to control and more accessible for musicians than conventional systems by integrating all equipment into one mobile device per musician. The results of the two user studies that we conducted show that Giggler's User Interface performs up to more than twice as fast as a traditional mixer and indicate that Giggler outperforms a Physical mixer because visual features are added to streamline the channel identification process.
 Dynamic Adjustment of Subtitles Using Audio Fingerprints | BIBA | Full-Text 975-978 Lucas C. Villa Real; Rodrigo Laiola Guimarães; Priscilla Avegliano Anyone who ever downloaded subtitle files from the Internet has faced problems synchronizing them with the associated media files. Even with the efforts of communities on reviewing user-contributed subtitles and with mechanisms in movie players to automate the discovery of subtitles for a given media, users still face lip synchronization issues. In this work we conduct a study on several subtitle files associated with popular movies and TV series and analyze their differences. Based on that, we propose a two-phase subtitle synchronization method that annotates subtitles with audio fingerprints, which serve as synchronization anchors to the media player. Preliminary results obtained with our prototype suggest that our technique is effective and has minimal impact on the extension of subtitle formats and on media playback performance.
 Octave-dependent Probabilistic Latent Semantic Analysis to Chorus Detection of Popular Song | BIBA | Full-Text 979-982 Sheng Gao; Haizhou Li Content representation of music signal is an essential part of music information retrieval applications, e.g. chorus detection, genre classification, etc. In the paper, we propose the octave-dependent probabilistic latent semantic analysis (OdPlsa) to discover the latent audio patterns (or clusters) through spectral-temporal analysis. Then the audio content of each segment is characterized using the statistical pattern distribution. In OdPlsa, the latent pattern is modeled by multinomial distribution which characterizes the magnitude distribution of 12-dimensional pitch class profiles over a temporal window. It thus effectively models melody information as well as octave relations in music signal. Its efficiency as a feature extraction technique is evaluated on chorus detection of popular songs. In terms of multiple performance metrics such as boundary accuracy, precision, recall and F1, the proposed technique is much superior to the widely accepted chroma feature.
 Subjectivity in Aesthetic Quality Assessment of Digital Photographs: Analysis of User Comments | BIBA | Full-Text 983-986 Won-Hee Kim; Jun-Ho Choi; Jong-Seok Lee While most of the existing work in aesthetic image quality assessment focuses on the overall (or average) opinion of users, this paper raises the issue of subjectivity (or taste) of aesthetic quality. We argue that subjectivity differs among different images, and investigate what causes such difference. We first analyze statistics of the user ratings of photos in a photo contest website, DPChallenge, in the viewpoint of average and standard deviation values of the ratings. Then, more importantly, we analyze the users' comments in order to identify sources contributing to subjectivity. When considering the importance of personalization in photo applications, we believe that our findings will be a valuable first step in the relevant future research.
 EEG Connectivity Analysis in Perception of Tone-mapped High Dynamic Range Videos | BIBA | Full-Text 987-990 Seong-Eun Moon; Jong-Seok Lee High dynamic range (HDR) imaging has attracted attention as a new technology for immersive multimedia experience. In comparison to conventional low dynamic range (LDR) contents, HDR contents are expected to provide better quality of experience (QoE). In this paper, we investigate implicit QoE measurement of tone-mapped HDR videos by using connectivity-based EEG features that convey information about simultaneous activations of different brain regions and thus can explain better the cognitive process than the conventional features using single channel powers. Through the experiment classifying EEG signals into tone-mapped HDR and LDR, it is shown that the connectivity features, particularly those representing directed information flows between brain regions, are effective in both subject-dependent and subject-independent scenarios.
 Polyphonic Music Modelling with LSTM-RTRBM | BIBA | Full-Text 991-994 Qi Lyu; Zhiyong Wu; Jun Zhu Recent interest in music information retrieval and related technologies is exploding. However, very few of the existing techniques take advantage of the recent advancements in neural networks. The challenges of developing effective browsing, searching and organization techniques for the growing bodies of music collections call for more powerful statistical models. In this paper, we present LSTM-RTRBM, a new neural network model for the problem of creating accurate yet flexible models of polyphonic music. Our model integrates the ability of Long Short-Term Memory (LSTM) in memorizing and retrieving useful history information, together with the advantage of Restricted Boltzmann Machine (RBM) in high dimensional data modelling. Our approach greatly improves the performance of polyphonic music sequence modelling, achieving the state-of-the-art results on multiple datasets.
 Multi-Sensor Cello Recordings for Instantaneous Frequency Estimation | BIBA | Full-Text 995-998 Fabian-Robert Stöter; Michael Müller; Bernd Edler Estimating the fundamental frequency (F0) of a signal is a well studied task in audio signal processing with many applications. If the F0 varies over time, the complexity increases, and it is also more difficult to provide ground truth data for evaluation. In this paper we present a novel dataset of cello recordings addressing the lack of reference annotations for musical instruments. Besides audio data, we include sensor recordings capturing the finger position on the fingerboard which is converted into an instantaneous frequency estimate. In speech processing, the electroglottograph (EGG) is able to capture the excitation signal of the vocal tract, which is then used to generate a reference instantaneous F0. Inspired by this approach, we included high speed video camera recordings to extract the excitation signal originating from the moving string. The derived data can be used to analyze vibratos -- a very commonly used playing style. The dataset is released under a Creative Commons license.
 An Elicitation Study on Gesture Attitudes and Preferences Towards an Interactive Hand-Gesture Vocabulary | BIBA | Full-Text 999-1002 Haiwei Dong; Nadia Figueroa; Abdulmotaleb El Saddik With the introduction of new depth sensing technologies, interactive hand-gesture devices are rapidly emerging. However, the hand-gestures used in these devices do not follow a common vocabulary, making certain control command device-specific. In this paper we present an initial effort to create a standardized interactive hand-gesture vocabulary for the next generation of television applications. We conduct a user-elicitation study using a survey in order to define a common vocabulary for specific control commands, such as Volume up/down, Menu open/close, etc. This survey is entirely user-oriented and thus it has two phases. In the first phase, we ask open questions about specific commands. In the second phase, we use the answers suggested from the first phase to create a multiple choice questionnaire. Based on the results from the survey, we study the gesture attitudes and preferences between gender groups, and between age groups with a quantitative and qualitative statistical analysis. Finally, the hand-gesture vocabulary is derived after applying an agreement analysis on the user-elicited gestures. The proposed methodology for gesture set design is comparable with existing methodologies and yields higher agreement levels than relevant user-elicited studies in the field.
 Automated Video Editing for Aesthetic Quality Improvement | BIBA | Full-Text 1003-1006 Jun-Ho Choi; Jong-Seok Lee In these days, a large number of videos is taken by various kinds of handheld devices, but many of them have poor aesthetic quality. In this paper, we present an automated video editing system that uses the shot length, camera motion, and color distribution as key aesthetic features. Given an amateur video, our system computes the original unrefined camera motion as homography and tries to remove some unreliable frames, which consequently splits the video into several shots. It then applies enhancement processes, including reconstruction of the overall camera motions and harmonization of color distributions. We apply our method to some amateur videos and evaluate the results through a subjective test. It is demonstrated that reducing the shot length in our method is a key point of editing that can lead enhanced satisfaction by viewers for the edited videos.
 Multimodal Dataset for Assessment of Quality of Experience in Immersive Multimedia | BIBA | Full-Text 1007-1010 Anne-Flore Nicole Marie Perrin; He Xu; Eleni Kroupi; Martin Rerábek; Tourajd Ebrahimi This paper presents a novel multimodal dataset for the analysis of Quality of Experience (QoE) in emerging immersive multimedia technologies. In particular, the perceived Sense of Presence (SoP) induced by one-minute long video stimuli is explored with respect to content, quality, resolution, and sound reproduction and annotated with subjective scores. Furthermore, a complementary analysis of the acquired physiological signals, such as EEG, ECG, and respiration is carried out, aiming at an alternative evaluation of human experience while consuming immersive multimedia.    Presented results confirm the value of the introduced dataset and its consistency for the purposes of QoE assessment for immersive multimedia. More specifically, subjective ratings demonstrate that the created dataset enables distinction between low and high levels of immersiveness, which is also confirmed by a preliminary analysis of recorded physiological signals.
 MIL: Music Exploration and Visualization via Lyric and Image | BIBA | Full-Text 1011-1014 Xixuan Wu; Yu Qiao; Xiaoou Tang In this paper, we introduce MIL: a music exploration prototype which integrates music (M), image (I), and lyrics (L), for efficiently visualizing and browsing music collections. MIL utilizes a novel structure, music semantic graph (MSG), to organize music collections in a hierarchical way by leveraging lyrics and acoustic cues of music. Each node of MSG corresponds to a music concept and is associated with a cluster of music tracks. MIL offers users a novel way to efficiently explore and scan music collections by using lyrics and image information. In addition, the proposed prototype supplies an easy-to-use interface to visualize MSG hierarchically. The user study shows that our prototype can effectively and efficiently help users to browse, search, and scan music collections.
 ESC: Dataset for Environmental Sound Classification | BIBA | Full-Text 1015-1018 Karol J. Piczak One of the obstacles in research activities concentrating on environmental sound classification is the scarcity of suitable and publicly available datasets. This paper tries to address that issue by presenting a new annotated collection of 2000 short clips comprising 50 classes of various common sound events, and an abundant unified compilation of 250000 unlabeled auditory excerpts extracted from recordings available through the Freesound project. The paper also provides an evaluation of human accuracy in classifying environmental sounds and compares it to the performance of selected baseline classifiers using features derived from mel-frequency cepstral coefficients and zero-crossing rate.
 Improving Feature Aggregation for Semantic Music Retrieval | BIBA | Full-Text 1019-1022 Zhouyu Fu Feature aggregation is an important step in semantic music retrieval that accumulates features obtained from local frames to produce a global song-level representation. A good aggregation scheme should capture both feature correlations and temporal information, while existing schemes only focus on one of the two respects and lack in the other. In this paper, we present a new feature aggregation scheme to model the dependencies in both feature and temporal domains. This is achieved by augmenting local feature vectors with second-order monomials that capture the correlations between different variables and performing temporal integration over the augmented features. To cope with increased feature dimensions, we further employ an embedded technique for feature selection by training an l2,1 regularized linear classifier model for all label classes. The use of l2,1 regularization produces a group sparse solution for classifier weight vectors, thus automatically eliminating irrelevant feature variables with varnishing weights. Our preliminary results demonstrate the effectiveness of the proposed feature aggregation scheme over existing aggregation schemes for large-scale music retrieval and annotation.
 Accelerating Large-scale Image Retrieval on Heterogeneous Architectures with Spark | BIBA | Full-Text 1023-1026 Hanli Wang; Bo Xiao; Lei Wang; Jun Wu Apache Spark is a general-purpose cluster computing system for big data processing and has drawn much attention recently from several fields, such as pattern recognition, machine learning and so on. Unlike MapReduce, Spark is especially suitable for iterative and interactive computations. With the computing power of Spark, a utility library, referred to as IRlib, is proposed in this work to accelerate large-scale image retrieval applications by jointly harnessing the power of GPU. Similar to the built-in machine learning library of Spark, namely MLlib, IRlib fits into the Spark APIs and benefits from the powerful functionalities of Spark. The main contributions of IRlib lie in two-folds. First, IRlib provides a uniform set of APIs for the programming of image retrieval applications. Second, the computational performance of Spark equipped with multiple GPUs is dramatically boosted by developing high performance modules for common image retrieval related algorithms. Comparative experiments concerning large-scale image retrieval are carried out to demonstrate the significant performance improvement achieved by IRlib as compared with single CPU thread implementation as well as Spark without GPUs employed.
 Implementation of Face Recognition for Screen Unlocking on Mobile Device | BIBA | Full-Text 1027-1030 Chung-Hua Chu; Shih-Ming Peng Face recognition is a computer technique to capture the feature of the human face for user authentication. In the advanced mobile technology, mobile devices take the place of computers to become the major human-computer interaction. Related researches become more and more popular. This study discusses about the identification of users on the mobile devices. Recently, many users still use passwords for user authentication. However, such traditional password identification may not be secure since the passwords are easily intercepted. To solve this problem, this thesis adopts facial recognition for screen unlocking on the mobile devices.
 Web-based Interactive Free-Viewpoint Streaming: A framework for high quality interactive free viewpoint navigation | BIBA | Full-Text 1031-1034 Matthias Ueberheide; Felix Klose; Tilak Varisetty; Markus Fidler; Marcus Magnor Recent advances in free-viewpoint rendering techniques as well as the continued improvements of the internet network infrastructure open the door for challenging new applications. In this paper, we present a framework for interactive free-viewpoint streaming with open standards and software. Network bandwidth, encoding strategy as well as codec support for open source browsers are key constraints to be considered for our interactive streaming applications. Our framework is capable of real-time server-side rendering and interactively streaming the output by means of open source streaming. To enable viewer interaction with the free-viewpoint video rendering back-end in a standard browser, user events are captured with Javascript and transmitted using WebSockets. The rendered video is streamed to the browser using the FFmpeg free software project. This paper discusses the applicability of open source streaming and presents timing measurements for video-frame transmission over network.
 Distributed Bandwidth-efficient Packet Scheduling for Live Streaming with Network Coding | BIBA | Full-Text 1035-1038 Shenglan Huang; Ebroul Izquierdo; Pengwei Hao This paper proposed a distributed packet scheduling algorithm for live peer-to-peer streaming system, where network coding is extended to improve the efficiency in bandwidth utilization. A problem of superfluous packet transmission due to the lack of synchronization among peers is identified. This problem often leads to bandwidth inefficiencies. We solve the problem of finding a suitable asynchronous packet scheduling policy by posing and solving a bandwidth allocation problem. This proposed scheduling policy can reduce the superfluous packet transmission, thereby achieving a more efficient bandwidth usage and an improved quality of service in live streaming applications. Experimental results confirm that the proposed scheme demonstrates significantly better video quality, delivery ratio under different network size and different loss rate compared with other push-based schemes.
 Vision-Inertial Hybrid Tracking for Robust and Efficient Augmented Reality on Smartphones | BIBA | Full-Text 1039-1042 Xin Yang; Xun Si; Tangli Xue; Liheng Zhang; Kwang-Ting (Tim) Cheng This paper aims at robust and efficient pose tracking for augmented reality on modern smartphones. Existing methods, relying on either vision analysis or motion sensing, are either too computationally expensive to achieve real-time performance on a smartphone, or too noisy to achieve sufficient robustness. This paper presents a hybrid tracking system which can achieve real-time performance with high robustness. Our system utilizes an efficient featureless method based on pixel-based registration to track the object pose on every frame. The featureless tracking result is revised from time to time by a feature-based method to reduce tracking errors. Both featureless and feature-based tracking results are sensitive to large motion blurs. To improve the robustness, an adaptive Kamlan filter is proposed to fuse the visual tracking results with the inertial tracking results computed form phone's built-in sensors. Our hybrid method is evaluated on a dataset consisting of 16 video clips with synchronized inertial sensing data. Experimental results demonstrated the superior performance of our method to state-of-the-art visual tracking methods [5, 12] on smartphones. The dataset will be made publicly available with the publication of this paper.
 An SDN Controller for Delay and Jitter Reduction in Cloud Gaming | BIBA | Full-Text 1043-1046 Maryam Amiri; Hussein Al Osman; Shervin Shirmohammadi; Maha Abdallah Cloud gaming is an emerging service that has recently started to garner prominence in the gaming industry. Since the significant part of computational processing, including game rendering and video compression, is performed in data centers, controlling the transfer of information within the cloud has an important impact on the quality of cloud gaming services. In this paper, we make two contributions: we propose a design to apply the recent paradigm of Software Defined Networks (SDNs) to Cloud Gaming, and we propose an SDN controller that reduces end-to-end delay and delay variations experienced by players. Our SDN controller adaptively disperses the game traffic load among different network paths according to their corresponding end-to-end delays. Experimental results show that our proposed controller reduces end-to-end delay and delay variation by almost 9% and 50% respectively without engendering additional packet loss, compared to a representative conventional method: Open Shortest Path First (OSPF). These reductions lead to improvements in players' gaming experience.
 The Invisible QR Code | BIBA | Full-Text 1047-1050 Zhongpai Gao; Guangtao Zhai; Chunjia Hu QR (Quick Response) Codes are widely used as a convenient unidirectional communication channel to convey information, such as emails, hyperlinks, or phone numbers, from publicity materials to mobile devices. But the QR Code is not visually appealing and takes up valuable space of publicity materials. In this paper, we propose a new method to embed QR Code on digital screen via temporal psychovisual modulation (TPVM). By exploiting the difference between human eyes and semiconductor imaging sensors in temporal convolution of optical signals, we make QR Code perceptually transparent to human but detectable for mobile devices. Based on the idea of invisible QR Code, many applications can be implemented, e.g., "physical hyperlink" for something interesting on TV or digital signage, "invisible watermark" for anti-piracy in theater. A prototype system introduced in this paper serves as a proof-of-concept of the invisible QR Code and can be improved in future works.
 3D Background Modeling in Multi-view RGB-D Video | BIBA | Full-Text 1051-1054 Yung-Lin Huang; Ku-Chu Wei; Shao-Yi Chien In this paper, we proposed a 3D background modeling system for multi-view 3D video. We first reconstructed a 3D model, and we updated the subsequent frames into it using our proposed updating strategy. The results show that dynamic objects in the model can be excluded, leaving behind a compact 3D background model.
 Audio Routing for Scalable Conferencing using AAC-ELD and Bit Stream Domain Energy Estimation | BIBA | Full-Text 1055-1058 Iaroslav Kryvyi; Nikolaus Färber; Conrad Benndorf; Manfred Lutzky There is an increasing interest in multipoint conferencing but service providers face the challenge of complexity when scaling to thousands of users. This problem can be resolved by a scalable architecture based on central media routers, which allows for low complexity server components and therefore low operational cost. In this paper we describe an audio routing approach using advantages offered by the AAC-ELD bit stream structure. By estimating the signal energy in the bit stream domain, we can detect active speakers at low complexity, which results in substantial bitrate and complexity reduction. Subjective tests with mediated conversations among four participants are conducted to compare the perceived audio quality of the audio router to conventional mixing in a conference bridge. The results show a reduction of complexity by one order of magnitude while maintaining the same subjective audio quality when forwarding the two most active speakers on a 10 ms framing.
 Ciphertext-Only Attack on an Image Homomorphic Encryption Scheme with Small Ciphertext Expansion | BIBA | Full-Text 1063-1066 Yunyu Li; Jiantao Zhou; Yuanman Li The paper "An Efficient Image Homomorphic Encryption Scheme with Small Ciphertext Expansion" In Proc. ACM MM'13, pp.803-812) presented a novel image homomorphic encryption approach achieving significant reduction of the ciphertext expansion. In the current work, we study the security of this cryptosystem under a ciphertext-only attack (COA). We show that our proposed COA is effective in generating a sketch of great fidelity of the original image. Experimental results are provided to verify the validity of the proposed attack strategy.

### Poster Session 2

 Sense Beyond Expressions: Cuteness | BIBA | Full-Text 1067-1070 Kang Wang; Tam V. Nguyen; Jiashi Feng; Jose Sepulveda With the development of Internet culture, cute has become a popular concept. Many people are curious about what factors making a person look cute. However, there is rare research to answer this interesting question. In this work, we construct a dataset of personal images with comprehensively annotated cuteness scores and facial attributes to investigate this high-level concept in depth. Based on this dataset, through an automatic attributes mining process, we find several critical attributes determining the cuteness of a person. We also develop a novel Continuous Latent Support Vector Machine (C-LSVM) method to predict the cuteness score of one person given only his image. Extensive evaluations validate the effectiveness of the proposed method for cuteness prediction.
 Joint Visual-Textual Sentiment Analysis with Deep Neural Networks | BIBA | Full-Text 1071-1074 Quanzeng You; Jiebo Luo; Hailin Jin; Jianchao Yang Sentiment analysis of online user generated content is important for many social media analytics tasks. Researchers have largely relied on textual sentiment analysis to develop systems to predict political elections, measure economic indicators, and so on. Recently, social media users are increasingly using additional images and videos to express their opinions and share their experiences. Sentiment analysis of such large-scale textual and visual content can help better extract user sentiments toward events or topics. Motivated by the needs to leverage large-scale social multimedia content for sentiment analysis, we utilize both the state-of-the-art visual and textual sentiment analysis techniques for joint visual-textual sentiment analysis. We first fine-tune a convolutional neural network (CNN) for image sentiment analysis and train a paragraph vector model for textual sentiment analysis. We have conducted extensive experiments on both machine weakly labeled and manually labeled image tweets. The results show that joint visual-textual features can achieve the state-of-the-art performance than textual and visual sentiment analysis algorithms alone.
 Attribute Mining for Scalable 3D Human Action Recognition | BIBA | Full-Text 1075-1078 Xingyang Cai; Wengang Zhou; Houqiang Li With the development of depth sensor, skeletal human action recognition from 3D video is paving the way for many practical applications. For most applications, scalable action recognition is desired to identify novel actions without rebuilding the system. To address this problem, a potential solution is to identify those intrinsic attributes which are semantic-aware and shared among known and novel actions. With such motivation, in this paper, we propose an attribute-based skeletal action recognition and explore the scalable action recognition. We first present a new skeletal feature with the representations of static pose and motion of human skeleton to support a comprehensive action attribute space. Then, a novel action attribute mining method is proposed to discover action attributes for each bone pair across action classes. Finally, we accomplish action recognition based on those mined attributes. Extensive experiments on MSRAction3D and UTKinect-Action demonstrate the effectiveness and superiority of our attribute-based action recognition approach over the existing methods.
 Learning Features from Large-Scale, Noisy and Social Image-Tag Collection | BIBA | Full-Text 1079-1082 Hanwang Zhang; Xindi Shang; Huanbo Luan; Yang Yang; Tat-Seng Chua Feature representation for multimedia content is the key to the progress of many fundamental multimedia tasks. Although recent advances in deep feature learning offer a promising route towards these tasks, they are limited in application to domains where high-quality and large-scale training data are hard to obtain. In this paper, we propose a novel deep feature learning paradigm based on large, noisy and social image-tag collections, which can be acquired from the inexhaustible social multimedia content on the Web. Instead of learning features from high-quality image-label supervision, we propose to learn from the image-word semantic relations, in a way of seeking a unified image-word embedding space, where the pairwise feature similarities preserve the semantic relations in the original image-word pairs. We offer an easy-to-use implementation for the proposed paradigm, which is fast and compatible for integrating into any state-of-the-art deep architectures. Experiments on NUSWIDE benchmark demonstrate that the features learned by our method significantly outperforms other state-of-the-art ones.
 Saliency Detection Based on Graph-Structural Agglomerative Clustering | BIBA | Full-Text 1083-1086 Youbao Tang; Xiangqian Wu; Wei Bu This paper proposes a novel saliency detection method based on graph-structural agglomerative clustering (GSAC). In this method, a number of intermediate images with consecutive number of regions are firstly created by using GSAC to the input image with the maximum incremental path integral criterion. Then an initial salient map is computed based on the boundary connectivity of the regions in the intermediate images, with enforcing the early formed objects in the clustering process. Finally, the initial salient map is refined to get the final salient map by using the reconstruction errors of sparse coding and the object-bias prior. The experimental results demonstrate that the proposed method greatly outperforms the state-of-the-art approaches on two standard benchmark datasets.
 Detecting Salient Objects via Spatial and Appearance Compactness Hypotheses | BIBA | Full-Text 1087-1090 Ping Hu; Weiqiang Wang; Ke Lu Object-level saliency detection has been attracting a lot of attention, due to its potential enhancement in many high-level vision tasks. Many previous methods are based on the contrast hypothesis which regards the regions with high contrast in a certain context as salient. Although the contrast hypothesis is valid in many cases, it cannot handle some difficult cases. To make up for the weakness of contrast hypothesis, we propose a novel compactness hypothesis which assumes salient regions are more compact than background spatially and in appearance. Based on compactness hypotheses, we implement an effective object-level saliency detection method, which is demonstrated to be effective even in difficult cases. In addition, we present an adaptive multiple saliency maps fusion framework which can automatically select saliency maps of high quality according to three quality assessment rules. We evaluate the proposed method on four benchmark datasets and the comparable performance as the state-of-the-art methods has been achieved.
 A Probabilistic Approach for Image Retrieval Using Descriptive Textual Queries | BIBA | Full-Text 1091-1094 Yashaswi Verma; C. V. Jawahar We address the problem of image retrieval using textual queries. In particular, we focus on descriptive queries that can be either in the form of simple captions (e.g., "a brown cat sleeping on a sofa"), or even long descriptions with multiple sentences. We present a probabilistic approach that seamlessly integrates visual and textual information for the task. It relies on linguistically and syntactically motivated mid-level textual patterns (or phrases) that are automatically extracted from available descriptions. At the time of retrieval, the given query is decomposed into such phrases, and images are ranked based on their joint relevance with these phrases. Experiments on two popular datasets (UIUC Pascal Sentence and IAPR-TC12 benchmark) demonstrate that our approach effectively retrieves semantically meaningful images, and outperforms baseline methods.
 Multi-cue Augmented Face Clustering | BIBA | Full-Text 1095-1098 Chengju Zhou; Changqing Zhang; Huazhu Fu; Rui Wang; Xiaochun Cao Face clustering is an important but challenging task since facial images always have huge variation due to change in facial expressions, head poses and partial occlusions, etc. Moreover, face clustering is actually an unsupervised problem which makes it more difficult to reach an accurate result. Fortunately, there are some cues that can be used to improve clustering performance. In this paper, two types of cues are employed. The first one is pairwise constraints: must-link and cannot-link constraints, which can be extracted from the temporal and spatial knowledge of data. The other is that each face is associated with a series of attributes (i.e, gender) which can contribute discrimination among faces. To take advantage of the above cues, we propose a new algorithm, Multi-cue Augmented Face Clustering (McAFC), which effectively incorporates the cues via graph-guided sparse subspace clustering technique. Specially, facial images from the same individual are encouraged to be connected while faces from different persons are restrained to be connected. Experiments on three face datasets from real-world videos show the improvements of our algorithm over the state-of-the-art methods.
 Robust Deep Auto-encoder for Occluded Face Recognition | BIBA | Full-Text 1099-1102 Lele Cheng; Jinjun Wang; Yihong Gong; Qiqi Hou Occlusions by sunglasses, scarf, hats, beard, shadow etc, can significantly reduce the performance of face recognition systems. Although there exists a rich literature of researches focusing on face recognition with illuminations, poses and facial expression variations, there is very limited work reported for occlusion robust face recognition. In this paper, we present a method to restore occluded facial regions using deep learning technique to improve face recognition performance. Inspired by SSDA for facial occlusion removal with known occlusion type and explicit occlusion location detection from a preprocessing step, this paper further introduces Double Channel SSDA (DC-SSDA) which requires no prior knowledge of the types and the locations of occlusions. Experimental results based on CMU-PIE face database have showed that, the proposed method is robust to a variety of occlusion types and locations, and the restored faces could yield significant recognition performance improvements over occluded ones.
 Dissecting Urban Noises from Heterogeneous Geo-Social Media and Sensor Data | BIBA | Full-Text 1103-1106 Hsun-Ping Hsieh; Rui Yan; Cheng-Te Li Geo-social media services, such as Foursquare and Flickr, provide rich data that sensors various urban activities of human beings from geographical, mobility, visual, and social aspects. While noise pollution in modern cities is getting worse and sound sensors are sparse and costly, it is highly demanded to infer and analyze the noise at any region in urban areas. In this paper, we aim to leverage heterogeneous geo-social sensor data on Foursquare, Flickr, and Gowalla, to dissect urban noises for every regions in a city. Using NYC 311 noise complaint records as the approximation of urban noises generated by regions, we propose a novel unsupervised framework that integrates the extracted geographical, mobility, visual, and social features to infer the noise composition for regions and time intervals of interest in NYC. Experimental results show that our system can achieve promising results with substantially few training data, compared to state-of-the-art methods.
 Deep Multimodal Speaker Naming | BIBA | Full-Text 1107-1110 Yongtao Hu; Jimmy SJ. Ren; Jingwen Dai; Chang Yuan; Li Xu; Wenping Wang Automatic speaker naming is the problem of localizing as well as identifying each speaking character in a TV/movie/live show video. This is a challenging problem mainly attributes to its multimodal nature, namely face cue alone is insufficient to achieve good performance. Previous multimodal approaches to this problem usually process the data of different modalities individually and merge them using handcrafted heuristics. Such approaches work well for simple scenes, but fail to achieve high performance for speakers with large appearance variations. In this paper, we propose a novel convolutional neural networks (CNN) based learning framework to automatically learn the fusion function of both face and audio cues. We show that without using face tracking, facial landmark localization or subtitle/transcript, our system with robust multimodal feature extraction is able to achieve state-of-the-art speaker naming performance evaluated on two diverse TV series. The dataset and implementation of our algorithm are publicly available online.
 Cross-Modal Image-Tag Relevance Learning for Social Images | BIBA | Full-Text 1111-1114 Yong Cheng; Zhengxiang Cai; Rui Feng; Cheng Jin; Yuejie Zhang; Tao Zhang A new algorithm is developed in this paper to support more effective cross-modal image-tag relevance learning for large-scale social images, which integrates the multimodal feature representation, multimodal relevance measurement, and cross-modal relevance fusion. The main contribution of our work is that we provide a more reasonable base to learn cross-modal relevance among social images, which can be acquired from integrating multimodal image and tag relevance with multiple features in different modalities. Very positive results were obtained in our experiments using a large quantity of public social image data.
 Local Depth Patterns for Tracking in Depth Videos | BIBA | Full-Text 1115-1118 Sari Awwad; Fairouz Hussein; Massimo Piccardi Conventional video tracking operates over RGB or grey-level data which contain significant clues for the identification of the targets. While this is often desirable in a video surveillance context, use of video tracking in privacy-sensitive environments such as hospitals and care facilities is often perceived as intrusive. Therefore, in this work we present a tracker that provides effective target tracking based solely on depth data. The proposed tracker is an extension of the popular Struck algorithm which leverages a structural SVM framework for tracking. The main contributions of this work are novel depth features based on local depth patterns and a heuristic for effectively handling occlusions. Experimental results over the challenging Princeton Tracking Benchmark (PTB) dataset report a remarkable accuracy compared to the original Stuck tracker and other state-of-the-art trackers using depth and RGB data.
 ConvNets-Based Action Recognition from Depth Maps through Virtual Cameras and Pseudocoloring | BIBA | Full-Text 1119-1122 Pichao Wang; Wanqing Li; Zhimin Gao; Chang Tang; Jing Zhang; Philip Ogunbona In this paper, we propose to adopt ConvNets to recognize human actions from depth maps on relatively small datasets based on Depth Motion Maps (DMMs). In particular, three strategies are developed to effectively leverage the capability of ConvNets in mining discriminative features for recognition. Firstly, different viewpoints are mimicked by rotating virtual cameras around subject represented by the 3D points of the captured depth maps. This not only synthesizes more data from the captured ones, but also makes the trained ConvNets view-tolerant. Secondly, DMMs are constructed and further enhanced for recognition by encoding them into Pseudo-RGB images, turning the spatial-temporal motion patterns into textures and edges. Lastly, through transferring learning the models originally trained over ImageNet for image classification, the three ConvNets are trained independently on the color-coded DMMs constructed in three orthogonal planes. The proposed algorithm was extensively evaluated on MSRAction3D, MSRAction3DExt and UTKinect-Action datasets and achieved the state-of-the-art results on these datasets.
 Spatio-Temporal Learning of Basketball Offensive Strategies | BIBA | Full-Text 1123-1126 Ching-Hang Chen; Tyng-Luh Liu; Yu-Shuen Wang; Hung-Kuo Chu; Nick C. Tang; Hong-Yuan Mark Liao Video-based group behavior analysis is drawing attention to its rich applications in sports, military, surveillance and biological observations. The recent advances in tracking techniques, based on either computer vision methodology or hardware sensors, further provide the opportunity of better solving this challenging task. Focusing specifically on the analysis of basketball offensive strategies, we introduce a systematic approach to establishing unsupervised modeling of group behaviors. In view that a possible group behavior (offensive strategy) could be of different duration and represented by dynamic player trajectories, the crux of our method is to automatically divide training data into meaningful clusters and learn their respective spatio-temporal model, which is established upon Gaussian mixture regression to account for intra-class spatio-temporal variations. The resulting strategy representation turns out to be flexible that can be used to not only establish the discriminant functions but also improve learning the models. We demonstrate the usefulness of our approach by exploring its effectiveness in analyzing a set of given basketball video clips.
 Weak Labeled Multi-Label Active Learning for Image Classification | BIBA | Full-Text 1127-1130 Shiquan Zhao; Jian Wu; Victor S. Sheng; Chen Ye; Pengpeng Zhao; Zhiming Cui In order to achieve better classification performance with even fewer labeled images, active learning is suitable for these situations. Several active learning methods have been proposed for multi-label image classification, but all of them assume that all training images with complete labels. However, as a matter of fact, it is very difficult to get complete labels for each example, especially when the size of labels in a multi-label domain is huge. Usually, only partial labels are available. This is one kind of "weak label" problems. This paper proposes an ingeniously solution to this "weak label" problem on multi-label active learning for image classification (called WLMAL). It explores label correlation on the weak label problem with the help of input features, and then utilizes label correlation to evaluate the informativeness of each example-label pair in a multi-label dataset for active sampling. Our experimental results on three real-world datasets show that our proposed approach WLMAL consistently outperforms existing approaches significantly.
 Probabilistic Semi-Canonical Correlation Analysis | BIBA | Full-Text 1131-1134 Chie Kamada; Asako Kanezaki; Tatsuya Harada anonical Correlation Analysis (CCA) requires paired multimodal data to ascertain the relation between two variables. However, it is generally difficult to collect a sufficient amount of paired data of two variables as training samples. This fact leads individual samples of unpaired variables to be additional resources for learning CCA, which are not only able to increase the number of training samples; they are also effective to remove the learning bias caused by the variables' missing patterns. As described in this paper, we propose a novel model of probabilistic CCA by considering the mechanism of data missing. Our method enables widespread applications such as semi-supervised learning via partially labeled training samples and analysis of sensory data which are lacking under certain circumstances. We demonstrate the superior performance of parameter estimation as well as an application of image annotation, compared with existing methods.
 Recognizing Human Activity in Still Images by Integrating Group-Based Contextual Cues | BIBA | Full-Text 1135-1138 Zheng Zhou; Kan Li; Xiangjian He Images with wider angles usually capture more persons in wider scenes, and recognizing individuals' activities in these images based on existing contextual cues usually meet difficulties. We instead construct a novel group-based cue to utilize the context carried by suitable surrounding persons. We propose a global-local cue integration model (GLCIM) to find a suitable group of local cues extracted from individuals and form a corresponding global cue. A fusion restricted Boltzmann machine, a focal subspace measurement and a cue integration algorithm based on entropy are proposed to enable the GLCIM to integrate most of the relevant local cues and least of the irrelevant ones into the group. Our experiments demonstrate how integrating group-based cues improves the activity recognition accuracies in detail and show that all of the key parts of GLCIM make positive contributions to the increases of the accuracies.
 3D Person Tracking In World Coordinates and Attribute Estimation with PDR | BIBA | Full-Text 1139-1142 Yuki Nagai; Daisuke Kamisaka; Naoya Makibuchi; Jianfeng Xu; Shigeyuki Sakazawa In this paper, we propose an online 3D person tracking method and an attribute estimation method with pedestrian dead reckoning (PDR). For person tracking, we employ a structured prediction approach, which extends the Struck algorithm. Although the main stream of visual object tracking, including Struck, utilizes only 2D information in image coordinates, it is difficult to track object correctly because of changes in the scale and angle of the target. In contrast, our classifier adaptively learns structural relationship in world coordinates and in image coordinates using Structured SVM. Furthermore, we combine visual tracking results and sensor trajectories based on PDR. Our method estimates a person attribute whether insider like a sales staff, or outsider like a customer. According to experimental results, the proposed method outperforms the existing methods regarding the quality of localization. In addition, experimental results show that our method can estimate the attribute at a ratio of 0.84.
 Image Tagging via Cross-Modal Semantic Mapping | BIBA | Full-Text 1143-1146 Zhi-Hong Deng; Hongliang Yu; Yunlun Yang Images without annotations are ubiquitous on the Internet, and recommending tags for them has become a challenging open task in image understanding. A common bottleneck of related work is the semantic gap between the image and text representations. In this paper, we bridge the gap by introducing a semantic layer, the space of word embeddings that represents the image tags as the word vectors. Our model first learns the optimal mapping from the visual space to the semantic space using training sources. Then we annotate test images by decoding the semantic representations of the visual features. Extensive experiments demonstrate that our model outperforms the state-of-the-art approaches in predicting the image tags.
 Predicting Image Memorability by Multi-view Adaptive Regression | BIBA | Full-Text 1147-1150 Houwen Peng; Kai Li; Bing Li; Haibin Ling; Weihua Xiong; Weiming Hu The images we encounter throughout our lives make different impressions on us: Some are remembered at first glance, while others are forgotten. This phenomenon is caused by the intrinsic memorability of images revealed by recent studies [5,6]. In this paper, we address the issue of automatically estimating the memorability of images by proposing a novel multi-view adaptive regression (MAR) model. The MAR model provides an effective mapping of visual features to memorability scores by taking advantage of robust feature selection and multiple feature integration. It consists of three major components: an adaptive loss function, an adaptive regularization and a multi-view modeling strategy. Moreover, we design an alternating direction method (ADM) optimization algorithm to solve the proposed objective function. Experimental results on the MIT benchmark dataset show the superiority of the proposed model compared with existing image memorability prediction methods.
 Spatio-Temporal Triangular-Chain CRF for Activity Recognition | BIBA | Full-Text 1151-1154 Congqi Cao; Yifan Zhang; Hanqing Lu Understanding human activities in video is a fundamental problem in computer vision. In real life, human activities are composed of temporal and spatial arrangement of actions. Understanding such complex activities requires recognizing not only each individual action, but more importantly, capturing their spatio-temporal relationships. This paper addresses the problem of complex activity recognition with a unified hierarchical model. We expand triangular-chain CRFs (TriCRFs) to the spatial dimension. The proposed architecture can be perceived as a spatio-temporal version of the TriCRFs, in which the labels of actions and activity are modeled jointly and their complex dependencies are exploited. Experiments show that our model generates promising results, outperforming competing methods significantly. The framework also can be applied to model other structured sequential data.
 Query-Adaptive Logo Search using Shape-Aware Descriptors | BIBA | Full-Text 1155-1158 Sreyasee Das Bhattacharjee; Junsong Yuan; Yap-Peng Tan; Lingyu Duan We propose a graph-based optimization framework to leverage category independent object proposals (candidate object regions) for logo search in a large scale image database. The proposed contour-based feature descriptor EdgeBoW is robust to view-angle changes, varying illumination conditions and can implicitly capture the significant object shape information. Having been equipped with a local descriptor, it can handle a fair amount of occlusion and deformation frequently present in a real-life scenario. Given a small set of initially retrieved candidate object proposals, a fast graph-based short-listing scheme is designed to exploit the mutual similarities among these proposals for eliminating outliers. In contrast to a coarse image-level pairwise similarity measure, this search focussed on a few specific image regions provides a more accurate method for matching. The proposed query expansion strategy aims to assess each of the remaining better matched proposals against all its neighbors within the same image for a precise localization. Combined with an efficient feature descriptor EdgeBoW, a set of more insightful edge-weights and node-utility measures can yield promising results, specially for object categories primarily defined by its shape. Extensive set of experiments performed on a number of benchmark datasets demonstrates its effectiveness and superior generalization ability in both clutter intensive real-life images and poor quality binary document images.
 Hyperspectral Image Classification with Convolutional Neural Networks | BIBA | Full-Text 1159-1162 Viktor Slavkovikj; Steven Verstockt; Wesley De Neve; Sofie Van Hoecke; Rik Van de Walle Hyperspectral image (HSI) classification is one of the most widely used methods for scene analysis from hyperspectral imagery. In the past, many different engineered features have been proposed for the HSI classification problem. In this paper, however, we propose a feature learning approach for hyperspectral image classification based on convolutional neural networks (CNNs). The proposed CNN model is able to learn structured features, roughly resembling different spectral band-pass filters, directly from the hyperspectral input data. Our experimental results, conducted on a commonly-used remote sensing hyperspectral dataset, show that the proposed method provides classification results that are among the state-of-the-art, without using any prior knowledge or engineered features.
 Online Object Tracking Based on CNN with Metropolis-Hasting Re-Sampling | BIBA | Full-Text 1163-1166 Xiangzeng Zhou; Lei Xie; Peng Zhang; Yanning Zhang Tracking-by-learning strategies have been effective in solving many challenging problems in visual tracking, in which the learning sample generation and labeling play important roles for final performance. Since the concern of deep learning based approaches has shown an impressive performance in different vision tasks, how to properly apply the learning model, such as CNN, to an online tracking framework is still challenging. In this paper, to overcome the overfitting problem caused by straight-forward incorporation, we propose an online tracking framework by constructing a CNN based adaptive appearance model to generate more reliable training data over time. With a reformative Metropolis-Hastings re-sampling scheme to reshape particles for a better state posterior representation during online learning, the proposed tracking outperforms most of the state-of-art trackers on challenging benchmark video sequences.
 Progressive Shape-Distribution-Encoder for 3D Shape Retrieval | BIBA | Full-Text 1167-1170 Jin Xie; Fan Zhu; Guoxian Dai; Yi Fang In this paper, we propose a deep shape descriptor by learning the shape distributions at different diffusion time via a progressive deep shape-distribution-encoder. First, we develop a shape distribution representation with the kernel density estimator to characterize the intrinsic geometrical structure of the shape. Then, we propose to learn discriminative shape features through a progressive shape-distribution-encoder. Specially, the progressive shape-distribution-encoder aims at modeling the complex non-linear transform of the estimated shape distributions between consecutive diffusion time. Furthermore, in order to characterize the intrinsic structure of the shape more efficiently, we stack multiple proposed progressive shape-distribution-encoders to form a neural network structure. Finally, we concatenated all neurons in the hidden layers of the progressive shape-distribution-encoder network to form a discriminative shape descriptor for retrieval. The proposed method is evaluated on three benchmark 3D shape datasets and the experimental results demonstrate the superiority of our method to the existing approaches.
 Cross-media Topic Detection with Refined CNN based Image-Dominant Topic Model | BIBA | Full-Text 1171-1174 Zhiyi Wang; Liang Li; Qingming Huang Online heterogenous data is springing up while the data has the rich auxiliary information (e.g. pictures and videos) around the text. However, traditional topic models are suffering from the limitations to discover the topics effectively from the cross-media data. Incorporating with the convolutional neural network (CNN) feature, we propose a novel image dominant topic model, which projects both the text modality and the visual modality into a semantic simplex. Further, an improved CNN feature is introduced to capture more visual details by fusing the convolutional layer and fully-connected layer. Experimental comparisons with state-of-the-art methods in the cross-media topic detection task show the effectiveness of our model.
 Human Action Recognition With Trajectory Based Covariance Descriptor In Unconstrained Videos | BIBA | Full-Text 1175-1178 Hanli Wang; Yun Yi; Jun Wu Human action recognition from realistic videos plays a key role in multimedia event detection and understanding. In this paper, a novel Trajectory Based Covariance (TBC) descriptor is proposed, which is formulated along the dense trajectories. To map the descriptor matrix to vector space and trim out the redundancy of data, the TBC descriptor matrix is projected to Euclidean space by the Logarithm Principal Components Analysis (LogPCA). Our method is tested on the challenging Hollywood2 and TV Human Interaction datasets. Experimental results show that the proposed TBC descriptor outperforms three baseline descriptors (i.e., histogram of oriented gradient, histogram of optical flow and motion boundary histogram), and our method achieves better recognition performances than a number of state-of-the-art approaches.
 RECfusion: Automatic Video Curation Driven by Visual Content Popularity | BIBA | Full-Text 1179-1182 Alessandro Ortis; Giovanni Maria Farinella; Valeria D'amico; Luca Addesso; Giovanni Torrisi; Sebastiano Battiato The proliferation of mobile devices and the diffusion of social media have changed the communication paradigm of people that share multimedia data by allowing new interaction models (e.g., social networks). In social events (e.g., concerts), the automatic video understanding goal includes the interpretation of which visual contents are the most popular. The popularity of a visual content depends on how many people are looking at that scene, and therefore it could be obtained through the "visual consensus" among multiple video streams acquired by the different users devices. In this work we present RECfusion, a system able to automatically create a single video from multiple video sources by taking into account the popularity of the acquired scenes. The frames composing the final popular video are selected from the different video streams by considering those visual scenes which are pointed and recorded by the highest number of users' devices. Results on two benchmark datasets confirm the effectiveness of the proposed system.
 Gender Classification Using Pyramid Segmentation for Unconstrained Back-facing Video Sequences | BIBA | Full-Text 1183-1186 Hao Tang; Hong Liu; Wei Xiao This paper presents a pioneering study on gender classification from unconstrained back-facing video sequences in natural scenes. In many cases, classifying gender simply via faces or other biometric cues may fail when the video only contains back-facing people. To address this problem, we propose a novel approach to classify the gender according to back-facing video sequences. For this task, a novel Pyramid Segmentation approach is proposed to divide video sequence into a suite of equal time-length sleeves with different scales. Moreover, a heuristic approach is used to compute weights for different features from each sleeve. Finally, a framework of gender classification based on video sequences is presented. To validate our approach, we introduce a new dataset, called BackFacing dataset, featured by 720 annotated back-facing human video sequences. To our knowledge, this is the first dataset only containing back-facing video shots. Experiments demonstrate that the proposed approach achieves competitive results on VidTIMIT, Cohn-Kanade, CASIA Gait and BackFacing datasets.
 Object Segmentation from Long Video Sequences | BIBA | Full-Text 1187-1190 Bing Luo; Hongliang Li; Tiecheng Song; Chao Huang Most existing video segmentation methods are focused on extracting the primary objects in test video sequences. They assumed that only one object appeared through the whole video sequences, which is impractical in many applications. In this paper, we focus on the object segmentation from the long video sequences which consist of many different scenes, shot cuts and various motion patterns, etc. In order to solve this problem, we propose a framework to segment the objects in relative video shots, while discarding the irrelative video shots. A graph is constructed to model the video object detection and final segmentation is obtained by getting the superpixels in the detection boxes. We also introduce a new long video segmentation dataset which corresponds to the pixel-wise ground truth. The experiments demonstrate that our proposed method can deal with the object segmentation in long video sequence.
 Summarization-based Video Caption via Deep Neural Networks | BIBA | Full-Text 1191-1194 Guang Li; Shubo Ma; Yahong Han Generating appropriate descriptions for visual content draws increasing attention recently, where the promising progresses were obtained owing to the breakthroughs in deep neural networks. Different from the traditional SVO (subject, verb, object) based methods, in this paper, we propose a novel framework of video caption via deep neural networks. For each frame, we extract visual features by a fine-tuned deep Convulutional Neural Networks (CNN), which are then fed into a Recurrent Neural Networks (RNN) to generate novel sentences descriptions for each frame. In order to obtain the most representative and high-quality descriptions for target video, a well-devised automatic summarization process is incorporated to reduce the noises by ranking on the sentence-sequence graph. Moreover, our framework owns the merit of describing out-of-sample videos by transferring knowledge from pre-captioned images. Experiments on the benchmark datasets demonstrate our method has better performance than the state-of-the-art methods of video caption in language generation metrics as well as SVO accuracy.
 Multi-modal & Multi-view & Interactive Benchmark Dataset for Human Action Recognition | BIBA | Full-Text 1195-1198 Ning Xu; Anan Liu; Weizhi Nie; Yongkang Wong; Fuwu Li; Yuting Su Human action recognition is one of the most active research areas in both computer vision and machine learning communities. Several methods for human action recognition have been proposed in the literature and promising results have been achieved on the popular datasets. However, the comparison of existing methods is often limited given the different datasets, experimental settings, feature representations, and so on. In particularly, there are no human action dataset that allow concurrent analysis on three popular scenarios, namely single view, cross view, and cross domain. In this paper, we introduce a Multi-modal & Multi-view & Interactive (M2I) dataset, which is designed for the evaluation of the performances of human action recognition under multi-view scenario. This dataset consists of 1760 action samples, including 9 person-person interaction actions and 13 person-object interaction actions. Moreover, we respectively evaluate three representative methods for the single-view, cross-view, and cross domain human action recognition on this dataset with the proposed evaluation protocol. It is experimentally demonstrated that this dataset is extremely challenging due to large intraclass variation, multiple similar actions, significant view difference. This benchmark can provide solid basis for the evaluation of this task and will benefit advancing related computer vision and machine learning research topics.
 A Deep Siamese Network for Scene Detection in Broadcast Videos | BIBA | Full-Text 1199-1202 Lorenzo Baraldi; Costantino Grana; Rita Cucchiara We present a model that automatically divides broadcast videos into coherent scenes by learning a distance measure between shots. Experiments are performed to demonstrate the effectiveness of our approach by comparing our algorithm against recent proposals for automatic scene segmentation. We also propose an improved performance measure that aims to reduce the gap between numerical evaluation and expected results, and propose and release a new benchmark dataset.
 Unsupervised Cosegmentation based on Global Graph Matching | BIBA | Full-Text 1203-1206 Takanori Tamanaha; Hideki Nakamaya Cosegmentation is defined as the task of segmenting a common object from multiple images. Hitherto, graph matching has been known as a promising approach because of its flexibility in matching deformable objects and regions, and several methods based on this approach have been proposed. However, candidate foregrounds obtained by a local matching algorithm in previous methods tend to include false-positive areas, particularly when visually similar backgrounds (e.g., sky) commonly appear across images.    We propose an unsupervised cosegmentation method based on a global graph matching algorithm. Rather than using a local matching algorithm that finds a small common subgraph, we employ global matching that can find a one-to-one mapping for every vertex between input graphs such that we can remove negative regions estimated as background. Experimental results obtained using the iCoseg and MSRC datasets demonstrate that the accuracy of the proposed method is higher than that of previous graph-based methods.
 Facial Age Estimation Based on Structured Low-rank Representation | BIBA | Full-Text 1207-1210 Chenjing Yan; Congyan Lang; Songhe Feng This paper presents an algorithm based on structured, low-rank representation for facial age estimation. The proposed method learns the discriminative feature representation of images with the constraint of the classwise block-diagonal structure to promote discrimination of representations for robust recognition. A block-sparse regularizer is introduced to exploit the similarity and structure information of class. Based on the new representation, we estimate the accurate age using a regression function. By subtly introducing the structured, low-rank representation, we achieve good age estimation performance. Experimental results on three well-known aging faces datasets have demonstrated that the proposed method is superior to the conventional approaches.
 Semantic Segmentation based on Stacked Discriminative Autoencoders and Context-Constrained Weakly Supervised Learning | BIBA | Full-Text 1211-1214 Xiwen Yao; Junwei Han; Gong Cheng; Lei Guo In this paper, we focus on tacking the problem of weakly supervised semantic segmentation. The aim is to predict the class label of image regions under weakly supervised settings, where training images are only provided with image-level labels indicating the classes they contain. The main difficulty of weakly supervised semantic segmentation arises from the complex diversity of visual classes and the lack of supervision information for learning a multi-classes classifier. To conquer the challenge, we propose a novel discriminative deep feature learning framework based on stacked autoencoders (SAE) by integrating pairwise constraints to serve as a discriminative term. Furthermore, to mine effective supervision information, global context about co-occurrence of visual classes as well as local context around each image region is exploited as constraints for training a multi-class classifier. Finally, the classifier training is formulated as an ultimate optimization problem, which can be solved efficiently by an alternate iterative optimization method. Comprehensive experiments on the MSRC 21 dataset demonstrate the superior performance compared with several state-of-the-art weakly supervised image segmentation methods.
 Deep Self-taught Hashing for Image Retrieval | BIBA | Full-Text 1215-1218 Ke Zhou; Yu Liu; Jingkuan Song; Linyu Yan; Fuhao Zou; Fumin Shen Hashing is a promising technique to tackle the problem of scalable retrieval, and it generally consists two major components, namely hash code generation and hash functions learning.    The majority of existing hashing fall under the shallow model, which is intrinsically weak on mining robust visual features and learning complicated hash functions. In view of the superiority of deep structure, especially the Convolutional Neural Networks (CNNs), on extracting high level representation, we propose a deep self-taught hashing (DSTH) framework to combine deep structures with hashing to improve the retrieval performance by automatically learning robust visual features and hash functions. By employing CNNs, more robust and discriminative features of the images can be extracted to benefit the hash codes generation. Then, we apply CNNs and Multi-layer Perceptron under deep learning scheme to learn hash function in supervised process by using the generated hash codes as labels. The experimental results have shown that the DSTH is superior to several state-of-the-art algorithms.
 GPU Accelerated Generalised Subclass Discriminant Analysis for Event and Concept Detection in Video | BIBA | Full-Text 1219-1222 Stavros Arestis-Chartampilas; Nikolaos Gkalelis; Vasileios Mezaris In this paper a discriminant analysis (DA) technique called accelerated generalised subclass discriminant analysis (AGSDA) and its GPU implementation are presented. This method identifies a discriminant subspace of the input space in three steps: a) Gram matrix computation, b) eigenvalue decomposition of the between subclass factor matrix, and c) computation of the solution of a linear matrix system with symmetric positive semidefinite (SPSD) matrix of coefficients. Based on the fact that the computationally intensive parts of AGSDA, i.e. Gram matrix computation and identification of the SPSD linear matrix system solution, are highly parallelisable, a GPU implementation of AGSDA is proposed. Experimental results on large-scale datasets of TRECVID for event and concept detection show that our GPU-AGSDA method combined with LSVM outperforms LSVM alone in training time, memory consumption, and detection accuracy.
 Semi- and Weakly- Supervised Semantic Segmentation with Deep Convolutional Neural Networks | BIBA | Full-Text 1223-1226 Yuhang Wang; Jing Liu; Yong Li; Hanqing Lu Successful semantic segmentation methods typically rely on the training datasets containing a large number of pixel-wise labeled images. To alleviate the dependence on such a fully annotated training dataset, in this paper, we propose a semi- and weakly-supervised learning framework by exploring images most only with image-level labels and very few with pixel-level labels, in which two stages of Convolutional Neural Network (CNN) training are included. First, a pixel-level supervised CNN is trained on very few fully annotated images. Second, given a large number of images with only image-level labels available, a collaborative-supervised CNN is designed to jointly perform the pixel-level and image-level classification tasks, while the pixel-level labels are predicted by the fully-supervised network in the first stage. The collaborative-supervised network can remain the discriminative ability of the fully-supervised model learned with fully labeled images, and further enhance the performance by importing more weakly labeled data. Our experiments on two challenging datasets, i.e, PASCAL VOC 2007 and LabelMe LMO, demonstrate the satisfactory performance of our approach, nearly matching the results achieved when all training images have pixel-level labels.
 Learning Pairwise Neural Network Encoder for Depth Image-based 3D Model Retrieval | BIBA | Full-Text 1227-1230 Jing Zhu; Fan Zhu; Edward K. Wong; Yi Fang With the emergence of RGB-D cameras (e.g., Kinect), the sensing capability of artificial intelligence systems has been dramatically increased, and as a consequence, a wide range of depth image-based human-machine interaction applications are proposed. In design industry, a 3D model always contains abundant information, which are required for manufacture. Since depth images can be conveniently acquired, a retrieval system that can return 3D models based on depth image inputs can assist or improve the traditional product design process. In this work, we address the depth image-based 3D model retrieval problem. By extending the neural network to a neural network pair with identical output layers for objects of the same category, unified domain-invariant representations can be learned based on the low-level mismatched depth image features and 3D model features. A unique advantage of the framework is that the correspondence information between depth images and 3D models are not required, so that it can easily be generalized to large-scale databases. In order to evaluate the effectiveness of our approach, depth images (with Kinect-type noise) in the NYU Depth V2 dataset are used as queries to retrieve 3D models of the same categories in the SHREC 2014 dataset. Experimental results suggest that our approach can outperform the state-of-the-arts methods, and the paradigm that directly uses the original representations of depth images and 3D models for retrieval.
 Using the Eyes to "See" the Objects | BIBA | Full-Text 1231-1234 Concetto Spampinato; Simone Palazzo; Francesca Murabito; Daniela Giordano This paper investigates how to exploit eye gaze data for understanding visual content. In particular, we propose a human-in-the-loop approach for object segmentation in videos, where humans provide significant cues on spatiotemporal relations between object parts (i.e. superpixels in our approach) by simply looking at video sequences. Such constraints, together with object appearance properties, are encoded into an energy function so as to tackle the segmentation problem as a labeling one. The proposed method uses gaze data from only two people and was tested on two challenging visual benchmarks: 1) SegTrack v2 and 2) FBMS-59. The achieved performance showed how our method outperformed more complex video object segmentation approaches, while reducing the effort needed for collecting human feedback.
 Discriminative Light Unsupervised Learning Network for Image Representation and Classification | BIBA | Full-Text 1235-1238 Le Dong; Ling He; Qianni Zhang This paper proposes a discriminative light unsupervised learning network (DLUN) to counter the image classification challenge. Compared with the traditional convolutional networks learning filters by the time-consuming stochastic gradient descent, DLUN learns the filter bank from diverse image patches with the classical K-means, which significantly reduces the training complexity while maintains the high discriminative ability. Besides, we design a new pooling strategy named voting pooling which considers the contribution difference of the adjacent activations. In the output layer, DLUN computes histograms in the size-changed dense sliding windows, followed by a max pooling operation on histogram bins at different scales to obtain the most competitive features. The classification performance on two widely used benchmarks verifies that DLUN is competitive among some state-of-the-arts.
 Ranking Optimization for Person Re-identification via Similarity and Dissimilarity | BIBA | Full-Text 1239-1242 Mang Ye; Chao Liang; Zheng Wang; Qingming Leng; Jun Chen Person re-identification is a key technique to match different persons observed in non-overlapping camera views.    Many researchers treat it as a special object retrieval problem, where ranking optimization plays an important role. Existing ranking optimization methods utilize the similarity relationship between the probe and gallery images to optimize the original ranking list in which dissimilarity relationship is seldomly investigated. In this paper, we propose to use both similarity and dissimilarity cues in a ranking optimization framework for person re-identification. Its core idea is based on the phenomenon that the true match should not only be similar to the strong similar samples of the probe but also dissimilar to the strong dissimilar samples. Extensive experiments have shown the great superiority of the proposed ranking optimization method.
 Leveraging Knowledge-based Inference for Material Classification | BIBA | Full-Text 1243-1246 Jie Yu; Sandra Skaff; Liang Peng; Francisco Imai Material classification is one of the fundamental problems for multimedia content analysis, computer vision and graphics. Existing efforts mostly focus on extracting representative visual features and training a classifier to recognize unknown materials. Compared with human visual recognition, automatic recognition cannot leverage common sense knowledge regarding material categories and contextual information such as object and scene. In this paper, we propose to first extract such knowledge on material, object and scene from heterogeneous sources, i.e. a public data set of 100 million Flickr images [13] and Bing search results. To improve the material classification task, the knowledge information is further exploited in a probabilistic inference framework. Our method is evaluated on OpenSurfaces [10], the largest public material data set which contains both visual features of physical properties as well as image context information. The quantitative evaluation demonstrates the superior performance of our proposed method.
 Emotion Distribution Recognition from Facial Expressions | BIBA | Full-Text 1247-1250 Ying Zhou; Hui Xue; Xin Geng Most existing facial expression recognition methods assume the availability of a single emotion for each expression in the training set. However, in practical applications, an expression rarely expresses pure emotion, but often a mixture of different emotions. To address this problem, this paper deals with a more common case where multiple emotions are associated to each expression. The key idea is to learn the specific description degrees of all basic emotions for each expression and the mapping from the expression images to the emotion distributions by the proposed emotion distribution learning (EDL) method. Experimental results show that EDL can effectively deal with the emotion distribution recognition problem and perform remarkably better than the state-of-the-art multi-label learning methods.
 Exclusive Constrained Discriminative Learning for Weakly-Supervised Semantic Segmentation | BIBA | Full-Text 1251-1254 Peng Ying; Jin Liu; Hanqing Lu; Songde Ma How to import image-level labels as weak supervision to direct the region-level labeling task is the core task of weakly-supervised semantic segmentation. In this paper, we focus on designing an effective but simple weakly-supervised constraint, and propose an exclusive constrained discriminative learning model for image semantic segmentation. To be specific, we employ a discriminative linear regression model to assign subsets of superpixels with different labels. During the assignment, we construct an exclusive weakly-supervised constraint term to suppress the labeling responses of each superpixel on the labels outside its parent image-level label set. Besides, a spectral smoothing term is integrated to encourage that both visually and semantically similar superpixels have similar labels. Combining these terms, we formulate the problem as a convex objective function, which can be easily optimized via alternative iterations. Extensive experiments on MSRC-21 and LabelMe datasets demonstrate the effectiveness of the proposed model.
 Multimedia Event Detection Using Event-Driven Multiple Instance Learning | BIBA | Full-Text 1255-1258 Sang Phan; Duy-Dinh Le; Shin'ichi Satoh A complex event can be recognized by observing necessary evidences. In the real world scenarios, this is a difficult task because the evidences can happen anywhere in a video. A straightforward solution is to decompose the video into several segments and search for the evidences in each segment. This approach is based on the assumption that segment annotation can be assigned from its video label. However, this is a weak assumption because the importance of each segment is not considered. On the other hand, the importance of a segment to an event can be obtained by matching its detected concepts against the evidential description of that event. Leveraging this prior knowledge, we propose a new method, Event-driven Multiple Instance Learning (EDMIL), to learn the key evidences for event detection. We treat each segment as an instance and quantize the instance-event similarity into different levels of relatedness. Then the instance label is learned by jointly optimizing the instance classifier and its related level. The significant performance improvement on the TRECVID Multimedia Event Detection (MED) 2012 dataset proves the effectiveness of our approach.
 Learning Semantic Correlation of Web Images and Text with Mixture of Local Linear Mappings | BIBA | Full-Text 1259-1262 Youtian Du; Kai Yang This paper proposes a new approach, called mixture of local linear mappings (MLLM), to the modeling of semantic correlation between web images and text. We consider that close examples generally represent a uniform concept and can be supposed to be locally transformed based on a linear mapping into the feature space of another modality. Thus, we use a mixture of local linear transformations, each local component being constrained by a neighborhood model into a finite local space, instead of a more complex nonlinear one. To handle the sparseness of data representation, we introduce the constraints of sparseness and non-negativeness into the approach. MLLM is with good interpretability due to its explicit closed form and concept-related local components, and it avoids the determination of capacity that is often considered for nonlinear transformations. Experimental results demonstrate the effectiveness of the proposed approach.
 Learned vs. Hand-Crafted Features for Pedestrian Gender Recognition | BIBA | Full-Text 1263-1266 Grigory Antipov; Sid-Ahmed Berrani; Natacha Ruchaud; Jean-Luc Dugelay This paper addresses the problem of image features selection for pedestrian gender recognition. Hand-crafted features (such as HOG) are compared with learned features which are obtained by training convolutional neural networks. The comparison is performed on the recently created collection of versatile pedestrian datasets which allows us to evaluate the impact of dataset properties on the performance of features. The study shows that hand-crafted and learned features perform equally well on small-sized homogeneous datasets. However, learned features significantly outperform hand-crafted ones in the case of heterogeneous and unfamiliar (unseen) datasets. Our best model which is based on learned features obtains 79% average recognition rate on completely unseen datasets. We also show that a relatively small convolutional neural network is able to produce competitive features even with little training data.
 Multi-Level Fusion for Person Re-identification with Incomplete Marks | BIBA | Full-Text 1267-1270 Zheng Wang; Ruimin Hu; Yi Yu; Chao Liang; Wenxin Huang Most video surveillance suspect investigation systems rely on the videos taken in different camera views. Actually, besides the videos, in the investigation process, investigators also manually label some marks, which, albeit incomplete, can be quite accurate and helpful in identifying persons. This paper studies the problem of Person Re-identification with Incomplete Marks (PRIM), aiming at ranking the persons in the gallery according to both the videos and incomplete marks. This problem is solved by a multi-step fusion algorithm, which consists of three key steps: (i) The early fusing step exploits both visual features and marked attributes to predict a complete and precise attribute vector. (ii) Based on the statistical attribute d ominance and saliency phenomena, a dominance-saliency matching model is suggested for measuring the distance between attribute vectors. (iii) The gallery is ranked separately by using visual features and attribute vectors, and the overall ranking list is the result of a late fusion. Experiments conducted on VIPeR dataset have validated the effectiveness of the proposed method in all the three key steps. The results also show that through introducing marks, the retrieval accuracy is significantly improved.
 Real-Time Instant Event Detection in Egocentric Videos by Leveraging Sensor-Based Motion Context | BIBA | Full-Text 1275-1278 Pei-Yun Hsu; Wen-Feng Cheng; Peng-Ju Hsieh; Yen-Liang Lin; Winston H. Hsu With rapid growth of egocentric videos from wearable devices, the need for instant video event detection is emerging. Different from conventional video event detection, it requires more considerations on real-time event detection and immediate video recording due to the computational cost on wearable devices (e.g., Google Glass). Conventional work of video event detection analyzed video content in an offline process and it is time-consuming for visual analysis. Observing that wearable devices are usually along with sensors, we propose a novel approach for instant event detection in egocentric videos by leveraging sensor-based motion context. We compute statistics of sensor data as features. Next, we predict the user's current motion context by a hierarchical model, and then choose the corresponding ranking model to rate the importance score of the timestamp. With importance score provided in real-time, camera on the wearable device can dynamically record micro-videos without wasting power and storage. In addition, we collected a challenging daily-life dataset called EDS (Egocentric Daily-life Videos with Sensor Data), which contains both egocentric videos and sensor data recorded by Google Glass of different subjects. We evaluate the performance of our system on the EDS dataset, and the result shows that our method outperforms other baselines.
 Modeling Temporal Effects in Re-captured Video | BIBA | Full-Text 1279-1282 Philipp Schaber; Sally Dong; Benjamin Guthier; Stephan Kopf; Wolfgang Effelsberg The re-capturing of video content poses significant challenges to algorithms in the fields of video forensics, watermarking, and near-duplicate detection. Using a camera to record a video from a display introduces a variety of artifacts, such as geometric distortions, luminance transformations, and temporal aliasing. A deep understanding of the causes and effects of such phenomena is required for their simulation, and for making the affected algorithms more robust. In this paper, we provide a detailed model of the temporal effects in re-captured video. Such effects typically result in the re-captured frames being a blend of the original video's source frames, where the specific blend ratios are difficult to predict. Our proposed parametric model captures the temporal artifacts introduced by interactions between the video renderer, display device, and camera. The validity of our model is demonstrated through experiments with real re-captured videos containing specially marked frames.
 On the Benefit of Synthetic Data for Company Logo Detection | BIBA | Full-Text 1283-1286 Christian Eggert; Anton Winschel; Rainer Lienhart In this paper we explore the benefits of synthetically generated data for the task of company logo detection with deep-learned features in the absence of a large training set. We use pre-trained deep convolutional neural networks for feature extraction and use a set of support vector machines for classifying those features. In order to generate sufficient training examples we synthesize artificial training images. Using a bootstrapping process, we iteratively add new synthesized examples from an unlabeled dataset to the training set. Using this setup we are able to obtain a performance which is close to the performance of the full training set.
 Retrieving Unfamiliar Faces: Towards Understanding Human Performance | BIBA | Full-Text 1287-1290 Xu Zhou; Baoxin Li Face image retrieval is to find from a dataset all images containing the same person in the query image. Automatic face retrieval has seen fast development in recent years, although humans still appear to be the better performer on this task. This paper reports a study towards understanding human performance on retrieving unfamiliar faces. Wild Web face images are utilized in the study, and two experiments are designed to assess human performance and behavior on the retrieval task. The experiments help to identify a set of important features and also to understand how human behaved when facing the task of retrieving unfamiliar faces. Such observations/conclusions may provide guidelines for improving existing automated algorithms.
 Acoustic Scene Classification based on Sound Textures and Events | BIBA | Full-Text 1291-1294 Jiaxing Ye; Takumi Kobayashi; Masahiro Murakawa; Tetsuya Higuchi Semantic labelling of acoustic scenes has recently emerged as active topic covering a wide range of applications, e.g. surveillance and audio-based information retrieval. In this paper, we present an effective approach for acoustic scene classification through characterizing both background sound textures and acoustic events. The work takes inspiration from the psychoacoustic definition of acoustic scenes, that is, "skeleton of (acoustic) events on a bed of (sound) texture". In detail, we firstly employ distinct models to exploit sound textures and events in acoustic scenes, individually. Subsequently, based on fact that the perceptual importance of two parts will vary with respect to different scene categories, we develop favourable class-conditional fusion scheme to aggregate two-channel information. To validate proposed approach, we conduct extensive experiments on Rouen dataset which includes 19 categories of daily acoustic scenes with 3026 real-world recordings, and the proposed approach outperforms state-of-the-art methods by a large margin.
 Coupled Support Vector Machines for Supervised Domain Adaptation | BIBA | Full-Text 1295-1298 Hemanth Venkateswara; Prasanth Lade; Jieping Ye; Sethuraman Panchanathan Popular domain adaptation (DA) techniques learn a classifier for the target domain by sampling relevant data points from the source and combining it with the target data. We present a Support Vector Machine (SVM) based supervised DA technique, where the similarity between source and target domains is modeled as the similarity between their SVM decision boundaries. We couple the source and target SVMs and reduce the model to a standard single SVM. We test the Coupled-SVM on multiple datasets and compare our results with other popular SVM based DA approaches.
 Deep People Counting in Extremely Dense Crowds | BIBA | Full-Text 1299-1302 Chuan Wang; Hua Zhang; Liang Yang; Si Liu; Xiaochun Cao People counting in extremely dense crowds is an important step for video surveillance and anomaly warning. The problem becomes especially more challenging due to the lack of training samples, severe occlusions, cluttered scenes and variation of perspective. Existing methods either resort to auxiliary human and face detectors or surrogate by estimating the density of crowds. Most of them rely on hand-crafted features, such as SIFT, HOG etc, and thus are prone to fail when density grows or the training sample is scarce. In this paper we propose an end-to-end deep convolutional neural networks (CNN) regression model for counting people of images in extremely dense crowds. Our method has following characteristics. Firstly, it is a deep model built on CNN to automatically learn effective features for counting. Besides, to weaken influence of background like buildings and trees, we purposely enrich the training data with expanded negative samples whose ground truth counting is set as zero. With these negative samples, the robustness can be enhanced. Extensive experimental results show that our method achieves superior performance than the state-of-the-arts in term of the mean and variance of absolute difference.
 Gyro-based Camera-motion Detection in User-generated Videos | BIBA | Full-Text 1303-1306 Sophia Bano; Andrea Cavallaro; Xavier Parra We propose a gyro-based camera-motion detection method for videos captured with smartphones. First, the delay between the acquisition of video and gyroscope data is estimated using similarities induced by camera motion in the two sensor modalities. Pan, tilt and shake are then detected using the dominant motions and high frequencies in the gyroscope data. Morphological operations are applied to remove outliers and to identify segments with continuous camera-motion. We compare the proposed method with existing methods that use visual or inertial sensor data.
 Human Activity Recognition Using Wearable Sensors by Deep Convolutional Neural Networks | BIBA | Full-Text 1307-1310 Wenchao Jiang; Zhaozheng Yin Human physical activity recognition based on wearable sensors has applications relevant to our daily life such as healthcare. How to achieve high recognition accuracy with low computational cost is an important issue in the ubiquitous computing. Rather than exploring handcrafted features from time-series sensor signals, we assemble signal sequences of accelerometers and gyroscopes into a novel activity image, which enables Deep Convolutional Neural Networks (DCNN) to automatically learn the optimal features from the activity image for the activity recognition task. Our proposed approach is evaluated on three public datasets and it outperforms state-of-the-arts in terms of recognition accuracy and computational cost.
 Image2Emoji: Zero-shot Emoji Prediction for Visual Media | BIBA | Full-Text 1311-1314 Spencer Cappallo; Thomas Mensink; Cees G. M. Snoek We present Image2Emoji, a multi-modal approach for generating emoji labels for an image in a zero-shot manner. Different from existing zero-shot image-to-text approaches, we exploit both image and textual media to learn a semantic embedding for the new task of emoji prediction. We propose that the widespread adoption of emoji suggests a semantic universality which is well-suited for interaction with visual media. We quantify the efficacy of our proposed model on the MSCOCO dataset, and demonstrate the value of visual, textual and multi-modal prediction of emoji. We conclude the paper with three examples of the application potential of emoji in the context of multimedia retrieval.
 Rich Image Description Based on Regions | BIBA | Full-Text 1315-1318 Xiaodan Zhang; Xinhang Song; Xiong Lv; Shuqiang Jiang; Qixiang Ye; Jianbin Jiao Abstract Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In contrast to the previous image description methods that focus on describing the whole image, this paper presents a method of generating rich image descriptions from image regions. First, we detect regions with R-CNN (regions with convolutional neural network features) framework. We then utilize the RNN (recurrent neural networks) to generate sentences for image regions. Finally, we propose an optimization method to select one suitable region. The proposed model generates several sentence description of regions in an image, which has sufficient representative power of the whole image and contains more detailed information. Comparing to general image level description, generating more specific and accurate sentences on the different regions can satisfy more personal requirements for different people. Experimental evaluations validate the effectiveness of the proposed method.

### Tutorials

 VM Hub: Building Cloud Service and Mobile Application for Image/Video/Multimedia Services | BIBA | Full-Text 1319-1320 Jin Li In this tutorial, we will teach how to use VM Hub (Visual Media Hub), an open multimedia hub with most of the code in the open source space, to convert a multimedia application to a cloud service, and to build mobile applications that consumes the cloud service. The tutorial also covers the architecture and design consideration of VM Hub.
 Interactive Video Search | BIBA | Full-Text 1321-1322 Klaus Schoeffmann; Frank Hopfgartner With an increasing amount of video data in our daily life, the need for content-based search in videos increases as well. Though a lot of research has been spent on video retrieval tools and methods which allow for automatic search in videos through content-based queries, still the performance of automatic video retrieval is far from optimal. In this tutorial we discussed (i) proposed solutions for improved video content navigation, (ii) typical interaction of content-based querying features, and (iii) advanced video content visualization methods. Moreover, we discussed interactive video search systems and ways to evaluate their performance.
 Learning Knowledge Bases for Multimedia in 2015 | BIBA | Full-Text 1323-1324 Lexing Xie; Haixun Wang Knowledge acquisition, representation, and reasoning has been one of the long-standing challenges in artificial intelligence and related application areas. Only in the past few years, massive amounts of structured and semi-structured data that directly or indirectly encode human knowledge became widely available, turning the knowledge representation problems into a computational grand challenge with feasible solutions in sight. The research and development on knowledge bases is becoming a lively fusion area among web information extraction, machine learning, databases and information retrieval, with knowledge over images and multimedia emerging as another new frontier of representation and acquisition. This tutorial aims to present a gentle overview of knowledge bases on text and multimedia, including representation, acquisition, and inference. In particular, the 2015 edition of the tutorial will include recent progress from several active research communities: web, natural language processing, and computer vision and multimedia.
 Image Tag Assignment, Refinement and Retrieval | BIBA | Full-Text 1325-1326 Xirong Li; Tiberio Uricchio; Lamberto Ballan; Marco Bertini; Cees G. M. Snoek; Alberto Del Bimbo This tutorial focuses on challenges and solutions for content-based image annotation and retrieval in the context of online image sharing and tagging. We present a unified review on three closely linked problems, i.e., tag assignment, tag refinement, and tag-based image retrieval. We introduce a taxonomy to structure the growing literature, understand the ingredients of the main works, clarify their connections and difference, and recognize their merits and limitations. Moreover, we present an open-source testbed, with training sets of varying sizes and three test datasets, to evaluate methods of varied learning complexity. A selected set of eleven representative works have been implemented and evaluated. During the tutorial we provide a practice session for hands on experience with the methods, software and datasets. For repeatable experiments all data and code are online at http://www.micc.unifi.it/tagsurvey
 Tutorial on Emotional and Social Signals for Multimedia Research | BIB | Full-Text 1327-1328 Hayley Hung; Hatice Gunes
 An Introduction to Arts and Digital Culture Inside Multimedia | BIBA | Full-Text 1329-1330 David A. Shamma; Daragh Bryne The Arts and Digital Culture program has offered a high quality forum for the presentation of interactive and arts-based multimedia applications at the annual ACM Multimedia conference for over a decade. This tutorial will explore the evolution of this program as a guide to new authors considering future participation in this program. By surveying both past technical and past exhibited contributions, this tutorial will offer guidance to artists, researchers and practitioners on success at this multifaceted, interdisciplinary forum at ACM Multimedia.
 Human-Centric Images and Videos Analysis | BIBA | Full-Text 1331-1332 Si Liu; BingBing Ni; Liang Lin This article summarizes the corresponding half-day tutorial at ACM Multimedia 2015. This tutorial reviews recent progresses in human-centric images and videos analysis: 1) fashion analysis: parsing, attribute prediction and retrieval; 2) action analysis: discriminative feature selection, pooling and fusion; 3) person verification: cross-domain person verification via learning a generalized similarity measure, and bit-scalable deep hashing with regularized similarity learning.
 User-centric Cross-OSN Multimedia Computing | BIBA | Full-Text 1333-1334 Jitao Sang This article summarizes the corresponding half-day tutorial at ACM Multimedia 2015. This tutorial is divided into two parts as (1) User-centric Social Multimedia Computing; and (2) Cross-OSN Multimedia Computing.

### Workshop Summaries

 AVEC 2015: The 5th International Audio/Visual Emotion Challenge and Workshop | BIBA | Full-Text 1335-1336 Fabien Ringeval; Bjoern Schuller; Michel Valstar; Roddy Cowie; Maja Pantic The fifth Audio-Visual Emotion Challenge and workshop AVEC 2015 was held in conjunction ACM Multimedia'15. Like the previous editions of AVEC, the workshop/challenge addresses the detection of affective signals represented in audio-visual data in terms of high-level continuous dimensions. A major novelty was further introduced this year by the inclusion of the physiological modality -- along with the audio and the video modalities -- in the dataset. In this summary, we mainly describe participation and its conditions.
 Multimedia COMMONS -- Community-Organized Multimodal Mining: Opportunities for Novel Solutions (MMCommons Workshop 2015) | BIBA | Full-Text 1337-1338 Gerald Friedland; Chong-Wah Ngo; David Ayman Shamma The Multimedia COMMONS workshop laid the groundwork for developing a research community around the Multimedia Genome Project (MMGP), an initiative initially focused on annotation of -- and research using -- the 99.2 million images and nearly 800,000 videos in the Yahoo Flickr Creative Commons 100 Million dataset (YFCC100M). Current and potential users of the YFCC100M presented new research and systems that used this unprecedentedly large, unprecedentedly open-source dataset; discussed ideas for future data challenges and new benchmarking tasks that would not previously have been possible; and suggested priorities and plans for annotation and distribution based on community needs and interests.
 ImmersiveMe'15: 3rd ACM International Workshop on Immersive Media Experiences | BIBA | Full-Text 1339-1340 Teresa Chambel; Paula Viana; V. Michael Bove; Sharon Strover; Graham Thomas This ACM International Workshop on Immersive Media Experiences is in its 3rd edition. Since 2013 in Barcelona, it has been a meeting point of researchers, students, media producers, service providers and industry players in the area of immersive media environments, applications and experiences. After the successful first edition at ACM Multimedia 2013 and the consolidation of the theme and the team at Orlando in 2014, ImmersiveMe'15 aims at bringing to the stage new ideas and developments that keep this topic as appealing as in the previous editions. ImmersiveMe'15 will now take place in Brisbane and, again, it will be a platform to present interesting and out-of-the-box new work that contributes to make the world more interactive, immersive and engaging.
 CrowdMM 2015: Fourth International ACM Workshop on Crowdsourcing for Multimedia | BIBA | Full-Text 1341-1342 Judith Redi; Stevan Rudinac Crowdsourcing has the potential to address key challenges in multimedia research. Multimedia evaluation, annotation, retrieval and creation can be obtained at a low time and monetary cost from the contribution of large crowds and by leveraging human computation. In fact, the applicative frontiers of this potential are yet to be discovered. And yet, challenges already arise as to how to cautiously exploit it. The crowd, as a users (workers) community, is a complex and dynamic system highly sensitive to changes in the form and the parametrization of their activities. Issues concerning motivation, reliability, and engagement are being more and more often documented, and need to be addressed. Since 2012, the International ACM Workshop on Crowdsourcing for Multimedia CrowdMM has welcomed new insights on the effective deployment of crowdsourcing towards boosting Multimedia research. On its fourth year, CrowdMM 2015 focuses on contributions addressing the key challenges that still hinder widespread adoption of crowdsourcing paradigms in the multimedia research community: identification of optimal crowd members (e.g., user expertise, worker reliability), providing effective explanations (i.e., good task design), controlling noise and quality in the results, designing incentive structures that do not breed cheating, and tackling privacy issues in data collection.
 2nd Workshop on Computational Models of Social Interactions: Human-Computer-Media Communication (HCMC2015) | BIBA | Full-Text 1343-1344 Mohamed R. Amer; Ajay Divakaran; Shih-Fu Chang; Nicu Sebe Communicating ideas and information from and to humans is a very important subject. In our daily life, human interact with variety of entities, such as, other humans, machines, media. Constructive interactions are needed for good communication, which would result in successful outcomes, such as answering a query, learning a new skill, getting a service done, and communicating emotions. Each of these entities invokes a set of signals. Current research has focused on analyzing one entity's signals with no respect to the other entities in a unidirectional manner. The computer vision community focused on detection, classification and recognition of humans and their poses and gestures progressing onto actions, activities, and events but it does not go beyond that. The signal processing community focused on emotion recognition from facial expressions or audio or both combined. The HCI community focused on making easier interfaces for machines to ease their usage. The goal of this workshop is to bring multiple disciplines together, to process human directed signals holistically, in a bidirectional manner, rather than isolation. This workshop is positioned to display this rich domain of applications, which will provide the necessary next boost for these technologies. At the same time, it seeks to ground computational models on theory that would help achieve the technology goals. This would allow us to leverage decades of research in different fields and to spur interdisciplinary research thereby opening up new problem domains for the multimedia community.
 About Events, Objects, and their Relationships: Human-centered Event Understanding from Multimedia | BIBA | Full-Text 1345-1346 Ansgar Scherp; Vasileios Mezaris; Bogdan Ionescu; Francesco De Natale HuEvent'15 is a continuation of previous year's successful workshop on events in multimedia. It focuses on the human-centered aspects of understanding events from multimedia content. This includes the notion of objects and their relation to events. The workshop brings together researchers from the different areas in multimedia and beyond that are interested in understanding the concept of events.
 Overview of the 2015 Workshop on Speech, Language and Audio in Multimedia | BIBA | Full-Text 1347-1348 Guillaume Gravier; Gareth F. Jones; Martha Larson; Roeland Ordelman The Workshop on Speech, Language and Audio in Multimedia (SLAM) positions itself at the crossroad of multiple scientific fields (music and audio processing, speech processing, natural language processing and multimedia) to discuss and stimulate research results, projects, datasets and benchmarks initiatives where audio, speech and language are applied to multimedia data. While the first two editions were collocated with major speech events, SLAM'15 is deeply rooted in the multimedia community, opening up to computer vision and multimodal fusion. To this end, the workshop emphasizes video hyperlinking as an showcase where computer vision meets speech and language. Such techniques provide a powerful illustration of how multimedia technologies incorporating speech, language and audio can make multimedia content collections better accessible, and thereby more useful, to users.
 ASM'15: The 1st International Workshop on Affect and Sentiment in Multimedia | BIB | Full-Text 1349 Mohammad Soleymani; Yi-Hsuan Yang; Yu-Gang Jiang; Shih-Fu Chang