[1]
Perception-Guided Multimodal Feature Fusion for Photo Aesthetics Assessment
Multimedia HCI and QoE
/
Zhang, Luming
/
Gao, Yue
/
Zhang, Chao
/
Zhang, Hanwang
/
Tian, Qi
/
Zimmermann, Roger
Proceedings of the 2014 ACM International Conference on Multimedia
2014-11-03
p.237-246
© Copyright 2014 ACM
Summary: Photo aesthetic quality evaluation is a challenging task in multimedia and
computer vision fields. Conventional approaches suffer from the following three
drawbacks: 1) the deemphasized role of semantic content that is many times more
important than low-level visual features in photo aesthetics; 2) the difficulty
to optimally fuse low-level and high-level visual cues in photo aesthetics
evaluation; and 3) the absence of a sequential viewing path in the existing
models, as humans perceive visually salient regions sequentially when viewing a
photo.
To solve these problems, we propose a new aesthetic descriptor that mimics
humans sequentially perceiving visually/semantically salient regions in a
photo. In particular, a weakly supervised learning paradigm is developed to
project the local aesthetic descriptors (graphlets in this work) into a
low-dimensional semantic space. Thereafter, each graphlet can be described by
multiple types of visual features, both at low-level and in high-level. Since
humans usually perceive only a few salient regions in a photo, a
sparsity-constrained graphlet ranking algorithm is proposed that seamlessly
integrates both the low-level and the high-level visual cues. Top-ranked
graphlets are those visually/semantically prominent graphlets in a photo. They
are sequentially linked into a path that simulates the process of humans
actively viewing. Finally, we learn a probabilistic aesthetic measure based on
such actively viewing paths (AVPs) from the training photos that are marked as
aesthetically pleasing by multiple users. Experimental results show that: 1)
the AVPs are 87.65% consistent with real human gaze shifting paths, as verified
by the eye-tracking data; and 2) our photo aesthetic measure outperforms many
of its competitors.
[2]
Fused one-vs-all mid-level features for fine-grained visual categorization
Multimedia Analysis and Mining
/
Zhang, Xiaopeng
/
Xiong, Hongkai
/
Zhou, Wengang
/
Tian, Qi
Proceedings of the 2014 ACM International Conference on Multimedia
2014-11-03
p.287-296
© Copyright 2014 ACM
Summary: As an emerging research topic, fine-grained visual categorization has been
attracting growing attentions in recent years. Due to the large inter-class
similarity and intra-class variance, recognizing objects in fine-grained
domains is extremely challenging, and sometimes even humans can not recognize
them accurately. Traditional bag-of-words model could obtain desirable results
for basic-level category classification by weak alignment using spatial pyramid
matching model, but may easily fail in fine-grained domains since the
discriminative features are not only subtle but also extremely localized. The
fine differences often get swamped by those irrelevant features, and it is
virtually impossible to distinguish them. To address the problems above, we
propose a new framework for fine-grained visual categorization. We strengthen
the spatial correspondence among parts by including foreground segmentation and
part localization. Based on the part representations of the images, we learn a
large set of mid-level features which are more suitable for fine-grained tasks.
Comparing with the low level features directly extracted from the images, the
learned one-vs-all mid-level features enjoy the following advantages. First,
the dimension of the mid-level features is relatively small. In order to obtain
high classification accuracy, the dimension of the low level features usually
reaches several thousand to tens of thousand, and becomes even larger when
introducing spatial pyramid model. However, the dimension of our mid-level
features is related to the number of classes, which is far less. Second, each
entry of the proposed mid-level features is meaningful, which forms a more
compact representation of the image. Third, the mid-level features are more
robust than the low level ones, which is helpful for classification. Fourth,
the learning process of the mid-level features is independent and can be easily
combined with other techniques to boost the performance. We evaluate the
proposed approach on the extensive fine-grained dataset CUB 200-2011 and
Stanford Dogs, by learning the mid-level features based on the popular Fisher
vectors and convolutional neural network, we boost the classification accuracy
by a considerable margin and advance the state-of-the-art performance in
fine-grained visual categorization.
[3]
Social Embedding Image Distance Learning
Multimedia Recommendations
/
Liu, Shaowei
/
Cui, Peng
/
Zhu, Wenwu
/
Yang, Shiqiang
/
Tian, Qi
Proceedings of the 2014 ACM International Conference on Multimedia
2014-11-03
p.617-626
© Copyright 2014 ACM
Summary: Image distance (similarity) is a fundamental and important problem in image
processing. However, traditional visual features based image distance metrics
usually fail to capture human cognition. This paper presents a novel Social
embedding Image Distance Learning (SIDL) approach to embed the similarity of
collective social and behavioral information into visual space. The social
similarity is estimated according to multiple social factors. Then a metric
learning method is especially designed to learn the distance of visual features
from the estimated social similarity. In this manner, we can evaluate the
cognitive image distance based on the visual content of images. Comprehensive
experiments are designed to investigate the effectiveness of SIDL, as well as
the performance in the image recommendation and reranking tasks. The
experimental results show that the proposed approach makes a marked improvement
compared to the state-of-the-art image distance metrics. An interesting
observation is given to show that the learned image distance can better reflect
human cognition.
[4]
Salable Image Search with Reliable Binary Code
Posters 1
/
Ren, Guangxin
/
Cai, Junjie
/
Li, Shipeng
/
Yu, Nenghai
/
Tian, Qi
Proceedings of the 2014 ACM International Conference on Multimedia
2014-11-03
p.769-772
© Copyright 2014 ACM
Summary: In many existing image retrieval algorithms, Bag-of-Words (BoW) model has
been widely adopted for image representation. To achieve accurate indexing and
efficient retrieval, local features such as the SIFT descriptor are extracted
and quantized to visual words. One of the most popular quantization scheme is
scalar quantization, which generates binary signature with an empirical
threshold value. However, such binarization strategy inevitably suffers from
the quantization loss induced by each quantized bit and impairs the
effectiveness of search performance. In this paper, we investigate the
reliability of each bit in scalar quantization and propose a novel reliable
binary SIFT feature. We move one step ahead to incorporate the reliability in
both index word expansion and feature similarity. Our proposed approach not
only accelerates the search speed by narrowing search space, but also improves
the retrieval accuracy by alleviating the impact of unreliable quantized bits.
Experimental results demonstrate that the proposed approach achieves
significant improvement in retrieval efficiency and accuracy.
[5]
Personalized Visual Vocabulary Adaption for Social Image Retrieval
Posters 2
/
Niu, Zhenxing
/
Zhang, Shiliang
/
Gao, Xinbo
/
Tian, Qi
Proceedings of the 2014 ACM International Conference on Multimedia
2014-11-03
p.993-996
© Copyright 2014 ACM
Summary: With the popularity of mobile devices and social networks, users can easily
build their personalized image sets. Thus, personalized image analysis,
indexing, and retrieval have become important topics in social media analysis.
Because of users' diverse preferences, their personalized image sets are
usually related to specific topics and show large feature distribution bias
from general Internet images. Therefore, the visual vocabulary trained on
general Internet images may could not fit across users' personalized image sets
very well. To improve the image retrieval performance on personalized image
sets, we propose the personalized visual vocabulary adaption which removes
non-discriminative visual words and replaces them with more exact and
discriminative ones, i.e., adapt a general vocabulary toward a specific user's
image set. The proposed algorithm updates the visual vocabulary during off-line
feature quantization, and operates on a limited number of visual words, hence
shows satisfying efficiency. Extensive experiments of image search on public
datasets demonstrate the efficiency and superior performance of our approach.
[6]
Image Re-ranking with an Alternating Optimization
Posters 3
/
Pang, Shanmin
/
Xue, Jianru
/
Gao, Zhanning
/
Tian, Qi
Proceedings of the 2014 ACM International Conference on Multimedia
2014-11-03
p.1141-1144
© Copyright 2014 ACM
Summary: In this work, we propose an efficient image re-ranking method, without
additional memory cost compared with the baseline method[8], to re-rank all
retrieved images. The motivation of the proposed method is that, there are
usually many visual words in the query image that only give votes to irrelevant
images. With this observation, we propose to only use visual words which can
help to find relevant images to re-rank the retrieved images. To achieve the
goal, we first find some similar images to the query by maximizing a quadratic
function when given an initial ranking of the retrieved images. Then we select
query visual words with an alternating optimization strategy: (1) at each
iteration, select words based on the similar images that we have found and (2)
in turn, update the similar images with the selected words. These two steps are
repeated until convergence. Experimental results on standard benchmark datasets
show that the proposed method outperforms spatial based re-ranking methods.
[7]
Discriminative coupled dictionary hashing for fast cross-media retrieval
Session 4c: more hashing
/
Yu, Zhou
/
Wu, Fei
/
Yang, Yi
/
Tian, Qi
/
Luo, Jiebo
/
Zhuang, Yueting
Proceedings of the 2014 Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2014-07-06
p.395-404
© Copyright 2014 ACM
Summary: Cross-media hashing, which conducts cross-media retrieval by embedding data
from different modalities into a common low-dimensional Hamming space, has
attracted intensive attention in recent years. The existing cross-media hashing
approaches only aim at learning hash functions to preserve the intra-modality
and inter-modality correlations, but do not directly capture the underlying
semantic information of the multi-modal data. We propose a discriminative
coupled dictionary hashing (DCDH) method in this paper. In DCDH, the coupled
dictionary for each modality is learned with side information (e.g.,
categories). As a result, the coupled dictionaries not only preserve the
intra-similarity and inter-correlation among multi-modal data, but also contain
dictionary atoms that are semantically discriminative (i.e., the data from the
same category is reconstructed by the similar dictionary atoms). To perform
fast cross-media retrieval, we learn hash functions which map data from the
dictionary space to a low-dimensional Hamming space. Besides, we conjecture
that a balanced representation is crucial in cross-media retrieval. We
introduce multi-view features on the relatively "weak" modalities into DCDH and
extend it to multi-view DCDH (MV-DCDH) in order to enhance their representation
capability. The experiments on two real-world data sets show that our DCDH and
MV-DCDH outperform the state-of-the-art methods significantly on cross-media
retrieval.
[8]
Topology preserving hashing for similarity search
Similarity search
/
Zhang, Lei
/
Zhang, Yongdong
/
Tang, Jinhui
/
Gu, Xiaoguang
/
Li, Jintao
/
Tian, Qi
Proceedings of the 2013 ACM International Conference on Multimedia
2013-10-21
p.123-132
© Copyright 2013 ACM
Summary: Binary hashing has been widely used for efficient similarity search.
Learning efficient codes has become a research focus and it is still a
challenge. In many cases, the real-world data often lies on a low-dimensional
manifold, which should be taken into account to capture meaningful neighbors
with hashing. The importance of a manifold is its topology, which represents
the neighborhood relationships between its subregions and the relative
proximities between the neighbors of each subregion, e.g. the relative ranking
of neighbors of each subregion. Most existing hashing methods try to preserve
the neighborhood relationships by mapping similar points to close codes, while
ignoring the neighborhood rankings. Moreover, most hashing methods lack in
providing a good ranking for query results since they use Hamming distance as
the similarity metric, and in practice, there are often a lot of results
sharing the same distance to a query. In this paper, we propose a novel hashing
method to solve these two issues jointly. The proposed method is referred to as
Topology Preserving Hashing (TPH). TPH is distinct from prior works by
preserving the neighborhood rankings of data points in Hamming space. The
learning stage of TPH is formulated as a generalized eigendecomposition problem
with closed form solutions. Experimental comparisons with other
state-of-the-art methods on three noted image benchmarks demonstrate the
efficacy of the proposed method.
[9]
Stereotime: a wireless 2D and 3D switchable video communication system
Demos
/
Yang, You
/
Liu, Qiong
/
Gao, Yue
/
Xiong, Binbin
/
Yu, Li
/
Luan, Huanbo
/
Ji, Rongrong
/
Tian, Qi
Proceedings of the 2013 ACM International Conference on Multimedia
2013-10-21
p.473-474
© Copyright 2013 ACM
Summary: Mobile 3D video communication, especially with 2D and 3D compatible, is a
new paradigm for both video communication and 3D video processing. Current
techniques face challenges in mobile devices when bundled constraints such as
computation resource and compatibility should be considered. In this work, we
present a wireless 2D and 3D switchable video communication to handle the
previous challenges, and name it as Stereotime. The methods of Zig-Zag fast
object segmentation, depth cues detection and merging, and texture-adaptive
view generation are used for 3D scene reconstruction. We show the
functionalities and compatibilities on 3D mobile devices in WiFi network
environment.
[10]
Object coding on the semantic graph for scene classification
Posters
/
Chen, Jingjing
/
Han, Yahong
/
Cao, Xiaochun
/
Tian, Qi
Proceedings of the 2013 ACM International Conference on Multimedia
2013-10-21
p.493-496
© Copyright 2013 ACM
Summary: In the scene classification, a scene can be considered as a set of object
cliques. Objects inside each clique have semantic correlations with each other,
while two objects from different cliques are relatively independent. To utilize
these correlations for better recognition performance, we propose a new method
-- Object Coding on the Semantic Graph to address the scene classification
problem. We first exploit prior knowledge by making statistics on a large
number of labeled images and calculating the dependency degree between objects.
Then, a graph is built to model the semantic correlations between objects. This
semantic graph captures semantics by treating the objects as vertices and the
objects affinities as the weights of edges. By encoding this semantic knowledge
into the semantic graph, object coding is conducted to automatically select a
set of object cliques that have strongly semantic correlations to represent a
specific scene. The experimental results show that the Object Coding on
semantic graph can improve the classification accuracy.
[11]
Beyond bag of words: image representation in sub-semantic space
Posters
/
Zhang, Chunjie
/
Wang, Shuhui
/
Liang, Chao
/
Liu, Jing
/
Huang, Qingming
/
Li, Haojie
/
Tian, Qi
Proceedings of the 2013 ACM International Conference on Multimedia
2013-10-21
p.497-500
© Copyright 2013 ACM
Summary: Due to the semantic gap, the low-level features are not able to semantically
represent images well. Besides, traditional semantic related image
representation may not be able to cope with large inter class variations and
are not very robust to noise. To solve these problems, in this paper, we
propose a novel image representation method in the sub-semantic space. First,
examplar classifiers are trained by separating each training image from the
others and serve as the weak semantic similarity measurement. Then a graph is
constructed by combining the visual similarity and weak semantic similarity of
these training images. We partition this graph into visually and semantically
similar sub-sets. Each sub-set of images are then used to train classifiers in
order to separate this sub-set from the others. The learned sub-set classifiers
are then used to construct a sub-semantic space based representation of images.
This sub-semantic space is not only more semantically meaningful but also more
reliable and resistant to noise. Finally, we make categorization of images
using this sub-semantic space based representation on several public datasets
to demonstrate the effectiveness of the proposed method.
[12]
What are the distance metrics for local features?
Posters
/
Mao, Zhendong
/
Zhang, Yongdong
/
Tian, Qi
Proceedings of the 2013 ACM International Conference on Multimedia
2013-10-21
p.505-508
© Copyright 2013 ACM
Summary: Previous research has found that the distance metric for similarity
estimation is determined by the underlying data noise distribution. The well
known Euclidean (L2) and Manhattan (L1) metrics are then justified when the
additive noise are Gaussian and Exponential, respectively. However, finding a
suitable distance metric for local features is still a challenge when the
underlying noise distribution is unknown and could be neither Gaussian nor
Exponential. To address this issue, we introduce a modeling framework for
arbitrary noise distributions and propose a generalized distance metric for
local features based on this framework. We prove that the proposed distance is
equivalent to the L1 or the L2 distance when the noise is Gaussian or
Exponential. Furthermore, we justify the Hamming metric when the noise meets
the given conditions. In that case, the proposed distance is a linear mapping
of the Hamming distance. The proposed metric has been extensively tested on a
benchmark data set with five state-of-the-art local features: SIFT, SURF,
BRIEF, ORB and BRISK. Experiments show that our framework better models the
real noise distributions and that more robust results can be obtained by using
the proposed distance metric.
[13]
Locality preserving verification for image search
Posters
/
Pang, Shanmin
/
Xue, Jianru
/
Zheng, Nanning
/
Tian, Qi
Proceedings of the 2013 ACM International Conference on Multimedia
2013-10-21
p.529-532
© Copyright 2013 ACM
Summary: Establishing correct correspondences between two images has a wide range of
applications, such as 2D and 3D registration, structure from motion, and image
retrieval. In this paper, we propose a new matching method based on spatial
constraints. The proposed method has linear time complexity, and is efficient
when applying it to image retrieval. The main assumption behind our method is
that, the local geometric structure among a feature point and its neighbors, is
not easily affected by both geometric and photometric transformations, and thus
should be preserved in their corresponding images. We model this local
geometric structure by linear coefficients that reconstruct the point from its
neighbors. The method is flexible, as it can not only estimate the number of
correct matches between two images efficiently, but also determine the
correctness of each match accurately. Furthermore, it is simple and easy to be
implemented. When applying the proposed method on re-ranking images in an image
search engine, it outperforms the-state-of-the-art techniques.
[14]
Undo the codebook bias by linear transformation for visual applications
Posters
/
Zhang, Chunjie
/
Zhang, Yifan
/
Wang, Shuhui
/
Pang, Junbiao
/
Liang, Chao
/
Huang, Qingming
/
Tian, Qi
Proceedings of the 2013 ACM International Conference on Multimedia
2013-10-21
p.533-536
© Copyright 2013 ACM
Summary: The bag of visual words model (BoW) and its variants have demonstrate their
effectiveness for visual applications and have been widely used by researchers.
The BoW model first extracts local features and generates the corresponding
codebook, the elements of a codebook are viewed as visual words. The local
features within each image are then encoded to get the final histogram
representation. However, the codebook is dataset dependent and has to be
generated for each image dataset. This costs a lot of computational time and
weakens the generalization power of the BoW model. To solve these problems, in
this paper, we propose to undo the dataset bias by codebook linear
transformation. To represent every points within the local feature space using
Euclidean distance, the number of bases should be no less than the space
dimensions. Hence, each codebook can be viewed as a linear transformation of
these bases. In this way, we can transform the pre-learned codebooks for a new
dataset. However, not all of the visual words are equally important for the new
dataset, it would be more effective if we can make some selection using
sparsity constraints and choose the most discriminative visual words for
transformation. We propose an alternative optimization algorithm to jointly
search for the optimal linear transformation matrixes and the encoding
parameters. Image classification experimental results on several image datasets
show the effectiveness of the proposed method.
[15]
Static saliency vs. dynamic saliency: a comparative study
Scene understanding
/
Nguyen, Tam V.
/
Xu, Mengdi
/
Gao, Guangyu
/
Kankanhalli, Mohan
/
Tian, Qi
/
Yan, Shuicheng
Proceedings of the 2013 ACM International Conference on Multimedia
2013-10-21
p.987-996
© Copyright 2013 ACM
Summary: Recently visual saliency has attracted wide attention of researchers in the
computer vision and multimedia field. However, most of the visual
saliency-related research was conducted on still images for studying static
saliency. In this paper, we give a comprehensive comparative study for the
first time of dynamic saliency (video shots) and static saliency (key frames of
the corresponding video shots), and two key observations are obtained: 1) video
saliency is often different from, yet quite related with, image saliency, and
2) camera motions, such as tilting, panning or zooming, affect dynamic saliency
significantly. Motivated by these observations, we propose a novel camera
motion and image saliency aware model for dynamic saliency prediction. The
extensive experiments on two static-vs-dynamic saliency datasets collected by
us show that our proposed method outperforms the state-of-the-art methods for
dynamic saliency prediction. Finally, we also introduce the application of
dynamic saliency prediction for dynamic video captioning, assisting people with
hearing impairments to better entertain videos with only off-screen voices,
e.g., documentary films, news videos and sports videos.
[16]
Scale based region growing for scene text detection
Scene understanding
/
Mao, Junhua
/
Li, Houqiang
/
Zhou, Wengang
/
Yan, Shuicheng
/
Tian, Qi
Proceedings of the 2013 ACM International Conference on Multimedia
2013-10-21
p.1007-1016
© Copyright 2013 ACM
Summary: Scene text is widely observed in our daily life and has many important
multimedia applications. Unlike document text, scene text usually exhibits
large variations in font and language, and suffers from low resolution,
occlusions and complex background. In this paper, we present a novel
scale-based region growing algorithm for scene text detection. We first
distinguish SIFT features in text regions from those in background by exploring
the inter- and intra-statistics of SIFT features. Then scene text regions in
images are identified by scale-based region growing, which explores the
geometric context of SIFT keypoints in local regions. Our algorithm is very
effective to detect multilingual text in various fonts, sizes, and with complex
background. In addition, it offers insights on efficiently deploying local
features in numerous applications, such as visual search. We evaluate our
algorithm on three datasets and achieve the state-of-the-art performance.
[17]
Extraction of Light Stripe Centerline Based on Self-adaptive Thresholding
and Contour Polygonal Representation
Ergonomics of Work with Computers
/
Tian, Qingguo
/
Yang, Yujie
/
Zhang, Xiangyu
/
Ge, Baozhen
DHM 2013: 4th International Conference on Digital Human Modeling and
Applications in Health, Safety, Ergonomics, and Risk Management, Part II: Human
Body Modeling and Ergonomics
2013-07-21
v.2
p.292-301
Keywords: centerline extraction; light stripe; integral image thresholding; polygon
representation; adaptive center of mass
© Copyright 2013 Springer-Verlag
Summary: Extracting light stripe centerline is the key step in the line-structure
light scanning visual measuring system. It directly determines the quality of
three-dimensional point clouds obtained from images. Due to the reflectivity
and/or color of object surface, illumination condition change and other
factors, gray value and curvature of light stripe in image will vary greatly
that makes it very difficulty to completely and precisely extract sub-pixel
centerline. This paper presents a novel method for light stripe centerline
extraction efficiently. It combines the integral image thresholding method,
polygon representation of light stripe contour and adaptive center of mass
method together. It firstly locates light stripe region and produces binary
image no matter how change gray values of light stripe against background. Then
the contour of light stripe is extracted and approximately represented by
polygon. Based on the local orthogonal relationship between directions of light
stripe cross-section and corresponding polygon segment, the direction of light
stripe cross-section is calculated quickly. Along this direction, sub-pixel
centerline coordinates are calculated using adaptive center of mass method. 3D
scanning experiments with human model dressed colorful swimsuit on a
self-designed line laser 3D scanning system are implemented. Some comparisons
such as light stripe segmentation using 3 thresholding methods, the time used
and the smoothness are given and the results show that the proposed method can
acquire satisfying data. The mean time used for one image is not beyond 5 ms
and the completeness and smoothness of point clouds acquired by presented
methods are better than those of other two methods. This demonstrates the
effectiveness and practicability of the proposed method.
[18]
Automated description generation for indoor floor maps
Posters and demonstrations
/
Paladugu, Devi A.
/
Maguluri, Hima Bindu
/
Tian, Qiongjie
/
Li, Baoxin
Fourteenth Annual ACM SIGACCESS Conference on Assistive Technologies
2012-10-22
p.211-212
© Copyright 2012 ACM
Summary: People with visual impairment generally suffer from diminished freedom in
navigating an environment. A practical need is to navigate through unfamiliar
indoor environments such as school buildings, hotels, etc., for which
commonly-used existing tools like canes, seeing-eye dogs and GPS devices cannot
provide adequate support. We demonstrate a prototype system that aims at
addressing this practical need. The input to the system is the name of the
building/establishment supplied by a user, which is used by a web crawler to
determine the availability of a floor map on the corresponding website. If
available, the map is downloaded and used by the proposed system to generate a
verbal description giving an overview of the locations of key landmarks inside
the map with respect to one another. Our preliminary survey and experiments
indicate that this is a promising direction to pursue in supporting indoor
navigation for the visually impaired.
[19]
Exploring tag relevance for image tag re-ranking
Poster abstracts
/
Xiao, Jie
/
Zhou, Wengang
/
Tian, Qi
Proceedings of the 35th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2012-08-12
p.1069-1070
© Copyright 2012 ACM
Summary: In this paper, we propose to explore the relevance between tags for image
tag re-ranking. The key component is to define a global tag-tag similarity
matrix, which is achieved by analysis in both semantic and visual aspects. The
text semantic relevance is explored by the Latent Semantic Indexing (LSI) model
[1]. For the visual information, the tag-relevance can be propagated by
reconstructing exemplar images with visually and semantically consistent
images. Based on our tag relevance matrix, a random-walk approach is leveraged
to discover the significance of each tag. Finally, all tags in an image are
re-ranked by their significance values. Extensive experiments show its
effectiveness on an image dataset with a large tags vocabulary.
[20]
Efficient l<sub>p</sub>-norm multiple feature metric learning for image
categorization
Poster session: information retrieval
/
Wang, Shuhui
/
Huang, Qingming
/
Jiang, Shuqiang
/
Tian, Qi
Proceedings of the 2011 ACM Conference on Information and Knowledge
Management
2011-10-24
p.2077-2080
© Copyright 2011 ACM
Summary: Previous metric learning approaches are only able to learn the metric based
on single concatenated multivariate feature representation. However, for many
real world problems with multiple feature representation such as image
categorization, the model trained by previous approaches will degrade because
of sparsity brought by significant dimension growth and uncontrolled influence
from each feature channel. In this paper, we propose an efficient distance
metric learning model which adapts Distance Metric Learning on multiple feature
representations. The aim is to learn the Mahalanobis matrices for each
independent feature and their non-sparse lp-norm weight coefficients
simultaneously by maximizing the margin of the overall learned distance metric
among the pairs from the same class and the distance of pairs from different
classes. We further extend this method to nonlinear kernel learning and
category specific metric learning, which demonstrate the applicability of using
many existing kernels for image data and exploring the hierarchical semantic
structures for large scale image datasets. Experiments on various datasets
demonstrate the promising power of our method.
[21]
Auto-calibration of a Laser 3D Color Digitization System
Advances in Digital Human Modeling
/
Li, Xiaojie
/
Ge, Bao-zhen
/
Zhao, Dan
/
Tian, Qing-guo
/
Young, K. David
DHM 2009: 2nd International Conference on Digital Human Modeling
2009-07-19
p.691-699
Copyright © 2009 Springer-Verlag
Summary: A typical 3D color digitization system is composed of 3D sensors to obtain
3D information, and color sensors to obtain color information. Sensor
calibration plays a key role in determining the correctness and accuracy of the
3D color digitization data. In order to carry out the calibration quickly and
accurately, this paper introduces an automated calibration process which
utilizes 3D dynamic precision fiducials, with which calibration dot pairs are
extracted automatically, and as the corresponding data are processed via a
calibration algorithm. This automated was experimentally verified to be fast
and effective. Both the 3D information and color information are extracted such
that the 3D sensors and the color sensors are calibrated with one automated
calibration process. We believe it is the first such calibration process for a
3D color digitization system.
[22]
Color 3D Digital Human Modeling and Its Applications to Animation and
Anthropometry
Part I: Shape and Movement Modeling and Anthropometry
/
Ge, Bao-zhen
/
Tian, Qing-guo
/
Young, K. David
/
Sun, Yu-chen
DHM 2007: 1st International Conference on Digital Human Modeling
2007-07-22
p.82-91
Copyright © 2007 Springer-Verlag
Summary: With the rapid advancement in laser technology, computer vision, and
embedded computing, the application of laser scanning to the digitization of
three dimensional physical realities has become increasingly widespread. In
this paper, we focus on research results embodied in a 3D human body color
digitization system developed at Tianjin University, and in collaboration with
the Hong Kong University of Science and Technology. In digital human modeling,
the first step involves the acquisition of the 3D human body data. We have over
the years developed laser scanning technological know-how from first principles
to support our research activities on building the first 3D digital human
database for ethnic Chinese. The disadvantage of the conventional laser
scanning is that surface color information is not contained in the point cloud
data. By adding color imaging sensors to the developed multi-axis laser
scanner, both the 3D human body coordinate data and the body surface color
mapping are acquired. Our latest development is focused on skeleton extraction
which is the key step towards human body animation, and applications to dynamic
anthropometry. For dynamic anthropometric measurements, we first use an
animation algorithm to adjust the 3D digital human to the required standard
posture for measurement, and then fix the feature points and feature planes
based on human body geometric characteristics. Utilizing the feature points,
feature planes, and the extracted human body skeleton, we have measured 40 key
sizes for the stand posture, and the squat posture. These experimental results
will be given, and the factors that affect the measurement precision are
analyzed through qualitative and quantitative analyses.
[23]
Content-based summarization for personal image library
Posters
/
Lim, Joo-Hwee
/
Li, Jun
/
Mulhem, Philippe
/
Tian, Qi
JCDL'03: Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital
Libraries
2003-05-27
p.393
Summary: With the accumulation of consumer's personal image library, the problem of
managing, browsing, querying and presenting photos effectively and efficiently
would become critical. We propose a framework for automatic organization of
personal image libraries based on analysis of image creation time stamps and
image contents to facilitate browsing and summarization of images.