[1]
CCS-TA: quality-guaranteed online task allocation in compressive
crowdsensing
Sensing with crowd
/
Wang, Leye
/
Zhang, Daqing
/
Pathak, Animesh
/
Chen, Chao
/
Xiong, Haoyi
/
Yang, Dingqi
/
Wang, Yasha
Proceedings of the 2015 International Conference on Ubiquitous Computing
2015-09-07
p.683-694
© Copyright 2015 ACM
Summary: Data quality and budget are two primary concerns in urban-scale mobile
crowdsensing applications. In this paper, we leverage the spatial and temporal
correlation among the data sensed in different sub-areas to significantly
reduce the required number of sensing tasks allocated (corresponding to
budget), yet ensuring the data quality. Specifically, we propose a novel
framework called CCS-TA, combining the state-of-the-art compressive sensing,
Bayesian inference, and active learning techniques, to dynamically select a
minimum number of sub-areas for sensing task allocation in each sensing cycle,
while deducing the missing data of unallocated sub-areas under a probabilistic
data accuracy guarantee. Evaluations on real-life temperature and air quality
monitoring datasets show the effectiveness of CCS-TA. In the case of
temperature monitoring, CCS-TA allocates 18.0-26.5% fewer tasks than baseline
approaches, allocating tasks to only 15.5% of the sub-areas on average while
keeping overall sensing error below 0.25°C in 95% of the cycles.
[2]
Multi-source Information Fusion for Personalized Restaurant Recommendation
Short Papers
/
Sun, Jing
/
Xiong, Yun
/
Zhu, Yangyong
/
Liu, Junming
/
Guan, Chu
/
Xiong, Hui
Proceedings of the 2015 Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2015-08-09
p.983-986
© Copyright 2015 ACM
Summary: In this paper, we study the problem of personalized restaurant
recommendations. Specifically, we develop a probabilistic factor analysis
framework, named RMSQ-MF, which has the ability in exploiting multi-source
information, such as the users' task, their friends' preferences, and human
mobility patterns, for personalized restaurant recommendations. The rationale
of this work is motivated by two observations. First, people's preferences can
be affected by their friends. Second, human mobility patterns can reflect the
popularity of restaurants to a certain degree. Finally, empirical studies on
real-world data demonstrate that the proposed method outperforms benchmark
methods with a significant margin.
[3]
Remix in 3D Printing: What your Sources say About You
WebSci Track Papers & Posters
/
Papadimitriou, Spiros
/
Papalexakis, Evangelos
/
Liu, Bin
/
Xiong, Hui
Companion Proceedings of the 2015 International Conference on the World Wide
Web
2015-05-18
v.2
p.367-368
© Copyright 2015 ACM
Summary: Concurrently with the recent, rapid adoption of 3D printing technologies,
online sharing of 3D-printable designs is growing equally rapidly, even though
it has received far less attention. We study remix relationships on
Thingiverse, the dominant online repository and social network for 3D printing.
We collected data of designs published over five years, and we find that remix
ties exhibit both homophily and inverse-homophily across numerous key metrics,
which is stronger compared to other kinds of social and content links. This may
have implications on graph prediction tasks, as well as on the design of
3D-printable content repositories.
[4]
Fused one-vs-all mid-level features for fine-grained visual categorization
Multimedia Analysis and Mining
/
Zhang, Xiaopeng
/
Xiong, Hongkai
/
Zhou, Wengang
/
Tian, Qi
Proceedings of the 2014 ACM International Conference on Multimedia
2014-11-03
p.287-296
© Copyright 2014 ACM
Summary: As an emerging research topic, fine-grained visual categorization has been
attracting growing attentions in recent years. Due to the large inter-class
similarity and intra-class variance, recognizing objects in fine-grained
domains is extremely challenging, and sometimes even humans can not recognize
them accurately. Traditional bag-of-words model could obtain desirable results
for basic-level category classification by weak alignment using spatial pyramid
matching model, but may easily fail in fine-grained domains since the
discriminative features are not only subtle but also extremely localized. The
fine differences often get swamped by those irrelevant features, and it is
virtually impossible to distinguish them. To address the problems above, we
propose a new framework for fine-grained visual categorization. We strengthen
the spatial correspondence among parts by including foreground segmentation and
part localization. Based on the part representations of the images, we learn a
large set of mid-level features which are more suitable for fine-grained tasks.
Comparing with the low level features directly extracted from the images, the
learned one-vs-all mid-level features enjoy the following advantages. First,
the dimension of the mid-level features is relatively small. In order to obtain
high classification accuracy, the dimension of the low level features usually
reaches several thousand to tens of thousand, and becomes even larger when
introducing spatial pyramid model. However, the dimension of our mid-level
features is related to the number of classes, which is far less. Second, each
entry of the proposed mid-level features is meaningful, which forms a more
compact representation of the image. Third, the mid-level features are more
robust than the low level ones, which is helpful for classification. Fourth,
the learning process of the mid-level features is independent and can be easily
combined with other techniques to boost the performance. We evaluate the
proposed approach on the extensive fine-grained dataset CUB 200-2011 and
Stanford Dogs, by learning the mid-level features based on the popular Fisher
vectors and convolutional neural network, we boost the classification accuracy
by a considerable margin and advance the state-of-the-art performance in
fine-grained visual categorization.
[5]
Influence Maximization over Large-Scale Social Networks: A Bounded Linear
Approach
KM Session 1: Social Networks & Social Media I
/
Liu, Qi
/
Xiang, Biao
/
Chen, Enhong
/
Xiong, Hui
/
Tang, Fangshuang
/
Yu, Jeffrey Xu
Proceedings of the 2014 ACM Conference on Information and Knowledge
Management
2014-11-03
p.171-180
© Copyright 2014 ACM
Summary: Information diffusion in social networks is emerging as a promising solution
to successful viral marketing, which relies on the effective and efficient
identification of a set of nodes with the maximal social influence. While there
are tremendous efforts on the development of social influence models and
algorithms for social influence maximization, limited progress has been made in
terms of designing both efficient and effective algorithms for finding a set of
nodes with the maximal social influence. To this end, in this paper, we provide
a bounded linear approach for influence computation and influence maximization.
Specifically, we first adopt a linear and tractable approach to describe the
influence propagation. Then, we develop a quantitative metric, named
Group-PageRank, to quickly estimate the upper bound of the social influence
based on this linear approach. More importantly, we provide two algorithms
Linear and Bound, which exploit the linear approach and Group-PageRank for
social influence maximization. Finally, extensive experimental results
demonstrate that (a) the adopted linear approach has a close relationship with
traditional models and Group-PageRank provides a good estimation of social
influence; (b) Linear and Bound can quickly find a set of the most influential
nodes and both of them are scalable for large-scale social networks.
[6]
Multi-task Multi-view Learning for Heterogeneous Tasks
KM Session 5: Classification II
/
Jin, Xin
/
Zhuang, Fuzhen
/
Xiong, Hui
/
Du, Changying
/
Luo, Ping
/
He, Qing
Proceedings of the 2014 ACM Conference on Information and Knowledge
Management
2014-11-03
p.441-450
© Copyright 2014 ACM
Summary: Multi-task multi-view learning deals with the learning scenarios where
multiple tasks are associated with each other through multiple shared feature
views. All previous works for this problem assume that the tasks use the same
set of class labels. However, in real world there exist quite a few
applications where the tasks with several views correspond to different set of
class labels. This new learning scenario is called Multi-task Multi-view
Learning for Heterogeneous Tasks in this study. Then, we propose a Multi-tAsk
MUlti-view Discriminant Analysis (MAMUDA) method to solve this problem.
Specifically, this method collaboratively learns the feature transformations
for different views in different tasks by exploring the shared task-specific
and problem intrinsic structures. Additionally, MAMUDA method is convenient to
solve the multi-class classification problems. Finally, the experiments on two
real-world problems demonstrate the effectiveness of MAMUDA for heterogeneous
tasks.
[7]
Predicting the Popularity of Online Serials with Autoregressive Models
KM Session 17: Web Data Mining
/
Chang, Biao
/
Zhu, Hengshu
/
Ge, Yong
/
Chen, Enhong
/
Xiong, Hui
/
Tan, Chang
Proceedings of the 2014 ACM Conference on Information and Knowledge
Management
2014-11-03
p.1339-1348
© Copyright 2014 ACM
Summary: Recent years have witnessed the rapid prevalence of online serials, which
play an important role in our daily entertainment. A critical demand along this
line is to predict the popularity of online serials, which can enable a wide
range of applications, such as online advertising, and serial recommendation.
However, compared with traditional online media such as user-generated content
(UGC), online serials have unique characteristics of sequence dependence,
release date dependence as well as unsynchronized update regularity. Therefore,
the popularity prediction for online serials is a nontrivial task and still
under-addressed. To this end, in this paper we present a comprehensive study
for predicting the popularity of online serials with autoregressive models.
Specifically, we first introduce a straightforward yet effective Naive
Autoregressive (NAR) model based on the correlations of serial episodes.
Furthermore, we develop a sophisticated model, namely Transfer Autoregressive
(TAR) model, to capture the dynamic behaviors of audiences, which can achieve
better prediction performance than the NAR model. Indeed, the two models can
reveal the popularity generation from different perspectives. In addition, as a
derivative of the TAR model, we also design a novel metric, namely favor, for
evaluating the quality of online serials. Finally, extensive experiments on two
real-world data sets clearly show that both models are effective and outperform
baselines in terms of the popularity prediction for online serials. And the new
metric performs better than other metrics for quality estimation.
[8]
Eye Glance Behavior Associated with Cell-Phone Use: Examination with
Naturalistic Driving Data
Surface Transportation: ST4 -- Naturalistic Driving Research
/
Bao, Shan
/
Flannagan, Carol
/
Xiong, Huimin
/
Sayer, Jim
Proceedings of the Human Factors and Ergonomics Society 2014 Annual Meeting
2014-10-27
p.2112-2116
doi 10.1177/1541931214581444
© Copyright 2014 HFES
Summary: The purpose of this study is to examine eye-glance patterns of drivers
engaged in cell phone related tasks. To observe eye-glance patterns,
researchers used naturalistic driving data from the Integrated Vehicle-Based
Safety Systems field operational test to construct and tabulate two datasets.
One dataset included gaze data that were coded from cell phone conversation
clips by fifty different drivers under different driving conditions. The second
dataset was created in a similar way using video clips from twenty-four drivers
who engaged in visual-manual tasks (e.g., texting and dialing). Mixed-model
analyses were conducted. Results showed that drivers' on-road gazes were longer
when they were engaged in a cell phone conversation than when they were not
engaged. Off-road gaze length was the same, regardless of task involvement. In
contrast, drivers who engaged in visual-manual tasks had substantially shorter
on-road gaze length compared to when those same drivers were not involved in
visual-manual tasks.
[9]
CrowdRecruiter: selecting participants for piggyback crowdsensing under
probabilistic coverage constraint
Sensing the crowd
/
Zhang, Daqing
/
Xiong, Haoyi
/
Wang, Leye
/
Chen, Guanling
Proceedings of the 2014 International Joint Conference on Pervasive and
Ubiquitous Computing
2014-09-13
v.1
p.703-714
© Copyright 2014 ACM
Summary: This paper proposes a novel participant selection framework, named
CrowdRecruiter, for mobile crowdsensing. CrowdRecruiter operates on top of
energy-efficient Piggyback Crowdsensing (PCS) task model and minimizes
incentive payments by selecting a small number of participants while still
satisfying probabilistic coverage constraint. In order to achieve the objective
when piggybacking crowdsensing tasks with phone calls, CrowdRecruiter first
predicts the call and coverage probability of each mobile user based on
historical records. It then efficiently computes the joint coverage probability
of multiple users as a combined set and selects the near-minimal set of
participants, which meets coverage ratio requirement in each sensing cycle of
the PCS task. We evaluated CrowdRecruiter extensively using a large-scale
real-world dataset and the results show that the proposed solution
significantly outperforms three baseline algorithms by selecting 10.0% -- 73.5%
fewer participants on average under the same probabilistic coverage constraint.
[10]
Cost-Aware Collaborative Filtering for Travel Tour Recommendations
/
Ge, Yong
/
Xiong, Hui
/
Tuzhilin, Alexander
/
Liu, Qi
ACM Transactions on Information Systems
2014-01
v.32
n.1
p.4
© Copyright 2014 ACM
Summary: Advances in tourism economics have enabled us to collect massive amounts of
travel tour data. If properly analyzed, this data could be a source of rich
intelligence for providing real-time decision making and for the provision of
travel tour recommendations. However, tour recommendation is quite different
from traditional recommendations, because the tourist's choice is affected
directly by the travel costs, which includes both financial and time costs. To
that end, in this article, we provide a focused study of cost-aware tour
recommendation. Along this line, we first propose two ways to represent user
cost preference. One way is to represent user cost preference by a
two-dimensional vector. Another way is to consider the uncertainty about the
cost that a user can afford and introduce a Gaussian prior to model user cost
preference. With these two ways of representing user cost preference, we
develop different cost-aware latent factor models by incorporating the cost
information into the probabilistic matrix factorization (PMF) model, the
logistic probabilistic matrix factorization (LPMF) model, and the maximum
margin matrix factorization (MMMF) model, respectively. When applied to
real-world travel tour data, all the cost-aware recommendation models
consistently outperform existing latent factor models with a significant
margin.
[11]
Ranking fraud detection for mobile apps: a holistic view
KM track: mobile and event mining
/
Zhu, Hengshu
/
Xiong, Hui
/
Ge, Yong
/
Chen, Enhong
Proceedings of the 2013 ACM Conference on Information and Knowledge
Management
2013-10-27
p.619-628
© Copyright 2013 ACM
Summary: Ranking fraud in the mobile App market refers to fraudulent or deceptive
activities which have a purpose of bumping up the Apps in the popularity list.
Indeed, it becomes more and more frequent for App develops to use shady means,
such as inflating their Apps' sales or posting phony App ratings, to commit
ranking fraud. While the importance of preventing ranking fraud has been widely
recognized, there is limited understanding and research in this area. To this
end, in this paper, we provide a holistic view of ranking fraud and propose a
ranking fraud detection system for mobile Apps. Specifically, we investigate
two types of evidences, ranking based evidences and rating based evidences, by
modeling Apps' ranking and rating behaviors through statistical hypotheses
tests. In addition, we propose an optimization based aggregation method to
integrate all the evidences for fraud detection. Finally, we evaluate the
proposed system with real-world App data collected from the Apple's App Store
for a long time period. In the experiments, we validate the effectiveness of
the proposed system, and show the scalability of the detection algorithm as
well as some regularity of ranking fraud activities.
[12]
Drivers' Selected Settings for Adaptive Cruise Control (ACC): Implications
for Long-Term Use
Surface Transportation: ST6 -- In-Vehicle Driver Support Systems
/
Xiong, Huimin
/
Boyle, Linda Ng
Proceedings of the Human Factors and Ergonomics Society 2013 Annual Meeting
2013-09-30
p.1928-1932
doi 10.1177/1541931213571431
© Copyright 2013 HFES
Summary: Adaptive Cruise Control (ACC) is a system that assists drivers on
longitudinal control by automatically adjusting the throttle. Users can set the
speed and gap setting based on their driving preferences. In this study,
drivers' ACC use pattern and selection choices are examined based on their
level of experience and geographical location. Experienced ACC users from urban
settings in Washington are compared to less urbanized areas in Iowa.
Information on novice ACC users were also collected in Washington and compared
with experienced ACC users within the same area. The outcomes show that
although similar use patterns do exist, there are differences in geographical
locations and experience levels that impact drivers' choice of ACC settings. In
Iowa, experienced ACC drivers select faster speed and closer time headway
distance than drivers in Washington State. This suggests that use of ACC differ
given environmental surroundings. Within Washington, experienced ACC users set
faster speed, closer time headway distance, and intervened less compared with
novice ACC users. This suggests that drivers' behavior may change with greater
exposure to ACC, which can provide insights on drivers' automation reliance
after extended use.
[13]
effSense: energy-efficient and cost-effective data uploading in mobile
crowdsensing
Workshop: PUCAA: 1st international workshop on pervasive urban crowdsensing
architecture and applications
/
Wang, Leye
/
Zhang, Daqing
/
Xiong, Haoyi
Adjunct Proceedings of the 2013 International Joint Conference on Pervasive
and Ubiquitous Computing
2013-09-08
v.2
p.1075-1086
© Copyright 2013 ACM
Summary: Energy consumption and mobile data cost are two key factors affecting users'
willingness to participate in crowdsensing tasks. While data-plan users are
mostly concerned about the energy consumption, non-data-plan users are more
sensitive to data transmission cost incurred. Traditional ways of data
collection in mobile crowdsensing often go to two extremes: either uploading
the sensed data online in real-time or fully offline after the whole sensing
task is finished. In this paper, we propose effSense -- a novel
energy-efficient and cost-effective data uploading framework leveraging the
delay-tolerant mechanisms. Specifically, effSense reduces the data cost of
non-data-plan users by maximally offloading the data to Bluetooth/WiFi gateways
or data-plan users encountered to relay the data to the server; it reduces
energy consumption of data-plan users by uploading data in parallel with a call
or using less-energy demand networks (e.g. Bluetooth). By leveraging the
prediction of critical events such as user's future calls or encounters,
effSense selects the optimal uploading scheme for both types of users. Our
evaluation with MIT Reality Mining and Nodobo datasets show that effSense can
save 55%~65% energy and 45%~50% data cost for the two types of users,
respectively, compared with the traditional uploading schemes.
[14]
Detecting and Tracking Topics and Events from Web Search Logs
/
Liu, Hongyan
/
He, Jun
/
Gu, Yingqin
/
Xiong, Hui
/
Du, Xiaoyong
ACM Transactions on Information Systems
2012-11
v.30
n.4
p.21
© Copyright 2012 ACM
Summary: Recent years have witnessed increased efforts on detecting topics and events
from Web search logs, since this kind of data not only capture web content but
also reflect the users' activities. However, the majority of existing work is
focused on exploiting clustering techniques for topic and event detection. Due
to the huge size and the evolving nature of Web data, existing clustering
approaches are limited to meet the real-time demand. To that end, in this
article, we propose a method called LETD to detect evolving topics in a timely
manner. Also, we design the techniques to extract events from topics and to
infer the evolving relationship among the events. For topic detection, we first
provide a measurement to select the important URLs, which are most likely to
describe a real-life topic. Then, starting from these selected URLs, we exploit
the local expansion method to find other topic-related URLs. Moreover, in the
LETD framework, we design algorithms based on Random Walk and Markov Random
Fields (MRF), respectively. Because the LETD method exploits a
divide-and-conquer strategy to process the data, it is more efficient than
existing methods based on clustering techniques. To better illustrate the LETD
framework, we develop a demo system StoryTeller which can discover hot topics
and events, infer the evolving relationships among events, and visualize
information in a storytelling way. This demo system can provide a global view
of the topic development and help users target the interesting events more
conveniently. Finally, experimental results on real-world Microsoft
click-through data have shown that StoryTeller can find real-life hot topics
and meaningful evolving relationships among events, and has also demonstrated
the efficiency and effectiveness of the LETD method.
[15]
Exploiting enriched contextual information for mobile app classification
Knowledge management short paper session
/
Zhu, Hengshu
/
Cao, Huanhuan
/
Chen, Enhong
/
Xiong, Hui
/
Tian, Jilei
Proceedings of the 2012 ACM Conference on Information and Knowledge
Management
2012-10-29
p.1617-1621
© Copyright 2012 ACM
Summary: A key step for the mobile app usage analysis is to classify apps into some
predefined categories. However, it is a nontrivial task to effectively classify
mobile apps due to the limited contextual information available for the
analysis. To this end, in this paper, we propose an approach to first enrich
the contextual information of mobile apps by exploiting the additional Web
knowledge from the Web search engine. Then, inspired by the observation that
different types of mobile apps may be relevant to different real-world
contexts, we also extract some contextual features for mobile apps from the
context-rich device logs of mobile users. Finally, we combine all the enriched
contextual information into a Maximum Entropy model for training a mobile app
classifier. The experimental results based on 443 mobile users' device logs
clearly show that our approach outperforms two state-of-the-art benchmark
methods with a significant margin.
[16]
Influential seed items recommendation
Short papers
/
Liu, Qi
/
Xiang, Biao
/
Chen, Enhong
/
Ge, Yong
/
Xiong, Hui
/
Bao, Tengfei
/
Zheng, Yi
Proceedings of the 2012 ACM Conference on Recommender Systems
2012-09-09
p.245-248
© Copyright 2012 ACM
Summary: In this paper, we present a systematic perspective study on choosing and
evaluating the initial seed items that will be recommended to the cold start
users. We first construct an item consumption correlation network to capture
the existing users' general consumption behaviors. Then, we formalize initial
items recommendation as the influential seed set selection problem. Along this
line, we present several methods, each of which selects seed items according to
different rules. Finally, the experimental results on two real-world data sets
verify that with different seed items, the users' consumption numbers will be
quite different. Meanwhile, the results also provide many deep insights into
these selection methods and their recommended seed items.
[17]
Accelerate TV-L1 optical flow with edge-based image decomposition and its
implementation on mobile phone
/
Wang, Botao
/
Zhu, Qingxiang
/
Xiong, Hongkai
/
Luo, Chuanfei
Proceedings of the 2011 International Conference on Mobile and Ubiquitous
Multimedia
2011-12-07
p.144-151
© Copyright 2011 ACM
Summary: Variational methods are among the most accurate techniques of optical flow
computation. TV-L1 optical flow, which is based on L1-norm data fidelity term
and total variation (TV) regularization term, preserves discontinuities in the
flow field and also can deal with large displacements. However, the TV-L1
optical flow method is inaccurate near edges and computationally intensive. In
this paper, we proposed a technique, called Edge-based Image Decomposition
(EID), to improve the accuracy in the edge areas and also accelerate the
original TV-L1 method. EID improves the performance by decomposing image into
edge regions and flat regions, and also assigns computing power
discriminatively. We evaluated our algorithm on Middlebury datasets and proved
that by applying EID, 30% of run-time can be saved with no loss in accuracy,
and with same run-time, 7% of accuracy can be promoted. In addition, we
implemented our EID-enhanced TV-L1 optical flow algorithm on mobile phone with
Android operating system. Our application calculates the optical flow field
between two images and can be used to generate the disparity map and
reconstruct 3D scenes.
[18]
Towards expert finding by leveraging relevant categories in authority
ranking
Poster session: knowledge management
/
Zhu, Hengshu
/
Cao, Huanhuan
/
Xiong, Hui
/
Chen, Enhong
/
Tian, Jilei
Proceedings of the 2011 ACM Conference on Information and Knowledge
Management
2011-10-24
p.2221-2224
© Copyright 2011 ACM
Summary: How to improve authority ranking is a crucial research problem for expert
finding. In this paper, we propose a novel framework for expert finding based
on the authority information in the target category as well as the relevant
categories. First, we develop a scalable method for measuring the relevancy
between categories through topic models. Then, we provide a link analysis
approach for ranking user authority by considering the information in both the
target category and the relevant categories. Finally, the extensive experiments
on two large-scale real-world Q&A data sets clearly show that the proposed
method outperforms the baseline methods with a significant margin.
[19]
Collaborative filtering with collective training
Poster session 2
/
Ge, Yong
/
Xiong, Hui
/
Tuzhilin, Alexander
/
Liu, Qi
Proceedings of the 2011 ACM Conference on Recommender Systems
2011-10-23
p.281-284
© Copyright 2011 ACM
Summary: Rating sparsity is a critical issue for collaborative filtering. For
example, the well-known Netflix Movie rating data contain ratings of only about
1% user-item pairs. One way to address this rating sparsity problem is to
develop more effective methods for training rating prediction models. To this
end, in this paper, we introduce a collective training paradigm to
automatically and effectively augment the training ratings. Essentially, the
collective training paradigm builds multiple different Collaborative Filtering
(CF) models separately, and augments the training ratings of each CF model by
using the partial predictions of other CF models for unknown ratings. Along
this line, we develop two algorithms, Bi-CF and Tri-CF, based on collective
training. For Bi-CF and Tri-CF, we collectively and iteratively train two and
three different CF models via iteratively augmenting training ratings for
individual CF model. We also design different criteria to guide the selection
of augmented training ratings for Bi-CF and Tri-CF. Finally, the experimental
results show that Bi-CF and Tri-CF algorithms can significantly outperform
baseline methods, such as neighborhood-based and SVD-based models.
[20]
PHD-THESIS
Combining subject expert experimental data with standard data in Bayesian
mixture modeling
/
Xiong, Hui
/
Allen, Theodore
2011
Columbus, Ohio
Ohio State University, Industrial and Systems Engineering
Keywords: Quality engineering
Keywords: Bayesian mixture model
Keywords: Topic model
Keywords: Unstructured data
Keywords: Freestyle text
Keywords: Collapsed Gibbs Sampling
Keywords: Text mining
Keywords: Data mining
Keywords: Human computer interaction
Keywords: Subject matter experT
Summary: Engineers face many quality-related datasets containing free-style text or
images. For example, a database could include summaries of complaints filed by
customers, or descriptions of the causes of rework or maintenance or of the
associated actions taken, or a collection of quality inspection images of
welded tubes. The goal of this dissertation is to enable engineers to input a
database of free-style text or image data and then obtain a set of clusters or
"topics" with intuitive definitions and information about the degree of
commonality that together helps prioritize system improvement. The proposed
methods generate Pareto charts of ranked clusters or topics with their
interpretability improved by input from the analyst or method user. The
combination of subject matter expert data with standard data is the novel
feature of the methods considered. Prior to the methods proposed here, analysts
applied Bayesian mixture models and had limited recourse if the cluster or
topic definitions failed to be interpretable or are at odds with the knowledge
of subject matter experts. The associated "Subject Matter Expert Refined Topic"
(SMERT) model permits on-going knowledge elicitation and high-level human
expert data integration to address the issues regarding: (1) unsupervised topic
models often produce results to user, and (2) to provide a "Hierarchical
Analysis Designed Latency Experiment" (HANDLE) for human expert to interact
with the model results. If grouping are missing key elements, so-called
"boosting" these elements is possible. If certain members of a cluster are
nonsensical or nonphysical, so-called "zapping" these nonsensical elements is
possible. We also describe a fast Collapsed Gibbs Sampling (CGS) algorithm for
SMERT method, which offers the capacity to efficiently SMERT model large
datasets but which is associated with approximations in certain cases. We use
three case studies to illustrate the proposed methods. The first relates to
scrap text reports for a Chinese manufacturer of stone products. The second
relates to laser welding of tube joints and images characterizing bead shape.
The third case study relates to consumer reports text user reviews of the
Toyota Camry. The user reviews cover 10 years and the widely publicized
acceleration issue. In all cases, the SMERT models help provide interpretable
groupings of records in a way that could facilitate data-driven prioritization
of improvement actions.
[21]
Collaborative Dual-PLSA: mining distinction and commonality across multiple
domains for text classification
KM track: classification and clustering
/
Zhuang, Fuzhen
/
Luo, Ping
/
Shen, Zhiyong
/
He, Qing
/
Xiong, Yuhong
/
Shi, Zhongzhi
/
Xiong, Hui
Proceedings of the 2010 ACM Conference on Information and Knowledge
Management
2010-10-26
p.359-368
© Copyright 2010 ACM
Summary: The distribution difference among multiple data domains has been considered
for the cross-domain text classification problem. In this study, we show two
new observations along this line. First, the data distribution difference may
come from the fact that different domains use different key words to express
the same concept. Second, the association between this conceptual feature and
the document class may be stable across domains. These two issues are actually
the distinction and commonality across data domains.
Inspired by the above observations, we propose a generative statistical
model, named Collaborative Dual-PLSA (CD-PLSA), to simultaneously capture both
the domain distinction and commonality among multiple domains. Different from
Probabilistic Latent Semantic Analysis (PLSA) with only one latent variable,
the proposed model has two latent factors y and z, corresponding to word
concept and document class respectively. The shared commonality intertwines
with the distinctions over multiple domains, and is also used as the bridge for
knowledge transformation. We exploit an Expectation Maximization (EM) algorithm
to learn this model, and also propose its distributed version to handle the
situation where the data domains are geographically separated from each other.
Finally, we conduct extensive experiments over hundreds of classification tasks
with multiple source domains and multiple target domains to validate the
superiority of the proposed CD-PLSA model over existing state-of-the-art
methods of supervised and transfer learning. In particular, we show that
CD-PLSA is more tolerant of distribution differences.
[22]
Exploiting user interests for collaborative filtering: interests expansion
via personalized ranking
Poster session 3: KM track
/
Liu, Qi
/
Chen, Enhong
/
Xiong, Hui
/
Ding, Chris H. Q.
Proceedings of the 2010 ACM Conference on Information and Knowledge
Management
2010-10-26
p.1697-1700
© Copyright 2010 ACM
Summary: In real applications, a given user buys or rates an item based on his/her
interests. Learning to leverage this interest information is often critical for
recommender systems. However, in existing recommender systems, the information
about latent user interests are largely under-explored. To that end, in this
paper, we propose an interest expansion strategy via personalized ranking based
on the topic model, named iExpand, for building an interest-oriented
collaborative filtering framework. The iExpand method introduces a three-layer,
user-interest-item, representation scheme, which leads to more interpretable
recommendation results and helps the understanding of the interactions among
users, items, and user interests. Moreover, iExpand strategically deals with
many issues, such as the overspecialization and the cold-start problems.
Finally, we evaluate iExpand on benchmark data sets, and experimental results
show that iExpand outperforms state-of-the-art methods.
[23]
Top-Eye: top-k evolving trajectory outlier detection
Poster session 3: KM track
/
Ge, Yong
/
Xiong, Hui
/
Zhou, Zhi-hua
/
Ozdemir, Hasan
/
Yu, Jannite
/
Lee, K. C.
Proceedings of the 2010 ACM Conference on Information and Knowledge
Management
2010-10-26
p.1733-1736
© Copyright 2010 ACM
Summary: The increasing availability of large-scale location traces creates
unprecedent opportunities to change the paradigm for identifying abnormal
moving activities. Indeed, various aspects of abnormality of moving patterns
have recently been exploited, such as wrong direction and wandering. However,
there is no recognized way of combining different aspects into an unified
evolving abnormality score which has the ability to capture the evolving nature
of abnormal moving trajectories. To that end, in this paper, we provide an
evolving trajectory outlier detection method, named TOP-EYE, which continuously
computes the outlying score for each trajectory in an accumulating way.
Specifically, in TOP-EYE, we introduce a decay function to mitigate the
influence of the past trajectories on the evolving outlying score, which is
defined based on the evolving moving direction and density of trajectories.
This decay function enables the evolving computation of accumulated outlying
scores along the trajectories. An advantage of TOP-EYE is to identify evolving
outliers at very early stage with relatively low false alarm rate. Finally,
experimental results on real-world location traces show that TOP-EYE can
effectively capture evolving abnormal trajectories.
[24]
Enhancing recommender systems under volatile user interest drifts
IR personalization and social search I
/
Cao, Huanhuan
/
Chen, Enhong
/
Yang, Jie
/
Xiong, Hui
Proceedings of the 2009 ACM Conference on Information and Knowledge
Management
2009-11-02
p.1257-1266
© Copyright 2009 ACM
Summary: This paper presents a systematic study of how to enhance recommender systems
under volatile user interest drifts. A key development challenge along this
line is how to track user interests dynamically. To this end, we first define
four types of interest patterns to understand users' rating behaviors and
analyze the properties of these patterns. We also propose a rating graph and
rating chain based approach for detecting these interest patterns. For each
users' rating series, a rating graph and a rating chain are constructed based
on the similarities between rated items. The type of a given user's interest
pattern is identified through the density of the corresponding rating graph and
the continuity of the corresponding rating chain. In addition, we propose a
general algorithm framework for improving recommender systems by exploiting
these identified patterns. Finally, experimental results on a real-world data
set show that the proposed rating graph based approach is effective for
detecting user interest patterns, which in turn help to improve the performance
of recommender systems.
[25]
What's behind topic formation and development: a perspective of community
core groups
Poster session 6: IR track
/
Qian, Tieyun
/
Li, Qing
/
Liu, Bing
/
Xiong, Hui
/
Srivastava, Jaideep
/
Sheu, Phillip
Proceedings of the 2009 ACM Conference on Information and Knowledge
Management
2009-11-02
p.1843-1846
© Copyright 2009 ACM
Summary: Over the past several years, there has been a great interest in topic
detection and tracking (TDT). Recently, analyzing general research trend from
the huge amount of history documents also arouses considerable attention.
However, existing work on TDT mainly focuses on overall trend analysis, and is
unable to address questions such as "what determines the evolution of a topic?"
and "when and how does a new topic get formed?".
In this paper, we propose a core group model to explain the dynamics and
further segment topic development. According to the division phase and
interphase in the life cycle of a core group, a topic is separated into four
states, i.e. birth state, extending state, saturation state and shrinkage
state. Experimental results on a real dataset show that the division of a core
group brings on the generation of a new topic, and the progress of an entire
topic is closely correlated to the growth of a core group during its
interphase.