[1]
CricketLinking: Linking Event Mentions from Cricket Match Reports to Ball
Entities in Commentaries
Demonstrations
/
Gupta, Manish
Proceedings of the 2015 Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2015-08-09
p.1033-1034
© Copyright 2015 ACM
Summary: The 2011 Cricket World Cup final match was watched by around 135 million
people. Such a huge viewership demands a great experience for users of online
cricket portals. Many portals like espncricinfo.com host a variety of content
related to recent matches including match reports and ball-by-ball
commentaries. When reading a match report, reader experience can be
significantly improved by augmenting (on demand) the event mentions in the
report with detailed commentaries. We build an event linking system
CricketLinking which first identifies event mentions from the reports and then
links them to a set of balls. Finding linkable mentions is challenging because
unlike entity linking problem settings, we do not have a concrete set of event
entities to link to. Further, depending on the event type, event mentions could
be linked to a single ball, or to a set of balls. Hence, identifying mention
type as well as linking becomes challenging. We use a large number of domain
specific features to learn classifiers for mention and mention type detection.
Further, we leverage structured match, context similarity and sequential
proximity to perform accurate linking. Finally, context based summarization is
performed to provide a concise briefing of linked balls to each mention.
[2]
Information Retrieval with Verbose Queries
Tutorials
/
Gupta, Manish
/
Bendersky, Michael
Proceedings of the 2015 Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2015-08-09
p.1121-1124
© Copyright 2015 ACM
Summary: Recently, the focus of many novel search applications shifted from short
keyword queries to verbose natural language queries. Examples include question
answering systems and dialogue systems, voice search on mobile devices and
entity search engines like Facebook's Graph Search or Google's Knowledge Graph.
However the performance of textbook information retrieval techniques for such
verbose queries is not as good as that for their shorter counterparts. Thus,
effective handling of verbose queries has become a critical factor for adoption
of information retrieval techniques in this new breed of search applications.
Over the past decade, the information retrieval community has deeply explored
the problem of transforming natural language verbose queries using operations
like reduction, weighting, expansion, reformulation and segmentation into more
effective structural representations. However, thus far, there was not a
coherent and organized tutorial on this topic. In this tutorial, we aim to put
together various research pieces of the puzzle, provide a comprehensive and
structured overview of various proposed methods, and also list various
application scenarios where effective verbose query processing can make a
significant difference.
[3]
Characterizing Credit Card Black Markets on the Web
WebQuality 2015
/
Bulakh, Vlad
/
Gupta, Minaxi
Companion Proceedings of the 2015 International Conference on the World Wide
Web
2015-05-18
v.2
p.1435-1440
© Copyright 2015 ACM
Summary: We study carding shops that sell stolen credit and debit card information
online. By bypassing the anti-scrapping mechanisms they use, we find that the
prices of cards depend heavily on factors such as the issuing bank, country of
origin, and whether the card can be used in brick-and-mortar stores or not.
Almost 70% of cards sold by these outfits are priced at or below the cost banks
incur in re-issuing them. Ironically, this makes buying their own cards more
economical for the banks than re-issuing. We also find that the monthly
revenues for the carding shops we study are high enough to justify the risk
fraudsters take. Further, inventory at carding outfits seems to follow data
breaches and the impact of delayed deployment of the smart chip technology is
evident in the disproportionate share the U.S. commands in the underground card
fraud economy.
[4]
Ballet hero: building a garment for memetic embodiment in dance learning
Design exhibition
/
Hallam, James
/
McKenna, Alison
/
Keen, Emily
/
Gupta, Mudit
/
Lee, Christa
Adjunct Proceedings of the 2014 International Symposium on Wearable
Computers
2014-09-13
v.2
p.49-54
© Copyright 2014 ACM
Summary: This paper describes the analysis and design of a wearable technology
garment intended to aid with the instruction of ballet technique to adult
beginners. A phenomenological framework is developed and used to assess
physiological training tools. Following this, a garment is developed that
incorporates visual feedback inspired by animation techniques that more
directly convey the essential movements of ballet. The garment design is
presented, and a discussion is provided on the challenges of constructing an
e-textile garment using contemporary materials and techniques.
[5]
Modeling the evolution of product entities
Poster session (short papers)
/
Radhakrishnan, Priya
/
Gupta, Manish
/
Varma, Vasudeva
Proceedings of the 2014 Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2014-07-06
p.923-926
© Copyright 2014 ACM
Summary: A large number of web queries are related to product entities. Studying
evolution of product entities can help analysts understand the change in
particular attribute values for these products. However, studying the evolution
of a product requires us to be able to link various versions of a product
together in a temporal order. While it is easy to temporally link recent
versions of products in a few domains manually, solving the problem in general
is challenging. The ability to temporally order and link various versions of a
single product can also improve product search engines. In this paper, we
tackle the problem of finding the previous version (predecessor) of a product
entity. Given a repository of product entities, we first parse the product
names using a CRF model. After identifying entities corresponding to a single
product, we solve the problem of finding the previous version of any given
particular version of the product. For the second task, we leverage innovative
features with a Naïve Bayes classifier. Our methods achieve a precision of
88% in identifying the product version from product entity names, and a
precision of 53% in identifying the predecessor.
[6]
CharBoxes: a system for automatic discovery of character infoboxes from
books
Demo session
/
Gupta, Manish
/
Bansal, Piyush
/
Varma, Vasudeva
Proceedings of the 2014 Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2014-07-06
p.1255-1256
© Copyright 2014 ACM
Summary: Entities are centric to a large number of real world applications. Wikipedia
shows entity infoboxes for a large number of entities. However, not much
structured information is available about character entities in books.
Automatic discovery of characters from books can help in effective
summarization. Such a structured summary which not just introduces characters
in the book but also provides a high level relationship between them can be of
critical importance for buyers. This task involves the following challenging
novel problems: 1. automatic discovery of important characters given a book; 2.
automatic social graph construction relating the discovered characters; 3.
automatic summarization of text most related to each of the characters; and 4.
automatic infobox extraction from such summarized text for each character. As
part of this demo, we design mechanisms to address these challenges and
experiment with publicly available books.
[7]
EDIUM: Improving Entity Disambiguation via User Modeling
Short Paper Session 1
/
Bansal, Romil
/
Panem, Sandeep
/
Gupta, Manish
/
Varma, Vasudeva
Proceedings of ECIR'14, the 2014 European Conference on Information
Retrieval
2014-04-13
p.418-423
Keywords: Entity Disambiguation; Knowledge Graph; User Modeling
© Copyright 2014 Springer International Publishing
Summary: Entity Disambiguation is the task of associating entity name mentions in
text to the correct referent entities in the knowledge base, with the goal of
understanding and extracting useful information from the document. Entity
disambiguation is a critical component of systems designed to harness
information shared by users on microblogging sites like Twitter. However, noise
and lack of context in tweets makes disambiguation a difficult task. In this
paper, we describe an Entity Disambiguation system, EDIUM, which uses User
interest Models to disambiguate the entities in the user's tweets. Our system
jointly models the user's interest scores and the context disambiguation
scores, thus compensating the sparse context in the tweets for a given user. We
evaluated the system's entity linking capabilities on tweets from multiple
users and showed that improvement can be achieved by combining the user models
and the context based models.
[8]
Entity Tracking in Real-Time Using Sub-topic Detection on Twitter
Short Paper Session 1
/
Panem, Sandeep
/
Bansal, Romil
/
Gupta, Manish
/
Varma, Vasudeva
Proceedings of ECIR'14, the 2014 European Conference on Information
Retrieval
2014-04-13
p.528-533
Keywords: Sub-Topic Detection; Clustering; Entity Tracking; Text Mining
© Copyright 2014 Springer International Publishing
Summary: The velocity, volume and variety with which Twitter generates text is
increasing exponentially. It is critical to determine latent sub-topics from
such tweet data at any given point of time for providing better topic-wise
search results relevant to users' informational needs. The two main challenges
in mining sub-topics from tweets in real-time are (1) understanding the
semantic and the conceptual representation of the tweets, and (2) the ability
to determine when a new sub-topic (or cluster) appears in the tweet stream. We
address these challenges by proposing two unsupervised clustering approaches.
In the first approach, we generate a semantic space representation for each
tweet by keyword expansion and keyphrase identification. In the second
approach, we transform each tweet into a conceptual space that represents the
latent concepts of the tweet. We empirically show that the proposed methods
outperform the state-of-the-art methods.
[9]
Towards a social media analytics platform: event detection and user
profiling for Twitter
WWW 2014 tutorials
/
Gupta, Manish
/
Li, Rui
/
Chang, Kevin Chen-Chuan
Companion Proceedings of the 2014 International Conference on the World Wide
Web
2014-04-07
v.2
p.193-194
© Copyright 2014 ACM
Summary: Microblog data differs significantly from the traditional text data with
respect to a variety of dimensions. Microblog data contains short documents,
SMS kind of language, and is full of code mixing. Though a lot of it is mere
social babble, it also contains fresh news coming from human sensors at a
humungous rate. Given such interesting characteristics, the world wide web
community has witnessed a large number of research tasks for microblogging
platforms recently. Event detection on Twitter is one of the most popular such
tasks with a large number of applications. The proposed tutorial on social
analytics for Twitter will contain three parts. In the first part, we will
discuss research efforts towards detection of events from Twitter using both
the tweet content as well as other external sources. We will also discuss
various applications for which event detection mechanisms have been put to use.
Merely detecting events is not enough. Applications require that the detector
must be able to provide a good description of the event as well. In the second
part, we will focus on describing events using the best phrase, event type,
event timespan, and credibility. In the third part, we will discuss user
profiling for Twitter with a special focus on user location prediction. We will
conclude with a summary and thoughts on future directions.
[10]
Cross market modeling for query-entity matching
WWW 2014 posters
/
Gupta, Manish
/
Borole, Prashant
/
Hebbar, Praful
/
Mehta, Rupesh
/
Nayak, Niranjan
Companion Proceedings of the 2014 International Conference on the World Wide
Web
2014-04-07
v.2
p.285-286
© Copyright 2014 ACM
Summary: Given a query, the query-entity (QE) matching task involves identifying the
best matching entity for the query. When modeling this task as a binary
classification problem, two issues arise: (1) features in specific global
markets (like de-at: German users in Austria) are quite sparse compared to
other markets like en-us, and (2) training data is expensive to obtain in
multiple markets and hence limited. Can we leverage some form of cross market
data/features for effective query-entity matching in sparse markets? Our
solution consists of three main modules: (1) Cross Market Training Data
Leverage (CMTDL) (2) Cross Market Feature Leverage (CMFL), and (3) Cross Market
Output Data Leverage (CMODL). Each of these parts perform "signal" sharing at
different points during the classification process. Using a combination of
these strategies, we show significant improvements in query-impression weighted
coverage for the query-entity matching task.
[11]
Identifying fraudulently promoted online videos
WebQuality 2014 workshop
/
Bulakh, Vlad
/
Dunn, Christopher W.
/
Gupta, Minaxi
Companion Proceedings of the 2014 International Conference on the World Wide
Web
2014-04-07
v.2
p.1111-1116
© Copyright 2014 ACM
Summary: Fraudulent product promotion online, including online videos, is on the
rise. In order to understand and defend against this ill, we engage in the
fraudulent video economy for a popular video sharing website, YouTube, and
collect a sample of over 3,300 fraudulently promoted videos and 500 bot
profiles that promote them. We then characterize fraudulent videos and profiles
and train supervised machine learning classifiers that can successfully
differentiate fraudulent videos and profiles from legitimate ones.
[12]
Modeling click and relevance relationship for sponsored search
Posters: internet monetization and incentives
/
Zhang, Wei Vivian
/
Chen, Ye
/
Gupta, Mitali
/
Sett, Swaraj
/
Yan, Tak W.
Companion Proceedings of the 2013 International Conference on the World Wide
Web
2013-05-13
v.2
p.119-120
© Copyright 2013 ACM
Summary: Click-through rate (CTR) prediction and relevance ranking are two
fundamental problems in web advertising. In this study, we address the problem
of modeling the relationship between CTR and relevance for sponsored search. We
used normalized relevance scores comparable across all queries to represent
relevance when modeling with CTR, instead of directly using human judgment
labels or relevance scores valid only within same query. We classified clicks
by identifying their relevance quality using dwell time and session
information, and compared all clicks versus selective clicks effects when
modeling relevance.
Our results showed that the cleaned click signal outperforms raw click
signal and others we explored, in terms of relevance score fitting. The cleaned
clicks include clicks with dwell time greater than 5 seconds and last clicks in
session. Besides traditional thoughts that there is no linear relation between
click and relevance, we showed that the cleaned click based CTR can be fitted
well with the normalized relevance scores using a quadratic regression model.
This relevance-click model could help to train ranking models using processed
click feedback to complement expensive human editorial relevance labels, or
better leverage relevance signals in CTR prediction.
[13]
Fast query evaluation for ad retrieval
Poster presentations
/
Chen, Ye
/
Gupta, Mitali
/
Yan, Tak W.
Proceedings of the 2012 International Conference on the World Wide Web
2012-04-16
v.2
p.479-480
© Copyright 2012 ACM
Summary: We describe a fast query evaluation method for ad document retrieval in
online advertising, based upon the classic WAND algorithm. The key idea is to
localize per-topic term upper bounds into homogeneous ad groups. Our approach
is not only theoretically motivated by a topical mixture model; but empirically
justified by the characteristics of the ad domain, that is, short and
semantically focused documents with natural hierarchy. We report experimental
results using artificial and real-world query-ad retrieval data, and show that
the tighter-bound WAND outperforms the traditional approach by 35.4% reduction
in number of full evaluations.
[14]
Trust analysis with clustering
Poster session
/
Gupta, Manish
/
Sun, Yizhou
/
Han, Jiawei
Proceedings of the 2011 International Conference on the World Wide Web
2011-03-28
v.2
p.53-54
© Copyright 2011 ACM
Summary: Web provides rich information about a variety of objects. Trustability is a
major concern on the web. Truth establishment is an important task so as to
provide the right information to the user from the most trustworthy source.
Trustworthiness of information provider and the confidence of the facts it
provides are inter-dependent on each other and hence can be expressed
iteratively in terms of each other. However, a single information provider may
not be the most trustworthy for all kinds of information. Every information
provider has its own area of competence where it can perform better than
others. We derive a model that can evaluate trustability on objects and
information providers based on clusters (groups). We propose a method which
groups the set of objects for which similar set of providers provide "good"
facts, and provides better accuracy in addition to high quality object
clusters.
[15]
Connecting the next billion web users
Panel session
/
Rastogi, Rajeev
/
Cutrell, Ed
/
Gupta, Manish
/
Jhunjhunwala, Ashok
/
Narayan, Ramkumar
/
Sanghal, Rajeev
Proceedings of the 2011 International Conference on the World Wide Web
2011-03-28
v.2
p.329-330
© Copyright 2011 ACM
Summary: With 2 billion users, the World Wide Web has indeed come a long way.
However, of the 4.8 billion people living in Asia and Africa, only 1 in 5 has
access to the Web. For instance, in India, the 100 million Web users constitute
less than 10% of the total population of 1.2 billion. So it is universally
accepted that the next billion users will come from emerging markets like
Brazil, China, India, Indonesia and Russia. Emerging markets have a number of
unique characteristics: Large dense populations with low incomes, Lack of
infrastructure in terms of broadband, electricity, etc., Poor PC penetration
due to limited affordability, High illiteracy rates and inability to
read/write, Plethora of local languages and dialects, General paucity of local
content, especially in local languages, Explosive growth in the number of
mobile phones. The panel will debate the various technical challenges in
overcoming the digital divide, and potential approaches to bring the Web to the
underserved populations of the developing world.
[16]
Spoken Web: a mobile cloud based parallel web for the masses
Keynote
/
Gupta, Manish
Proceedings of the 2011 International Cross-Disciplinary Conference on Web
Accessibility (W4A)
2011-03-28
v.2
p.1
© Copyright 2011 ACM
Summary: In India and several other countries, most notably in Africa, the
penetration of the personal computer and the internet remains relatively low.
However, there has been a huge surge in the adoption of simple mobile phones
(there are over 700 million mobile phone numbers in India), and this
penetration continues to grow at a fast pace. We will present Spoken Web, an
attempt to create a new world wide web for the masses in these countries,
accessible over the telephone network and hosted in a cloud. The Spoken Web
platform facilitates easy creation of user-generated content that populates
'voice sites', and allows contextual traversal of voice sites interconnected
via hyperlinks based on the Hyperspeech Transfer Protocol. We present our
experience from pilots conducted in villages in Andhra Pradesh, Gujarat, and
other states in India. These pilots demonstrate the ease with which a
semi-literate and non-IT savvy population can create voice sites with locally
relevant content, including schedule of education/training classes,
agricultural information, and entertainment related content, and their strong
interest in accessing this information over the telephone network. We describe
several outstanding challenges and opportunities in creating and using a Spoken
Web for facilitating exchange of information and conducting business
transactions.
[17]
iCollaborate: harvesting value from enterprise web usage
Demonstrations
/
Kale, Ajinkya
/
Burris, Thomas
/
Shah, Bhavesh
/
Venkatesan, T. L. Prasanna
/
Velusamy, Lakshmanan
/
Gupta, Manish
/
Degerattu, Melania
Proceedings of the 33rd Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2010-07-19
p.699
Keywords: enterprise social data, social browsing
© Copyright 2010 ACM
Summary: We are in a phase of 'Participatory Web' in which users add value' to the
information on the web by publishing, tagging and sharing. The Participatory
Web has enormous potential for an enterprise because unlike the users of the
internet an enterprise is a community that shares common goals, assumptions,
vocabulary and interest and has reliable user identification and mutual trust
along with a central governance and incentives to collaborate. Everyday, the
employees of an organization locate content relevant to their work on the web.
Finding this information takes time, expertise and creativity, which costs an
organization money. That is, the web pages employees find are knowledge assets
owned by the enterprise. This investment in web-based knowledge assets is lost
every time the enterprise fails to capture and reuse them. iCollaborate is
tooled to capture user's web interaction, persist and analyze it, and feed that
interaction back into the community -- the enterprise.
[18]
LINKREC: a unified framework for link recommendation with user attributes
and graph structure
WWW posters
/
Yin, Zhijun
/
Gupta, Manish
/
Weninger, Tim
/
Han, Jiawei
Proceedings of the 2010 International Conference on the World Wide Web
2010-04-26
v.1
p.1211-1212
Keywords: link recommendation, random walk
© Copyright 2010 ACM
Summary: With the phenomenal success of networking sites (e.g., Facebook, Twitter and
LinkedIn), social networks have drawn substantial attention. On online social
networking sites, link recommendation is a critical task that not only helps
improve user experience but also plays an essential role in network growth. In
this paper we propose several link recommendation criteria, based on both user
attributes and graph structure. To discover the candidates that satisfy these
criteria, link relevance is estimated using a random walk algorithm on an
augmented social graph with both attribute and structure information. The
global and local influence of the attributes is leveraged in the framework as
well. Besides link recommendation, our framework can also rank attributes in a
social network. Experiments on DBLP and IMDB data sets demonstrate that our
method outperforms state-of-the-art methods based on network structure and node
attribute information for link recommendation.
[19]
Adding GPS-Control to Traditional Thermostats: An Exploration of Potential
Energy Savings and Design Challenges
At Home with Pervasive Applications
/
Gupta, Manu
/
Intille, Stephen S.
/
Larson, Kent
Proceedings of Pervasive 2009: International Conference on Pervasive
Computing
2009-05-11
p.95-114
© Copyright 2009 Springer-Verlag
Summary: Although manual and programmable home thermostats can save energy when used
properly, studies have shown that over 40% of U.S. homes may not use
energy-saving temperature setbacks when homes are unoccupied. We propose a
system for augmenting these thermostats using just-in-time heating and cooling
based on travel-to-home distance obtained from location-aware mobile phones.
Analyzing GPS travel data from 8 participants (8-12 weeks each) and heating and
cooling characteristics from 5 homes, we report results of running computer
simulations estimating potential energy savings from such a device. Using a
GPS-enabled thermostat might lead to savings of as much as 7% for some
households that do not regularly use the temperature setback afforded by manual
and programmable thermostats. Significantly, these savings could be obtained
without requiring any change in occupant behavior or comfort level, and the
technology could be implemented affordably by exploiting the ubiquity of mobile
phones. Additional savings may be possible with modest context-sensitive
prompting. We report on design considerations identified during a pilot test of
a fully-functional implementation of the system.
[20]
Predicting click through rate for job listings
Posters Wednesday, April 22, 2009
/
Gupta, Manish
Proceedings of the 2009 International Conference on the World Wide Web
2009-04-20
p.1053-1054
Keywords: CPC, CTR, GBDT, click through rate, gradient boosted decision trees, jobs,
linear regression, prediction, treenet
© Copyright 2009 International World Wide Web Conference Committee (IW3C2)
Summary: Click Through Rate (CTR) is an important metric for ad systems, job portals,
recommendation systems. CTR impacts publisher's revenue, advertiser's bid
amounts in "pay for performance" business models. We learn regression models
using features of the job, optional click history of job, features of "related"
jobs. We show that our models predict CTR much better than predicting avg. CTR
for all job listings, even in absence of the click history for the job listing.
[21]
Detecting image spam using visual features and near duplicate detection
Security I: misc
/
Mehta, Bhaskar
/
Nangia, Saurabh
/
Gupta, Manish
/
Nejdl, Wolfgang
Proceedings of the 2008 International Conference on the World Wide Web
2008-04-21
p.497-506
Keywords: email spam, image analysis, machine learning
© Copyright 2008 International World Wide Web Conference Committee (IW3C2)
Summary: Email spam is a much studied topic, but even though current email spam
detecting software has been gaining a competitive edge against text based email
spam, new advances in spam generation have posed a new challenge: image-based
spam. Image based spam is email which includes embedded images containing the
spam messages, but in binary format. In this paper, we study the
characteristics of image spam to propose two solutions for detecting
image-based spam, while drawing a comparison with the existing techniques. The
first solution, which uses the visual features for classification, offers an
accuracy of about 98%, i.e. an improvement of at least 6% compared to existing
solutions. SVMs (Support Vector Machines) are used to train classifiers using
judiciously decided color, texture and shape features. The second solution
offers a novel approach for near duplication detection in images. It involves
clustering of image GMMs (Gaussian Mixture Models) based on the Agglomerative
Information Bottleneck (AIB) principle, using Jensen-Shannon divergence (JS) as
the distance measure.
[22]
Fast algorithms for topk personalized pagerank queries
Posters
/
Gupta, Manish
/
Pathak, Amit
/
Chakrabarti, Soumen
Proceedings of the 2008 International Conference on the World Wide Web
2008-04-21
p.1225-1226
Keywords: hubrank, node-deletion, pagerank, personalized, top-k
© Copyright 2008 International World Wide Web Conference Committee (IW3C2)
Summary: In entity-relation (ER) graphs (V,E), nodes V represent typed entities and
edges E represent typed relations. For dynamic personalized PageRank queries,
nodes are ranked by their steady-state probabilities obtained using the
standard random surfer model. In this work, we propose a framework to answer
top-k graph conductance queries. Our top-k ranking technique leads to a 4X
speedup, and overall, our system executes queries 200-1600X faster than
whole-graph PageRank. Some queries might contain hard predicates i.e.
predicates that must be satisfied by the answer nodes. E.g. we may seek
authoritative papers on public key cryptography, but only those written during
1997. We extend our system to handle hard predicates. Our system achieves these
substantial query speedups while consuming only 10-20% of the space taken by a
regular text index.
[23]
Sonic Grid: an auditory interface for the visually impaired to navigate
GUI-based environments
Short papers
/
Jagdish, Deepak
/
Sawhney, Rahul
/
Gupta, Mohit
/
Nangia, Shreyas
Proceedings of the 2008 International Conference on Intelligent User
Interfaces
2008-01-13
p.337-340
© Copyright 2008 ACM
Summary: This paper explores the prototype design of an auditory interface
enhancement called the Sonic Grid that helps visually impaired users navigate
GUI-based environments. The Sonic Grid provides an auditory representation of
GUI elements embedded in a two-dimensional interface, giving a 'global' spatial
context for use of auditory icons, ear-cons and speech feedback. This paper
introduces the Sonic Grid, discusses insights gained through participatory
design with members of the visually impaired community, and suggests various
applications of the technique, including its use to ease the learning curve for
using computers by the visually impaired.