Foraging Among an Overabundance of Similar Variants
End-User Programming
/
Ragavan, Sruti Srinivasa
/
Kuttal, Sandeep Kaur
/
Hill, Charles
/
Sarma, Anita
/
Piorkowski, David
/
Burnett, Margaret
Proceedings of the ACM CHI'16 Conference on Human Factors in Computing
Systems
2016-05-07
v.1
p.3509-3521
© Copyright 2016 ACM
Summary: Foraging among too many variants of the same artifact can be problematic
when many of these variants are similar. This situation, which is largely
overlooked in the literature, is commonplace in several types of creative
tasks, one of which is exploratory programming. In this paper, we investigate
how novice programmers forage through similar variants. Based on our results,
we propose a refinement to Information Foraging Theory (IFT) to include
constructs about variation foraging behavior, and propose refinements to
computational models of IFT to better account for foraging among variants.
Perceptions of answer quality in an online technical question and answer
forum
Short Papers
/
Hart, Kerry
/
Sarma, Anita
Proceedings of the 2014 International Workshop on Cooperative and Human
Aspects of Software Engineering
2014-06-02
p.103-106
© Copyright 2014 ACM
Summary: Software developers are used to seeking information from authoritative
texts, such as a technical manuals, or from experts with whom they are
familiar. Increasingly, developers seek information in online question and
answer forums, where the quality of the information is variable. To a novice,
it may be challenging to filter good information from bad. Stack Overflow is a
Q&A forum that introduces a social reputation element: users rate the
quality of post-ed answers, and answerers can accrue points and rewards for
writing answers that are rated highly by their peers. A user that consistently
authors good answers will develop a good 'reputation' as recorded by these
points. While this system was designed with the intent to incentivize
high-quality answers, it has been suggested that information seekers -- and
particularly technical novices -- may rely on the social reputation of the
answerer as a proxy for answer quality. In this paper, we investigate the role
that this social factor -- as well as other answer characteristics -- plays in
the information filtering process of technical novices in the context of Stack
Over-flow. The results of our survey conducted on Amazon.com's Mechanical Turk
indicate that technical novices assess information quality based on the
intrinsic qualities of the answer, such as presentation and content, suggesting
that novices are wary to rely on social cues in the Q&A context.
What makes an image popular?
Content quality & popularity
/
Khosla, Aditya
/
Sarma, Atish Das
/
Hamid, Raffay
Proceedings of the 2014 International Conference on the World Wide Web
2014-04-07
v.1
p.867-876
© Copyright 2014 ACM
Summary: Hundreds of thousands of photographs are uploaded to the internet every
minute through various social networking and photo sharing platforms. While
some images get millions of views, others are completely ignored. Even from the
same users, different photographs receive different number of views. This begs
the question: What makes a photograph popular? Can we predict the number of
views a photograph will receive even before it is uploaded? These are some of
the questions we address in this work. We investigate two key components of an
image that affect its popularity, namely the image content and social context.
Using a dataset of about 2.3 million images from Flickr, we demonstrate that we
can reliably predict the normalized view count of images with a rank
correlation of 0.81 using both image content and social cues. In this paper, we
show the importance of image cues such as color, gradients, deep learning
features and the set of objects present, as well as the importance of various
social cues such as number of friends or number of photos uploaded that lead to
high or low popularity of images.
E-commerce product search: personalization, diversification, and beyond
WWW 2014 tutorials
/
Sarma, Atish Das
/
Parikh, Nish
/
Sundaresan, Neel
Companion Proceedings of the 2014 International Conference on the World Wide
Web
2014-04-07
v.2
p.189-190
© Copyright 2014 ACM
Summary: The focus of this tutorial will is e-commerce product search. Several
challenges appear in this context, both from a research standpoint as well as
an application standpoint. We present various approaches adopted in the
industry, review well-known research techniques developed over the last decade,
draw parallels to traditional web search highlighting the new challenges in
this setting, and dig deep into some of the algorithmic and technical
approaches developed. A specific approach that advances theoretical techniques
and illustrates practical impact considered here is of identifying most suited
results quickly from a large database. Settings span cold start users and
advanced users for whom personalization is possible. In this context, top-$k$
and skylines are discussed as they form a key approach that spans the web, data
mining, and database communities. These present powerful tools for search
across multi-dimensional items with clear preferences within each attribute,
like product search as opposed to regular web search.
The "expression gap": do you like what you share?
WWW 2014 posters
/
Sarma, Atish Das
/
Si, Si
/
Churchill, Elizabeth F.
/
Sundaresan, Neel
Companion Proceedings of the 2014 International Conference on the World Wide
Web
2014-04-07
v.2
p.247-248
© Copyright 2014 ACM
Summary: While recommendation profiles increasingly leverage social actions such as
"shares", the predictive significance of such actions is unclear. To what
extent do public shares correlate with other online behaviors such as searches,
views and purchases? Based on an analysis of 950,000 users' behavioral,
transactional, and social sharing data on a global online commerce platform, we
show that social "shares", or publicly posted expressions of interest do not
correlate with non-public behaviors such as views and purchases. A key takeaway
is that there is a "gap" between public and non-public actions online,
suggesting that marketers and advertisers need to be cautious in their
estimation of the significance of social sharing.
Beyond modeling private actions: predicting social shares
WWW 2014 posters
/
Si, Si
/
Sarma, Atish Das
/
Churchill, Elizabeth F.
/
Sundaresan, Neel
Companion Proceedings of the 2014 International Conference on the World Wide
Web
2014-04-07
v.2
p.377-378
© Copyright 2014 ACM
Summary: We study the problem of predicting sharing behavior from e-commerce sites to
friends on social networks via share widgets. The contextual variation in an
action that is private (like rating a movie on Netflix), to one shared with
friends online (like sharing an item on Facebook), to one that is completely
public (like commenting on a YouTube video) introduces behavioral differences
that pose interesting challenges. In this paper, we show that users' interests
manifest in actions that spill across different types of channels such as
sharing, browsing, and purchasing. This motivates leveraging all such signals
available from the e-commerce platform. We show that carefully incorporating
signals from these interactions significantly improves share prediction
accuracy.
On the benefits of providing versioning support for end users: An empirical
study
/
Kuttal, Sandeep K.
/
Sarma, Anita
/
Rothermel, Gregg
ACM Transactions on Computer-Human Interaction
2014-02
v.21
n.2
p.9
© Copyright 2014 ACM
Summary: End users with little formal programming background are creating software in
many different forms, including spreadsheets, web macros, and web mashups. Web
mashups are particularly popular because they are relatively easy to create,
and because many programming environments that support their creation are
available. These programming environments, however, provide no support for
tracking versions or provenance of mashups. We believe that versioning support
can help end users create, understand, and debug mashups. To investigate this
belief, we have added versioning support to a popular wire-oriented mashup
environment, Yahoo! Pipes. Our enhanced environment, which we call "Pipes
Plumber," automatically retains versions of pipes and provides an interface
with which pipe programmers can browse histories of pipes and retrieve specific
versions. We have conducted two studies of this environment: an exploratory
study and a larger controlled experiment. Our results provide evidence that
versioning helps pipe programmers create and debug mashups. Subsequent
qualitative results provide further insights into the barriers faced by pipe
programmers, the support for reuse provided by our approach, and the support
for debugging provided.
Optimal hashing schemes for entity matching
Research papers
/
Dalvi, Nilesh
/
Rastogi, Vibhor
/
Dasgupta, Anirban
/
Sarma, Anish Das
/
Sarlos, Tamas
Proceedings of the 2013 International Conference on the World Wide Web
2013-05-13
v.1
p.295-306
© Copyright 2013 ACM
Summary: In this paper, we consider the problem of devising blocking schemes for
entity matching. There is a lot of work on blocking techniques for supporting
various kinds of predicates, e.g. exact matches, fuzzy string-similarity
matches, and spatial matches. However, given a complex entity matching function
in the form of a Boolean expression over several such predicates, we show that
it is an important and non-trivial problem to combine the individual blocking
techniques into an efficient blocking scheme for the entity matching function,
a problem that has not been studied previously.
In this paper, we make fundamental contributions to this problem. We
consider an abstraction for modeling complex entity matching functions as well
as blocking schemes. We present several results of theoretical and practical
interest for the problem. We show that in general, the problem of computing the
optimal blocking strategy is NP-hard in the size of the DNF formula describing
the matching function. We also present several algorithms for computing the
exact optimal strategies (with exponential complexity, but often feasible in
practice) as well as fast approximation algorithms. We experimentally
demonstrate over commercially used rule-based matching systems over real
datasets at Yahoo!, as well as synthetic datasets, that our blocking strategies
can be an order of magnitude faster than the baseline methods, and our
algorithms can efficiently find good blocking strategies.
Debugging support for end user mashup programming
Papers: novel programming
/
Kuttal, Sandeep Kaur
/
Sarma, Anita
/
Rothermel, Gregg
Proceedings of ACM CHI 2013 Conference on Human Factors in Computing Systems
2013-04-27
v.1
p.1609-1618
© Copyright 2013 ACM
Summary: Programming for the web can be an intimidating task, particularly for
non-professional ("end-user") programmers. Mashup programming environments
attempt to remedy this by providing support for such programming. It is well
known, however, that mashup programmers create applications that contain bugs.
Furthermore, mashup programmers learn from examples and reuse other mashups,
which causes bugs to propagate to other mashups. In this paper we classify the
bugs that occur in a large corpus of Yahoo! Pipes mashups. We describe support
we have implemented in the Yahoo! Pipes environment to provide automatic error
detection techniques that help mashup programmers localize and correct these
bugs. We present the results of a think-aloud study comparing the experiences
of end-user mashup programmers using and not using our support. Our results
show that our debugging enhancements do help these programmers localize and
correct bugs more effectively and efficiently.
Dynamic covering for recommendation systems
KM track: recommender systems
/
Antonellis, Ioannis
/
Sarma, Anish Das
/
Dughmi, Shaddin
Proceedings of the 2012 ACM Conference on Information and Knowledge
Management
2012-10-29
p.26-34
© Copyright 2012 ACM
Summary: In this paper, we identify a fundamental algorithmic problem that we term
succinct dynamic covering (SDC), arising in many modern-day web applications,
including ad-serving and online recommendation systems such as in eBay,
Netflix, and Amazon. Roughly speaking, SDC applies two restrictions to the
well-studied Max-Coverage problem [14]: Given an integer k, X={1,2,...,n} and
I={S_1,...,S_m}, S_i subseteq X, find |J| subseteq I, such that |J| < k and
(union_S_in_J S) is as large as possible. The two restrictions applied by SDC
are: (1) Dynamic: At query-time, we are given a query Q subseteq X, and our
goal is to find J such that Q bigcap (union_S_J S) is as large as possible;
Space-constrained: We don't have enough space to store (and process) the entire
input; specifically, we have o(mn), and maybe as little as O((m+n)polylog(mn))
space. A solution to SDC maintains a small data structure, and uses this data
structure to answer most dynamic queries with high accuracy. We call such a
scheme a Coverage Oracle.
We present algorithms and complexity results for coverage oracles. We
present deterministic and probabilistic near-tight upper and lower bounds on
the approximation ratio of SDC as a function of the amount of space available
to the oracle. Our lower bound results show that to obtain constant-factor
approximations we need Omega(mn) space. Fortunately, our upper bounds present
an explicit tradeoff between space and approximation ratio, allowing us to
determine the amount of space needed to guarantee certain accuracy.
An automatic blocking mechanism for large-scale de-duplication tasks
DB track: web data management
/
Sarma, Anish Das
/
Jain, Ankur
/
Machanavajjhala, Ashwin
/
Bohannon, Philip
Proceedings of the 2012 ACM Conference on Information and Knowledge
Management
2012-10-29
p.1055-1064
© Copyright 2012 ACM
Summary: De-duplication -- identification of distinct records referring to the same
real-world entity -- is a well-known challenge in data integration. Since very
large datasets prohibit the comparison of every pair of records, blocking has
been identified as a technique of dividing the dataset for pairwise
comparisons, thereby trading off recall of identified duplicates for
efficiency. Traditional de-duplication tasks, while challenging, typically
involved a fixed schema such as Census data or medical records. However, with
the presence of large, diverse sets of structured data on the web and the need
to organize it effectively on content portals, de-duplication systems need to
scale in a new dimension to handle a large number of schemas, tasks and data
sets, while handling ever larger problem sizes. In addition, when working in a
map-reduce framework it is important that canopy formation be implemented as a
hash function, making the canopy design problem more challenging. We present
CBLOCK, a system that addresses these challenges.
CBLOCK learns hash functions automatically from attribute domains and a
labeled dataset consisting of duplicates. Subsequently, CBLOCK expresses
blocking functions using a hierarchical tree structure composed of atomic hash
functions. The application may guide the automated blocking process based on
architectural constraints, such as by specifying a maximum size of each block
(based on memory requirements), impose disjointness of blocks (in a grid
environment), or specify a particular objective function trading off recall for
efficiency. As a post-processing step to automatically generated blocks, CBLOCK
rolls-up smaller blocks to increase recall. We present experimental results on
two large-scale de-duplication datasets from a commercial search engine --
consisting of over 140K movies and 40K restaurants respectively -- and
demonstrate the utility of CBLOCK.
Your two weeks of fame and your grandmother's
Web mining
/
Cook, James
/
Sarma, Atish Das
/
Fabrikant, Alex
/
Tomkins, Andrew
Proceedings of the 2012 International Conference on the World Wide Web
2012-04-16
v.1
p.919-928
© Copyright 2012 ACM
Summary: Did celebrity last longer in 1929, 1992 or 2009? We investigate the
phenomenon of fame by mining a collection of news articles that spans the
twentieth century, and also perform a side study on a collection of blog posts
from the last 10 years. By analyzing mentions of personal names, we measure
each person's time in the spotlight, and watch the distribution change from a
century ago to a year ago. We expected to find a trend of decreasing durations
of fame as news cycles accelerated and attention spans became shorter. Instead,
we find a remarkable consistency through most of the period we study. Through a
century of rapid technological and societal change, through the appearance of
Twitter, communication satellites and the Internet, we do not observe a
significant change in typical duration of celebrity. We also study the most
famous of the famous, and find different results depending on our method for
measuring duration of fame. With a method that may be thought of as measuring a
spike of attention around a single narrow news story, we see the same result as
before: stories last as long now as they did in 1930. A second method, which
may be thought of as measuring the duration of public interest in a person,
indicates that famous people's presence in the news is becoming longer rather
than shorter, an effect most likely driven by the wider distribution and higher
volume of media in modern times. Similar studies have been done with much
shorter timescales specifically in the context of information spreading on
Twitter and similar social networking site. However, to the best of our
knowledge, this is the first massive scale study of this nature that spans over
a century of archived data, thereby allowing us to track changes across
decades.
EDITED BOOK
Search Computing: Broadening Web Search
Lecture Notes in Computer Science 7538
/
Ceri, Stefano
/
Brambilla, Marco
2012
n.16
p.254
Springer Berlin Heidelberg
DOI: 10.1007/978-3-642-34213-4
== Extraction and Integration ==
Web Data Reconciliation: Models and Experiences (1-15)
+ Blanco, Lorenzo
+ Crescenzi, Valter
+ Merialdo, Paolo
+ Papotti, Paolo
A Domain Independent Framework for Extracting Linked Semantic Data from Tables (16-33)
+ Mulwad, Varish
+ Finin, Tim
+ Joshi, Anupam
Knowledge Extraction from Structured Sources (34-52)
+ Unbehauen, Jörg
+ Hellmann, Sebastian
+ Auer, Sören
+ Stadler, Claus
Extracting Information from Google Fusion Tables (53-67)
+ Brambilla, Marco
+ Ceri, Stefano
+ Cinefra, Nicola
+ Sarma, Anish Das
+ Forghieri, Fabio
+ et al
Materialization of Web Data Sources (68-81)
+ Bozzon, Alessandro
+ Ceri, Stefano
+ Zagorac, Srdan
== Query and Visualization Paradigms ==
Natural Language Interfaces to Data Services (82-97)
+ Guerrisi, Vincenzo
+ Torre, Pietro La
+ Quarteroni, Silvia
Mobile Multi-domain Search over Structured Web Data (98-110)
+ Aral, Atakan
+ Akin, Ilker Zafer
+ Brambilla, Marco
Clustering and Labeling of Multi-dimensional Mixed Structured Data (111-126)
+ Brambilla, Marco
+ Zanoni, Massimiliano
Visualizing Search Results: Engineering Visual Patterns Development for the Web (127-142)
+ Morales-Chaparro, Rober
+ Preciado, Juan Carlos
+ Sánchez-Figueroa, Fernando
== Exploring Linked Data ==
Extending SPARQL Algebra to Support Efficient Evaluation of Top-K SPARQL Queries (143-156)
+ Bozzon, Alessandro
+ Valle, Emanuele Della
+ Magliacane, Sara
Thematic Clustering and Exploration of Linked Data (157-175)
+ Castano, Silvana
+ Ferrara, Alfio
+ Montanelli, Stefano
Support for Reusable Explorations of Linked Data in the Semantic Web (176-190)
+ Cohen, Marcelo
+ Schwabe, Daniel
== Games, Social Search and Economics ==
A Survey on Proximity Measures for Social Networks (191-206)
+ Cohen, Sara
+ Kimelfeld, Benny
+ Koutrika, Georgia
Extending Search to Crowds: A Model-Driven Approach (207-222)
+ Bozzon, Alessandro
+ Brambilla, Marco
+ Ceri, Stefano
+ Mauri, Andrea
BetterRelations: Collecting Association Strengths for Linked Data Triples with a Game (223-239)
+ Hees, Jörn
+ Roth-Berghofer, Thomas
+ Biedert, Ralf
+ Adrian, Benjamin
+ Dengel, Andreas
An Incentive-Compatible Revenue-Sharing Mechanism for the Economic Sustainability of Multi-domain Search Based on Advertising (240-254)
+ Brambilla, Marco
+ Ceppi, Sofia
+ Gatti, Nicola
+ Gerding, Enrico H.
Building a generic debugger for information extraction pipelines
Poster session: knowledge management
/
Sarma, Anish Das
/
Jain, Alpa
/
Bohannon, Philip
Proceedings of the 2011 ACM Conference on Information and Knowledge
Management
2011-10-24
p.2229-2232
© Copyright 2011 ACM
Summary: Complex information extraction (IE) pipelines are becoming an integral
component of most text processing frameworks. We introduce a first system to
help IE users analyze extraction pipeline semantics and operator
transformations interactively while debugging. This allows the effort to be
proportional to the need, and to focus on the portions of the pipeline under
the greatest suspicion. We present a generic debugger for running
post-execution analysis of any IE pipeline consisting of arbitrary types of
operators. For this, we propose an effective provenance model for IE pipelines
which captures a variety of operator types, ranging from those for which full
to no specifications are available. We have evaluated our proposed algorithms
and provenance model on large-scale real-world extraction pipelines.
STCML: an extensible XML-based language for socio-technical modeling
Short papers
/
Georgas, John C.
/
Sarma, Anita
Proceedings of the 2011 International Workshop on Cooperative and Human
Aspects of Software Engineering
2011-05-21
p.61-64
© Copyright 2011 ACM
Summary: Understanding the complex dependencies between the technical artifacts of
software engineering and the social processes involved in their development has
the potential to improve the processes we use to engineer software as well as
the eventual quality of the systems we produce. A foundational capability in
grounding this study of socio-technical concerns is the ability to explicitly
model technical and social artifacts as well as the dependencies between them.
This paper presents the STCML language, intended to support the modeling of
core socio-technical aspects in software development in a highly extensible
fashion. We present the basic structure of the language, discuss important
language design principles, and offer an example of its application.
Which bug should i fix: helping new developers onboard a new project
Short papers
/
Wang, Jianguo
/
Sarma, Anita
Proceedings of the 2011 International Workshop on Cooperative and Human
Aspects of Software Engineering
2011-05-21
p.76-79
© Copyright 2011 ACM
Summary: A typical entry point for new developers in an open source project is to
contribute a bug fix. However, finding an appropriate bug and an appropriate
fix for that bug requires a good understanding of the project, which is
nontrivial. Here, we extend Tesseract -- an interactive project exploration
environment -- to allow new developers to search over bug descriptions in a
project to quickly identify and explore bugs of interest and their related
resources. More specifically, we extended Tesseract with search capabilities
that enable synonyms and similar-bugs search over bug descriptions in a bug
repository. The goal is to enable users to identify bugs of interest, resources
related to that bug, (e.g., related files, contributing developers,
communication records), and visually explore the appropriate socio-technical
dependencies for the selected bug in an interactive manner. Here we present our
search extension to Tesseract.
Coordination in innovative design and engineering: observations from a lunar
robotics project
Designing for collaboration II
/
Dabbish, Laura A.
/
Wagstrom, Patrick
/
Sarma, Anita
/
Herbsleb, James D.
GROUP'10: International Conference on Supporting Group Work
2010-11-06
p.225-234
© Copyright 2010 ACM
Summary: Coordinating activities across groups in systems engineering or product
development projects is critical to project success, but substantially more
difficult when the work is innovative and dynamic. It is not clear how
technology should best support cross-group collaboration on these types of
projects. Recent work on coordination in dynamic settings has identified
cross-boundary knowledge exchange as a critical mechanism for aligning
activities. In order to inform the design of collaboration technology for
creative work settings, we examined the nature of cross-group knowledge
exchange in an innovative engineering research project developing a lunar rover
robot as part of the Google Lunar X-Prize competition. Our study extends the
understanding of communication and coordination in creative design work, and
contributes to theory on coordination. We introduce four types of cross-team
knowledge exchange mechanisms we observed on this project and discuss
challenges associated with each. We consider implications for the design of
collaboration technology to support cross-team knowledge exchange in dynamic,
creative work environments.
Continuous coordination within the context of cooperative and human aspects
of software engineering
/
Al-Ani, Ban
/
Trainer, Erik
/
Ripley, Roger
/
Sarma, Anita
/
van der Hoek, André
/
Redmiles, David
Proceedings of the 2008 International Workshop on Cooperative and Human
Aspects of Software Engineering
2008-05-13
p.1-4
© Copyright 2008 ACM
Summary: We have developed software tools that aim to support the cooperative
software engineering tasks and promote an awareness of social dependencies that
is essential to successful coordination. The tools share common characteristics
that can be traced back to the principles of the Continuous Coordination (CC)
paradigm. However, the development of each sprung from carrying out a different
set of activities during its development process. In this paper, we outline the
principles of the CC paradigm, the tools that implement these principles and
focus on the social aspects of software engineering. Finally, we discuss the
socio-technical and human-centered processes we adopted to develop these tools.
Our conclusion is that the cooperative dimension of our tools represents the
cooperation between researchers, subjects, and field sites. Our conclusion
suggests that the development processes adopted to develop like-tools need to
reflect this cooperative dimension.
Detecting near-duplicates for web crawling
Similarity search
/
Manku, Gurmeet Singh
/
Jain, Arvind
/
Sarma, Anish Das
Proceedings of the 2007 International Conference on the World Wide Web
2007-05-08
p.141-150
© Copyright 2007 International World Wide Web Conference Committee (IW3C2)
Summary: Near-duplicate web documents are abundant. Two such documents differ from
each other in a very small portion that displays advertisements, for example.
Such differences are irrelevant for web search. So the quality of a web crawler
increases if it can assess whether a newly crawled web page is a near-duplicate
of a previously crawled web page or not. In the course of developing a
near-duplicate detection system for a multi-billion page repository, we make
two research contributions. First, we demonstrate that Charikar's
fingerprinting technique is appropriate for this goal. Second, we present an
algorithmic technique for identifying existing f-bit fingerprints that differ
from a given fingerprint in at most k bit-positions, for small k. Our technique
is useful for both online queries (single fingerprints) and all batch queries
(multiple fingerprints). Experimental evaluation over real data confirms the
practicality of our design.