Items where author is affiliated with Microsoft Research Asia
Number of items: 17.
and Hu, Jian
and Zhu, Yunzhang
and Li, Hua
and Chen, Zheng Competitive Analysis from Click-Through Log.
Existing keyword suggestion tools from various search engine companies could automatically suggest keywords related to the advertisers’ products or services, counting in simple statistics of the keywords, such as search volume, cost per click (CPC), etc. However, the nature of the generalized Second Price Auction suggests that better understanding the competitors’ keyword selection and bidding strategies better helps to win the auction, other than only relying on general search statistics. In this paper, we propose a novel keyword suggestion strategy, called Competitive Analysis, to explore the keyword based competition relationships among advertisers and eventually help advertisers to build campaigns with better performance. The experimental results demonstrate that the proposed Competitive Analysis can both help advertisers to promote their product selling and generate more revenue to the search engine companies.
and Xie, Xing
and Duan, Manni
and Hara, Takahiro
and Nishio, Shojiro A Game Based Approach to Assign Geographical Relevance to Web Images.
Geographical context is very important for images. Millions of images on the Web have been already assigned latitude and longitude information. Due to the rapid proliferation of such images with geographical context, it is still difficult to effectively search and browse them, since we do not have ways to decide their relevance. In this paper, we focus on the geographical relevance of images, which is defined as to what extent the main objects in an image match landmarks at the location where the image was taken. Recently, researchers have proposed to use game based approaches to label large scale data such as Web images. However, previous works have not shown the quality of collected game logs in detail and how the logs can improve existing applications. To answer these questions, we design and implement a Web-based and multi-player game to collect human knowledge while people are enjoying the game. Then we thoroughly analyze the game logs obtained during a three week study with 147 participants and propose methods to determine the image geographical relevance. In addition, we conduct an experiment to compare our methods with a commercial search engine. Experimental results show that our methods dramatically improve image search relevance. Furthermore, we show that we can derive geographically relevant objects and their salient portion in images, which is valuable for a number of applications such as image location recognition.
and Mei, Tao
and Liu, Chris
and Hua, Xian-Sheng GameSense.
This paper presents a novel game-like advertising system called GameSense, which is driven by the compelling contents of online images. Given a Web page which typically contains images, GameSense is able to select suitable images to create online in-image games for advertising. The contextually relevant ads (i.e., product logos) are embedded at appropriate positions within the online games. The ads are selected based on not only textual relevance but also visual content similarity. The game is able to provide viewers rich experience and thus promote the embedded ads to provide more effective advertising.
and Liu, Ning
and Wang, Gang
and Zhang, Wen
and Jiang, Yun
and Chen, Zheng How Much Can Behavioral Targeting Help Online Advertising?
Behavioral Targeting (BT) is a technique used by online advertisers to increase the effectiveness of their campaigns, and is playing an increasingly important role in the online advertising market. However, it is underexplored in academia how much BT can truly help online advertising in search engines. In this paper we provide an empirical study on the click-through log of advertisements collected from a commercial search engine. From the experiment results over a period of seven days, we draw three important conclusions: (1) Users who clicked the same ad will truly have similar behaviors on the Web; (2) Click-Through Rate (CTR) of an ad can be averagely improved as high as 670% by properly segmenting users for behavioral targeted advertising in a sponsored search; (3) Using short term user behaviors to represent users is more effective than using long term user behaviors for BT. We conducted statistical t-test which verified that all conclusions drawn in the paper are statistically significant. To the best of our knowledge, this work is the first empirical study for BT on the click-through log of real world ads.
and Yan, Jun
and Fan, Weiguo
and Yang, Qiang
and Chen, Zheng Identifying Vertical Search Intention of Query through Social Tagging Propagation.
A pressing task during the unification process is to identify a user’s vertical search intention based on the user’s query. In this paper, we propose a novel method to propagate social annotation, which includes user-supplied tag data, to both queries and VSEs for semantically bridging them. Our proposed algorithm consists of three key steps: query annotation, vertical annotation and query intention identification. Our algorithm, referred to as TagQV, verifies that the social tagging can be propagated to represent Web objects such as queries and VSEs besides Web pages. Experiments on real Web search queries demonstrate the effectiveness of TagQV in query intention identification.
and Cai, Rui
and Wang, Yida
and Zhu, Jun
and Zhang, Lei
and Ma, Wei-Ying Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums.
Web forums have become an important data resource for many web applications, but extracting structured data from unstructured web forum pages is still a challenging task due to both complex page layout designs and unrestricted user created posts. In this paper, we study the problem of structured data extraction from various web forum sites. Our target is to ﬁnd a solution as general as possible to extract structured data, such as post title, post author, post time, and post content from any forum site. In contrast to most existing information extraction methods, which only leverage the knowledge inside an individual page, we incorporate both page-level and site-level knowledge and employ Markov logic networks (MLNs) to effectively integrate all useful evidence by learning their importance automatically. Site-level knowledge includes (1) the linkages among different object pages, such as list pages and post pages, and (2) the interrelationships of pages belonging to the same object. The experimental results on 20 forums show a very encouraging information extraction performance, and demonstrate the ability of the proposed approach on various forums. We also show that the performance is limited if only page-level knowledge is used, while when incorporating the site-level knowledge both precision and recall can be signiﬁcantly improved.
and Yang, Linjun
and Yu, Nenghai
and Hua, Xian-Sheng Learning to Tag.
Social tagging provides valuable and crucial information for large-scale web image retrieval. It is ontology-free and easy to obtain; however, irrelevant tags frequently appear, and users typically will not tag all semantic objects in the image, which is also called semantic loss. To avoid noises and compensate for the semantic loss, tag recommendation is proposed in literature. However, current recommendation simply ranks the related tags based on the single modality of tag co-occurrence on the whole dataset, which ignores other modalities, such as visual correlation. This paper proposes a multi-modality recommendation based on both tag and visual correlation, and formulates the tag recommendation as a learning problem. Each modality is used to generate a ranking feature, and Rankboost algorithm is applied to learn an optimal combination of these ranking features from different modalities. Experiments on Flickr data demonstrate the effectiveness of this learning-based multi-modality recommendation strategy.
and Zhang, Lizhu
and Xie, Xing
and Ma, Wei-Ying Mining Interesting Locations and Travel Sequences from GPS Trajectories.
The increasing availability of GPS-enabled devices is changing the way people interact with the Web, and brings us a large amount of GPS trajectories representing people’s location histories. In this paper, based on multiple users’ GPS trajectories, we aim to mine interesting locations and classical travel sequences in a given geospatial region. Here, interesting locations mean the culturally important places, such as Tiananmen Square in Beijing, and frequented public areas, like shopping malls and restaurants, etc. Such information can help users understand surrounding locations, and would enable travel recommendation. In this work, we first model multiple individuals’ location histories with a tree-based hierarchical graph (TBHG). Second, based on the TBHG, we propose a HITS (Hypertext Induced Topic Search)-based inference model, which regards an individual’s access on a location as a directed link from the user to that location. This model infers the interest of a location by taking into account the following three factors. 1) The interest of a location depends on not only the number of users visiting this location but also these users’ travel experiences. 2) Users’ travel experiences and location interests have a mutual reinforcement relationship. 3) The interest of a location and the travel experience of a user are relative values and are region-related. Third, we mine the classical travel sequences among locations considering the interests of these locations and users’ travel experiences. We evaluated our system using a large GPS dataset collected by 107 users over a period of one year in the real world. As a result, our HITS-based inference model outperformed baseline approaches like rank-by-count and rank-by-frequency. Meanwhile, when considering the users’ travel experiences and location interests, we achieved a better performance beyond baselines, such as rank-by-count and rank-by-interest, etc.
and Sun, Jian-Tao
and Hu, Jian
and Chen, Zheng Mining Multilingual Topics from Wikipedia.
In this paper, we try to leverage a large-scale and multilingual knowledge base, Wikipedia, to help effectively analyze and organize Web information written in different languages. Based on the observation that one Wikipedia concept may be described by articles in different languages, we adapt existing topic modeling algorithm for mining multilingual topics from this knowledge base. The extracted “universal” topics have multiple types of representations, with each type corresponding to one language. Accordingly, new documents of different languages can be represented in a space using a group of universal topics, which makes various multilingual Web applications feasible.
and Yan, Jun
and Chen, Zheng A Probabilistic Model Based Approach for Blended Search.
In this paper, we propose to model the blended search problem by assuming conditional dependencies among queries, VSEs and search results. The probability distributions of this model are learned from search engine query log through unigram language model. Our experimental exploration shows that, (1) a large number of queries in generic Web search have vertical search intentions; and (2) our proposed algorithm can effectively blend vertical search results into generic Web search, which can improve the Mean Average Precision (MAP) by as much as 16% compared to traditional Web search without blending. these components into a single list. However, from the classical meta-search problem’s configuration, the query log of component search engines is not available for study. In this extended abstract, we model the blended search problem based on the conditional dependencies among queries, VSEs and all the search results. We utilize the usage information, i.e. query log, of all the VSEs, which are not available for traditional metasearch engines, to learn the model parameters by the smoothed unigram language model. Finally, given a user query, the search results from both generic Web search and different VSEs are ranked together by inferring their probabilities of relevance to the given query. The main contributions of this work are, (1) through studying the belonging vertical search engines’ query log of a commercial search engine, we show the importance of blended search problem; (2) we propose a novel probabilistic model based approach to explore the blended search problem; and (3) we experimentally verify that our proposed algorithm can effectively blend vertical search results into generic Web search, which can improve the MAP as much as 16% in contrast to traditional Web search without vertical search blending and 10% to some other some ranking baseline.
and Wang, Xin-Jing
and Feng, Dan
and Zhang, Lei Ranking Community Answers via Analogical Reasoning.
Due to the lexical gap between questions and answers, automatically detecting right answers becomes very challenging for community question-answering sites. In this paper, we propose an analogical reasoning-based method. It treats questions and answers as relational data and ranks an answer by measuring the analogy of its link to a query with the links embedded in previous relevant knowledge; the answer that links in the most analogous way to the new question is assumed to be the best answer. We based our experiments on 29.8 million Yahoo!Answer questionanswer threads and showed the effectiveness of the approach.
and Liu, Ning
and Qing Chang, Elaine
and Ji, Lei
and Chen, Zheng Search Result Re-ranking Based on Gap between Search Queries and Social Tags.
Both search engine click-through log and social annotation have been utilized as user feedback for search result re-ranking. However, to our best knowledge, no previous study has explored the correlation between these two factors for the task of search result re-ranking. In this paper, we show that the gap between search queries and social tags of the same web page can well reflect its user preference score. Motivated by this observation, we propose a novel algorithm, called Query-Tag-Gap (QTG), to rerank search results for better user satisfaction. Intuitively, on one hand, the search users’ intentions are generally described by their queries before they read the search results. On the other hand, the web annotators semantically tag web pages after they read the content of the pages. The difference between users’ recognition of the same page before and after they read it is a good reflection of user satisfaction. In this extended abstract, we formally define the query set and tag set of the same page as users’ pre- and postknowledge respectively. We empirically show the strong correlation between user satisfaction and user’s knowledge gap before and after reading the page. Based on this gap, experiments have shown outstanding performance of our proposed QTG algorithm in search result re-ranking.
and Nie, Zaiqing
and Liu, Xiaojiang
and Zhang, Bo
and Wen, Ji-Rong StatSnowball: a Statistical Approach to Extracting Entity Relationships.
Traditional relation extraction methods require pre-specified relations and relation-specific human-tagged examples. Boot- strapping systems significantly reduce the number of train- ing examples, but they usually apply heuristic-based meth- ods to combine a set of strict hard rules, which limit the ability to generalize and thus generate a low recall. Further- more, existing bootstrapping methods do not perform open information extraction (Open IE), which can identify var- ious types of relations without requiring pre-specifications. In this paper, we propose a statistical extraction framework called Statistical Snowball (StatSnowball), which is a boot- strapping system and can perform both traditional relation extraction and Open IE. StatSnowball uses the discriminative Markov logic net- works (MLNs) and softens hard rules by learning their weights in a maximum likelihood estimate sense. MLN is a general model, and can be configured to perform different levels of relation extraction. In StatSnwoball, pattern selection is performed by solving an l1 -norm penalized maximum like- lihood estimation, which enjoys well-founded theories and efficient solvers. We extensively evaluate the performance of StatSnowball in different configurations on both a small but fully labeled data set and large-scale Web data. Empirical results show that StatSnowball can achieve a significantly higher recall without sacrificing the high precision during it- erations with a small number of seeds, and the joint inference of MLN can improve the performance. Finally, StatSnowball is efficient and we have developed a working entity relation search engine called Renlifang based on it.
and Hua, Xian-Sheng
and Yang, Linjun
and Wang, Meng
and Zhang, Hong-Jiang Tag Ranking.
Social media sharing web sites like Flickr allow users to annotate images with free tags, which signiﬁcantly facilitate Web image search and organization. However, the tags associated with an image generally are in a random order without any importance or relevance information, which limits the effectiveness of these tags in search and other applications. In this paper, we propose a tag ranking scheme, aiming to automatically rank the tags associated with a given image according to their relevance to the image content. We ﬁrst estimate initial relevance scores for the tags based on probability density estimation, and then perform a random walk over a tag similarity graph to reﬁne the relevance scores. Experimental results on a 50, 000 Flickr photo collection show that the proposed tag ranking method is both effective and efficient. We also apply tag ranking into three applications: (1) tag-based image search, (2) tag recommendation, and (3) group recommendation, which demonstrates that the proposed tag ranking approach really boosts the performances of social-tagging related applications.
and Jiang, Daxin
and Pei, Jian
and Chen, Enhong
and Li, Hang Towards Context-Aware Search by Learning a Very Large Variable Length Hidden Markov Model from Search Logs.
Capturing the context of a user’s query from the previous queries and clicks in the same session may help understand the user’s information need. A context-aware approach to document re-ranking, query suggestion, and URL recommendation may improve users’ search experience substantially. In this paper, we propose a general approach to context-aware search. To capture contexts of queries, we learn a variable length Hidden Markov Model (vlHMM) from search sessions extracted from log data. Although the mathematical model is intuitive, how to learn a large vlHMM with millions of states from hundreds of millions of search sessions poses a grand challenge. We develop a strategy for parameter initialization in vlHMM learning which can greatly reduce the number of parameters to be estimated in practice. We also devise a method for distributed vlHMM learning under the map-reduce model. We test our approach on a real data set consisting of 1.8 billion queries, 2.6 billion clicks, and 840 million search sessions, and evaluate the effectiveness of the vlHMM learned from the real data on three search applications: document re-ranking, query suggestion, and URL recommendation. The experimental results show that our approach is both effective and efficient.
and Wang, Gang
and Lochovsky, Fred
and Sun, Jian-tao
and Chen, Zheng Understanding User's Query Intent with Wikipedia.
Understanding the intent behind a user’s query can help search engine to automatically route the query to some corresponding vertical search engines to obtain particularly relevant contents, thus, greatly improving user satisfaction. There are three major challenges to the query intent classification problem: (1) Intent representation; (2) Domain coverage and (3) Semantic interpretation. Current approaches to predict the user’s intent mainly utilize machine learning techniques. However, it is difficult and often requires many human efforts to meet all these challenges by the statistical machine learning approaches. In this paper, we propose a general methodology to the problem of query intent classification. With very little human effort, our method can discover large quantities of intent concepts by leveraging Wikipedia, one of the best human knowledge base. The Wikipedia concepts are used as the intent representation space, thus, each intent domain is represented as a set of Wikipedia articles and categories. The intent of any input query is identified through mapping the query into the Wikipedia representation space. Compared with previous approaches, our proposed method can achieve much better coverage to classify queries in an intent domain even through the number of seed intent examples is very small. Moreover, the method is very general and can be easily applied to various intent domains. We demonstrate the effectiveness of this method in three different applications, i.e., travel, job, and person name. In each of the three cases, only a couple of seed intent queries are provided. We perform the quantitative evaluations in comparison with two baseline methods, and the experimental results show that our method significantly outperforms other approaches in each intent domain.
and Wang, Lu
and Guo, Xiaolin
and Pan, Aimin
and Zhu, Bin B. WPBench: A Benchmark for Evaluating the Client-side Performance of Web 2.0 Applications.
In this paper, a benchmark called WPBench is reported to evaluate the responsiveness of Web browsers for modern Web 2.0 applications. In WPBench, variations of servers and networks are removed and the benchmark result is the closest to what Web users would perceive. To achieve these, WPBench records users’ interactions with typical Web 2.0 applications, and then replays Web navigations when benchmarking browsers. The replay mechanism can emulate the actual user interactions and the characteristics of the servers and the networks in a consistent way independent of browsers so that any browser compliant to the standards can be benchmarked fairly. In addition to describing the design and generation of WPBench, we also report the WPBench comparison results on the responsiveness performance for three popular Web browsers: Internet Explorer, Firefox and Chrome.
About this site
This website has been set up for WWW2009 by Christopher Gutteridge of the University of Southampton, using our EPrints software.
We (Southampton EPrints Project) intend to preserve the files and HTML pages of this site for many years, however we will turn it into flat files for long term preservation. This means that at some point in the months after the conference the search, metadata-export, JSON interface, OAI etc. will be disabled as we "fossilize" the site. Please plan accordingly. Feel free to ask nicely for us to keep the dynamic site online longer if there's a rally good (or cool) use for it... [this has now happened, this site is now static]