演讲摘要:In this talk, we will discuss our effort to overcome these problems. The talk will be done in three parts. In the first part, we will present k-FreqItems, a clustering algorithm for large, sparse categorical sets. k-FreqItems is built upon a novel sparse center representation called FreqItem and uses the popular Jaccard distance for comparing sets. We will first describe various techniques that we used to scale k-FreqItems so that it can partition datasets with millions of columns and billions of rows into tens of thousands of clusters. In the second part of the talk, we will discuss how the generality and scalability of k-FreqItems allow us to perform clustering on complex data like graphs and mixed-attribute tables while helping us to identify good settings for the weight and scaling of features. Finally, we will look at how k-FreqItems can be used as a pre-processing tool in other data analytics operations like active learning, classification, data indexing and other more expensive clustering methods.
讲者简介:Anthony K. H. Tung is currently a Professor in the Department of Computer Science, National University of Singapore (NUS). He received both his B.Sc. and M.Sc. in computer sciences from the National University of Singapore in 1997 and 1998 respectively. In 2001, he received the Ph.D. in computer sciences from Simon Fraser University (SFU).
Dr Anthony Tung main research areas are on searching, mining, and visualizing complex data. More recently, he also looks into the creation of innovative big data applications over the data processing techniques that he had developed over the past 22 years. Anthony is also the deputy director of NUS N-CRiPT research center (https://ncript.comp.nus.edu.sg/).