NIPS Conference Book 2011
NIPS Conference Book 2011
2011
N I P S 2 0 11
TUTORIALS December 12, 2011 Granada Congress and Exhibition Centre, Granada, Spain CONFERENCE SESSIONS December 12-15, 2011 Granada Congress and Exhibition Centre, Granada, Spain WORKSHOPS December 16-17, 2011 Melia Sierra Nevada & Melia Sol y Nieve, Sierra Nevada, Spain
Sponsored by the Neural Information Processing System Foundation, Inc The technical program includes six invited talks and 306 accepted papers, selected from a total of 1,400 submissions considered by the program committee. Because the conference stresses interdisciplinary interactions, there are no parallel sessions. Papers presented at the conference will appear in Advances in Neural Information Processing Systems 23, edited by Rich Zemel, John Shawe-Taylor, Peter Bartlett, Fernando Pereira and Kilian Weinberger.
table of Contents
Organizing Committee Program Committee 3 3
Wednesday
oral sessions Sessions 6 - 10, Abstracts spotlights sessions Sessions 5 - 7, Abstracts Poster sessions Sessions W1 - W101 76 80 79 103 73 73
NIPS Foundation Offices and Board Members 4 Core Logistics Team Awards Sponsors 4 4 5
8 12
tHursday
oral sessions Sessions 11 - 14, Abstracts 105
12 16 15
tuesday
oral sessions Sessions 1 - 5, Abstracts spotlights sessions Sessions 2 - 4, Abstracts Poster sessions Sessions T1 - T103 Abstracts location of Presentations demonstrations 43 48 47 71 40 40
organiZing Committee
General Chairs John shawe-taylor, University College London, richard Zemel, University of Toronto Program Chairs Peter bartlett, Queensland Univ. of Technology & UC Berkeley, fernando Pereira, Google Research Spanish Ambassador Jesus Cortes, University of Granada, Spain Tutorials Chair max Welling, University of California, Irvine Workshop Chairs fernando Perez-Cruz, University Carlos III in Madrid, spain; Jeff bilmes, University of Washington Demonstration Chair samy bengio, Google Research Publications Chair & Electronic Proceedings Chair Kilian Weinberger, Washington University in St. Louis Program Manager david Hall, University of California, Berkeley
Program Committee
Cedric Archambeau (Xerox Research Centre Europe) Andreas Argyriou (Toyota Technological Institute at Chicago) Peter Auer (Montanuniversitt Leoben) Mikhail Belkin (Ohio State University) Chiru Bhattarcharyya (Indian Institute of Computer Science) Charles Cadieu (University of California, Berkeley) Michael Collins (Columbia University) Ronan Collobert (IDIAP Research Institute) Hal Daume III (University of Maryland) Fei Fei Li (Stanford University) Rob Fergus (New York University) Maria Florina Balcan (Georgia Tech) Kenji Fukumizu (Institute of Statistical Mathematics) Amir Globerson (The Hebrew University of Jerusalem) Sally Goldman (Google) Noah Goodman (Stanford University) Alexander Gray (Georgia Tech) Katherine Heller (MIT) Guy Lebanon (Georgia Tech) Mate Lengyel (University of Cambridge) Roger Levy (University of California, San Diego) Hang Li (Microsoft) Chih-Jen Lin (National Taiwan University) Phil Long (Google) Yi Ma (University of Illinois at Urbana-Champaign) Remi Munos (INRIA, Lille) Jan Peters (Max Planck Institute of Intelligent Systems, Tbingen) Jon Pillow (University of Texas, Austin) Joelle Pineau (McGill University) Ali Rahimi (San Francisco, CA) Sasha Rakhlin (University of Pennsylvania) Pradeep Ravikumar (University of Texas, Austin) Ruslan Salakhutdinov (MIT) Sunita Sarawagi (IIT Bombay) Thomas Serre (Brown University) Shai Shalev-Shwartz (The Hebrew University of Jerusalem) Ingo Steinwart (Universitt Stuttgart) Amar Subramanya (Google) Masashi Sugiyama (Tokyo Institute of Technology) Koji Tsuda (National Institute of Advanced Industrial Science and Technology) Raquel Urtasun (Toyota Technological Institute at Chicago) Manik Varma (Microsoft) Nicolas Vayatis (Ecole Normale Suprieure de Cachan) Jean-Philippe Vert (Mines ParisTech) Hanna Wallach (University of Massachusetts Amherst) Frank Wood (Columbia University) Eric Xing (Carnegie Mellon University) Yuan Yao (Peking University) Kai Yu (NEC Labs) Tong Zhang (Rutgers University) Jerry Zhu (University of Wisconsin-Madison)
NIPS would like to especially thank Microsoft Research for their donation of Conference Management Toolkit (CMT) software and server space.
3
Advisory Board
Emeritus Members
aWards
outstanding student PaPer aWards
Efficient Inference in Fully Connected CRFs with gaussian edge Potentials Philipp Krhenbhl * and Vladlen Koltun Priors over recurrent Continuous time Processes Ardavan Saeedi * and Alexandre Bouchard-Cte fast and accurate k-means for large datasets Michael Shindler * Alex Wong, and Adam Meyerson * Winner
sPonsors
NIPS gratefully acknowledges the generosity of those individuals and organizations who have provided financial support for the NIPS 2011 conference. The financial support enabled us to sponsor student travel and participation, the outstanding student paper awards, the demonstration track and the opening buffet.
Program HigHligHts
Reception Desk
FLOOR ZERO
Front Entrance
12:00 12:40 pm
Oral Session 3: modelling genetic Variations using fragmentation-Coagulation Processes Y. Teh, C. Blundell, L. Elliott Priors over recurrent Continuous time Processes A. Saeedi, A. Bouchard-Ct
12:40 1:10 pm
M O ND AY D EC 1 2T H
7:30 am 6:30 pm
Registration Desk Open Floor Zero
Spotlights Session 3 Oral Session 4: Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss D. Mcallester, J. Keshet Spotlights Session 4
1:10 1:30 pm
TUE S DAY D EC 1 3T H
1:30 2 pm
TH U R SD AY D EC 1 5TH
Spotlight Session 1 Poster Session Floor One
7:00 11:59 pm
M O ND AY D EC 1 2T H
TUE S DAY D EC 1 3T H
8 am 5:30 pm
9:30 10:40 am
TH U R SD AY D EC 1 5TH
Oral Session 1: Posner lecture: learning about sensorimotor data Richard Sutton Oral Session 1: a non-Parametric approach to dynamic Programming O. Kroemer, J. Peters
5:45 11:59 pm
8 am 5:30 pm
9:30 10:40 am
T U ESD AY D EC 1 3TH HigHligHts Program MON DAY DEC 1 2 T H WEDNESDAY DEC 14TH TUESDAY DEC 1 3 T H T H UR SD AY D EC 15 TH WEDNESDAY DEC 14TH
5:45 11:59 pm
Demonstrations Andalucia II Poster Session Floor One Registration Desk Open Floor Zero
5:45 11:59 pm
Oral Session 6: Posner lecture: from kernels to causal inference Bernhard Schlkopf High-dimensional graphical model selection: tractable graph families and necessary Conditions A. Anandkumar, V. Tan, A. Willsky
THURSDAY DEC 1 5 T H
8 am 12:00 pm
Registration Desk Open Floor Zero Oral Session 11: invited talk: the neuronal replicator Hypothesis: novel mechanisms for information transfer and search in the brain Ers Szathmry, Chrisantha Fernando Continuous-time regression models for longitudinal networks D. Vu, A. Asuncion, D. Hunter, P. Smyth
Spotlights Session 5 Oral Session 7: Efficient Inference in Fully Connected Crfs with gaussian edge Potentials P. Krhenbhl, V. Koltun
9:30 10:40 am
10:40 11:20 am
Oral Session 12: scalable training of mixture models via Coresets D. Feldman, M. Faulkner, A. Krause fast and accurate k-means for large datasets M. Shindler, A. Wong, A. Meyerson
Spotlights Session 6 Oral Session 9: k-nn regression adapts to local intrinsic dimension S. Kpotufe Spotlights Session 7
1:30 2:00 pm
1:10 1:50 pm
Oral Session 14: The Manifold Tangent Classifier S. Rifai, Y. Dauphin, P. Vincent, Y., Bengio, X. Muller reconstructing Patterns of information diffusion from incomplete observations F. Chierichetti, J. Kleinberg, D. Liben-Nowell
Bus to Sierra Nevada Alhambra Tour Bus Boarding Alhambra Tour Bus Boarding Bus to Sierra Nevada
7
monday tutorials
9:30 am 11:30 am - tutorial session 1 Flexible, Multivariate Point Process Models for Unlocking the Neural Code Jonathan Pillow Location: Andulucia II & III Linear Programming Relaxations for Graphical Models Amir Globerson, Tommi Jaakkola Location: Manuel de Falla 12:00 2:00 Pm - tutorial session 2 Lagrangian Relaxation Algorithms for Inference in Natural Language Processing Alexander Rush, Michael Collins Location: Andulucia II & III Modern Bayesian Nonparametrics Peter Orbanz, Yee Whye Teh Location: Manuel de Falla 4:00 Pm 6:00 Pm - tutorial session 3 Tutorial: Graphical Models for the Internet Amr Ahmed, Alexander Smola Location: Andulucia II & III Tutorial: Information Theory in Learning and Control Naftali Tishby Location: Manuel de Falla
abstraCts of tutorials
noise models, functional connectivity, advanced regularization methods, and model-based (Bayesian) techniques for decoding multi-neuron spike trains.
Jonathan Pillow is an assistant professor in Psychology and Neurobiology at the University of Texas at Austin. He graduated from the University of Arizona in 1997 with a degree in mathematics and philosophy, and was a U.S. Fulbright fellow in Morocco in 1998. He received his Ph.D. in neuroscience from NYU in 2005, and was a Royal Society postdoctoral reserach fellow at the Gatsby Computational Neuroscience Unit, UCL from 2005 to 2008. His recent work involves statistical methods for understanding the neural code in single neurons and neural populations, and his lab conducts psychophysical experiments designed to test Bayesian models of human sensory perception.
abstraCts of tutorials
tutorial session 2, 12:00 - 2:00pm
Tutorial: Lagrangian Relaxation Algorithms for Inference in Natural Language Processing
Alexander Rush, Michael Collins here has been a long history in combinatorial optimization of methods that exploit structure in complex problems, using methods such as dual decomposition or Lagrangian relaxation. These methods leverage the observation that complex inference problems can often be decomposed into efficiently solvable subproblems. Recent work within the machine learning community has explored algorithms for MAP inference in graphical models using these methods, as an alternative for example to maxproduct loopy belief propagation. In this tutorial, we give an overview of Lagrangian relaxation for inference problems in natural language processing. The goals of the tutorial will be two-fold: 1) to give an introduction to key inference problems in NLP: for example problems arising in machine translation, sequence modeling, parsing, and information extraction. 2) to give a formal and practical overview of Lagrangrian relaxation as a method for deriving inference algorithms for NLP problems. In general, the algorithms we describe combine combinatorial optimization methods (for example dynamic programming, exact belief propagation, minimum spanning tree, all-pairs shortest path) with subgradient methods from the optimization community. Formal guarantees for the algorithms come from a close relationship to linear programming relaxations. For many of the problems that we consider, the resulting algorithms produce exact solutions, with certificates of optimality, on the vast majority of examples. In practice the algorithms are efficient for problems that are either NP-hard (as is the case for non-projective parsing, or for phrase-based translation), or for problems that are solvable in polynomial time using dynamic programming, but where the traditional exact algorithms are far too expensive to be practical.
Alexander Rush is a Ph.D. candidate in computer science at MIT. His research explores novel decoding methods for problems in natural language processing with applications to parsing and statistical machine translation. His paper ``Dual Decomposition for Parsing with NonProjective Head Automata received the best paper award at EMNLP 2010. Michael Collins is the Vikram S. Pandit Professor of computer science at Columbia University. His research is focused on topics including statistical parsing, structured prediction problems in machine learning, and applications including machine translation, dialog systems, and speech recognition. His awards include a Sloan fellowship, an NSF career award, and best paper awards at EMNLP (2002, 2004, and 2010), UAI (2004 and 2005), and CoNLL 2008.
abstraCts of tutorials
tutorial session 3, 4:00 6:00 pm
Tutorial: Graphical Models for the Internet
Amr Ahmed, Alexander Smola In this tutorial we give an overview over applications and scalable inference in graphical models for the internet. Structured data analysis has become a key enabling technique to process significant amounts of data, ranging from entity extraction on webpages to sentiment and topic analysis for news articles and comments. Our tutorial covers large scale sampling and optimization methods for Nonparametric Bayesian models such as Latent Dirichlet Allocation, both from a statistics and a systems perspective. Subsequently we give an overview over a range of generative models to elicit sentiment, ideology, time dependence, hierarchical structure, and multilingual similarity from data. We conclude with an overview of recent advances in (semi)supervised information extraction methods based on conditional random fields and related undirected graphical models.
Amr Ahmed is a Research Scientist at Yahoo! Research. He got his M.Sc and PhD from the School of Computer Science at Carnegie Mellon University in 2009 and 2011 respectively. He is interested in graphical models and Bayesian non-parametric statistics with an eye towards building efficient inference algorithms for such models that scale to the size of the data on the internet. On the application side, he is interested in information retrieval over structured sources, social media ( blogs, news stream, twitter), user modeling and personalization. Alex Smola is Principal Research Scientist at Yahoo! Research and adjunct Professor at the Australian National University. Prior to joining Yahoo! in September 2008 he was group leader of the machine learning program at NICTA, a federally funded research center in Canberra, Australia. His role involved leading a team of up to 35 researchers, programmers, PhD students, visitors, and interns. Prior to that he held positions at the Australian National University, GMD FIRST in Berlin and AT&T Research. He has written and edited 4 books, published over 160 papers, and organized several Summer Schools.
10
monday ConferenCe
11
monday - ConferenCe
monday, deCember 12tH
6:30 6:40Pm - oPening remarKs, aWards & reCePtion Tapas Reception. M6 analysis and improvement of Policy gradient estimation, T. Zhao, H. Hachiya, G. Niu, M. Sugiyama
M7 Efficient Offline Communication Policies for factored multiagent PomdPs, J. Messias, M. Spaan, P. Lima M8 speedy Q-learning, M. Gheshlaghi Azar, R. Munos, M. Ghavamzadeh, H. Kappen M9 optimal reinforcement learning for gaussian systems, P. Hennig M10 Clustering via dirichlet Process mixture models for Portable skill discovery, S. Niekum, A. Barto M11 nonlinear inverse reinforcement learning with gaussian Processes, S. Levine, Z. Popovic, V. Koltun M12 from stochastic nonlinear integrate-and-fire to generalized linear models, S. Mensi, R. Naud, W. Gerstner M13 a brain-machine interface operating with a realtime spiking neural network Control algorithm, J. Dethier, P. Nuyujukian, C. Eliasmith, T. Stewart, S. Elasaad, K. Shenoy, K. Boahen M14 energetically optimal action Potentials, M. Stemmler, B. Sengupta, S. Laughlin, J. Niven M15 active dendrites: adaptation to spike-based communication, B. Ujfalussy, M. Lengyel M16 inferring spike-timing-dependent plasticity from spike train data, I. Stevenson, K. Koerding M17 dynamical segmentation of single trials from population neural data, B. Petreska, B. Yu, J. Cunningham, G. Santhanam, S. Ryu, K. Shenoy, M. Sahani M18 emergence of multiplication in a biophysical model of a Wide-field Visual neuron for Computing object approaches: dynamics, Peaks, & fits, M. Keil M19 Why the brain separates face recognition from object recognition, J. Leibo, J. Mutch, T. Poggio M20 estimating time-varying input signals and ion channel states from a single voltage trace of a neuron, R. Kobayashi, Y. Tsubo, P. Lansky, S. Shinomoto M21 How biased are maximum entropy models?, J. Macke, I. Murray, P. Latham M22 Joint 3d estimation of objects and scene layout, A. Geiger, C. Wojek, R. Urtasun M23 Pylon model for semantic segmentation, V. Lempitsky, A. Vedaldi, A. Zisserman M24 Higher-order Correlation Clustering for image segmentation, S. Kim, S. Nowozin, P. Kohli, C. Yoo
SPOTLIGHT SESSION
session 1 - 6:40 7:00 Pm
Session Chair: Rob Fergus Structural equations and divisive normalization for energy-dependent component analysis Jun-ichiro Hirayama, Aapo Hyvarinen, Kyoto University Subject area: ICA, PCA, CCA & Other Linear Models See abstract, page 29 (M57) Uniqueness of Belief Propagation on Signed Graphs Yusuke Watanabe, The Institute of Statistical Mathematics Subject area: Approximate Inference See abstract, page 35 (M87) Probabilistic amplitude and frequency demodulation Richard Turner, Maneesh Sahani, Gatsby Unit, UCL Subject area: Probabilistic Models and Methods See abstract, page 34 (M79) On the accuracy of l1-filtering of signals with blocksparse structure, A. Iouditski, UJF; F. Kilinc Karzan, Carnegie Mellon University; A. Nemirovski, Georgia Institute of Technology; B. Polyak, Institute for Control Sciences, RAS Moscow Subject area: Statistical Learning Theory See abstract, page 31 (M66) Active dendrites: adaptation to spike-based communication Balazs Ujfalussy, Mate Lengyel, University of Cambridge Subject area: Computational Neural Models See abstract, page 19 (M15)
POSTER SESSION
and reCePtion - 7:00 11:59 Pm M1 a reinterpretation of the policy oscillation phenomenon in approximate policy iteration, P. Wagner M2 M3 maP inference for bayesian inverse reinforcement learning, J. Choi, K. Kim monte Carlo Value iteration with macro-actions, Z. Lim, D. Hsu, L. Sun
M4 reinforcement learning using Kernel-based stochastic factorization, A. Barreto, D. Precup, J. Pineau M5 Policy gradient Coagent networks, P. Thomas
12
monday - ConferenCe
M25 unsupervised learning models of primary cortical receptive fields and receptive field plasticity, A. Saxe, M. Bhand, R. Mudur, B. Suresh, A. Ng M26 transfer learning by borrowing examples, J. Lim, R. Salakhutdinov, A. Torralba M27 large-scale Category structure aware image Categorization, B. Zhao, F. Li, E. Xing M44 Hierarchical multitask structured output learning for large-scale sequence segmentation, N. Goernitz, C. Widmer, G. Zeller, A. Kahles, S. Sonnenburg, G. Raetsch M45 submodular multi-label learning, J. Petterson, T. Caetano M46 algorithms for Hyper-Parameter optimization, J. Bergstra, R. Bardenet, Y. Bengio, B. Kgl M47 non-parametric group orthogonal matching Pursuit for sparse learning with multiple Kernels, V. Sindhwani, A. Lozano M48 manifold Precis: an annealing technique for diverse sampling of manifolds, N. Shroff, P. Turaga, R. Chellappa M49 group anomaly detection using flexible genre models, L. Xiong, B. Poczos, J. Schneider M50 Matrix Completion for Image Classification, R. Cabral, F. De la Torre, J. Costeira, A. Bernardino M51 selecting receptive fields in deep networks, A. Coates, A. Ng M52 Co-regularized multi-view spectral Clustering, A. Kumar, P. Rai, H. Daume III M53 learning with the weighted trace-norm under arbitrary sampling distributions, R. Foygel, R. Salakhutdinov, O. Shamir, N. Srebro M54 active learning with a drifting distribution, L. Yang M55 bayesian Partitioning of large-scale distance data, D. Adametz, V. Roth M56 beyond spectral Clustering - tight relaxations of balanced graph Cuts, M. Hein, S. Setzer M57 structural equations and divisive normalization for energy-dependent component analysis, J. Hirayama, A. Hyvarinen M58 generalised Coupled tensor factorisation, K. Ylmaz, A. Cemgil, U. Simsekli M59 similarity-based learning via data driven embeddings, P. Kar, P. Jain M60 metric learning with multiple Kernels, J. Wang, H. Do, A. Woznica, A. Kalousis M61 regularized laplacian estimation and fast eigenvector approximation, P. Perry, M. Mahoney M62 Hierarchical topic modeling for analysis of timeevolving Personal Choices, X. Zhang, D. Dunson, L. Carin M63 Efficient Learning of Generalized Linear and Single index models with isotonic regression, S. Kakade, A. Kalai, V. Kanade, O. Shamir
13
M28 Hierarchical matching Pursuit for recognition: architecture and fast algorithms, L. Bo, X. Ren, D. Fox M29 Portmanteau Vocabularies for multi-Cue image representation, F. Khan, J. van de Weijer, A. Bagdanov, M. Vanrell M30 PiCodes: learning a Compact Code for novelCategory recognition, A. Bergamo, L. Torresani, A. Fitzgibbon M31 orthogonal matching Pursuit with replacement, P. Jain, A. Tewari, I. Dhillon M32 sparCs: recovering low-rank and sparse matrices from compressive measurements A. Waters, A. Sankaranarayanan, R. Baraniuk signal estimation under random time-Warpings and nonlinear signal alignment, S. Kurtek, A. Srivastava, W. Wu inverting grices maxims to learn rules from natural language extractions, M. Sorower, T. Dietterich, J. Doppa, W. Orr, P. Tadepalli, X. Fern multi-View learning of Word embeddings via CCa, P. Dhillon, D. Foster, L. Ungar active ranking using Pairwise Comparisons, K. Jamieson, R. Nowak Co-training for domain adaptation, M. Chen, K. Weinberger, J. Blitzer
M33
M34
M38 Efficient anomaly detection using bipartite k-NN graphs, K. Sricharan, A. Hero M39 a maximum margin multi-instance learning framework for image Categorization, H. Wang, H. Huang, F. Kamangar, F. Nie, C. Ding M40 M41 Advice Refinement in Knowledge-Based SVMs, G. Kunapuli, R. Maclin, J. Shavlik multiclass boosting: theory and algorithms, M. Saberian, N. Vasconcelos
M42 boosting with maximum adaptive sampling, C. Dubout, F. Fleuret M43 Kernel embeddings of latent tree graphical models, L. Song, A. Parikh, E. Xing
monday - ConferenCe
M64 M65 unifying non-maximum likelihood learning objectives with minimum Kl Contraction, S. Lyu statistical Performance of Convex tensor decomposition, R. Tomioka, T. Suzuki, K. Hayashi, H. Kashima On the accuracy of l1-filtering of signals with blocksparse structure, A. Iouditski, F. Kilinc Karzan, A. Nemirovski, B. Polyak Committing bandits, L. Bui, R. Johari, S. Mannor Newtron: an Efficient Bandit algorithm for Online multiclass Prediction, E. Hazan, S. Kale learning eigenvectors for free, W. Koolen, W. Kotlowski, M. Warmuth online learning: stochastic, Constrained, and smoothed adversaries, A. Rakhlin, K. Sridharan, A. Tewari optimistic optimization of deterministic functions, R. Munos the impact of unlabeled Patterns in rademacher Complexity Theory for Kernel Classifiers, L. Oneto, D. Anguita, A. Ghio, S. Ridella unifying framework for fast learning rate of nonsparse multiple Kernel learning, T. Suzuki nearest neighbor based greedy Coordinate descent, I. Dhillon, P. Ravikumar, A. Tewari Agnostic Selective Classification, Y. Wiener, R. ElYaniv greedy model averaging, D. Dai, T. Zhang Confidence Sets for Network Structure, D. Choi, P. Wolfe, E. Airoldi on tracking the Partition function, G. Desjardins, A. Courville, Y. Bengio Probabilistic amplitude and frequency demodulation, R. Turner, M. Sahani structure learning for optimization, . Yang, A. Rahimi spike and slab Variational inference for multi-task and multiple Kernel learning, M. Titsias, M. LzaroGredilla thinning measurement models and Questionnaire design, R. Silva an application of tree-structured expectation Propagation for Channel decoding, P. Olmos, L. Salamanca, J. Murillo Fuentes, F. Perez-Cruz M84 global solution of fully-observed Variational bayesian matrix factorization is Column-Wise independent, S. Nakajima, M. Sugiyama, S. Babacan M85 Quasi-newton methods for markov Chain, Monte Carlo Y. Zhang, C. Sutton M86 nonstandard interpretations of Probabilistic Programs for Efficient Inference, D. Wingate, N. Goodman, A. Stuhlmueller, J. Siskind M87 uniqueness of belief Propagation on signed graphs, Y. Watanabe M88 non-conjugate Variational message Passing for multinomial and binary regression, D. Knowles, T. Minka M89 ranking annotators for crowdsourced labeling tasks, V. Raykar, S. Yu M90 gaussian process modulated renewal processes, V. Rao, Y. Teh M91 Infinite Latent SVM for Classification and Multi-task learning, J. Zhu, N. Chen, E. Xing M92 Spatial distance dependent Chinese restaurant Process for image segmentation, S. Ghosh, A. Ungureanu, E. Sudderth, D. Blei M93 analytical results for the error in filtering of gaussian Processes, A. Susemihl, R. Meir, M. Opper M94 Robust Multi-Class Gaussian Process Classification, D. Hernndez-lobato, J. Hernndez-Lobato, P. Dupont M95 additive gaussian Processes, D. Duvenaud, H. Nickisch, C. Rasmussen M96 a global structural em algorithm for a model of Cancer Progression, A. Tofigh, E. Sjlund, M. Hglund, J. Lagergren M97 Collective graphical models, D. Sheldon, T. Dietterich M98 simultaneous sampling and multi-structure fitting with adaptive reversible Jump mCmC, T. Pham, T. Chin, J. Yu, D. Suter M99 Causal discovery with Cyclic additive noise models, J. Mooij, D. Janzing, T. Heskes, B. Schlkopf M100 a model for temporal dependencies in event streams, A. Gunawardana, C. Meek, P. Xu M101 facial expression transfer with input-output temporal restricted boltzmann machines, M. Zeiler, G. Taylor, L. Sigal, I. Matthews, R. Fergus M102 learning auto-regressive models from sequence and non-sequence data, T. Huang, J. Schneider
M66
M71 M72
M82 M83
14
FRONT ENTRANCE
M100 M91 M82 M73 M64 M55 M46 M37 M28 M19 M10 M01 M101 M92 M83 M74 M65 M56 M47 M38 M29 M20 M11 M02 M102 M93 M84 M75 M66 M57 M48 M39 M30 M21 M12 M03 M94 M85 M76 M67 M58 M49 M40 M31 M22 M13 M04 M95 M86 M77 M68 M59 M50 M41 M32 M23 M14 M05 M96 M87 M78 M69 M60 M51 M42 M33 M24 M15 M06 M97 M88 M79 M70 M61 M52 M43 M34 M25 M16 M07 M98 M89 M81 M71 M62 M53 M44 M35 M26 M17 M08 M99 M90 M82 M72 M63 M54 M45 M36 M27 M18 M09
Speaker Podium
FLOOR ONE
Internet Area
To Cafeteria
Andalucia 3
Andalucia 2
Andalucia 3
Cafeteria
5
15
monday abstraCts
m1 a reinterpretation of the policy oscillation phenomenon in approximate policy iteration
P. Wagner pwagner@cis.hut.fi Aalto University School of Science A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and valuebased policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. Subject Area: Control and Reinforcement Learning provide sufficient conditions for Macro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI does not require explicit construction of probabilistic models for macro-actions and is thus easy to apply in practice. Experiments show that Macro-MCVI substantially improves the performance of MCVI with suitable macro-actions. Subject Area: Control and Reinforcement Learning
m4
m2
The difficulty in inverse reinforcement learning (IRL) arises in choosing the best reward function since there are typically an infinite number of reward functions that yield the given behaviour data as optimal. Using a Bayesian framework, we address this challenge by using the maximum a posteriori (MAP) estimation for the reward function, and show that most of the previous IRL algorithms can be modeled into our framework. We also present a gradient method for the MAP estimation based on the (sub)differentiability of the posterior distribution. We show the effectiveness of our approach by comparing the performance of the proposed method to those of the previous algorithms. Subject Area: Control and Reinforcement Learning
Kernel-based reinforcement-learning (KBRL) is a method for learning a decision policy from a set of sample transitions which stands out for its strong theoretical guarantees. However, the size of the approximator grows with the number of transitions, which makes the approach impractical for large problems. In this paper we introduce a novel algorithm to improve the scalability of KBRL. We resort to a special decomposition of a transition matrix, called stochastic factorization, to fix the size of the approximator while at the same time incorporating all the information contained in the data. The resulting algorithm, kernelbased stochastic factorization (KBSF), is much faster but still converges to a unique solution. We derive a theoretical upper bound for the distance between the value functions computed by KBRL and KBSF. The effectiveness of our method is illustrated with computational experiments on four reinforcement-learning problems, including a difficult task in which the goal is to learn a neurostimulation policy to suppress the occurrence of seizures in epileptic rat brains. We empirically demonstrate that the proposed approach is able to compress the information contained in KBRLs model. Also, on the tasks studied, KBSF outperforms two of the most prominent reinforcement-learning algorithms, namely least-squares policy iteration and fitted Q-iteration. Subject Area: Control and Reinforcement Learning m5
m3
16
monday abstraCts
m6 analysis and improvement of Policy gradient estimation
T. Zhao tingting@sg.cs.titech.ac.jp H. Hachiya hachiya@sg.cs.titech.ac.jp G. Niu gang@sg.cs.titech.ac.jp M. Sugiyama sugi@cs.titech.ac.jp Tokyo Institute of Technology Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE(policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments. Subject Area: Control and Reinforcement Learning slow convergence in the standard form of the Q-learning algorithm. We prove a PAC bound on the performance of SQL, which shows that for an MDP with n state-action pairs and the discount factor only T = O(log(n)=(2(1-)4)) steps are required for the SQL algorithm to converge to an -optimal action-value function with high probability. This bound has a better dependency on 1= and 1=(1-), and thus, is tighter than the best available result for Q-learning. Our bound is also superior to the existing results for both model-free and model-based instances of batch Q-value iteration that are considered to be more efficient than the incremental methods like Q-learning. Subject Area: Control and Reinforcement Learning
m9
M7
m10 Clustering via dirichlet Process mixture models for Portable skill discovery
S. Niekum sniekum@cs.umass.edu A. Barto barto@cs.umass.edu University of Massachusetts Skill discovery algorithms in reinforcement learning typically identify single states or regions in state space that correspond to task-specific subgoals. However, such methods do not directly address the question of how many distinct skills are appropriate for solving the tasks that the agent faces. This can be highly inefficient when many identified subgoals correspond to the same underlying skill, but are all used individually as skill goals. Furthermore, skills created in this manner are often only transferable to tasks that share identical state spaces, since corresponding subgoals across tasks are not merged into a single skill goal. We show that these problems can be overcome by clustering subgoal data defined in an agent-space and using the resulting clusters as templates for skill termination conditions. Clustering via a Dirichlet process mixture model is used to discover a minimal, sufficient collection of portable skills. Subject Area: Control and Reinforcement Learning
m8
speedy Q-learning
M. Gheshlaghi Azar m.azar@science.ru.nl H. Kappen b.kappen@science.ru.nl Radboud University of Nijmegen R. Munos remi.munos@inria.fr M. Ghavamzadeh mohammad.ghavamzadeh@inria.fr INRIA Lille - Nord Europe We introduce a new convergent variant of Q-learning, called speedy Q-learning, to address the problem of
17
monday abstraCts
m11 nonlinear inverse reinforcement learning with gaussian Processes
S. Levine svlevine@cs.stanford.edu V. Koltun vladlen@stanford.edu Stanford University Z. Popovic zoran@washington.edu University of Washington We present a probabilistic algorithm for nonlinear inverse reinforcement learning. The goal of inverse reinforcement learning is to learn the reward function in a Markov decision process from expert demonstrations. While most prior inverse reinforcement learning algorithms represent the reward as a linear combination of a set of features, we use Gaussian processes to learn the reward as a nonlinear function, while also determining the relevance of each feature to the experts policy. Our probabilistic algorithm allows complex behaviors to be captured from suboptimal stochastic demonstrations, while automatically balancing the simplicity of the learned reward structure against its consistency with the observed actions. Subject Area: Control and Reinforcement Learning
m13 a brain-machine interface operating with a realtime spiking neural network Control algorithm
J. Dethier P. Nuyujukian S. Elasaad K. Shenoy K. Boahen Stanford University C. Eliasmith T. Stewart University of Waterloo jdethier@stanford.edu nips.npl.stanford@herag.com shauki@stanford.edu shenoy@stanford.edu boahen@stanford.edu celiasmith@uwaterloo.ca tcstewar@uwaterloo.ca
Motor prostheses aim to restore function to disabled patients. Despite compelling proof of concept systems, barriers to clinical translation remain. One challenge is to develop a low-power, fully-implantable system that dissipates only minimal power so as not to damage tissue. To this end, we implemented a Kalman-filter based decoder via a spiking neural network (SNN) and tested it in brain-machine interface (BMI) experiments with a rhesus monkey. The Kalman filter was trained to predict the arms velocity and mapped on to the SNN using the Neural Engineer- ing Framework (NEF). A 2,000-neuron embedded Matlab SNN implementation runs in real-time and its closed-loop performance is quite comparable to that of the standard Kalman filter. The success of this closed-loop decoder holds promise for hardware SNN implementations of statistical signal processing algorithms on neuromorphic chips, which may offer power savings necessary to overcome a major obstacle to the successful clinical translation of neural motor prostheses. Subject Area: Neuroscience\Brain-computer Interfaces
Most action potentials in the nervous system take on the form of strong, rapid, and brief voltage deflections known as spikes, in stark contrast to other action potentials, such as in the heart, that are characterized by broad voltage plateaus. We derive the shape of the neuronal action potential from first principles, by postulating that action potential generation is strongly constrained by the brains need to minimize energy expenditure. For a given height of an action potential, the least energy is consumed when the underlying currents obey the bang-bang principle: the currents giving rise to the spike should be intense, yet shortlived, yielding spikes with sharp onsets and offsets. Energy optimality predicts features in the biophysics that are not per se required for producing the characteristic neuronal action potential: sodium currents should be extraordinarily powerful and inactivate with voltage; both potassium and sodium currents should have kinetics that have a bellshaped voltage-dependence; and the cooperative action of multiple `gates should start the flow of current. Subject Area: Neuroscience\Computational Neural Models
18
monday abstraCts
m15 active dendrites: adaptation to spike-based communication
B. Ujfalussy M. Lengyel University of Cambridge bbu20@cam.ac.uk m.lengyel@eng.cam.ac.uk
Computational analyses of dendritic computations often assume stationary inputs to neurons, ignoring the pulsatile nature of spike-based communication between neurons and the moment-to-moment fluctuations caused by such spiking inputs. Conversely, circuit computations with spiking neurons are usually formalized without regard to the rich nonlinear nature of dendritic processing. Here we address the computational challenge faced by neurons that compute and represent analogue quantities but communicate with digital spikes, and show that reliable computation of even purely linear functions of inputs can require the interplay of strongly nonlinear subunits within the postsynaptic dendritic tree. Our theory predicts a matching of dendritic nonlinearities and synaptic weight distributions to the joint statistics of presynaptic inputs. This approach suggests normative roles for some puzzling forms of nonlinear dendritic dynamics and plasticity. Subject Area: Neuroscience\Computational Neural Models
m18 emergence of multiplication in a biophysical model of a Wide-field Visual neuron for Computing object approaches: dynamics, Peaks, & fits
M. Keil University of Barcelona matskeil@ub.edu
Many species show avoidance reactions in response to looming object approaches. In locusts, the corresponding escape behavior correlates with the activity of the lobula giant movement detector (LGMD) neuron. During an object approach, its firing rate was reported to gradually increase until a peak is reached, and then it declines quickly. The -function predicts that the LGMD activity is a product between an exponential function of angular size exp(-) and angular velocity , and that peak activity is reached before time-to-contact (ttc). The -function has become the prevailing LGMD model because it reproduces many experimental observations, and even experimental evidence for the multiplicative operation was reported. Several inconsistencies remain unresolved, though. Here we address these issues with a new model (-model), which explicitly connects and to biophysical quantities. The -model avoids biophysical problems associated with implementing exp(), implements the multiplicative operation of via divisive inhibition, and explains why activity peaks could occur after ttc. It consistently predicts response
19
monday abstraCts
features of the LGMD, and provides excellent fits to published experimental data, with goodness of fit measures comparable to corresponding fits with the -function. Subject Area: Neuroscience the state-space model with prior distributions that penalize large fluctuations in these parameters. After optimizing the hyperparameters by maximizing the marginal likelihood, the state-space model provides the time-varying parameters of the input signals and the ion channel states. The proposed method is tested not only on the simulated data from the Hodgkin-Huxley type models but also on experimental data obtained from a cortical slice in vitro. Subject Area: Neuroscience\Neural Coding
m19 Why the brain separates face recognition from object recognition
J. Leibo jzleibo@mit.edu J. Mutch jmutch@mit.edu T. Poggio tp@csail.mit.edu Massachusetts Institute of Technology Many studies have uncovered evidence that visual cortex contains specialized regions involved in processing faces but not other object classes. Recent electrophysiology studies of cells in several of these specialized regions revealed that at least some of these regions are organized in a hierarchical manner with viewpoint-specific cells projecting to downstream viewpoint-invariant identityspecific cells (Freiwald and Tsao 2010). A separate computational line of reasoning leads to the claim that some transformations of visual inputs that preserve viewed object identity are class-specific. In particular, the 2D images evoked by a face undergoing a 3D rotation are not produced by the same image transformation (2D) that would produce the images evoked by an object of another class undergoing the same 3D rotation. However, within the class of faces, knowledge of the image transformation evoked by 3D rotation can be reliably transferred from previously viewed faces to help identify a novel face at a new viewpoint. We show, through computational simulations, that an architecture which applies this method of gaining invariance to class-specific transformations is effective when restricted to faces and fails spectacularly when applied across object classes. We argue here that in order to accomplish viewpoint-invariant face identification from a single example view, visual cortex must separate the circuitry involved in discounting 3D rotations of faces from the generic circuitry involved in processing other objects. The resulting model of the ventral stream of visual cortex is consistent with the recent physiology results showing the hierarchical organization of the face processing network. Subject Area: Neuroscience\Computational Neural Models
m20 estimating time-varying input signals and ion channel states from a single voltage trace of a neuron
R. Kobayashi Ritsumeikan University Y. Tsubo RIKEN P. Lansky Academy of Sciences S. Shinomoto Kyoto University kobayashi@cns.ci.ritsumei.ac.jp yasuhirotsubo@riken.jp lansky@biomed.cas.cz shinomoto@scphys.kyoto-u.ac.jp
State-of-the-art statistical methods in neuroscience have enabled us to fit mathematical models to experimental data and subsequently to infer the dynamics of hidden parameters underlying the observable phenomena. Here, we develop a Bayesian method for inferring the time-varying mean and variance of the synaptic input, along with the dynamics of each ion channel from a single voltage trace of a neuron. An estimation problem may be formulated on the basis of
20
We propose a novel generative model that is able to reason jointly about the 3D scene layout as well as the 3D location and orientation of objects in the scene. In particular, we infer the scene topology, geometry as well as traffic activities from a short video sequence acquired with a single camera mounted on a moving car. Our generative model takes advantage of dynamic information in the form of vehicle tracklets as well as static information coming from semantic labels and geometry (i.e., vanishing points). Experiments show that our approach outperforms a discriminative baseline based on multiple kernel learning (MKL) which has access to the same image information. Furthermore, as we reason about objects in 3D, we are able to significantly increase the performance of state-of-the-art object detectors in their ability to estimate object orientation. Subject Area: Vision
monday abstraCts
m23 Pylon model for semantic segmentation
V. Lempitsky Yandex A. Vedaldi A. Zisserman University of Oxford victorlempitsky@gmail.com vedaldi@robots.ox.ac.uk az@robots.ox.ac.uk support vector machine is possible. Experimental results on various datasets show that the proposed higher-order correlation clustering outperforms other state-of-the-art image segmentation algorithms. Subject Area: Vision\Image Segmentation
Graph cut optimization is one of the standard workhorses of image segmentation since for binary random field representations of the image, it gives globally optimal results and there are efficient polynomial time implementations. Often, the random field is applied over a flat partitioning of the image into non-intersecting elements, such as pixels or super-pixels. In the paper we show that if, instead of a flat partitioning, the image is represented by a hierarchical segmentation tree, then the resulting energy combining unary and boundary terms can still be optimized using graph cut (with all the corresponding benefits of global optimality and efficiency). As a result of such inference, the image gets partitioned into a set of segments that may come from different layers of the tree. We apply this formulation, which we call the pylon model, to the task of semantic segmentation where the goal is to separate an image into areas belonging to different semantic classes. The experiments highlight the advantage of inference on a segmentation tree (over a flat partitioning) and demonstrate that the optimization in the pylon model is able to flexibly choose the level of segmentation across the image. Overall, the proposed system has superior segmentation accuracy on several datasets (Graz-02, Stanford background) compared to previously suggested approaches. Subject Area: Vision\Image Segmentation
m25 unsupervised learning models of primary cortical receptive fields and receptive field plasticity
A. Saxe M. Bhand R. Mudur B. Suresh A. Ng Stanford University asaxe@stanford.edu mbhand@cs.stanford.edu rmudur@stanford.edu bipins@cs.stanford.edu ang@cs.stanford.edu
The efficient coding hypothesis holds that neural receptive fields are adapted to the statistics of the environment, but is agnostic to the timescale of this adaptation, which occurs on both evolutionary and developmental timescales. In this work we focus on that component of adaptation which occurs during an organisms lifetime, and show that a number of unsupervised feature learning algorithms can account for features of normal receptive field properties across multiple primary sensory cortices. Furthermore, we show that the same algorithms account for altered receptive field properties in response to experimentally altered environmental statistics. Based on these modeling results we propose these models as phenomenological models of receptive field plasticity during an organisms lifetime. Finally, due to the success of the same models in multiple sensory areas, we suggest that these algorithms may provide a constructive realization of the theory, first proposed by Mountcastle (1978), that a qualitatively similar learning algorithm acts throughout primary sensory cortices. Subject Area: Vision\Natural Scene Statistics
For many of the state-of-the-art computer vision algorithms, image segmentation is an important preprocessing step. As such, several image segmentation algorithms have been proposed, however, with certain reservation due to high computational load and many hand-tuning parameters. Correlation clustering, a graph-partitioning algorithm often used in natural language processing and document clustering, has the potential to perform better than previously proposed image segmentation algorithms. We improve the basic correlation clustering formulation by taking into account higher-order cluster relationships. This improves clustering in the presence of local boundary ambiguities. We first apply the pairwise correlation clustering to image segmentation over a pairwise superpixel graph and then develop higher-order correlation clustering over a hypergraph that considers higher-order relations among superpixels. Fast inference is possible by linear programming relaxation, and also effective parameter learning framework by structured
21
monday abstraCts
m27 large-scale Category structure aware image Categorization
B. Zhao zhaobinhere@hotmail.com E. Xing epxing@cs.cmu.edu Carnegie Mellon University F. Li feifeili@cs.stanford.edu Stanford University Most previous research on image categorization has focused on medium-scale data sets, while large-scale image categorization with millions of images from thousands of categories remains a challenge. With the emergence of structured large-scale dataset such as the ImageNet, rich information about the conceptual relationships between images, such as a tree hierarchy among various image categories, become available. As human cognition of complex visual world benefits from underlying semantic relationships between object classes, we believe a machine learning system can and should leverage such information as well for better performance. In this paper, we employ such semantic relatedness among image categories for large-scale image categorization. Specifically, a category hierarchy is utilized to properly define loss function and select common set of features for related categories. An efficient optimization method based on proximal approximation and accelerated parallel gradient method is introduced. Experimental results on a subset of ImageNet containing 1.2 million images from 1000 categories demonstrate the effectiveness and promise of our proposed approach. Subject Area: Vision\Object Recognition
We describe a novel technique for feature combination in the bag-of-words model of image classification. Our approach builds discriminative compound words from primitive cues learned independently from training images. Our main observation is that modeling joint-cue distributions independently is more statistically robust for typical classification problems than attempting to empirically estimate the dependent, joint-cue distribution directly. We use Information theoretic vocabulary compression to find discriminative combinations of cues and the resulting vocabulary of portmanteau1 words is compact, has the cue binding property, and supports individual weighting of cues in the final image representation. State-of-the-art results on both the Oxford Flower-102 and Caltech-UCSD Bird-200 datasets demonstrate the effectiveness of our technique compared to other, significantly more complex approaches to multi-cue image representation Subject Area: Vision\Object Recognition
m28 Hierarchical matching Pursuit for image Classification: Architecture and Fast Algorithms
L. Bo lfb@cs.washington.edu D. Fox fox@cs.washington.edu University of Washington X. Ren xiaofeng.ren@intel.com Intel Labs Seattle Extracting good representations from images is essential for many computer vision tasks. In this paper, we propose hierarchical matching pursuit (HMP), which builds a feature hierarchy layer-by-layer using an efficient matching pursuit encoder. It includes three modules: batch (tree) orthogonal matching pursuit, spatial pyramid max pooling, and contrast normalization. We investigate the architecture of HMP, and show that all three components are critical for good performance. To speed up the orthogonal matching pursuit, we propose a batch tree orthogonal matching pursuit that is particularly suitable to encode a large number of observations that share the same large dictionary. HMP is scalable and can efficiently handle full-size images. In addition, HMP enables linear support vector machines (SVMs) to match the performance of nonlinear SVMs while being scalable to large datasets. We compare HMP with many state-of-the-art algorithms including convolutional deep belief networks, SIFT based single layer sparse coding, and kernel based feature learning. HMP consistently yields superior accuracy on three types of visual recognition problems: object recognition (Caltech-101), scene recognition (MIT-Scene), and static event recognition (UIUC-Sports). Subject Area: Vision\Object Recognition
We introduce PiCoDes: a very compact image descriptor which nevertheless allows high performance on object category recognition. In particular, we address novelcategory recognition: the task of defining indexing structures and image representations which enable a large collection of images to be searched for an object category that was not known when the index was built. Instead, the training images defining the category are supplied at query time. We explicitly learn descriptors of a given length (from as small as 16 bytes per image) which have good objectrecognition performance. In contrast to previous work in the domain of object recognition, we do not choose an arbitrary intermediate representation, but explicitly learn short codes. In contrast to previous approaches to learn compact codes, we optimize explicitly for (an upper bound on) classification performance. Optimization directly for binary features is difficult and nonconvex, but we present an alternation scheme and convex upper bound which demonstrate excellent performance in practice. PiCoDes of 256 bytes match the accuracy of the current best known classifier for the Caltech256 benchmark, but they decrease the database storage size by a factor of 100 and speed-up the training and testing of novel classes by orders of magnitude. Subject Area: Vision\Visual Features
22
monday abstraCts
m31 orthogonal matching Pursuit with replacement
P. Jain prajain@microsoft.com A. Tewari ambujtewari@gmail.com I. Dhillon inderjit@cs.utexas.edu University of Texas at Austin In this paper, we consider the problem of compressed sensing where the goal is to recover almost all the sparse vectors using a small number of fixed linear measurements. For this problem, we propose a novel partial hard-thresholding operator leading to a general family of iterative algorithms. While one extreme of the family yields well known hard thresholding algorithms like ITI and HTP[17, 10], the other end of the spectrum leads to a novel algorithm that we call Orthogonal Matching Pursuit with Replacement (OMPR). OMPR, like the classic greedy algorithm OMP, adds exactly one coordinate to the support at each iteration, based on the correlation with the current residual. However, unlike OMP, OMPR also removes one coordinate from the support. This simple change allows us to prove the best known guarantees for OMPR in terms of the Restricted Isometry Property (a condition on the measurement matrix). In contrast, OMP is known to have very weak performance guarantees under RIP. We also extend OMPR using locality sensitive hashing to get OMPRHash, the first provably sub-linear (in dimensionality) algorithm for sparse recovery. Our proof techniques are novel and flexible enough to also permit the tightest known analysis of popular iterative algorithms such as CoSaMP and Subspace Pursuit. We provide experimental results on large problems providing recovery for vectors of size up to million dimensions. We demonstrate that for largescale problems our proposed methods are more robust and faster than the existing methods. Subject Area: Speech and Signal Processing
m33 signal estimation under random timeWarpings and nonlinear signal alignment
S. Kurtek A. Srivastava W. Wu Florida State University skurtek@stat.fsu.edu anuj@stat.fsu.edu wwu@stat.fsu.edu
While signal estimation under random amplitudes, phase shifts, and additive noise is studied frequently, the problem of estimating a deterministic signal under random timewarpings has been relatively unexplored. We present a novel framework for estimating the unknown signal that utilizes the action of the warping group to form an equivalence relation between signals. First, we derive an estimator for the equivalence class of the unknown signal using the notion of Karcher mean on the quotient space of equivalence classes. This step requires the use of FisherRao Riemannian metric and a square-root representation of signals to enable computations of distances and means under this metric. Then, we define a notion of the center of a class and show that the center of the estimated class is a consistent estimator of the underlying unknown signal. This estimation algorithm has many applications: (1) registration/alignment of functional data, (2) separation of phase/amplitude components of functional data, (3) joint demodulation and carrier estimation, and (4) sparse modeling of functional data. Here we demonstrate only (1) and (2): Given signals are temporally aligned using nonlinear warpings and, thus, separated into their phase and amplitude components. The proposed method for signal alignment is shown to have state of the art performance using Berkeley growth, handwritten signatures, and neuroscience spike train data. Subject Area: Speech and Signal Processing
m32 sparCs: recovering low-rank and sparse matrices from compressive measurements
A. Waters A. Sankaranarayanan R. Baraniuk Rice University andrew.e.waters@rice.edu saswin@rice.edu richb@rice.edu
m34 inverting grices maxims to learn rules from natural language extractions
M. Sorower T. Dietterich J. Doppa W. Orr P. Tadepalli X. Fern Oregon State University ssorower@gmail.com tgd@cs.orst.edu doppa@eecs.oregonstate.edu orr@eecs.oregonstate.edu tadepall@eecs.oregonstate.edu xfern@eecs.oregonstate.edu
We consider the problem of recovering a matrix M that is the sum of a low-rank matrix L and a sparse matrix S from a small set of linear measurements of the form y = A(M) = A(L + S). This model subsumes three important classes of signal recovery problems: compressive sensing, affine rank minimization, and robust principal component analysis. We propose a natural optimization problem for signal recovery under this model and develop a new greedy algorithm called SpaRCS to solve it. SpaRCS inherits a number of desirable properties from the stateof-the-art CoSaMP and ADMiRA algorithms, including exponential convergence and efficient implementation. Simulation results with video compressive sensing, hyperspectral imaging, and robust matrix completion data sets demonstrate both the accuracy and efficacy of the algorithm. Subject Area: Speech and Signal Processing
We consider the problem of learning rules from natural language text sources. These sources, such as news articles and web texts, are created by a writer to communicate information to a reader, where the writer and reader share substantial domain knowledge. Consequently, the texts tend to be concise and mention the minimum information necessary for the reader to draw the correct conclusions. We study the problem of learning domain knowledge from such concise texts, which is an instance of the general problem of learning in the presence of missing data. However, unlike standard approaches to missing data, in this setting we know that facts are more likely to be missing from the text in cases where the reader can infer them from the facts that are mentioned combined with the domain knowledge. Hence, we can explicitly model this missingness process and invert it via probabilistic inference to learn the underlying domain knowledge. This paper introduces a mention model that models the probability of facts being mentioned in the text based on
23
monday abstraCts
what other facts have already been mentioned and domain knowledge in the form of Horn clause rules. Learning must simultaneously search the space of rules and learn the parameters of the mention model. We accomplish this via an application of Expectation Maximization within a Markov Logic framework. An experimental evaluation on synthetic and natural text data shows that the method can learn accurate rules and apply them to new texts to make correct inferences. Experiments also show that the method out-performs the standard EM approach that assumes mentions are missing at random. Subject Area: Applications\Natural Language Processing
Domain adaptation algorithms seek to generalize a model trained in a source domain to a new target domain. In many practical cases, the source and target distributions can differ substantially, and in some cases crucial target features may not have support in the source domain. In this paper we introduce an algorithm that bridges the gap between source and target domains by slowly adding both the target features and instances in which the current algorithm is the most confident. Our algorithm is a variant of co-training, and we name it CODA (Co-training for domain adaptation). Unlike the original co-training work, we do not assume a particular feature split. Instead, for each iteration of co-training, we add target features and formulate a single optimization problem which simultaneously learns a target predictor, a split of the feature space into views, and a shared subset of source and target features to include in the predictor. CODA significantly out-performs the stateof-the-art on the 12-domain benchmark data set of Blitzer et al.. Indeed, over a wide range (65 of 84 comparisons) of target supervision, ranging from no labeled target data to a relatively large number of target labels, CODA achieves the best performance. Subject Area: Supervised Learning
Learning minimum volume sets of an underlying nominal distribution is a very effective approach to anomaly detection. Several approaches to learning minimum volume sets have been proposed in the literature, including the K-point nearest neighbor graph (K-kNNG) algorithm based on the geometric entropy minimization (GEM) principle [4]. The K-kNNG detector, while possessing several desirable characteristics, suffers from high computation complexity, and in [4] a simpler heuristic approximation, the leaveone-out kNNG (L1O-kNNG) was proposed. In this paper, we propose a novel bipartite k-nearest neighbor graph (BP-kNNG) anomaly detection scheme for estimating minimum volume sets. Our bipartite estimator retains all the desirable theoretical properties of the K-kNNG, while being computationally simpler than the K-kNNG and the surrogate L1O-kNNG detectors. We show that BP-kNNG is asymptotically consistent in recovering the p-value of each test point. Experimental results are given that illustrate the superior performance of BP-kNNG as compared to the L1O-kNNG and other state of the art anomaly detection schemes. Subject Area: Supervised Learning
monday abstraCts
m39 maximum margin multi-instance learning
H. Wang H. Huang F. Kamangar F. Nie C. Ding UTA huawangcs@gmail.com heng@uta.edu kamangar@uta.edu feipingnie@gmail.com chqding@uta.edu
Multi-instance learning (MIL) considers input as bags of instances, in which labels are assigned to the bags. MIL is useful in many real-world applications. For example, in image categorization semantic meanings (labels) of an image mostly arise from its regions (instances) instead of the entire image (bag). Existing MIL methods typically build their models using the Bag-to-Bag (B2B) distance, which are often computationally expensive and may not truly reflect the semantic similarities. To tackle this, in this paper we approach MIL problems from a new perspective using the Class-to-Bag (C2B) distance, which directly assesses the relationships between the classes and the bags. Taking into account the two major challenges in MIL, high heterogeneity on data and weak label association, we propose a novel Maximum Margin Multi-Instance Learning (M3I) approach to parameterize the C2B distance by introducing the class specific distance metrics and the locally adaptive significance coefficients. We apply our new approach to the automatic image categorization tasks on three (one single-label and two multi-label) benchmark data sets. Extensive experiments have demonstrated promising results that validate the proposed method. Subject Area: Supervised Learning\Classification
The problem of multiclass boosting is considered. A new framework,based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. Subject Area: Supervised Learning
Classical Boosting algorithms, such as AdaBoost, build a strong classifier without concern about the computational cost. Some applications, in particular in computer vision, may involve up to millions of training examples and features. In such contexts, the training time may become prohibitive. Several methods exist to accelerate training, typically either by sampling the features, or the examples, used to train the weak learners. Even if those methods can precisely quantify the speed improvement they deliver, they offer no guarantee of being more efficient than any other, given the same amount of time. This paper aims at shading some light on this problem, i.e. given a fixed amount of time, for a particular problem, which strategy is optimal in order to reduce the training loss the most. We apply this analysis to the design of new algorithms which estimate on the fly at every iteration the optimal trade-off between the number of samples and the number of features to look at in order to maximize the expected loss reduction. Experiments in object recognition with two standard computer vision data-sets show that the adaptive methods we propose outperform basic sampling and state-of-the-art bandit methods. Subject Area: Supervised Learning
25
monday abstraCts
m43 Kernel embeddings of latent tree graphical models
L. Song lesong@cs.cmu.edu A. Parikh apparikh@cs.cmu.edu E. Xing epxing@cs.cmu.edu Carnegie Mellon University Latent tree graphical models are natural tools for expressing long range and hierarchical dependencies among many variables which are common in computer vision, bioinformatics and natural language processing problems. However, existing models are largely restricted to discrete and Gaussian variables due to computational constraints; furthermore, algorithms for estimating the latent tree structure and learning the model parameters are largely restricted to heuristic local search. We present a method based on kernel embeddings of distributions for latent tree graphical models with continuous and nonGaussian variables. Our method can recover the latent tree structures with provable guarantees and perform local-minimum free parameter learning and efficient inference. Experiments on simulated and real data show the advantage of our proposed approach. Subject Area: Supervised Learning\Kernel Methods
m44 Hierarchical multitask structured output learning for large-scale sequence segmentation
N. Goernitz nico.goernitz@tu-berlin.de Technical University Berlin C. Widmer cwidmer@tuebingen.mpg.de G. Raetsch Gunnar.Raetsch@tuebingen.mpg.de A. Kahles andre.kahles@tuebingen.mpg.de Max Planck Society G. Zeller georg.zeller@gmail.com EMBL S. Sonnenburg soeren@sonnenburgs.de TomTom We present a novel regularization-based Multitask Learning (MTL) formulation for Structured Output (SO) prediction for the case of hierarchical task relations. Structured output learning often results in difficult inference problems and requires large amounts of training data to obtain accurate models. We propose to use MTL to exploit information available for related structured output learning tasks by means of hierarchical regularization. Due to the combination of example sets, the cost of training models for structured output prediction can easily become infeasible for real world applications. We thus propose an efficient algorithm based on bundle methods to solve the optimization problems resulting from MTL structured output learning. We demonstrate the performance of our approach on gene finding problems from the application domain of computational biology. We show that 1) our proposed solver achieves much faster convergence than previous methods and 2) that the Hierarchical SO-MTL approach clearly outperforms considered non-MTL methods. Subject Area: Supervised Learning
26
monday abstraCts
m47 non-parametric group orthogonal matching Pursuit for sparse learning with multiple Kernels
V. Sindhwani vikas.sindhwani@gmail.com A. Lozano aclozano@us.ibm.com IBM T.J. Watson Research Center We consider regularized risk minimization in a large dictionary of Reproducing kernel Hilbert Spaces (RKHSs) over which the target function has a sparse representation. This setting, commonly referred to as Sparse Multiple Kernel Learning (MKL), may be viewed as the nonparametric extension of group sparsity in linear models. While the two dominant algorithmic strands of sparse learning, namely convex relaxations using l1 norm (e.g., Lasso) and greedy methods (e.g., OMP), have both been rigorously extended for group sparsity, the sparse MKL literature has so farmainly adopted the former withmild empirical success. In this paper, we close this gap by proposing a Group-OMP based framework for sparse multiple kernel learning. Unlike l1-MKL, our approach decouples the sparsity regularizer (via a direct l0 constraint) from the smoothness regularizer (via RKHS norms) which leads to better empirical performance as well as a simpler optimization procedure that only requires a blackbox single-kernel solver. The algorithmic development and empirical studies are complemented by theoretical analyses in terms of Rademacher generalization bounds and sparse recovery conditions analogous to those for OMP [27] and Group-OMP [16]. Subject Area: Supervised Learning
27
monday abstraCts
m51 selecting receptive fields in deep networks
A. Coates A. Ng Stanford University acoates@cs.stanford.edu ang@cs.stanford.edu
m53 learning with the weighted trace-norm under arbitrary sampling distributions
R. Foygel University of Chicago R. Salakhutdinov University of Toronto O. Shamir Microsoft Research N. Srebro TTI-Chicago rina@uchicago.edu rsalakhu@utstat.toronto.edu ohadsh@microsoft.com nati@ttic.edu
Recent deep learning and unsupervised feature learning systems that learn from unlabeled data have achieved high performance in benchmarks by using extremely large architectures with many features (hidden units) at each layer. Unfortunately, for such large architectures the number of parameters usually grows quadratically in the width of the network, thus necessitating hand-coded local receptive fields that limit the number of connections from lower level features to higher ones (e.g., based on spatial locality). In this paper we propose a fast method to choose these connections that may be incorporated into a wide variety of unsupervised training methods. Specifically, we choose local receptive fields that group together those lowlevel features that are most similar to each other according to a pairwise similarity metric. This approach allows us to harness the advantages of local receptive fields (such as improved scalability, and reduced data requirements) when we do not know how to specify such receptive fields by hand or where our unsupervised training algorithm has no obvious generalization to a topographic setting. We produce results showing how this method allows us to use even simple unsupervised training algorithms to train successful multi-layered etworks that achieve stateof-the-art results on CIFAR and STL datasets: 82.0\% and 60.1\% accuracy, respectively. Subject Area: Unsupervised & Semi-supervised Learning
We provide rigorous guarantees on learning with the weighted trace-norm under arbitrary sampling distributions. We show that the standard weighted-trace norm might fail when the sampling distribution is not a product distribution (i.e. when row and column indexes are not selected independently), present a corrected variant for which we establish strong learning guarantees, and demonstrate that it works better in practice. We provide guarantees when weighting by either the true or empirical sampling distribution, and suggest that even if the true distribution is known (or is uniform), weighting by the empirical distribution may be beneficial. Subject Area: Unsupervised & Semi-supervised Learning
In many clustering problems, we have access to multiple views of the data each of which could be individually used for clustering. Exploiting information from multiple views, one can hope to find a clustering that is more accurate than the ones obtained using the individual views. Since the true clustering would assign a point to the same cluster irrespective of the view, we can approach this problem by looking for clusterings that are consistent across the views, i.e., corresponding data points in each view should have same cluster membership. We propose a spectral clustering framework that achieves this goal by co-regularizing the clustering hypotheses, and propose two co-regularization schemes to accomplish this. Experimental comparisons with a number of baselines on two synthetic and three real-world datasets establish the efficacy of our proposed approaches. Subject Area: Unsupervised & Semi-supervised Learning
A Bayesian approach to partitioning distance matrices is presented. It is inspired by the Translation-Invariant Wishart-Dirichlet process (TIWD) in (Vogt et al., 2010) and shares a number of advantageous properties like the fully probabilistic nature of the inference model, automatic selection of the number of clusters and applicability in semi-supervised settings. In addition, our method (which we call fastTIWD) overcomes the main shortcoming of the original TIWD, namely its high computational costs. The fastTIWD reduces the workload in each iteration of a Gibbs sampler from O(n3)in the TIWD to O(n2). Our experiments show that this cost reduction does not compromise the quality of the inferred partitions. With this new method it is now possible to mine large relational datasets with a probabilistic model, thereby automatically detecting new and potentially interesting clusters. Subject Area: Unsupervised & Semi-supervised Learning
28
monday abstraCts
m56 beyond spectral Clustering - tight relaxations of balanced graph Cuts
M. Hein S. Setzer Saarland University hein@cs.uni-saarland.de setzer@mia.uni-saarland.de Scoring iteration of the GLM, we obtain general updates for real data and multiplicative updates for non-negative data. The GTF framework is, then extended easily to address the problems when multiple observed tensors are factorised simultaneously. We illustrate our coupled factorisation approach on synthetic data as well as on a musical audio restoration problem. Subject Area: Unsupervised & Semi-supervised Learning
Spectral clustering is based on the spectral relaxation of the normalized/ratio graph cut criterion. While the spectral relaxation is known to be loose, it has been shown recently that a non-linear eigenproblem yields a tight relaxation of the Cheeger cut. In this paper, we extend this result considerably by providing a characterization of all balanced graph cuts which allow for a tight relaxation. Although the resulting optimization problems are nonconvex and non-smooth, we provide an efficient firstorder scheme which scales to large graphs. Moreover, our approach comes with the quality guarantee that given any partition as initialization the algorithm either outputs a better partition or it stops immediately. Subject Area: Unsupervised & Semi-supervised Learning
m57 structural equations and divisive normalization for energy-dependent component analysis
J. Hirayama Kyoto University A. Hyvarinen hirayama@robot.kuass.kyoto-u.ac.jp aapo.hyvarinen@helsinki.fi
Components estimated by independent component analysis and related methods are typically not independent in real data. A very common form of nonlinear dependency between the components is correlations in their variances or ener- gies. Here, we propose a principled probabilistic model to model the energy- correlations between the latent variables. Our two-stage model includes a linear mixing of latent signals into the observed ones like in ICA. The main new fea- ture is a model of the energy-correlations based on the structural equation model (SEM), in particular, a Linear Non-Gaussian SEM. The SEM is closely related to divisive normalization which effectively reduces energy correlation. Our new two- stage model enables estimation of both the linear mixing and the interactions re- lated to energy-correlations, without resorting to approximations of the likelihood function or other non-principled approaches. We demonstrate the applicability of our method with synthetic dataset, natural images and brain signals. Subject Area: Unsupervised and Semi-supervised Learning\ICA, PCA, CCA & Other Linear Models
We derive algorithms for generalised tensor factorisation (GTF) by building upon the well-established theory of Generalised Linear Models. Our algorithms are general in the sense that we can compute arbitrary factorisations in a message passing framework, derived for a broad class of exponential family distributions including special cases such as Tweedies distributions corresponding to -divergences. By bounding the step size of the Fisher
Metric learning has become a very active research field. The most popular representative--Mahalanobis metric learning--can be seen as learning a linear transformation and then computing the Euclidean metric in the transformed space. Since a linear transformation might not always be appropriate for a given learning problem, kernelized versions of various metric learning algorithms exist. However, the problem then becomes finding the appropriate kernel function. Multiple kernel learning addresses this limitation by learning a linear combination of a number of predefined kernels; this approach can be also readily used in the context of multiple-source learning to fuse different data sources. Surprisingly, and despite the extensive work on multiple kernel learning for SVMs, there has been no work in the area of metric learning with multiple kernel learning. In this paper we fill this gap
29
monday abstraCts
and present a general approach for metric learning with multiple kernel learning. Our approach can be instantiated with different metric learning algorithms provided that they satisfy some constraints. Experimental evidence suggests that our approach outperforms metric learning with an unweighted kernel combination and metric learning with cross-validation based kernel selection. Subject Area: Unsupervised & Semi-supervised Learning Dirichlet process. As a demonstration of this concept, we analyze real data on course selections of undergraduate students at Duke University, with the goal of uncovering and concisely representing structure in the curriculum and in the characteristics of the student body. Subject Area: Unsupervised & Semi-supervised Learning
M63 Efficient Learning of Generalized Linear and single index models with isotonic regression
S. Kakade Microsoft Research A. Kalai O. Shamir Microsoft Research V. Kanade Harvard University sham@tti-c.org adum@microsoft.com ohadsh@microsoft.com vkanade@fas.harvard.edu
Recently, Mahoney and Orecchia demonstrated that popular diffusion-based procedures to compute a quick approximation to the first nontrivial eigenvector of a data graph Laplacian exactly solve certain regularized SemiDefinite Programs (SDPs). In this paper, we extend that result by providing a statistical interpretation of their approximation procedure. Our interpretation will be analogous to the manner in which `2-regularized or `1-regularized `2 regression (often called Ridge regression and Lasso regression, respectively) can be interpreted in terms of a Gaussian prior or a Laplace prior, respectively, on the coefficient vector of the regression problem. Our framework will imply that the solutions to the MahoneyOrecchia regularized SDP can be interpreted as regularized estimates of the pseudoinverse of the graph Laplacian. Conversely, it will imply that the solution to this regularized estimation problem can be computed very quickly by running, e.g., the fast diffusion-based PageRank procedure for computing an approximation to the first nontrivial eigenvector of the graph Laplacian. Empirical results are also provided to illustrate the manner in which approximate eigenvector computation implicitly performs statistical regularization, relative to running the corresponding exact algorithm. Subject Area: Unsupervised & Semi-supervised Learning
Generalized Linear Models (GLMs) and Single Index Models (SIMs) provide powerful generalizations of linear regression, where the target variable is assumed to be a (possibly unknown) 1-dimensional function of a linear predictor. In general, these problems entail nonconvex estimation procedures, and, in practice, iterative local search heuristics are often used. Kalai and Sastry (2009) provided the first provably efficient method, the Isotron algorithm, for learning SIMs and GLMs, under the assumption that the data is in fact generated under a GLM and under certain monotonicity and Lipschitz (bounded slope) constraints. The Isotron algorithm interleaves steps of perceptron-like updates with isotonic regression (fitting a one-dimensional non-decreasing function). However, to obtain provable performance, the method requires a fresh sample every iteration. In this paper, we provide algorithms for learning GLMs and SIMs, which are both computationally and statistically efficient. We modify the isotonic regression step in Isotron to fit a Lipschitz monotonic function, and also provide an efficient O(n(log(n)) algorithm for this step, improving upon the previous O(n2) algorithm. We provide a brief empirical study, demonstrating the feasibility of our algorithms in practice. Subject Area: Learning Theory
The nested Chinese restaurant process is extended to design a nonparametric topic-model tree for representation of human choices. Each tree branch corresponds to a type of person, and each node (topic) has a corresponding probability vector over items that may be selected. The observed data are assumed to have associated temporal covariates (corresponding to the time at which choices are made), and we wish to impose that with increasing time it is more probable that topics deeper in the tree are utilized. This structure is imposed by developing a new ``change point stick-breaking model that is coupled with a Poisson and product-of-gammas construction. To share topics across the tree nodes, topic distributions are drawn from a
30
monday abstraCts
show that the objective functions of several important or recently developed non-ML learn- ing methods, including contrastive divergence [12], noise-contrastive estimation [11], partial likelihood [7], non-local contrastive objectives [31], score match- ing [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual information [2], maximum marginal likelihood [9], and conditional and marginal composite likelihood [24], can be unified under the minimum KL con- traction framework with different choices of the KL contraction operators. Subject Area: Learning Theory
We consider a multi-armed bandit problem where there are two phases. The first phase is an experimentation phase where the decision maker is free to explore multiple options. In the second phase the decision maker has to commit to one of the arms and stick with it. Cost is incurred during both phases with a higher cost during the experimentation phase. We analyze the regret in this setup, and both propose algorithms and provide upper and lower bounds that depend on the ratio of the duration of the experimentation phase to the duration of the commitment phase. Our analysis reveals that if given the choice, it is optimal to experiment (ln T) steps and then commit, where T is the time horizon. Subject Area: Theory\Online Learning
We present an efficient algorithm for the problem of online multiclass prediction with bandit feedback in the fully adversarial setting. We measure its regret with respect to the log-loss defined in [AR09], which is parameterized by a scalar . We prove that the regret of NEWTRON is O(log T) when is a constant that does not vary with horizon T, and at most O(T ) if is allowed to increase to infinity with T. For = O(log T), the regret is bounded by O( T ), thus solving the open problem of [KSST08, AR09]. Our algorithm is based on a novel application of the online Newton method [HAK07]. We test our algorithm and show it to perform well in experiments, even when is a small constant. Subject Area: Theory\Online Learning
monday abstraCts
online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the n2 parameters of a density matrix is much harder than learning the n parameters of a multinomial distribution. Completely surprisingly, we prove that the worst-case regrets of certain classical algorithms and their matrix generalizations are identical. The reason is that the worstcase sequence of dyads share a common eigensystem, i.e. the worst case regret is achieved in the classical case. So these matrix algorithms learn the eigenvectors without any regret. Subject Area: Theory\Online Learning
m72 the impact of unlabeled Patterns in rademacher Complexity Theory for Kernel Classifiers
L. Oneto D. Anguita A. Ghio S. Ridella University of Genoa, Italy luca.oneto@unige.it davide.anguita@unige.it Alessandro.Ghio@unige.it sandro.ridella@unige.it
We derive here new generalization bounds, based on Rademacher Complexity theory, for model selection and error estimation of linear (kernel) classifiers, which exploit the availability of unlabeled samples. In particular, two results are obtained: the first one shows that, using the unlabeled samples, the confidence term of the conventional bound can be reduced by a factor of three; the second one shows that the unlabeled samples can be used to obtain much tighter bounds, by building localized versions of the hypothesis class containing the optimal classifier. Subject Area: Theory\Statistical Learning Theory
m73 unifying framework for fast learning rate of non-sparse multiple Kernel learning
T. Suzuki University of Tokyo s-taiji@stat.t.u-tokyo.ac.jp
m71 optimistic optimization of a deterministic function without the Knowledge of its smoothness
R. Munos remi.munos@inria.fr INRIA Lille - Nord Europe We consider a global optimization problem of a deterministic function f in a semi-metric space, given a finite budget of n evaluations. The function f is assumed to be locally smooth (around one of its global maxima) with respect to a semi-metric `. We describe two algorithms based on optimistic exploration that use a hierarchical partitioning of the space at all scales. A first contribution is an algorithm, DOO, that requires the knowledge of `. We report a finite-sample performance bound in terms of a measure of the quantity of near-optimal states. We then define a second algorithm, SOO, which does not require the knowledge of the semi-metric ` under which f is smooth, and whose performance is almost as good as DOO optimally-fitted. Subject Area: Theory\Online Learning
32
In this paper, we give a new generalization error bound of Multiple Kernel Learning (MKL) for a general class of regularizations. Our main target in this paper is dense type regularizations including `p-MKL that imposes `p-mixed-norm regularization instead of `1-mixed-norm regularization. According to the recent numerical experiments, the sparse regularization does not necessarily show a good performance compared with dense type regularizations. Motivated by this fact, this paper gives a general theoretical tool to derive fast learning rates that is applicable to arbitrary monotone norm-type regularizations in a unifying manner. As a byproduct of our general result, we show a fast learning rate of `p-MKL that is tightest among existing bounds. We also show that our general learning rate achieves the minimax lower bound. Finally, we show that, when the complexities of candidate reproducing kernel Hilbert spaces are inhomogeneous, dense type regularization shows better learning rate compared with sparse `1 regularization. Subject Area: Theory\Statistical Learning Theory
monday abstraCts
coordinate descent algorithm, and note that performing the greedy step efficiently weakens the costly dependence on the problem size provided the solution is sparse. We then propose a suite of methods that perform these greedy steps efficiently by a reduction to nearest neighbor search. We also devise a more amenable form of greedy descent for composite non-smooth objectives; as well as several approximate variants of such greedy descent. We develop a practical implementation of our algorithm that combines greedy coordinate descent with locality sensitive hashing. Without tuning the latter data structure, we are not only able to significantly speed up the vanilla greedy method, but also outperform cyclic descent when the problem size becomes large. Our results indicate the effectiveness of our nearest neighbor strategies, and also point to many open questions regarding the development of computational geometric techniques tailored towards first-order optimization methods. Subject Area: Theory\Statistical Learning Theory
For a learning problem whose associated excess loss class is (,B)-Bernstein, we show that it is theoretically possible to track the same classification performance of the best (unknown) hypothesis in our class, provided that we are free to abstain from prediction in some region of our choice. The (probabilistic) volume of this rejected region of the domain is shown to be diminishing at rate O(B(1=m)), where is Hannekes disagreement coefficient. The strategy achieving this performance has computational barriers because it requires empirical error minimization in an agnostic setting. Nevertheless, we heuristically approximate this strategy and develop a novel selective classification algorithm using constrained SVMs. We show empirically that the resulting algorithm consistently outperforms the traditional rejection mechanism based on distance from decision boundary. Subject Area: Theory\Statistical Learning Theory
Latent variable models are frequently used to identify structure in dichotomous network data, in part because they give rise to a Bernoulli product likelihood that is both well understood and consistent with the notion of exchangeable random graphs. In this article we propose conservative confidence sets that hold with respect to these underlying Bernoulli parameters as a function of any given partition of network nodes, enabling us to assess estimates of residual network structure, that is, structure that cannot be explained by known covariates and thus cannot be easily verified by manual inspection. We demonstrate the proposed methodology by analyzing student friendship networks from the National Longitudinal Survey of Adolescent Health that include race, gender, and school year as covariates. We employ a stochastic expectation-maximization algorithm to fit a logistic regression model that includes these explanatory variables as well as a latent stochastic blockmodel component and additional node-specific effects. Although maximum-likelihood estimates do not appear consistent in this context, we are able to evaluate confidence sets as a function of different blockmodel partitions, which enables us to qualitatively assess the significance of estimated residual network structure relative to a baseline, which models covariates but lacks block structure. Subject Area: Probabilistic Models and Methods
This paper considers the problem of combining multiple models to achieve a prediction accuracy not much worse than that of the best single model for least squares regression. It is known that if the models are mis-specified, model averaging is superior to model selection. Specifically, let n be the sample size, then the worst case regret of the former decays at the rate of O(1=n) while the worst case regret of the latter decays at the rate of O(1=n). In the literature, the most important and widely studied model averaging method that achieves the optimal O(1=n) average regret is the exponential weighted model averaging (EWMA) algorithm. However this method suffers from several limitations. The purpose of this paper is to present a new greedy model averaging procedure that improves EWMA. We prove strong theoretical guarantees for the new procedure and illustrate our theoretical results with empirical examples. Subject Area: Theory\Statistical Learning Theory
Markov Random Fields (MRFs) have proven very powerful both as density estimators and feature extractors for classification. However, their use is often limited by an inability to estimate the partition function Z. In this paper, we exploit the gradient descent training procedure of restricted Boltzmann machines (a type of MRF) to {\bf track} the log partition function during learning. Our method relies on two distinct sources of information: (1) estimating the change Z incurred by each gradient update, (2) estimating the difference in Z over a small set of tempered distributions using bridge sampling. The two sources of information are then combined using an inference procedure similar to Kalman filtering. Learning MRFs through Tempered Stochastic Maximum Likelihood, we can estimate Z using no more temperatures than are required for learning. Comparing to both exact values and estimates using annealed importance sampling (AIS), we show on several datasets that our method is able to accurately track the log partition function. In contrast to AIS, our method provides this estimate at each time-step, at a computational cost similar to that required for training alone. Subject Area: Probabilistic Models and Methods
33
monday abstraCts
m79 Probabilistic amplitude and frequency demodulation
R. Turner M. Sahani Gatsby Unit, UCL ret26@cam.ac.uk maneesh@gatsby.ucl.ac.uk algorithm is based on the spike and slab prior which, from a Bayesian perspective, is the golden standard for sparse inference. We apply the method to a general multi-task and multiple kernel learning model in which a common set of Gaussian process functions is linearly combined with task-specific sparse weights, thus inducing relation between tasks. This model unifies several sparse linear models, such as generalized linear models, sparse factor analysis and matrix factorization with missing values, so that the variational algorithm can be applied to all these cases. We demonstrate our approach in multi-output Gaussian process regression, multi-class classification, image processing applications and collaborative filtering. Subject Area: Probabilistic Models and Methods
A number of recent scientific and engineering problems require signals to be decomposed into a product of a slowly varying positive envelope and a quickly varying carrier whose instantaneous frequency also varies slowly over time. Although signal processing provides algorithms for socalled amplitude- and frequency-demodulation (AFD), there are well known problems with all of the existing methods. Motivated by the fact that AFD is ill-posed, we approach the problem using probabilistic inference. The new approach, called probabilistic amplitude and frequency demodulation (PAFD), models instantaneous frequency using an autoregressive generalization of the von Mises distribution, and the envelopes using Gaussian auto-regressive dynamics with a positivity constraint. A novel form of expectation propagation is used for inference. We demonstrate that although PAFD is computationally demanding, it outperforms previous approaches on synthetic and real signals in clean, noisy and missing data settings. Subject Area: Probabilistic Models and Methods
m81 spike and slab Variational inference for multitask and multiple Kernel learning
M. Titsias mtitsias@cs.man.ac.uk University of Manchester M. Lzaro-Gredilla lazarox@gmail.com Universidad Carlos III de Madrid We introduce a variational Bayesian inference algorithm which can be widely applied to sparse linear models. The
34
monday abstraCts
for finite-length practical sparse graphs, the tree structure approximation to the code graph provides accurate estimates for the marginal of each variable. Subject Area: Probabilistic Models and Methods
m84 global solution of fully-observed Variational bayesian matrix factorization is Column-Wise independent
S. Nakajima shinnkj23@gmail.com Nikon Corporation M. Sugiyama sugi@cs.titech.ac.jp Tokyo Institute of Technology S. Babacan dbabacan@illinois.edu University of Illinois at Urbana-Champaign Variational Bayesian matrix factorization (VBMF) efficiently approximates the posterior distribution of factorized matrices by assuming matrix-wise independence of the two factors. A recent study on fully-observed VBMF showed that, under a stronger assumption that the two factorized matrices are column-wise independent, the global optimal solution can be analytically computed. However, it was not clear how restrictive the column-wise independence assumption is. In this paper, we prove that the global solution under matrix-wise independence is actually column-wise independent, implying that the column-wise independence assumption is harmless. A practical consequence of our theoretical finding is that the global solution under matrix-wise independence (which is a standard setup) can be obtained analytically in a computationally very efficient way without any iterative algorithms. We experimentally illustrate advantages of using our analytic solution in probabilistic principal component analysis. Subject Area: Probabilistic Models and Methods
The performance of Markov chain Monte Carlo methods is often sensitive to the scaling and correlations between the random variables of interest. An important source of information about the local correlation and scale is given by the Hessian matrix of the target distribution, but this is often either computationally expensive or infeasible. In this paper we propose MCMC samplers that make use of quasiNewton approximations from the optimization literature, that approximate the Hessian of the target distribution from previous samples and gradients generated by the sampler. A key issue is that MCMC samplers that depend on the history of previous states are in general not valid. We address this problem by using limited memory quasi-Newton methods, which depend only on a fixed window of previous samples. On several real world datasets, we show that the quasiNewton sampler is a more effective sampler than standard Hamiltonian Monte Carlo at a fraction of the cost of MCMC methods that require higher-order derivatives. Subject Area: Probabilistic Models and Methods
35
monday abstraCts
m88 non-conjugate Variational message Passing for multinomial and binary regression
D. Knowles University of Cambridge T. Minka Microsoft Research Ltd dak33@cam.ac.uk minka@microsoft.com
Variational Message Passing (VMP) is an algorithmic implementation of the Variational Bayes (VB) method which applies only in the special case of conjugate exponential family models. We propose an extension to VMP, which we refer to as Non-conjugate Variational Message Passing (NCVMP) which aims to alleviate this restriction while maintaining modularity, allowing choice in how expectations are calculated, and integrating into an existing messagepassing framework: Infer.NET. We demonstrate NCVMP on logistic binary and multinomial regression. In the multinomial case we introduce a novel variational bound for the softmax factor which is tighter than other commonly used bounds whilst maintaining computational tractability. Subject Area: Probabilistic Models and Methods
m92 spatial distance dependent Chinese restaurant processes for image segmentation
S. Ghosh E. Sudderth Brown University A. Ungureanu Morgan Stanley D. Blei Princeton University sghosh@cs.brown.edu sudderth@cs.brown.edu andrei.b.ungureanu@gmail.com blei@cs.princeton.edu
The distance dependent Chinese restaurant process (ddCRP) was recently introduced to accommodate random partitions of non-exchangeable data. The ddCRP clusters data in a biased way: each data point is more likely to be clustered with other data that are near it in an external sense. This paper examines the ddCRP in a spatial setting with the goal of natural image segmentation. We explore the biases of the spatial ddCRP model and propose a novel hierarchical extension better suited for producing human-like segmentations. We then study the sensitivity of the models to various distance and appearance hyperparameters, and provide the first rigorous comparison of nonparametric Bayesian models in the image segmentation domain. On unsupervised image segmentation, we demonstrate that similar performance to existing nonparametric Bayesian models is possible with substantially simpler models and algorithms. Subject Area: Probabilistic Models and Methods
monday abstraCts
m93 analytical results for the error in filtering of gaussian Processes
A. Susemihl alex.susemihl@bccn-berlin.de M. Opper manfred.opper@tu-berlin.de Berlin Institute of Technology R. Meir rmeir@ee.technion.ac.il Technion Bayesian filtering of stochastic stimuli has received a great deal of attention re- cently. It has been applied to describe the way in which biological systems dy- namically represent and make decisions about the environment. There have been no exact results for the error in the biologically plausible setting of inference on point process, however. We present an exact analysis of the evolution of the mean- squared error in a state estimation task using Gaussian-tuned point processes as sensors. This allows us to study the dynamics of the error of an optimal Bayesian decoder, providing insights into the limits obtainable in this task. This is done for Markovian and a class of non-Markovian Gaussian processes. We find that there is an optimal tuning width for which the error is minimized. This leads to a char- acterization of the optimal encoding for the setting as a function of the statistics of the stimulus, providing a mathematically sound primer for an ecological theory of sensory processing. Subject Area: Probabilistic Models and Methods the standard GP models which use squared-exponential kernels. Hyperparameter learning in this model can be seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an expressive but tractable parameterization of the kernel function, which allows efficient evaluation of all input interaction terms, whose number is exponential in the input dimension. The additional structure discoverable by this model results in increased interpretability, as well as state-of-the-art predictive power in regression tasks. Subject Area: Probabilistic Models and Methods
Multi-class Gaussian Process Classifiers (MGPCs) are often affected by over-fitting problems when labeling errors occur far from the decision boundaries. To prevent this, we investigate a robust MGPC (RMGPC) which considers labeling errors independently of their distance to the decision boundaries. Expectation propagation is used for approximate inference. Experiments with several datasets in which noise is injected in the class labels illustrate the benefits of RMGPC. This method performs better than other Gaussian process alternatives based on considering latent Gaussian noise or heavy-tailed processes. When no noise is injected in the labels, RMGPC still performs equal or better than the other methods. Finally, we show how RMGPC can be used for successfully identifying data instances which are difficult to classify accurately in practice. Subject Area: Probabilistic Models and Methods
Cancer has complex patterns of progression that include converging as well as diverging progressional pathways. Vogelsteins path model of colon cancer was a pioneering contribution to cancer research. Since then, several attempts have been made at obtaining mathematical models of cancer progression, devising learning algorithms, and applying these to cross-sectional data. Beerenwinkel {\ em et al.} provided, what they coined, EM-like algorithms for Oncogenetic Trees (OTs) and mixtures of such. Given the small size of current and future data sets, it is important to minimize the number of parameters of a model. For this reason, we too focus on tree-based models and introduce Hidden-variable Oncogenetic Trees (HOTs). In contrast to OTs, HOTs allow for errors in the data and thereby provide more realistic modeling. We also design global structural EM algorithms for learning HOTs and mixtures of HOTs (HOT-mixtures). The algorithms are global in the sense that, during the M-step, they find a structure that yields a global maximum of the expected complete log-likelihood rather than merely one that improves it. The algorithm for single HOTs performs very well on reasonable-sized data sets, while that for HOT-mixtures requires data sets of sizes obtainable only with tomorrows more cost-efficient technologies. Subject Area: Probabilistic Models and Methods
We introduce a Gaussian process model of functions which are additive. An additive function is one which decomposes into a sum of low-dimensional functions, each depending on only a subset of the input variables. Additive GPs generalize both Generalized Additive Models, and
There are many settings in which we wish to fit a model of the behavior of individuals but where our data consist only of aggregate information (counts or low-dimensional contingency tables). This paper introduces Collective Graphical Models---a framework for modeling and probabilistic inference that operates directly on the sufficient statistics of the individual model. We derive a highlyefficient Gibbs sampling algorithm for sampling from the posterior distribution of the sufficient statistics conditioned on noisy aggregate observations, prove its correctness, and demonstrate its effectiveness experimentally. Subject Area: Probabilistic Models and Methods\ 37
monday abstraCts
m98 simultaneous sampling and multi-structure fitting with adaptive reversible Jump mCmC
T. Pham T. Chin J. Yu D. Suter The University of Adelaide trung@cs.adelaide.edu.au tjchin@cs.adelaide.edu.au jin.yu@adelaide.edu.au dsuter@cs.adelaide.edu.au approach to learning these models, and describe an importance sampling algorithm for forecasting future events using these models, using a proposal distribution based on Poisson superposition. We then use synthetic data, supercomputer event logs, and web search query logs to illustrate that our learning algorithm can efficiently learn nonlinear temporal dependencies, and that our importance sampling algorithm can effectively forecast future events. Subject Area: Probabilistic Models and Methods
Multi-structure model fitting has traditionally taken a twostage approach: First, sample a (large) number of model hypotheses, then select the subset of hypotheses that optimise a joint fitting and model selection criterion. This disjoint two-stage approach is arguably suboptimal and inefficient - if the random sampling did not retrieve a good set of hypotheses, the optimised outcome will not represent a good fit. To overcome this weakness we propose a new multi-structure fitting approach based on Reversible Jump MCMC. Instrumental in raising the effectiveness of our method is an adaptive hypothesis generator, whose proposal distribution is learned incrementally and online. We prove that this adaptive proposal satisfies the diminishing adaptation property crucial for ensuring ergodicity in MCMC. Our method effectively conducts hypothesis sampling and optimisation simultaneously, and gives superior computational efficiency over other methods. Subject Area: Probabilistic Models and Methods
m101 facial expression transfer with input-output temporal restricted boltzmann machines
M. Zeiler zeiler@cs.nyu.edu G. Taylor gwtaylor@cs.nyu.edu R. Fergus fergus@cs.nyu.edu New York University L. Sigal lsigal@disneyresearch.com I. Matthews iainm@disneyresearch.com Disney Research Pittsburgh We present a type of Temporal Restricted Boltzmann Machine that defines a probability distribution over an output sequence conditional on an input sequence. It shares the desirable properties of RBMs: efficient exact inference, an exponentially more expressive latent state than HMMs, and the ability to model nonlinear structure and dynamics. We apply our model to a challenging realworld graphics problem: facial expression transfer. Our results demonstrate improved performance over several baselines modeling high-dimensional 2D and 3D data. Subject Area: Probabilistic Models and Methods
tuesday ConferenCe
39
tuesday - ConferenCe
ORAL SESSION
session 1 - 9:30 10:40 am
Session Chair: Remi Munos Posner leCture: learning about sensorimotor data Richard Sutton University of Alberta sutton@cs.ualberta.ca
a non-Parametric approach to dynamic Programming Oliver Kroemer oliverkro@googlemail.com Jan Peters mail@jan-peters.net Technische Universitaet Darmstadt In this paper, we consider the problem of policy evaluation for continuous-state systems. We present a non-parametric approach to policy evaluation, which uses kernel density estimation to represent the system. The true form of the value function for this model can be determined, and can be computed using Galerkins method. Furthermore, we also present a unified view of several well-known policy evaluation methods. In particular, we show that the same Galerkin method can be used to derive Least-Squares Temporal Difference learning, Kernelized Temporal Difference learning, and a discrete-state Dynamic Programming solution, as well as our proposed method. In a numerical evaluation of these algorithms, the proposed approach performed better than the other methods Subject Area: Control and Reinforcement Learning
Temporal-difference (TD) learning of reward predictions underlies both reinforcement-learning algorithms and the standard dopamine model of reward-based learning in the brain. This confluence of computational and neuroscientific ideas is perhaps the most successful since the Hebb synapse. Can it be extended beyond reward? The brain certainly predicts many things other than reward---such as in a forward model of the consequences of various ways of behaving---and TD methods can be used to make these predictions. The idea and advantages of using TD methods to learn large numbers of predictions about many states and stimuli, in parallel, have been apparent since the 1990s, but technical issues have prevented this vision from being practically implemented...until now. A key breakthrough was the development of a new family of gradient-TD methods, introduced at NIPS in 2008 (by Maei, Szepesvari, and myself). Using these methods, and other ideas, we are now able to learn thousands of non-reward predictions in real-time at 10Hz from a single sensorimotor data stream from a physical robot. These predictions are temporally extended (ranging up to tens of seconds of anticipation), goal oriented, and policy contingent. The new algorithms enable learning to be off-policy and in parallel, resulting in dramatic increases in the amount that can be learned in a given amount of time. Our effective learning rate scales linearly with computational resources. On a consumer laptop we can learn thousands of predictions in real-time. On a larger computer, or on a comparable laptop in a few years, the same methods could learn millions of meaningful predictions about different alternate ways of behaving. These predictions in aggregate constitute a rich detailed model of the world that can support planning methods such as approximate dynamic programming.
Richard S. Sutton is a professor and iCORE chair in the department of computing science at the University of Alberta. He is a fellow of the Association for the Advancement of Artificial Intelligence and co-author of the textbook Reinforcement Learning: An Introduction from MIT Press. Before joining the University of Alberta in 2003, he worked in industry at AT&T and GTE Labs, and in academia at the University of Massachusetts. He received a PhD in computer science from the University of Massachusetts in 1984 and a BA in psychology from Stanford University in 1978. Richs research interests center on the learning problems facing a decision-maker interacting with its environment, which he sees as central to artificial intelligence. He is also interested in animal learning psychology, in connectionist networks, and generally in systems that continually improve their representations and models of the world.
SPOTLIGHT SESSION
session 2 - 10:40 11:10 am
Session Chair: Remi Munos Action-Gap Phenomenon in Reinforcement Learning A. Farahmand, McGill University Subject Area: Control and Reinforcement Learning See abstract, page 48 (T5) The Fixed Points of Off-Policy TD J. Kolter, MIT Subject Area: Control and Reinforcement Learning See abstract, page 48 (T6) Inductive reasoning about chimeric creatures C. Kemp, Carnegie Mellon University Subject Area: Cognitive Science See abstract, page 50 (T14) Evaluating computational models of preference learning A. Jern, C. Lucas, C. Kemp, Carnegie Mellon University Subject Area: Cognitive Science See abstract, page 50 (T13) Identifying Alzheimers Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis S. Huang, J. Li, J. Ye, T. Wu, Arizona State University K. Chen, A. Fleisher, E. Reiman, Banner Alzheimers University. Subject Area: Brain Imaging See abstract, page 51 (T16)
40
tuesday - ConferenCe
Decoding of Finger Flexion from Electrocorticographic Signals Using Switching NonParametric Dynamic Systems Z. Wang, Q. Ji, Rensselaer Polytechnic Institute; G. Schalk, Wadsworth Center Subject Area: Brain-computer Interfaces & Neural Prostheses See abstract, page 51 (T17) Active learning of neural response functions with Gaussian processes Mijung Park, Greg Horwitz, Jonathan Pillow, UT Austin Subject Area: Neural Coding See abstract, page 52 (T20) sequences using a partition-valued Markov process which evolves by splitting and merging clusters. An FCP is exchangeable, projective, stationary and reversible, and its equilibrium distributions are given by the Chinese restaurant process. As opposed to hidden Markov models, FCPs allow for flexible modelling of the number of clusters, and they avoid label switching non-identifiability problems. We develop an efficient Gibbs sampler for FCPs which uses uniformization and the forward-backward algorithm. Our development of FCPs is motivated by applications in population genetics, and we demonstrate the utility of FCPs on problems of genotype imputation with phased and unphased SNP data. Subject Area: Bayesian Nonparametrics Priors over Recurrent Continuous Time Processes Ardavan Saeedi ardavan.s@stat.ubc.ca Alexandre Bouchard-Ct bouchard@stat.ubc.ca University of British Columbia We introduce the Gamma-Exponential Process (GEP), a prior over a large family of continuous time processes. A hierarchical version of this prior (HGEP; the Hierarchical GEP) yields a useful model for analyzing complex time series. Models based on HGEPs display many attractive properties: conjugacy, exchangeability and closed-form predictive distribution for the waiting times, and exact Gibbs updates for the time scale parameters. After establishing these properties, we show how posterior inference can be carried efficiently using Particle MCMC methods. This yields a MCMC algorithm that can resample entire sequences atomically while avoiding the complications of introducing slice and stick auxiliary variables. We applied our model to the problem of estimating the disease progression in Multiple Sclerosis, and to RNA evolutionary modeling. In both domains, we found that our model outperformed the standard rate matrix estimation approach. Subject Area: Bayesian Nonparametrics
ORAL SESSION
session 2 - 11:10 11:30 am
Session Chair: Michael Collins On the Completeness of First-Order Knowledge Compilation for Lifted Probabilistic Inference Guy Van den Broeck guy.vandenbroeck@cs.kuleuven.be Katholieke Universiteit Leuven Probabilistic logics are receiving a lot of attention today because of their expressive power for knowledge representation and learning. However, this expressivity is detrimental to the tractability of inference, when done at the propositional level. To solve this problem, various lifted inference algorithms have been proposed that reason at the first-order level, about groups of objects as a whole. Despite the existence of various lifted inference approaches, there are currently no completeness results about these algorithms. The key contribution of this paper is that we introduce a formal definition of lifted inference that allows us to reason about the completeness of lifted inference algorithms relative to a particular class of probabilistic models. We then show how to obtain a completeness result using a first-order knowledge compilation approach for theories of formulae containing up to two logical variables. Subject Area: Structured and Relational Data
SPOTLIGHT SESSION
session 3 - 12:40 1:10 Pm
Session Chair: Amir Globerson Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability D. Reichert, P. Series, A. Storkey, Univ. of Edinburgh Subject Area: Computational Neural Models See abstract, page 51 (T18) Sequence learning with hidden units in spiking neural networks J. Brea, Bern University, W. Senn & J. Pfister, Cambridge University Subject Area: Computational Neural Models See abstract, page 52 (T19) Information Rates and Optimal Decoding in Large Neural Populations K. Rahnama Rad, L. Paninski, Columbia University Subject Area: Probabilistic Models and Methods See abstract, page 66 (T85)
ORAL SESSION
session 3 - 12:00 12:40 Pm
Session Chair: Amir Globerson Modelling Genetic Variations using FragmentationCoagulation Processes Yee Whye Teh ywteh@gatsby.ucl.ac.uk Charles Blundell c.blundell@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit , UCL Lloyd Elliott elliott@gatsby.ucl.ac.uk University College London We propose a novel class of Bayesian nonparametric models for sequential data called fragmentationcoagulation processes (FCPs). FCPs model a set of
41
tuesday - ConferenCe
A blind sparse deconvolution method for neural spike identification C. Ekanadham, D. Tranchina, E. Simoncelli, Courant Institute, New York University Subject Area: Approximate Inference See abstract, page 67 (T88) Accelerated Adaptive Markov Chain for Partition Function Computation S. Ermon, C. Gomes, A. Sabharwal, IBM Watson Research Center; B. Selman, Cornell University Subject Area: Approximate Inference See abstract, page 67 (T89) The Kernel Beta Process L. Ren, Y. Wang, D. Dunson, L. Carin, Duke University Subject Area: Bayesian Nonparametrics See abstract, page 69 (T95) Solving Decision Problems with Limited Information D. Maua, C. de Campos, Dalle Molle Institute for Artificial Intelligence Subject Area: Graphical Models See abstract, page 70 (T99) Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning F. Bach, INRIA - Ecole Normale Superieure; E. Moulines, Telecom Paristech Subject Area: Stochastic Methods See abstract, page 62 (T64) Online Submodular Set Cover, Ranking, and Repeated Active Learning A. Guillory, J. Bilmes, University of Washington Subject Area: Online Learning See abstract, page 65 (T80) Sparse Estimation with Structured Dictionaries D. Wipf, Microsoft Research Asia Subject Area: Sparsity and Feature Selection See abstract, page 59 (T52) Universal low-rank matrix recovery from Pauli measurements Y. Liu, National Institute of Standards and Technology Subject Area: Theory See abstract, page 63 (T71) See the Tree Through the Lines: The Shazoo Algorithm F. Vitale, N. Cesa-Bianchi and G. Zappella ,Universit degli Studi di Milano; C. Gentile, Universita dellInsubria Subject Area: Online Learning See abstract, page 66 (T82) On U-processes and clustering performance S. Clmenon, Telecom Paris Tech Subject Area: Statistical Learning Theory See abstract, page 66 (T84)
ORAL SESSION
session 4 - 1:10 1:30 Pm
Session Chair: Shai Shalev-Shwartz Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss David Mcallester Joseph Keshet TTI-Chicago mcallester@ttic.edu jkeshet@ttic.edu
ORAL SESSION
session 5 - 4:00 5:30 Pm
Session Chair: Pradeep Ravikumar
We consider latent structural versions of probit loss and ramp loss. We show that these surrogate loss functions are consistent in the strong sense that for any feature map (finite or infinite dimensional) they yield predictors approaching the infimum task loss achievable by any linear predictor over the given features. We also give finite sample generalization bounds (convergence rates) for these loss functions. These bounds suggest that probit loss converges more rapidly. However, ramp loss is more easily optimized and may ultimately be more practical.. Subject Area: Learning with Structured Data
SPOTLIGHT SESSION
session 4 - 1:30 2:00 am
Session Chair: Shai Shalev-Shwartz Algorithms and hardness results for parallel large margin learning R. Servedio, Columbia University; P. Long, Google Subject Area: Learning Theory See abstract, page 63 (T68)
The last 15 years we have seen an explosion in the role of sparsity in mathematical signal and image processing, signal and image acquisition and reconstruction algorithms, and myriad applications. It is also central to machine learning. I will present an overview of the mathematical theory and several fundamental algorithmic results, including a fun application to solving Sudoku puzzles..
Anna Gilbert received an S.B. degree from the University of Chicago and a Ph.D. from Princeton University, both in mathematics. In 1997, she was a postdoctoral fellow at Yale University and AT&T LabsResearch. From 1998 to 2004, she was a member of technical staff at AT&T Labs-Research in Florham Park, NJ. Since then she has been with the Department of Mathematics at the University of Michigan, where she is now a Professor. She has received several awards, including a Sloan Research Fellowship (2006), an NSF CAREER award (2006), the National Academy of Sciences Award for Initiatives in Research (2008), the Association of Computing Machinery (ACM) Douglas Engelbart
42
tuesday - ConferenCe
Best Paper award (2008), and the EURASIP Signal Processing Best Paper award (2010). Her research interests include analysis, probability, networking, and algorithms. She is especially interested in randomized algorithms with applications to harmonic analysis, signal and image processing, networking, and massive datasets.
POSTER SESSION
and reCePtion - 5:45 11:59 Pm t1 t2 a non-Parametric approach to dynamic Programming O. Kroemer, J. Peters Periodic Finite State Controllers for Efficient PomdP and deC-PomdP Planning, J. Pajarinen, J. Peltonen transfer from multiple mdPs, A. Lazaric, M. Restelli Variance reduction in monte-Carlo tree search, J. Veness, M. Lanctot, M. Bowling action-gap Phenomenon in reinforcement learning, A. Farahmand the fixed Points of off-Policy td, J. Kolter Convergent fitted Value iteration with linear function approximation, D. Lizotte blending autonomous exploration and apprenticeship learning, T. Walsh, D. Hewlett, C. Morrison selecting the state-representation in reinforcement learning, O. Maillard, R. Munos, D. Ryabko a reinforcement learning theory for Homeostatic regulation, M. Keramati, B. Gutkin environmental statistics and the trade-off between model-based and td learning in humans, D. Simon, N. Daw td: re-evaluating Complex backups in temporal difference learning, G. Konidaris, S. Niekum, P. Thomas evaluating computational models of preference learning, A. Jern, C. Lucas, C. Kemp inductive reasoning about chimeric creatures, C. Kemp Predicting response time and error rates in visual search, B. Chen, V. Navalpakkam, P. Perona identifying alzheimers disease-related brain regions from multi-modality neuroimaging data using sparse Composite linear discrimination analysis, S. Huang, J. Li, J. Ye, T. Wu, K. Chen, A. Fleisher, E. Reiman decoding of finger flexion from electrocorticographic signals using switching non-Parametric dynamic systems, Z. Wang, G. Schalk, Q. Ji
t3 t4 t5 t6 t7 t8
Learning sparse representations on data adaptive dictionaries is a state-of-the-art method for modeling data. But when the dictionary is large and the data dimension is high, it is a computationally challenging problem. We explore three aspects of the problem. First, we derive new, greatly improved screening tests that quickly identify codewords that are guaranteed to have zero weights. Second, we study the properties of random projections in the context of learning sparse representations. Finally, we develop a hierarchical framework that uses incremental random projections and screening to learn, in small stages, a hierarchically structured dictionary for sparse representations. Empirical results show that our framework can learn informative hierarchical sparse representations more efficiently. Subject Area: None of the above
t9
High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity
Po-Ling Loh Martin Wainwright UC Berkeley ploh@berkeley.edu wainwrig@eecs.berkeley.edu
t10 t11
t12
Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependencies. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing, and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently non-convex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing non-convex programs, we are able to both analyze the statistical error associated with any global optimum, and prove that a simple projected gradient descent algorithm will converge in polynomial time to a small neighborhood of the set of global minimizers. On the statistical side, we provide non-asymptotic bounds that hold with high probability for the cases of noisy, missing, and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm will converge at geometric rates to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing agreement with the predicted scalings. Subject Area: Supervised Learning
t17
43
tuesday - ConferenCe
t18 neuronal adaptation for sampling-based Probabilistic inference in Perceptual bistability, D. Reichert, P. Series, A. Storkey sequence learning with hidden units in spiking neural networks, J. Brea, W. Senn, J. Pfister active learning of neural response functions with gaussian processes,M. Park, G. Horwitz, J. Pillow recovering intrinsic images with a global sparsity Prior on Reflectance, P. Gehler, C. Rother, M. Kiefel, L. Zhang, B. Schlkopf semi-supervised regression via Parallel field regularization, B. Lin, C. Zhang, X. He Video annotation and tracking with active learning, C. Vondrick, D. Ramanan learning Probabilistic non-linear latent Variable models for tracking Complex activities, A. Yao, J. Gall, L. Gool, R. Urtasun image Parsing with stochastic scene grammar, Y. Zhao, S. Zhu maximal Cliques that satisfy Hard Constraints with application to deformable object model learning, X. Wang, X. Bai, X. Yang, W. Liu, L. Latecki generalized lasso based approximation of sparse Coding for Visual recognition, N. Morioka, S. Satoh semantic labeling of 3d Point Clouds for indoor scenes, H. Koppula, A. Anand, T. Joachims, A. Saxena an unsupervised decontamination Procedure for improving the reliability of Human Judgments, M. Mozer, B. Link, H. Pashler Learning to Search Efficiently in High Dimensions, Z. Li, H. Ning, L. Cao, T. Zhang, Y. Gong, T. Huang inferring interaction networks using the ibP applied to microrna target Prediction, H. Le, Z. Bar-Joseph Learning Patient-Specific Cancer Survival distributions as a sequence of dependent regressors, C. Yu, R. Greiner, H. Lin, V. Baracos History distribution matching method for predicting effectiveness of HiV combination therapies, J. Bogojeska an empirical evaluation of thompson sampling, O. Chapelle, L. Li Hashing algorithms for large-scale learning, P. Li, A. Shrivastava, J. Moore, A. Knig t36 learning sparse representations of High dimensional data on large scale dictionaries, Z. Xiang, H. Xu, P. Ramadge relative density-ratio estimation for robust distribution Comparison, M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, M. Sugiyama sparse bayesian multi-task learning, C. Archambeau, S. Guo, O. Zoeter High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity, P. Loh, M. Wainwright Learning Anchor Planes for Classification, Z. Zhang, L. Ladicky, P. Torr, A. Saffari ShareBoost: Efficient multiclass learning with feature sharing, S. Shalev-Shwartz, Y. Wexler, A. Shashua a two-stage Weighting framework for multi-source domain adaptation, Q. Sun, R. Chattopadhyay, S. Panchanathan, J. Ye the local rademacher Complexity of lp-norm multiple Kernel learning, M. Kloft, G. Blanchard maximum margin multi-label structured Prediction, C. Lampert generalization bounds and Consistency for latent structural Probit and ramp loss, D. Mcallester, J. Keshet sparse recovery with brownian sensing, A. Carpentier, O. Maillard, R. Munos sparse features for PCa-like linear regression, C. Boutsidis, P. Drineas, M. Magdon-Ismail shaping level sets with submodular functions, F. Bach greedy algorithms for structurally Constrained High dimensional Problems, A. Tewari, P. Ravikumar, I. Dhillon trace lasso: a trace norm regularization for correlated designs, E. Grave, G. Obozinski, F. Bach robust lasso with missing and grossly corrupted observations, N. Nguyen, N. Nasrabadi, T. Tran sparse estimation with structured dictionaries, D. Wipf learning a distance metric from a network, B. Shaw, B. Huang, T. Jebara a denoising View of matrix Completion, W. Wang, M. Carreira-Perpinan, Z. Lu
t37
t38 t39
t40 t41
t42
t25 t26
t30 t31
t32
t33
t34 t35
44
tuesday - ConferenCe
t55 t56 t57 Crowdclustering, R. Gomes, P. Welinder, A. Krause, P. Perona demixed Principal Component analysis, W. Brendel, R. Romo, C. Machens nonnegative dictionary learning in the exponential noise model for adaptive music signal representation, O. Dikmen, C. Fvotte target neighbor Consistent feature Weighting for Nearest Neighbor Classification, I. Takeuchi, M. Sugiyama Penalty decomposition methods for rank minimization, Y. Zhang, Z. Lu Statistical Tests for Optimization Efficiency, L. Boyles, A. Korattikara, D. Ramanan, M. Welling Prismatic algorithm for discrete d.C. Programming Problem, Y. Kawahara, T. Washio sparse inverse Covariance matrix estimation using Quadratic approximation, C. Hsieh, M. Sustik, I. Dhillon, P. Ravikumar a Convergence analysis of log-linear training, S. Wiesler, H. Ney non-asymptotic analysis of stochastic approximation algorithms for machine learning, F. Bach, E. Moulines better mini-batch algorithms via accelerated gradient methods, A. Cotter, O. Shamir, N. Srebro, K. Sridharan PaC-bayesian analysis of Contextual bandits, Y. Seldin, P. Auer, F. Laviolette, J. Shawe-Taylor, R. Ortner spectral methods for learning multivariate latent tree structure, A. anandkumar, K. Chaudhuri, D. Hsu, S. Kakade, L. Song, T. Zhang algorithms and hardness results for parallel large margin learning, R. Servedio, P. Long Composite multiclass losses, E. Vernet, R. Williamson, M. Reid autonomous learning of action models for Planning, N. Mehta, P. Tadepalli, A. Fern universal low-rank matrix recovery from Pauli measurements, Y. Liu a more Powerful two-sample test in High dimensions using random Projection, M. Lopes, L. Jacob, M. Wainwright Prediction strategies without loss, M. Kapralov, R. Panigrahy t74 learning in Hilbert vs. banach spaces: a measure embedding Viewpoint, B. Sriperumbudur, K. Fukumizu, G. Lanckriet on strategy stitching in large extensive form multiplayer games, R. Gibson, D. Szafron Multi-Bandit Best Arm Identification, V. Gabillon, M. Ghavamzadeh, A. Lazaric, S. Bubeck linear submodular bandits and their application to Diversified Retrieval, Y. Yue, C. Guestrin adaptive Hedge, T. van Erven, P. Grunwald, W. Koolen, S. Rooij on the universality of online mirror descent, N. Srebro, K. Sridharan, A. Tewari online submodular set Cover, ranking, and repeated active learning, A. Guillory, J. Bilmes Finite Time Analysis of Stratified Sampling for monte Carlo, A. Carpentier, R. Munos see the tree through the lines: the shazoo algorithm, F. Vitale, N. Cesa-Bianchi, C. Gentile, G. Zappella Generalizing from Several Related Classification tasks to a new unlabeled sample, G. Blanchard, G. Lee, C. Scott on u-processes and clustering performance, S. Clmenon information rates and optimal decoding in large neural Populations, K. Rahnama Rad, L. Paninski eigennet: a bayesian hybrid of generative and conditional models for sparse learning, Y. Qi, F. Yan learning unbelievable marginal probabilities, X. Pitkow, Y. Ahmadian, K. Miller a blind sparse deconvolution method for neural spike identification, C. Ekanadham, D. Tranchina, E. Simoncelli accelerated adaptive markov Chain for Partition function Computation, S. Ermon, C. Gomes, A. Sabharwal, B. Selman message-Passing for approximate maP inference with latent Variables, J. Jiang, P. Rai, H. Daume III Priors over recurrent Continuous time Processes, A. Saeedi, A. Bouchard-Ct modelling genetic Variations using fragmentationCoagulation Processes, Y. Teh, C. Blundell, L. Elliott Variational gaussian Process dynamical systems, A. Damianou, M. Titsias, N. Lawrence
45
t58
t63 t64
T83
t65
t66
t67
t89
t73
tuesday - ConferenCe
t94 t95 t96 the doubly Correlated nonparametric topic model, D. Kim, E. Sudderth the Kernel beta Process, L. Ren, Y. Wang, D. Dunson, L. Carin an exact algorithm for f-measure maximization, K. Dembczynski, W. Waegeman, W. Cheng, E. Hullermeier Contextual gaussian Process bandit optimization, A. Krause, C. Ong Automated Refinement of Bayes Networks Parameters based on test ordering Constraints, O. Khan, P. Poupart, J. Agosta solving decision Problems with limited information, D. Maua, C. de Campos learning Higher-order graph structure with features by structure Penalty, S. Ding, G. Wahba, X. Zhu on the Completeness of first-order Knowledge Compilation for lifted Probabilistic inference, G. Van den Broeck inference in continuous time changepoint point models, F. Stimberg, M. Opper, G. Sanguinetti, A. Ruttor Comparative analysis of Viterbi training and maximum likelihood estimation for Hmms, A. Allahverdyan, A. Galstyan 1A
DEMONSTRATIONS
5:45 11:59 Pm
Reproducing biologically realistic firing patterns on a highly-accelerated neuromorphic hardware system, M. Schwartz a smartphone 3d functional brain scanner, C. Stahlhut, A. Stopczynski, J. Larsen, M. Petersen, L. Hansen senna natural language Processing demo, R. Collobert Haptic belt with Pedestrian detection, J. Feng, M. Rasi, A. Ng, Q. Le, M. Quigley, J. Chen, T. Low, W. Zou
t97 T98
2a
3a 4a
t99 t100
t101
t102
t103
46
FRONT ENTRANCE
T100 T91 T101 T92 T102 T93 T103 T94 T95 T96 T97 T98 T99 T82 T73 T83 T74 T84 T75 T85 T76 T86 T77 T87 T78 T88 T79 T89 T81 T90 T82 T64 T55 T65 T56 T66 T57 T67 T58 T68 T59 T69 T60 T70 T61 T71 T62 T72 T63 T46 T37 T47 T38 T48 T39 T49 T40 T50 T41 T51 T42 T52 T43 T53 T44 T54 T45 T28 T19 T29 T20 T30 T21 T31 T22 T32 T23 T33 T24 T34 T25 T35 T26 T36 T27 T10 T01 T11 T02 T12 T03 T13 T04 T14 T05 T15 T06 T16 T07 T17 T08 T18 T09
FLOOR ONE
Internet Area
To Cafeteria
3A
Andalucia 3
1A
Andalucia 3
Demonstrations 4A 2A
Andalucia 2
Cafeteria
5
47
tuesday abstraCts
t1 a non-Parametric approach to dynamic Programming
O. Kroemer oliverkro@googlemail.com J. Peters mail@jan-peters.net Technische Universitaet Darmstadt In this paper, we consider the problem of policy evaluation for continuous-state systems. We present a non-parametric approach to policy evaluation, which uses kernel density estimation to represent the system. The true form of the value function for this model can be determined, and can be computed using Galerkins method. Furthermore, we also present a unified view of several well-known policy evaluation methods. In particular, we show that the same Galerkin method can be used to derive Least-Squares Temporal Difference learning, Kernelized Temporal Difference learning, and a discrete-state Dynamic Programming solution, as well as our proposed method. In a numerical evaluation of these algorithms, the proposed approach performed better than the other methods. Subject Area: Control and Reinforcement Learning source and target tasks. Finally, we report illustrative experimental results in a continuous chain problem. Subject Area: Control and Reinforcement Learning
t4
T2
Periodic Finite State Controllers for Efficient PomdP and deC-PomdP Planning
J. Pajarinen J. Peltonen Aalto University Joni.Pajarinen@tkk.fi jaakko.peltonen@tkk.fi
Monte-Carlo Tree Search (MCTS) has proven to be a powerful, generic planning technique for decisionmaking in single-agent and adversarial environments. The stochastic nature of the Monte-Carlo simulations introduces errors in the value estimates, both in terms of bias and variance. Whilst reducing bias (typically through the addition of domain knowledge) has been studied in the MCTS literature, comparatively little effort has focused on reducing variance. This is somewhat surprising, since variance reduction techniques are a well-studied area in classical statistics. In this paper, we examine the application of some standard techniques for variance reduction in MCTS, including common random numbers, antithetic variates and control variates. We demonstrate how these techniques can be applied to MCTS and explore their efficacy on three different stochastic, singleagent settings: Pig, Cant Stop and Dominion. Subject Area: Control and Reinforcement Learning
Applications such as robot control and wireless communication require planning under uncertainty. Partially observable Markov decision processes (POMDPs) plan policies for single agents under uncertainty and their decentralized versions (DEC-POMDPs) find a policy for multiple agents. The policy in infinite-horizon POMDP and DEC-POMDP problems has been represented as finite state controllers (FSCs). We introduce a novel class of periodic FSCs, composed of layers connected only to the previous and next layer. Our periodic FSC method finds a deterministic finite-horizon policy and converts it to an initial periodic infinite-horizon policy. This policy is optimized by a new infinite-horizon algorithm to yield deterministic periodic policies, and by a new expectation maximization algorithm to yield stochastic periodic policies. Our method yields better results than earlier planning methods and can compute larger solutions than with regular FSCs. Subject Area: Control and Reinforcement Learning
t5
t3
Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an agent following the greedy policy ^ with respect to an action-value function Q^, the performance loss E[V* (X) - V^(X)] is upper bounded by O(||Q^ - Q* ||1+ ), in which O is the parameter quantifying the action-gap regularity. For > O, our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms. Subject Area: Control and Reinforcement Learning
t6
48
tuesday abstraCts
function approximation. In general the answer is no: for arbitrary off-policy sampling the error of the TD solution can be unboundedly large, even when the approximator can represent the true value function well. In this paper we propose a novel approach to address this problem: we show that by considering a certain convex subset of off-policy distributions we can indeed provide guarantees as to the solution quality similar to the on-policy case. Furthermore, we show that we can efficiently project on to this convex set using only samples generated from the system. The end result is a novel TD algorithm that has approximation guarantees even in the case of off-policy sampling and which empirically outperforms existing TD methods. Subject Area: Control and Reinforcement Learning
t9
t7
t8
49
tuesday abstraCts
t11 environmental statistics and the trade-off between model-based and td learning in humans
D. Simon N. Daw New York University dylex@nyu.edu nathaniel.daw@nyu.edu Psychologists have recently begun to develop computational accounts of how people infer others preferences from their behavior. The inverse decisionmaking approach proposes that people infer preferences by inverting a generative model of decision-making. Existing data sets, however, do not provide sufficient resolution to thoroughly evaluate this approach. We introduce a new preference learning task that provides a benchmark for evaluating computational accounts and use it to compare the inverse decision-making approach to a feature-based approach, which relies on a discriminative combination of decision features. Our data support the inverse decision-making approach to preference learning. Subject Area: Cognitive Science
There is much evidence that humans and other animals utilize a combination of model-based and model-free RL methods. Although it has been proposed that these systems may dominate according to their relative statistical efficiency in different circumstances, there is little specific evidence -- especially in humans -- as to the details of this trade-off. Accordingly, we examine the relative performance of different RL approaches under situations in which the statistics of reward are differentially noisy and volatile. Using theory and simulation, we show that model-free TD learning is relatively most disadvantaged in cases of high volatility and low noise. We present data from a decisionmaking experiment manipulating these parameters, showing that humans shift learning strategies in accord with these predictions. The statistical circumstances favoring model-based RL are also those that promote a high learning rate, which helps explain why, in psychology, the distinction between these strategies is traditionally conceived in terms of rule-based vs. incremental learning. Subject Area: Cognitive Science
A model of human visual search is proposed. It predicts both response time (RT) and error rates (RT) as a function of image parameters such as target contrast and clutter. The model is an ideal observer, in that it optimizes the Bayes ratio of tar- get present vs target absent. The ratio is computed on the firing pattern of V1/V2 neurons, modeled by Poisson distributions. The optimal mechanism for integrat- ing information over time is shown to be a `soft max of diffusions, computed over the visual field by `hypercolumns of neurons that share the same receptive field and have different response properties to image features. An approximation of the optimal Bayesian observer, based on integrating local decisions, rather than diffusions, is also derived; it is shown experimentally to produce very similar pre- dictions. A psychophyisics experiment is proposed that may discriminate between which mechanism is used in the human brain. Subject Area: Neuroscience
tuesday abstraCts
t16 identifying alzheimers disease-related brain regions from multi-modality neuroimaging data using sparse Composite linear discrimination analysis S. Huang shuang31@asu.edu J. Li jing.li.8@asu.edu J. Ye jieping.ye@asu.edu T. Wu teresa.wu@asu.edu Arizona State University K. Chen kewei.chen@bannerhealth.com A. Fleisher adam.fleisher@bannerhealth.com E. Reiman eric.reiman@bannerhealth.com Banner Alzheimers Institute Diagnosis of Alzheimers disease (AD) at the early stage of the disease development is of great clinical importance. Current clinical assessment that relies primarily on cognitive measures proves low sensitivity and specificity. The fast growing neuroimaging techniques hold great promise. Research so far has focused on single neuroimaging modalities. However, as different modalities provide complementary measures for the same disease pathology, fusion of multi-modality data may increase the statistical power in identification of disease-related brain regions. This is especially true for early AD, at which stage the disease-related regions are most likely to be weak-effect regions that are difficult to be detected from a single modality alone. We propose a sparse composite linear discriminant analysis model (SCLDA) for identification of disease-related brain regions of early AD from multi-modality data. SCLDA uses a novel formulation that decomposes each LDA parameter into a product of a common parameter shared by all the modalities and a parameter specific to each modality, which enables joint analysis of all the modalities and borrowing strength from one another. We prove that this formulation is equivalent to a penalized likelihood with non-convex regularization, which can be solved by the DC ((difference of convex functions) programming. We show that in using the DC programming, the property of the nonconvex regularization in terms of preserving weak-effect features can be nicely revealed. We perform extensive simulations to show that SCLDA outperforms existing competing algorithms on feature selection, especially on the ability for identifying weak-effect features. We apply SCLDA to the Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET) images of 49 AD patients and 67 normal controls (NC). Our study identifies disease-related brain regions consistent with findings in the AD literature. Subject Area: Neuroscience\Brain Imaging convey a users intent. Some BCI approaches begin by decoding kinematic parameters of movements from brain signals, and then proceed to using these signals, in absence of movements, to allow a user to control an output. Recent results have shown that electrocorticographic (ECoG) recordings from the surface of the brain in humans can give information about kinematic parameters (e.g., hand velocity or finger flexion). The decoding approaches in these demonstrations usually employed classical classification/regression algorithms that derive a linear mapping between brain signals and outputs. However, they typically only incorporate little prior information about the target kinematic parameter. In this paper, we show that different types of anatomical constraints that govern finger flexion can be exploited in this context. Specifically, we incorporate these constraints in the construction, structure, and the probabilistic functions of a switched non-parametric dynamic system (SNDS) model. We then apply the resulting SNDS decoder to infer the flexion of individual fingers from the same ECoG dataset used in a recent study. Our results show that the application of the proposed model, which incorporates anatomical constraints, improves decoding performance compared to the results in the previous work. Thus, the results presented in this paper may ultimately lead to neurally controlled hand prostheses with full finegrained finger articulation. Subject Area: Neuroscience
t17 Decoding of Finger Flexion from Electrocorticographic Signals Using Switching Non-Parametric Dynamic Systems
Z. Wang wangz6@rpi.edu Q. Ji qji@ecse.rpi.edu Rensselaer Polytechnic Institute G. Schalk schalk@wadsworth.org Wadsworth Center Brain-computer interfaces (BCIs) use brain signals to
51
tuesday abstraCts
t19 sequence learning with hidden units in spiking neural networks
J. Brea brea@pyl.unibe.ch Universit\{a}t Bern W. Senn senn@pyl.unibe.ch J. Pfister jean-pascal.pfister@eng.cam.ac.uk Cambridge University We consider a statistical framework in which recurrent networks of spiking neurons learn to generate spatiotemporal spike patterns. Given biologically realistic stochastic neuronal dynamics we derive a tractable learning rule for the synaptic weights towards hidden and visible neurons that leads to optimal recall of the training sequences. We show that learning synaptic weights towards hidden neurons significantly improves the storing capacity of the network. Furthermore, we derive an approximate online learning rule and show that our learning rule is consistent with Spike-Timing Dependent Plasticity in that if a presynaptic spike shortly precedes a postynaptic spike, potentiation is induced and otherwise depression is elicited. Subject Area: Neuroscience
This paper studies the problem of semi-supervised learning from the vector field perspective. Many of the existing work use the graph Laplacian to ensure the smoothness of the prediction function on the data manifold. However, beyond smoothness, it is suggested by recent theoretical work that we should ensure second order smoothness for achieving faster rates of convergence for semi-supervised regression problems. To achieve this goal, we show that the second order smoothness measures the linearity of the function, and the gradient field of a linear function has to be a parallel vector field. Consequently, we propose to find a function which minimizes the empirical error, and simultaneously requires its gradient field to be as parallel as possible. We give a continuous objective function on the manifold and discuss how to discretize it by using random points. The discretized optimization problem turns out to be a sparse linear system which can be solved very efficiently. The experimental results have demonstrated the effectiveness of our proposed approach. Subject Area: Vision
52
tuesday abstraCts
t23 Video annotation and tracking with active learning
C. Vondrick D. Ramanan cvondric@ics.uci.edu dramanan@ics.uci.edu foreground/background, 2D faces, to 1D lines. The grammar includes three types of production rules and two types of contextual relations. Production rules: (i) AND rules represent the decomposition of an entity into subparts; (ii) OR rules represent the switching among subtypes of an entity; (iii) SET rules rep- resent an ensemble of visual entities. Contextual relations: (i) Cooperative + relations represent positive links between binding entities, such as hinged faces of a object or aligned boxes; (ii) Competitive - relations represents negative links between competing entities, such as mutually exclusive boxes. We design an efficient MCMC inference algorithm, namely Hierarchical cluster sampling, to search in the large solution space of scene configurations. The algorithm has two stages: (i) Clustering: It forms all possible higher-level structures (clusters) from lower-level entities by production rules and contextual relations. (ii) Sampling: It jumps between alternative structures (clusters) in each layer of the hierarchy to find the most probable configuration (represented by a parse tree). In our experiment, we demonstrate the superiority of our algorithm over existing methods on public dataset. In addition, our approach achieves richer structures in the parse tree. Subject Area: Vision\Natural Scene Statistics
We introduce a novel active learning framework for video annotation. By judiciously choosing which frames a user should annotate, we can obtain highly accurate tracks with minimal user effort. We cast this problem as one of active learning, and show that we can obtain excellent performance by querying frames that, if annotated, would produce a large expected change in the estimated object track. We implement a constrained tracker and compute the expected change for putative annotations with efficient dynamic programming algorithms. We demonstrate our framework on four datasets, including two benchmark datasets constructed with key frame annotations obtained by Amazon Mechanical Turk. Our results indicate that we could obtain equivalent labels for a small fraction of the original cost. Subject Area: Vision\Motion and Tracking
t24 learning Probabilistic non-linear latent Variable models for tracking Complex activities
A. Yao J. Gall ETH Zurich L. Gool R. Urtasun TTI-Chicago yaoa@vision.ee.ethz.ch gall@vision.ee.ethz.ch vangool@vision.ee.ethz.ch rurtasun@ttic.edu
t26 maximal Cliques that satisfy Hard Constraints with application to deformable object model learning
X. Wang wxghust@gmail.com X. Bai xiang.bai@gmail.com X. Yang xingwei@temple.edu W. Liu liuwy@hust.edu.cn Huazhong University of Science and Technology L. Latecki latecki@temple.edu Temple University We propose a novel inference framework for finding maximal cliques in a weighted graph that satisfy hard constraints. The constraints specify the graph nodes that must belong to the solution as well as mutual exclusions of graph nodes, i.e., sets of nodes that cannot belong to the same solution. The proposed inference is based on a novel particle filter algorithm with state permeations. We apply the inference framework to a challenging problem of learning part-based, deformable object models. Two core problems in the learning framework, matching of image patches and finding salient parts, are formulated as two instances of the problem of finding maximal cliques with hard constraints. Our learning framework yields discriminative part based object models that achieve very good detection rate, and outperform other methods on object classes with large deformation. Subject Area: Vision\Object Recognition
A common approach for handling the complexity and inherent ambiguities of 3D human pose estimation is to use pose priors learned from training data. Existing approaches however, are either too simplistic (linear), too complex to learn, or can only learn latent spaces from simple data, i.e., single activities such as walking or running. In this paper, we present an efficient stochastic gradient descent algorithm that is able to learn probabilistic non-linear latent spaces composed of multiple activities. Furthermore, we derive an incremental algorithm for the online setting which can update the latent space without extensive relearning. We demonstrate the effectiveness of our approach on the task of monocular and multi-view tracking and show that our approach outperforms the state-of-the-art. Subject Area: Vision\Motion and Tracking
53
tuesday abstraCts
t27 generalized lasso based approximation of sparse Coding for Visual recognition
N. Morioka nmorioka@gmail.com University of New South Wales S. Satoh satoh@nii.ac.jp NII Sparse coding, a method of explaining sensory data with as few dictionary bases as possible, has attracted much attention in computer vision. For visual object category recognition, `1 regularized sparse coding is combined with spatial pyramid representation to obtain state-ofthe-art performance. However, because of its iterative optimization, applying sparse coding onto every local feature descriptor extracted from an image database can become a major bottleneck. To overcome this computational challenge, this paper presents Generalized Lasso based Approximation of Sparse coding (GLAS). By representing the distribution of sparse coefficients with slice transform, we fit a piece-wise linear mapping function with generalized lasso. We also propose an efficient postrefinement procedure to perform mutual inhibition between bases which is essential for an overcomplete setting. The experiments show that GLAS obtains comparable performance to `1 regularized sparse coding, yet achieves significant speed up demonstrating its effectiveness for large-scale visual recognition problems. Subject Area: Vision\Object Recognition
t29 an unsupervised decontamination Procedure for improving the reliability of Human Judgments
M. Mozer mozer@colorado.edu B. Link link@colorado.edu University of Colorado H. Pashler hpashler@ucsd.edu University of California, San Diego Psychologists have long been struck by individuals limitations in expressing their internal sensations, impressions, and evaluations via rating scales. Instead of using an absolute scale, individuals rely on reference points from recent experience. This relativity of judgment limits the informativeness of responses on surveys, questionnaires, and evaluation forms. Fortunately, the cognitive processes that map stimuli to responses are not simply noisy, but rather are influenced by recent experience in a lawful manner. We explore techniques to remove sequential dependencies, and thereby decontaminate a series of ratings to obtain more meaningful human judgments. In our formulation, the problem is to infer latent (subjective) impressions from a sequence of stimulus labels (e.g., movie names) and responses. We describe an unsupervised approach that simultaneously recovers the impressions and parameters of a contamination model that predicts how recent judgments affect the current response. We test our iterated impression inference, or I3, algorithm in three domains: rating the gap between dots, the desirability of a movie based on an advertisement, and the morality of an action. We demonstrate significant objective improvements in the quality of the recovered impressions. Subject Area: Applications
Inexpensive RGB-D cameras that give an RGB image together with depth data have become widely available. In this paper, we use this data to build 3D point clouds of full indoor scenes such as an office and address the task of semantic labeling of these 3D point clouds. We propose a graphical model that captures various features and contextual relations, including the local visual appearance and shape cues, object co-occurence relationships and geometric relationships. With a large number of object classes and relations, the models parsimony becomes important and we address that by using multiple types of edge potentials. The model admits efficient approximate inference, and we train it using a maximum-margin learning approach. In our experiments over a total of 52 3D scenes of homes and offices (composed from about 550 views, having 2495 segments labeled with 27 object classes), we get a performance of 84.06\% in labeling 17 object classes for offices, and 73.38\% in labeling 17 object classes for home scenes. Finally, we applied these algorithms successfully on a mobile robot for the task of finding objects in large cluttered rooms. Subject Area: Vision\Visual Perception
54
tuesday abstraCts
directly optimize a data structure that supports efficient large scale search. Our approach takes both search quality and computational cost into consideration. Specifically, we learn a boosted search forest that is optimized using pairwise similarity labeled examples. The output of this search forest can be efficiently converted into an inverted indexing data structure, which can leverage modern text search infrastructure to achieve both scalability and efficiency. Experimental results show that our approach significantly outperforms the start-of-the-art learning to hash methods (such as spectral hashing), as well as state-of-the-art high dimensional search algorithms (such as LSH and k-means trees). Subject Area: Applications
t33 History distribution matching method for predicting effectiveness of HiV combination therapies
J. Bogojeska jasmina@mpi-inf.mpg.de Max-Planck Institute for Informatics This paper presents an approach that predicts the effectiveness of HIV combination therapies by simultaneously addressing several problems affecting the available HIV clinical data sets: the different treatment backgrounds of the samples, the uneven representation of the levels of therapy experience, the missing treatment history information, the uneven therapy representation and the unbalanced therapy outcome representation. The computational validation on clinical data shows that, compared to the most commonly used approach that does not account for the issues mentioned above, our model has significantly higher predictive power. This is especially true for samples stemming from patients with longer treatment history and samples associated with rare therapies. Furthermore, our approach is at least as powerful for the remaining samples. Subject Area: Applications
t31 inferring interaction networks using the ibP applied to microrna target Prediction
H. Le hple@cs.cmu.edu Z. Bar-Joseph zivbj@cs.cmu.edu Carnegie Mellon University Determining interactions between entities and the overall organization and clustering of nodes in networks is a major challenge when analyzing biological and social network data. Here we extend the Indian Buffet Process (IBP), a nonparametric Bayesian model, to integrate noisy interaction scores with properties of individual entities for inferring interaction networks and clustering nodes within these networks. We present an application of this method to study how microRNAs regulate mRNAs in cells. Analysis of synthetic and real data indicates that the method improves upon prior methods, correctly recovers interactions and clusters, and provides accurate biological predictions. Subject Area: Applications
Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly not very popular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against. Subject Area: Applications
Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not fit in memory. We compare b-bit minwise hashing with the Count-Min (CM) and Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is significantly more accurate (at the same storage cost) than VW (and random projections) for binary data. Subject Area: Dimensionality Reduction 55
tuesday abstraCts
t36 learning sparse representations of High dimensional data on large scale dictionaries
Z. Xiang H. Xu P. Ramadge Princeton University zxiang@princeton.edu haoxu@princeton.edu ramadge@princeton.edu
Learning sparse representations on data adaptive dictionaries is a state-of-the-art method for modeling data. But when the dictionary is large and the data dimension is high, it is a computationally challenging problem. We explore three aspects of the problem. First, we derive new, greatly improved screening tests that quickly identify codewords that are guaranteed to have zero weights. Second, we study the properties of random projections in the context of learning sparse representations. Finally, we develop a hierarchical framework that uses incremental random projections and screening to learn, in small stages, a hierarchically structured dictionary for sparse representations. Empirical results show that our framework can learn informative hierarchical sparse representations more efficiently. Subject Area: None of the above
t39 High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity
P. Loh M. Wainwright UC Berkeley ploh@berkeley.edu wainwrig@eecs.berkeley.edu
Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependencies. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing, and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently non-convex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing non-convex programs, we are able to both analyze the statistical error associated with any global optimum, and prove that a simple projected gradient descent algorithm will converge in polynomial time to a small neighborhood of the set of global minimizers. On the statistical side, we provide non-asymptotic bounds that hold with high probability for the cases of noisy, missing, and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm will converge at geometric rates to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing agreement with the predicted scalings. Subject Area: Supervised Learning
56
tuesday abstraCts
T40 Learning Anchor Planes for Classification
Z. Zhang ziming.zhang@brookes.ac.uk P. Torr philiptorr@brookes.ac.uk Oxford Brookes L. Ladicky lubor@robots.ox.ac.uk University of Oxford A. Saffari amir@ymer.org Sony Computer Entertainment Europe Local Coordinate Coding (LCC) [18] is a method for modeling functions of data lying on non-linear manifolds. It provides a set of anchor points which form a local coordinate system, such that each data point on the manifold can be approximated by a linear combination of its anchor points, and the linear weights become the local coordinate coding. In this paper we propose encoding data using orthogonal anchor planes, rather than anchor points. Our method needs only a few orthogonal anchor planes for coding, and it can linearize any (,,p)-Lipschitz smooth nonlinear function with a fixed expected value of the upper-bound approximation error on any high dimensional data. In practice, the orthogonal coordinate system can be easily learned by minimizing this upper bound using singular value decomposition (SVD). We apply our method to model the coordinates locally in linear SVMs for classification tasks, and our experiment on MNIST shows that using only 50 anchor planes our method achieves 1.72\% error rate, while LCC achieves 1.90\% error rate using 4096 anchor points. Subject Area: Supervised Learning\Classification
57
tuesday abstraCts
t44 maximum margin multi-label structured Prediction
C. Lampert IST Austria chl@ist.ac.at when the features are arbitrarily non-orthogonal. Under the assumption that f is Hlder continuous with exponent at least 1/2, we provide an estimate ^ of the parameter such that || - ^||2 = O(|| ||2=N), where is the observation noise. The method uses a set of sampling points uniformly distributed along a one-dimensional curve selected according to the features. We report numerical experiments illustrating our method. Subject Area: Supervised Learning
We study multi-label prediction for structured output spaces, a problem that occurs, for example, in object detection in images, secondary structure prediction in computational biology, and graph matching with symmetries. Conventional multi-label classification techniques are typically not applicable in this situation, because they require explicit enumeration of the label space, which is infeasible in case of structured outputs. Relying on techniques originally designed for single- label structured prediction, in particular structured support vector machines, results in reduced prediction accuracy, or leads to infeasible optimization problems. In this work we derive a maximum-margin training formulation for multilabel structured prediction that remains computationally tractable while achieving high prediction accuracy. It also shares most beneficial properties with single-label maximum-margin approaches, in particular a formulation as a convex optimization problem, efficient working set training, and PAC-Bayesian generalization bounds. Subject Area: Supervised Learning
t45 generalization bounds and Consistency for latent structural Probit and ramp loss
D. Mcallester J. Keshet TTI-Chicago mcallester@ttic.edu jkeshet@ttic.edu
We consider latent structural versions of probit loss and ramp loss. We show that these surrogate loss functions are consistent in the strong sense that for any feature map (finite or infinite dimensional) they yield predictors approaching the infimum task loss achievable by any linear predictor over the given features. We also give finite sample generalization bounds (convergence rates) for these loss functions. These bounds suggest that probit loss converges more rapidly. However, ramp loss is more easily optimized and may ultimately be more practical. Subject Area: Supervised Learning
Principal Components Analysis~(PCA) is often used as a feature extraction procedure. Given a matrix X Rnxd, whose rows represent n data points with respect to d features, the top k right singular vectors of X (the socalled eigenfeatures), are arbitrary linear combinations of all available features. The eigenfeatures are very useful in data analysis, including the regularization of linear regression. Enforcing sparsity on the eigenfeatures, i.e., forcing them to be linear combinations of only a small number of actual features (as opposed to all available features), can promote better generalization error and improve the interpretability of the eigenfeatures. We present deterministic and randomized algorithms that construct such sparse eigenfeatures while probably achieving in-sample performance comparable to regularized linear regression. Our algorithms are relatively simple and practically efficient, and we demonstrate their performance on several data sets. Subject Area: Supervised Learning
tuesday abstraCts
t49 greedy algorithms for structurally Constrained High dimensional Problems
A. Tewari ambujtewari@gmail.com P. Ravikumar pradeepr@cs.utexas.edu I. Dhillon inderjit@cs.utexas.edu University of Texas at Austin A hallmark of modern machine learning is its ability to deal with high dimensional problems by exploiting structural assumptions that limit the degrees of freedom in the underlying model. A deep understanding of the capabilities and limits of high dimensional learning methods under specific assumptions such as sparsity, group sparsity, and low rank has been attained. Efforts (Negahban et al., 2010, Chandrasekaran et al., 2010} are now underway to distill this valuable experience by proposing general unified frameworks that can achieve the twin goals of summarizing previous analyses and enabling their application to notions of structure hitherto unexplored. Inspired by these developments, we propose and analyze a general computational scheme based on a greedy strategy to solve convex optimization problems that arise when dealing with structurally constrained highdimensional problems. Our framework not only unifies existing greedy algorithms by recovering them as special cases but also yields novel ones. Finally, we extend our results to infinite dimensional problems by using interesting connections between smoothness of norms and behavior of martingales in Banach spaces. Subject Area: Supervised Learning This paper studies the problem of accurately recovering a sparse vector from highly corrupted linear measurements y = + e + w where e is a sparse error vector whose nonzero entries may be unbounded and w is a bounded noise. We propose a so-called extended Lasso optimization which takes into consideration sparse prior information of both and e. Our first result shows that the extended Lasso can faithfully recover both the regression and the corruption vectors. Our analysis is relied on a notion of extended restricted eigenvalue for the design matrix X. Our second set of results applies to a general class of Gaussian design matrix X with i.i.d rows N(0,), for which we provide a surprising phenomenon: the extended Lasso can recover exact signed supports of both and e from only (k log p log n) observations, even the fraction of corruption is arbitrarily close to one. Our analysis also shows that this amount of observations required to achieve exact signed support is optimal. Subject Area: Supervised Learning
Many real-world networks are described by both connectivity information and features for every node. To better model and understand these networks, we present structure preserving metric learning (SPML), an algorithm for learning a Mahalanobis distance metric from a network such that the learned distances are tied to the inherent connectivity structure of the network. Like
59
tuesday abstraCts
the graph embedding algorithm structure preserving embedding, SPML learns a metric which is structure preserving, meaning a connectivity algorithm such as k-nearest neighbors will yield the correct connectivity when applied using the distances from the learned metric. We show a variety of synthetic and real-world experiments where SPML predicts link patterns from node features more accurately than standard techniques. We further demonstrate a method for optimizing SPML based on stochastic gradient descent which removes the runningtime dependency on the size of the network and allows the method to easily scale to networks of thousands of nodes and millions of edges. Subject Area: Unsupervised & Semi-supervised Learning
t57 nonnegative dictionary learning in the exponential noise model for adaptive music signal representation
O. Dikmen C. Fvotte Telecom ParisTech dikmen@telecom-paristech.fr fevotte@telecom-paristech.fr
t55 Crowdclustering
R. Gomes P. Welinder Caltech A. Krause ETH Zurich P. Perona Caltech gomes@caltech.edu welinder@caltech.edu krausea@ethz.ch andrea@vision.caltech.edu
Is it possible to crowdsource categorization? Amongst the challenges: (a) each annotator has only a partial view of the data, (b) different annotators may have different clustering criteria and may produce different numbers of categories, (c) the underlying category structure may be hierarchical. We propose a Bayesian model of how annotators may approach clustering and show how one may infer clusters/ categories, as well as annotator parameters, using this model. Our experiments, carried out on large collections of images, suggest that Bayesian crowdclustering works well and may be superior to single-expert annotations. Subject Area: Unsupervised & Semi-supervised Learning
60
In this paper we describe a maximum likelihood likelihood approach for dictionary learning in the multiplicative exponential noise model. This model is prevalent in audio signal processing where it underlies a generative composite model of the power spectrogram. Maximum joint likelihood estimation of the dictionary and expansion coefficients leads to a nonnegative matrix factorization problem where the Itakura-Saito divergence is used. The optimality of this approach is in question because the number of parameters (which include the expansion coefficients) grows with the number of observations. In this paper we describe a variational procedure for optimization of the marginal likelihood, i.e., the likelihood of the dictionary where the activation coefficients have been integrated out (given a specific prior). We compare the output of both maximum joint likelihood estimation (i.e., standard ItakuraSaito NMF) and maximum marginal likelihood estimation (MMLE) on real and synthetical datasets. The MMLE
tuesday abstraCts
approach is shown to embed automatic model order selection, akin to automatic relevance determination. Subject Area: Unsupervised & Semi-supervised Learning Learning problems such as logistic regression are typically formulated as pure optimization problems defined on some loss function. We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. By considering the statistical properties of the update variables used during the optimization (e.g. gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. This provides computational benefits and avoids overfitting by stopping when the batch-size has become equal to size of the full dataset. Moreover, the proposed algorithms depend on a single interpretable parameter --- the probability for an update to be in the wrong direction --- which is set to a single value across all algorithms and datasets. In this paper, we illustrate these ideas on three L1 regularized coordinate algorithms: L1 -regularized L2 -loss SVMs, L1-regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable. Subject Area: Optimization
t58 target neighbor Consistent feature Weighting for Nearest Neighbor Classification
I. Takeuchi takeuchi.ichiro@nitech.ac.jp Nagoya Institute of Technology M. Sugiyama sugi@cs.titech.ac.jp Tokyo Institute of Technology We consider feature selection and weighting for nearest neighbor classifiers. A technical challenge in this scenario is how to cope with the discrete update of nearest neighbors when the feature space metric is changed during the learning process. This issue, called the target neighbor change, was not properly addressed in the existing feature weighting and metric learning literature. In this paper, we propose a novel feature weighting algorithm that can exactly and efficiently keep track of the correct target neighbors via sequential quadratic programming. To the best of our knowledge, this is the first algorithm that guarantees the consistency between target neighbors and the feature space metric. We further show that the proposed algorithm can be naturally combined with regularization path tracking, allowing computationally efficient selection of the regularization parameter. We demonstrate the effectiveness of the proposed algorithm through experiments. Subject Area: Unsupervised & Semi-supervised Learning
In this paper, we propose the first exact algorithm for minimizing the difference of two submodular functions (D.S.), i.e., the discrete version of the D.C. programming problem. The developed algorithm is a branch-and-boundbased algorithm which responds to the structure of this problem through the relationship between submodularity and convexity. The D.S. programming problem covers a broad range of applications in machine learning because this generalizes the optimization of a wide class of set functions. We empirically investigate the performance of our algorithm, and illustrate the difference between exact and approximate solutions respectively obtained by the proposed and existing algorithms in feature selection and discriminative structure learning. Subject Area: Optimization
tuesday abstraCts
determinant program. In contrast to other state-of-the-art methods that largely use first order gradient information, our algorithm is based on Newtons method and employs a quadratic approximation, but with some modifications that leverage the structure of the sparse Gaussian MLE problem. We show that our method is superlinearly convergent, and also present experimental results using synthetic and real application data that demonstrate the considerable improvements in performance of our method when compared to other state-of-the-art methods. Subject Area: Optimization
Log-linear models are widely used probability models for statistical pattern recognition. Typically, log-linear models are trained according to a convex criterion. In recent years, the interest in log-linear models has greatly increased. The optimization of log-linear model parameters is costly and therefore an important topic, in particular for large-scale applications. Different optimization algorithms have been evaluated empirically in many papers. In this work, we analyze the optimization problem analytically and show that the training of log-linear models can be highly illconditioned. We verify our findings on two handwriting tasks. By making use of our convergence analysis, we obtain good results on a large-scale continuous handwriting recognition task with a simple and generic approach. Subject Area: Optimization
tuesday abstraCts
t67 spectral methods for learning multivariate latent tree structure
a. anandkumar a.anandkumar@uci.edu U.C.Irvine K. Chaudhuri kamalika@cs.ucsd.edu UC San Diego D. Hsu danielhsu@gmail.com S. Kakade sham@tti-c.org Microsoft Research L. Song lesong@cs.cmu.edu Carnegie Mellon University T. Zhang tzhang@stat.rutgers.edu Rutgers University This work considers the problem of learning the structure of multivariate linear tree models, which include a variety of directed tree graphical models with continuous, discrete, and mixed latent variables such as linear-Gaussian models, hidden Markov models, Gaussian mixture models, and Markov evolutionary trees. The setting is one where we only have samples from certain observed variables in the tree, and our goal is to estimate the tree structure (i.e., the graph of how the underlying hidden variables are connected to each other and to the observed variables). We propose the Spectral Recursive Grouping algorithm, an efficient and simple bottom-up procedure for recovering the tree structure from independent samples of the observed variables. Our finite sample size bounds for exact recovery of the tree structure reveal certain natural dependencies on underlying statistical and structural properties of the underlying joint distribution. Furthermore, our sample complexity guarantees have no explicit dependence on the dimensionality of the observed variables, making the algorithm applicable to many highdimensional settings. At the heart of our algorithm is a spectral quartet test for determining the relative topology of a quartet of variables from second-order statistics. Subject Area: Learning Theory to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. Subject Area: Learning Theory
t68 algorithms and hardness results for parallel large margin learning
R. Servedio Columbia University P. Long Google rocco@cs.columbia.edu plong@google.com
We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown -margin halfspace over n dimensions using poly(n,1=) processors ~ and runs in time O(1= ) + O(log n). In contrast, naive parallel algorithms that learn a -margin halfspace in time that depends polylogarithmically on n have W(1=2) runtime dependence on . Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability
This paper introduces two new frameworks for learning action models for planning. In the mistake-bounded planning framework, the learner has access to a planner for the given model representation, a simulator, and a planning problem generator, and aims to learn a model with at most a polynomial number of faulty plans. In the planned exploration framework, the learner does not have access to a problem generator and must instead design its own problems, plan for them, and converge with at most a polynomial number of planning attempts. The paper reduces learning in these frameworks to concept learning with onesided error and provides algorithms for successful learning in both frameworks. A specific family of hypothesis spaces is shown to be efficiently learnable in both the frameworks. Subject Area: Learning Theory
tuesday abstraCts
property (RIP). This implies that M can be recovered from a fixed (universal) set of Pauli measurements, using nuclear-norm minimization (e.g., the matrix Lasso), with nearly-optimal bounds on the error. A similar result holds for any class of measurements that use an orthonormal operator basis whose elements have small operator norm. Our proof uses Dudleys inequality for Gaussian processes, together with bounds on covering numbers obtained via entropy duality. Subject Area: Theory (COLT07) and settling the main question left open in their paper. The strong loss bounds of the algorithm have some surprising consequences. First, we obtain a parameter free algorithm for the experts problem that has optimal regret bounds with respect to k-shifting optima, i.e. bounds with respect to the optimum that is allowed to change arms multiple times. Moreover, for any window of size N the regret of our algorithm to any expert never exceeds O(n(logN + logT)), where N is the number of experts and T is the time horizon, while maintaining the essentially zero loss property. Subject Area: Theory
t72 a more Powerful two-sample test in High dimensions using random Projection
M. Lopes L. Jacob M. Wainwright UC Berkeley mlopes@stat.berkeley.edu laurent@stat.berkeley.edu wainwrig@eecs.berkeley.edu
We consider the hypothesis testing problem of detecting a shift between the means of two multivariate normal distributions in the high-dimensional setting, allowing for the data dimension p to exceed the sample size n. Our contribution is a new test statistic for the two-sample test of means that integrates a random projection with the classical Hotelling T 2 statistic. Working within a highdimensional framework that allows (p,n) * , we first derive an asymptotic power function for our test, and then provide sufficient conditions for it to achieve greater power than other state-of-the-art tests. Using ROC curves generated from simulated data, we demonstrate superior performance against competing tests in the parameter regimes anticipated by our theoretical results. Lastly, we illustrate an advantage of our procedure with comparisons on a high-dimensional gene expression dataset involving the discrimination of different types of cancer. Subject Area: Theory
Consider a sequence of bits where we are trying to predict the next bit from the previous bits. Assume we are allowed to say `predict 0 or `predict 1, and our payoff is +1 if the prediction is correct and -1 otherwise. We will say that at each point in time the loss of an algorithm is the number of wrong predictions minus the number of right predictions so far. In this paper we are interested in algorithms that have essentially zero (expected) loss over any string at any point in time and yet have small regret with respect to always predicting 0 or always predicting 1. For a sequence of length T our algorithm has regret 14T and loss 2Te - c2 T in expectation for all strings. We show that the tradeoff between loss and regret is optimal up to constant factors. Our techniques extend to the general setting of N experts, where the related problem of trading off regret to the best expert for regret to the special expert has been studied by Even-Dar et al. (COLT07). We obtain essentially zero loss with respect to the special expert and optimal loss/ regret tradeoff, improving upon the results of Even-Dar et al
64
Computing a good strategy in a large extensive form game often demands an extraordinary amount of computer memory, necessitating the use of abstraction to reduce the game size. Typically, strategies from abstract games perform better in the real game as the granularity of abstraction is increased. This paper investigates two techniques for stitching a base strategy in a coarse abstraction of the full game tree, to expert strategies in fine abstractions of smaller subtrees. We provide a general framework for creating static experts, an approach that generalizes some previous strategy stitching efforts. In addition, we show that static experts can create strong agents for both 2-player and 3-player Leduc and Limit Texas Holdem poker, and that a specific class of static experts can be preferred among a number of alternatives. Furthermore, we describe a poker agent that used static experts and won the 3-player events of the 2010 Annual Computer Poker Competition. Subject Area: Theory
tuesday abstraCts
T76 Multi-Bandit Best Arm Identification
V. Gabillon victor.gabillon@inria.fr M. Ghavamzadeh mohammad.ghavamzadeh@inria.fr A. Lazaric alessandro.lazaric@inria.fr INRIA Lille-Nord Europe S. Bubeck sbubeck@princeton.edu Princeton University We study the problem of identifying the best arm in each of the bandits in a multi-bandit multi-armed setting. We first propose an algorithm called Gap-based Exploration (GapE) that focuses on the arms whose mean is close to the mean of the best arm in the same bandit (i.e., small gap). We then introduce an algorithm, called GapE-V, which takes into account the variance of the arms in addition to their gap. We prove an upper-bound on the probability of error for both algorithms. Since GapE and GapE-V need to tune an exploration parameter that depends on the complexity of the problem, which is often unknown in advance, we also introduce variations of these algorithms that estimate this complexity online. Finally, we evaluate the performance of these algorithms and compare them to other allocation strategies on a number of synthetic problems. Subject Area: Theory\Online Learning case performance, leading to suboptimal performance on easy instances, for example when there exists an action that is significantly better than all others. We propose a new way of setting the learning rate, which adapts to the difficulty of the learning problem: in the worst case our procedure still guarantees optimal performance, but on easy instances it achieves much smaller regret. In particular, our adaptive method achieves constant regret in a probabilistic setting, when there exists an action that on average obtains strictly smaller loss than all other actions. We also provide a simulation study comparing our approach to existing methods. Subject Area: Theory
t80 online submodular set Cover, ranking, and repeated active learning
A. Guillory guillory@cs.washington.edu J. Bilmes bilmes@ee.washington.edu University of Washington We propose an online prediction version of submodular set cover with connections to ranking and repeated active learning. In each round, the learning algorithm chooses a sequence of items. The algorithm then receives a monotone submodular function and suffers loss equal to the cover time of the function: the number of items needed, when items are selected in order of the chosen sequence, to achieve a coverage constraint. We develop an online learning algorithm whose loss converges to approximately that of the best sequence in hindsight. Our proposed algorithm is readily extended to a setting where multiple functions are revealed at each round and to bandit and contextual bandit settings. Subject Area: Theory
tuesday abstraCts
the arms. We provide two regret analyses: a distributiondependent bound ~O(n - 3/2) that depends on a measure of the disparity of the arms, and a distribution-free bound ~ O(n - 4/3) that does not. To the best of our knowledge, such a finite-time analysis is new for this problem. Subject Area: Theory
t82 see the tree through the lines: the shazoo algorithm
F. Vitale fabio.vitale@unimi.it N. Cesa-Bianchi nicolo.cesa-bianchi@unimi.it G. Zappella giovanni.zappella@unimi.it Universit\`{a} degli studi di Milano C. Gentile claudio.gentile@uninsubria.it Universita dellInsubria Predicting the nodes of a given graph is a fascinating theoretical problem with applications in several domains. Since graph sparsification via spanning trees retains enough information while making the task much easier, trees are an important special case of this problem. Although it is known how to predict the nodes of an unweighted tree in a nearly optimal way, in the weighted case a fully satisfactory algorithm is not available yet. We fill this hole and introduce an efficient node predictor, Shazoo, which is nearly optimal on any weighted tree. Moreover, we show that Shazoo can be viewed as a common nontrivial generalization of both previous approaches for unweighted trees and weighted lines. Experiments on real-world datasets confirm that Shazoo performs well in that it fully exploits the structure of the input tree, and gets very close to (and sometimes better than) less scalable energy minimization methods. Subject Area: Theory\Online Learning
T83 Generalizing from Several Related Classification tasks to a new unlabeled sample
G. Blanchard gilles.blanchard@math.unipotsdam.de Universit\{a}t Potsdam (DE) G. Lee gyemin@umich.edu C. Scott cscott@eecs.umich.edu University of Michigan We consider the problem of assigning class labels to an unlabeled test data set, given several labeled training data sets drawn from similar distributions. This problem arises in several applications where data distributions fluctuate because of biological, technical, or other sources of variation. We develop a distribution-free, kernelbased approach to the problem. This approach involves identifying an appropriate reproducing kernel Hilbert space and optimizing a regularized empirical risk over the space. We present generalization error analysis, describe universal kernels, and establish universal consistency of the proposed methodology. Experimental results on flow cytometry data are presented. Subject Area: Theory
Many fundamental questions in theoretical neuroscience involve optimal decoding and the computation of Shannon information rates in populations of spiking neurons. In this paper, we apply methods from the asymptotic theory of statistical inference to obtain a clearer analytical understanding of these quantities. We find that for large neural populations carrying a finite total amount of information, the full spiking population response is asymptotically as informative as a single observation from a Gaussian process whose mean and covariance can be characterized explicitly in terms of network and single neuron properties. The Gaussian form of this asymptotic sufficient statistic allows us in certain cases to perform optimal Bayesian decoding by simple linear transformations, and to obtain closed-form expressions of the Shannon information carried by the network. One technical advantage of the theory is that it may be applied easily even to non-Poisson point process network models; for example, we find that under some conditions, neural populations with strong history-dependent (non-Poisson) effects carry exactly the same information as do simpler equivalent populations of non-interacting Poisson neurons with matched firing rates. We argue that our findings help to clarify some results from the recent literature on neural decoding and neuroprosthetic design. Subject Area: Probabilistic Models and Methods
66
tuesday abstraCts
t86 eigennet: a bayesian hybrid of generative and conditional models for sparse learning
Y. Qi F. Yan Purdue University alanqi@cs.purdue.edu fengyan0@gmail.com
For many real-world applications, we often need to select correlated variables---such as genetic variations and imaging features associated with Alzheimers disease--in a high dimensional space. The correlation between variables presents a challenge to classical variable selection methods. To address this challenge, the elastic net has been developed and successfully applied to many applications. Despite its great success, the elastic net does not exploit the correlation information embedded in the data to select correlated variables. To overcome this limitation, we present a novel hybrid model, EigenNet, that uses the eigenstructures of data to guide variable selection. Specifically, it integrates a sparse conditional classification model with a generative model capturing variable correlations in a principled Bayesian framework. We develop an efficient active-set algorithm to estimate the model via evidence maximization. Experiments on synthetic data and imaging genetics data demonstrated the superior predictive performance of the EigenNet over the lasso, the elastic net, and the automatic relevance determination. Subject Area: Probabilistic Models and Methods
We consider the problem of estimating neural spikes from extracellular voltage recordings. Most current methods are based on clustering, which requires substantial human supervision and produces systematic errors by failing to properly handle temporally overlapping spikes. We formulate the problem as one of statistical inference, in which the recorded voltage is a noisy sum of the spike trains of each neuron convolved with its associated spike waveform. Joint maximum-a-posteriori (MAP) estimation of the waveforms and spikes is then a blind deconvolution problem in which the coefficients are sparse. We develop a block-coordinate descent method for approximating the MAP solution. We validate our method on data simulated according to the generative model, as well as on real data for which ground truth is available via simultaneous intracellular recordings. In both cases, our method substantially reduces the number of missed spikes and false positives when compared to a standard clustering algorithm, primarily by recovering temporally overlapping spikes. The method offers a fully automated alternative to clustering methods that is less susceptible to systematic errors. Subject Area: Probabilistic Models and Methods
67
tuesday abstraCts
t90 message-Passing for approximate maP inference with latent Variables
J. Jiang jiarong@umiacs.umd.edu H. Daume III me@hal3.name University of Maryland P. Rai piyush@cs.utah.edu University of Utah We consider a general inference setting for discrete probabilistic graphical models where we seek maximum a posteriori (MAP) estimates for a subset of the random variables (max nodes), marginalizing over the rest (sum nodes). We present a hybrid message-passing algorithm to accomplish this. The hybrid algorithm passes a mix of sum and max messages depending on the type of source node (sum or max). We derive our algorithm by showing that it falls out as the solution of a particular relaxation of a variational framework. We further show that the Expectation Maximization algorithm can be seen as an approximation to our algorithm. Experimental results on synthetic and realworld datasets, against several baselines, demonstrate the efficacy of our proposed algorithm. Subject Area: Probabilistic Models and Methods which evolves by splitting and merging clusters. An FCP is exchangeable, projective, stationary and reversible, and its equilibrium distributions are given by the Chinese restaurant process. As opposed to hidden Markov models, FCPs allow for flexible modelling of the number of clusters, and they avoid label switching non-identifiability problems. We develop an efficient Gibbs sampler for FCPs which uses uniformization and the forward-backward algorithm. Our development of FCPs is motivated by applications in population genetics, and we demonstrate the utility of FCPs on problems of genotype imputation with phased and unphased SNP data. Subject Area: Probabilistic Models and Methods
Topic models are learned via a statistical model of variation within document collections, but designed to extract meaningful semantic structure. Desirable traits include the ability to incorporate annotations or metadata associated with documents; the discovery of correlated patterns of topic usage; and the avoidance of parametric assumptions, such as manual specification of the number of topics. We propose a doubly correlated nonparametric topic (DCNT) model, the first model to simultaneously capture all three of these properties. The DCNT models metadata via a flexible, Gaussian regression on arbitrary input features; correlations via a scalable square-root covariance representation; and nonparametric selection from an unbounded series of potential topics via a stickbreaking construction. We validate the semantic structure and predictive performance of the DCNT using a corpus of NIPS documents annotated by various metadata. Subject Area: Probabilistic Models and Methods
68
tuesday abstraCts
t95 the Kernel beta Process
L. Ren Y. Wang D. Dunson L. Carin Duke University lren@yahoo-inc.com yw65@duke.edu dunson@stat.duke.edu lcarin@duke.edu
A new Le vy process prior is proposed for an uncountable collection of covariate- dependent feature-learning measures; the model is called the kernel beta process (KBP). Available covariates are handled efficiently via the kernel construction, with covariates assumed observed with each data sample (``customer), and latent covariates learned for each feature (``dish). Each customer selects dishes from an infinite buffet, in a manner analogous to the beta process, with the added constraint that a customer first decides probabilistically whether to ``consider a dish, based on the distance in covariate space between the customer and dish. If a customer does consider a particular dish, that dish is then selected probabilistically as in the beta process. The beta process is recovered as a limiting case of the KBP. An efficient Gibbs sampler is developed for computations, and state-of-the-art results are presented for image processing and music analysis tasks. Subject Area: Probabilistic Models and Methods
How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at each round, we receive context (about the experimental conditions, the query), and have to choose an action (parameters, documents). The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the context-action space, and to exploit by choosing an action deemed optimal based on the gathered data. We model the payoff function as a sample from a Gaussian process defined over the joint context-action space, and develop CGP-UCB, an intuitive upper-confidence style algorithm. We show that by mixing and matching kernels for contexts and actions, CGP-UCB can handle a variety of practical applications. We further provide generic tools for deriving regret bounds when using such composite kernel functions. Lastly, we evaluate our algorithm on two case studies, in the context of automated vaccine design and sensor management. We show that context-sensitive optimization outperforms no or naive use of context. Subject Area: Probabilistic Models and Methods
T98 Automated Refinement of Bayes Networks Parameters based on test ordering Constraints
O. Khan ozkhan@cs.uwaterloo.ca P. Poupart ppoupart@cs.uwaterloo.ca University of Waterloo J. Agosta johnmark.agosta@gmail.com Intel Corporation In this paper, we derive a method to refine a Bayes network diagnostic model by exploiting constraints implied by expert decisions on test ordering. At each step, the expert executes an evidence gathering test, which suggests the tests relative diagnostic value. We demonstrate that consistency with an experts test selection leads to non-convex constraints on the model parameters. We incorporate these constraints by augmenting the network with nodes that represent the constraint likelihoods. Gibbs sampling, stochastic hill climbing and greedy search algorithms are proposed to find a MAP estimate that takes into account test ordering constraints and any data available. We demonstrate our approach on diagnostic sessions from a manufacturing scenario. Subject Area: Probabilistic Models and Methods
69
tuesday abstraCts
t99 solving decision Problems with limited information
D. Maua denis@idsia.ch C. de Campos cassio@idsia.ch Dalle Molle Institute for Artificial Intelligence We present a new algorithm for exactly solving decisionmaking problems represented as an influence diagram. We do not require the usual assumptions of no forgetting and regularity, which allows us to solve problems with limited information. The algorithm, which implements a sophisticated variable elimination procedure, is empirically shown to outperform a state-of-the-art algorithm in randomly generated problems of up to 150 variables and 1064 strategies. Subject Area: Probabilistic Models and Methods about these algorithms. The key contribution of this paper is that we introduce a formal definition of lifted inference that allows us to reason about the completeness of lifted inference algorithms relative to a particular class of probabilistic models. We then show how to obtain a completeness result using a first-order knowledge compilation approach for theories of formulae containing up to two logical variables. Subject Area: Probabilistic Models and Methods
t103 Comparative analysis of Viterbi training and maximum likelihood estimation for Hmms
A. Allahverdyan armen.allahverdyan@gmail.com Yerevan Physics Institute A. Galstyan galstyan@isi.edu University of Southern California We present an asymptotic analysis of Viterbi Training (VT) and contrast it with a more conventional Maximum Likelihood (ML) approach to parameter estimation in Hidden Markov Models. While ML estimator works by (locally) maximizing the likelihood of the observed data, VT seeks to maximize the probability of the most likely hidden state sequence. We develop an analytical framework based on a generating function formalism and illustrate it on an exactly solvable model of HMM with one unambiguous symbol. For this particular model the ML objective function is continuously degenerate. VT objective, in contrast, is shown to have only finite degeneracy. Furthermore, VT converges faster and results in sparser (simpler) models, thus realizing an automatic Occams razor for HMM learning. For more general scenario VT can be worse compared to ML but still capable of correctly recovering most of the parameters. Subject Area: Probabilistic Models and Methods
t101 on the Completeness of first-order Knowledge Compilation for lifted Probabilistic inference
G. Van den Broeck guy.vandenbroeck@cs.kuleuven.be Katholieke Universiteit Leuven Probabilistic logics are receiving a lot of attention today because of their expressive power for knowledge representation and learning. However, this expressivity is detrimental to the tractability of inference, when done at the propositional level. To solve this problem, various lifted inference algorithms have been proposed that reason at the first-order level, about groups of objects as a whole. Despite the existence of various lifted inference approaches, there are currently no completeness results
70
demonstrations abstraCts
5:45 11:59 Pm 1A Reproducing biologically realistic firing patterns on a highly-accelerated neuromorphic hardware system
Marc-Olivier Schwartz Kirchhoff-Institut fr Physik This demonstration will feature a neuromorphic chip in operation. The chip that will be showed during the demonstration features 512 neuron circuits and 115.000 plastic synapses. The users will be able to see live membrane voltage recording from neurons on the neuromorphic chip, which will reproduce typical firing patterns as seen in biology. Users will be able to modify the parameters of the neuron circuits with a Graphical User Interface (GUI) and observe the result on a screen.
3a
4a
2a
71
Wednesday ConferenCe
72
Wednesday - ConferenCe
ORAL SESSION
session 6 - 9:30 10:40 am Session Chair: Raquel Urtasun POSNER LECTURE: From Kernels to Causal Inference, Bernhard Schlkopf, Max Planck Institute for Intelligent Systems Kernel methods in machine learning have expanded from tricks to construct nonlinear algorithms to general tools to assay higher order statistics and properties of distributions. They find applications also in causal inference, an intriguing field that examines causal structures by testing their probabilistic footprints. However, the links between causal inference and modern machine learning go beyond this and the talk will outline some initial thoughts how problems like covariate shift adaptation and semi-supervised learning can benefit from the causal methodology.
Bernhard Scholkopf received degrees in mathematics (London) and physics (Tubingen), and a doctorate in computer science from the Technical University Berlin. He has researched at AT&T Bell Labs, at GMD FIRST, Berlin, at the Australian National University, Canberra, and at Microsoft Research Cambridge (UK). In 2001, he was appointed scientific member of the Max Planck Society and director at the MPI for Biological Cybernetics; in 2010 he founded the Max Planck Institute for Intelligent Systems. For further information, see www.kyb.tuebingen.mpg.de/~bs.
SPOTLIGHT SESSION
session 5 - 10:40 11:10 am Session Chair: Raquel Urtasun Minimax Localization of Structural Information in Large Noisy Matrices Mladen Kolar, Sivaraman Balakrishnan, Alessandro Rinaldo, Aarti Singh, Carnegie Mellon University Subject Area: Clustering See abstract, page 93 (W55) On Learning Discrete Graphical Models using Greedy Methods Ali Jalali, Christopher Johnson, Pradeep Ravikumar, University of Texas at Austin Subject Area: Model Selection & Structure Learning See abstract, page 102 (W100) Learning to Learn with Compound HD Models Ruslan Salakhutdinov, University of Toronto; Josh Tenenbaum & Antonio Torralba, MIT Subject Area: Probabilistic Models and Methods See abstract, page 100 (W89) Probabilistic Joint Image Segmentation and Labeling Adrian Ion, TU Wien & IST Austria; Joao Carreira & Cristian Sminchisescu, University of Bonn; University of Bonn Subject Area: Vision See abstract, page 83 (W16) Object Detection with Grammar Models Ross Girshick, University of Chicago, Pedro Felzenszwalb, Brown University, and David Mcallester, TTI-Chicago. Subject Area: Object Recognition See abstract, page 85 (W24) Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound Iasonas Kokkinos, Ecole Centrale Paris / INRIA Saclay Subject Area: Object Recognition See abstract, page 85 (W23) Im2Text: Describing Images Using 1 Million Captioned Photographs Vicente Ordonez, Girish Kulkarni and Tamara Berg, Stony Brook University Subject Area: Object Recognition See abstract, page 85 (W21)
High-Dimensional Graphical Model Selection: Tractable Graph Families and Necessary Conditions, Animashree Anandkumar, UC Irvine; Vincent Tan, University of Wisconsin-Madison; Alan Willsky, MIT We consider the problem of Ising and Gaussian graphical model selection given n i.i.d. samples from the model. We propose an efficient threshold-based algorithm for structure estimation based known as conditional mutual information test. This simple local algorithm requires only low-order statistics of the data and decides whether two nodes are neighbors in the unknown graph. Under some transparent assumptions, we establish that the proposed algorithm is structurally consistent (or sparsistent) when the number of samples scales as n = W(Jmin-4 log p), where p is the number of nodes and Jmin is the minimum edge potential. We also prove novel non-asymptotic necessary conditions for graphical model selection. Subject Area: Probabilistic Models and Methods
73
Wednesday - ConferenCe
ORAL SESSION
session 7 - 11:10 11:30 am Session Chair: Katherine Heller Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials Philipp Krhenbhl and Vladlen Koltun, Stanford University Most state-of-the-art techniques for multi-class image segmentation and labeling use conditional random fields defined over pixels or image regions. While regionlevel models often feature dense pairwise connectivity, pixel-level models are considerably larger and have only permitted sparse graph structures. In this paper, we consider fully connected CRF models defined on the complete set of pixels in an image. The resulting graphs have billions of edges, making traditional inference algorithms impractical. Our main contribution is a highly efficient approximate inference algorithm for fully connected CRF models in which the pairwise edge potentials are defined by linear combinations of Gaussian kernels. Our algorithm can approximately minimize fully connected models on tens of thousands of variables in a fraction of a second. Quantitative and qualitative results on the MSRC-21 and PASCAL VOC 2010 datasets demonstrate that full pairwise connectivity at the pixel level produces significantly more accurate segmentations and pixel-level label assignments. Subject Area: Vision
the calculation of the gradient of the smooth term or in the proximity operator with respect to the second term. We show that the basic proximal-gradient method, the basic proximal-gradient method with a strong convexity assumption, and the accelerated proximal-gradient method achieve the same convergence rates as in the error-free case, provided the errors decrease at an appropriate rate. Our experimental results on a structured sparsity problem indicate that sequences of errors with these appealing theoretical properties can lead to practical performance improvements. Subject Area: Convex Optimization
SPOTLIGHT SESSION
session 6 - 12:40 1:10 Pm Session Chair: Phil Long Improved Algorithms for Linear Stochastic Bandits Y. Abbasi-yadkori and C. Szepesvari, University of Alberta; D. Pal, Google Subject Area: Online Learning See abstract, page 98 (W82) From Bandits to Experts: On the Value of SideObservations S. Mannor, O. Shamir, Microsoft Research Subject Area: Online Learning See abstract, page 98 (W81) Lower Bounds for Passive and Active Learning M. Raginsky, University of Illinois at Urbana-Champaign; A. Rakhlin, University of Pennsylvania Subject Area: Statistical Learning Theory See abstract, page 99 (W85) Active Classification based on Value of Classifier T. Gao, D. Koller, Stanford University Subject Area: Classication See abstract, page 88 (W33) Budgeted Optimization with Concurrent StochasticDuration Experiments J. Azimi, A. Fern, X. Fern, Oregon State University Subject Area: Active Learning See abstract, page 92 (W52) Projection onto A Nonnegative Max-Heap J. Liu, Siemens Corporate Research; L. Sun and J. Ye, Arizona State University Subject Area: Learning with Structured Data See abstract, page dd Phase transition in the family of p-resistances M. Alamgir, U. von Luxburg, Max Planck Institute for Intelligent Systems Subject Area: Statistical Learning Theory See abstract, page 99 (W87)
ORAL SESSION
session 8 - 12:00 12:40 Pm Session Chair: Phil Long Efficient Online Learning via Randomized Rounding Nicol Cesa-Bianchi, Universit degli Studi di Milano; Ohad Shamir, Microsoft Research Most online algorithms used in machine learning today are based on variants of mirror descent or follow-the-leader. In this paper, we present an online algorithm based on a completely different approach, which combines ``random playout and randomized rounding of loss subgradients. As an application of our approach, we provide the first computationally efficient online algorithm for collaborative filtering with trace-norm constrained matrices. As a second application, we solve an open question linking batch learning and transductive online learning. Subject Area: Online Learning Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization Mark Schmidt, INRIA ; Nicolas Le Roux, INRIA ; Francis Bach, INRIA - Ecole Normale Superieure We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in
74
Wednesday - ConferenCe
ORAL SESSION
session 9 - 1:10 1:30 Pm Session Chair: Cdric Archambeau k-NN Regression Adapts to Local Intrinsic Dimension Samory Kpotufe, Max Planck Institute Many nonparametric regressors were recently shown to converge at rates that depend only on the intrinsic dimension of data. These regressors thus escape the curse of dimension when high-dimensional data has low intrinsic dimension (e.g. a manifold). We show that k-NN regression is also adaptive to intrinsic dimension. In particular our rates are local to a query x and depend only on the way masses of balls centered at x vary with radius. Furthermore, we show a simple way to choose k = k(x) locally at any x so as to nearly achieve the minimax rate at x in terms of the unknown intrinsic dimension in the vicinity of x. We also establish that the minimax rate does not depend on a particular choice of metric space or distribution, but rather that this minimax rate holds for any metric space and doubling measure. Subject Area: Learning Theory
Noise Thresholds for Spectral Clustering S. Balakrishnan, M. Xu, A. Krishnamurthy, A. Singh, Carnegie Mellon University Subject Area: Clustering See abstract, page 93 (W56)
ORAL SESSION
session 10 - 4:00 5:30 Pm Session Chair: Joelle Pineau INVITED TALK: Natural Algorithms Bernard Chazelle, Princeton University I will discuss the merits of an algorithmic approach to the analysis of complex self-organizing systems. I will argue that computer science, and algorithms in particular, offer a fruitful perspective on the complex dynamics of multiagent systems: for example, opinion dynamics, bird flocking, and firefly synchronization. I will give many examples and try to touch on some of the theory behind them, with an emphasis on their algorithmic nature and the particular challenges to machine learning that an algorithmic approach to dynamical systems raises.
Bernard Chazelle is Eugene Higgins professor of computer science at Princeton University, where he has been on the faculty since 1986. He has held research and faculty positions at CarnegieMellon University, Brown University, Ecole Polytechnique, Ecole normale superieure, the University of Paris, and INRIA. He did extensive consulting for Xerox PARC, DEC SRC, and NEC Research, where he was President of the Board of Fellows for several years. He received his Ph.D in computer science from Yale University in 1980. He is the author of the book The Discrepancy Method. He is a fellow of the American Academy of Arts and Sciences, the European Academy of Sciences, and the World Innovation Foundation. He is an ACM Fellow and a former Guggenheim fellow.
SPOTLIGHT SESSION
session 7 - 1:30 2:00 Pm Session Chair: Cdric Archambeau Complexity of Inference in Latent Dirichlet Allocation D. Sontag, New York Univ.; D. Roy, Univ. of Cambridge Subject Area: Topic Models See abstract, page 95 (W66) Practical Variational Inference for Neural Networks A. Graves, University of Toronto Subject Area: Neural Networks See abstract, page 90 (W42) A Multilinear Subspace Regression Method Using Orthogonal Tensors Decompositions Q. Zhao and A. Cichocki, RIKEN Brain Science Institute; C. Caiafa, D. Mandic, L. Zhang, Shanghai Jiao Tong Univ.; T. Ball and A. Schulze-bonhage, Albert-Ludwigs-Univ. Subject Area: Regression See abstract, page 90 (W43) Sparse Filtering J. Ngiam, P. Koh, Z. Chen, S. Bhaskar, A. Ng, Stanford Subject Area: Unsupervised & Semi-supervised Learning See abstract, page 91 (W50) Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators D. Perrault-Joncas, M. Meila, University of Washington Subject Area: Spectral Methods See abstract, page 94 (W63)
Iterative Learning for Reliable Crowdsourcing Systems David Karger, Sewoong Oh, Devavrat Shah, MIT Crowdsourcing systems, in which tasks are electronically distributed to numerous ``information piece-workers, have emerged as an effective paradigm for humanpowered solving of large scale problems in domains such as image classification, data entry, optical character recognition, recommendation, and proofreading. Because these low-paid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in some way such as majority voting. In this paper, we consider a general model of such rowdsourcing tasks, and pose the problem of minimizing the total price (i.e., number of task assignments) that must be paid to achieve a target overall reliability. We give new algorithms for deciding which tasks to assign to which workers and for inferring correct answers from the workers answers. We show that our algorithm significantly outperforms majority voting and, in fact, are asymptotically optimal through comparison to an oracle that knows the reliability of every worker. Subject Area: Graphical Models
75
Wednesday - ConferenCe
A Collaborative Mechanism for Crowdsourcing Prediction Problems Jacob Abernethy, Rafael Frongillo, UC Berkeley Machine Learning competitions such as the Netflix Prize have proven reasonably successful as a method of crowdsourcing prediction tasks. But these competitions have a number of weaknesses, particularly in the incentive structure they create for the participants. We propose a new approach, called a Crowdsourced Learning Mechanism, in which participants collaboratively learn a hypothesis for a given prediction task. The approach draws heavily from the concept of a prediction market, where traders bet on the likelihood of a future event. In our framework, the mechanism continues to publish the current hypothesis, and participants can modify this hypothesis by wagering on an update. The critical incentive property is that a participant will profit an amount that scales according to how much her update improves performance on a released test set. Subject Area: Game Theory & Computational Economics W12 W13 Variational learning for recurrent spiking networks, D. Rezende, D. Wierstra, W. Gerstner empirical models of spiking in neural populations, J. Macke, L. Buesing, J. Cunningham, B. Yu, K. Shenoy, M. Sahani Efficient Inference in Fully Connected CRFs with gaussian edge Potentials, P. Krhenbhl, V. Koltun Fast and Balanced: Efficient Label Tree Learning for large scale object recognition, J. Deng, S. Satheesh, A. Berg, F. Li Probabilistic Joint image segmentation and labeling, A. Ion, J. Carreira, C. Sminchisescu Heavy-tailed distances for gradient based image descriptors, Y. Jia, T. Darrell -mrf: Capturing spatial and semantic structure in the Parameters for scene understanding, C. Li, A. Saxena, T. Chen learning to agglomerate superpixel Hierarchies, V. Jain, S. Turaga, K. Briggman, M. Helmstaedter, W. Denk, H. Seung multiple instance filtering, K. Wnuk, S. Soatto automatic Captioning using billions of Photographs, V. Ordonez, G. Kulkarni, T. Berg Exploiting spatial overlap to efficiently compute appearance distances between image windows, B. Alexe, V. Petrescu, V. Ferrari rapid deformable object detection using dualtree branch-and-bound, I. Kokkinos object detection with grammar models, R. Girshick, P. Felzenszwalb, D. Mcallester learning a tree of metrics with disjoint Visual features, S. Hwang, K. Grauman, F. Sha learning person-object interactions for action recognition in still images, V. Delaitre, J. Sivic, I. Laptev Extracting Speaker-Specific Information with a regularized siamese deep network, K. Chen, A. Salman a machine learning approach to Predict Chemical reactions, M. Kayala, P. Baldi rtrmC: a riemannian trust-region method for lowrank matrix completion, N. Boumal, P. Absil randomized algorithms for Comparison-based search, D. Tschopp, S. Diggavi, P. Delgosha, S. Mohajer
W14 W15
POSTER SESSION
and reCePtion - 5:45 11:59 Pm W1 W2 a rational model of causal inference with continuous causes, M. Pacer, T. Griffiths testing a bayesian measure of representativeness using a large image database, J. Abbott, K. Heller, Z. Ghahramani, T. Griffiths How do Humans teach: on Curriculum learning and teaching dimension, F. Khan, X. Zhu, B. Mutlu Probabilistic modeling of dependencies among Visual short-term memory representations, E. Orhan, R. Jacobs understanding the intrinsic memorability of images, P. Isola, D. Parikh, A. Torralba, A. Oliva an ideal observer model for identifying the reference frame of object, s J. Austerweil, A. Friesen, T. Griffiths on the analysis of multi-Channel neural spike dat, a B. Chen, D. Carlson, L. Carin Select and Sample: A Model of Efficient Neural inference and learning, J. Shelton, J. Bornschein, A. Sheikh, P. Berkes, J. Lucke neural reconstruction with approximate message Passing (neuramP), A. Fletcher, S. Rangan, L. Varshney, A. Bhargava two is better than one: distinct roles for familiarity and recollection in retrieving palimpsest memories, C. Savin, P. Dayan, M. Lengyel Efficient coding with a population of Linearnonlinear neurons, y. karklin, E. Simoncelli
W19
W3 W4
W5 W6
W7 W8
W27
W9
W10
W11
76
Wednesday - ConferenCe
W31 W32 W33 W34 W35 W36 W37 W38 W39 W40 W41 Active learning ranking from Pairwise Preferences with almost optimal Query Complexity, N. Ailon selective Prediction of financial trends with Hidden markov models, D. Pidan, R. El-Yaniv Active Classification based on Value of Classifier, T. Gao, D. Koller the fast Convergence of boosting, M. Telgarsky Variance Penalizing adaboost, P. Shivaswamy, T. Jebara Kernel bayes rule, K. Fukumizu, L. Song, A. Gretton multiple instance learning on structured data, D. Zhang, Y. Liu, L. Si, J. Zhang, R. Lawrence structured learning for Cell tracking, X. Lou, F. Hamprecht Projection onto a nonnegative max-Heap, J. Liu, L. Sun, J. Ye shallow vs. deep sum-Product networks, O. Delalleau, Y. Bengio unfolding recursive autoencoders for Paraphrase detection, R. Socher, E. Huang, J. Pennin, A. Ng, C. Manning Practical Variational inference for neural networks, A. Graves a multilinear subspace regression method using orthogonal tensors decompositions, Q. Zhao, C. Caiafa, D. Mandic, L. Zhang, T. Ball, A. Schulzebonhage, A. Cichocki sparse recovery by thresholded non-negative least squares, M. Slawski, M. Hein structured sparse coding via lateral inhibition, a. szlam, K. Gregor, Y. LeCun Efficient Methods for Overlapping Group Lasso, L. Yuan, J. Liu, J. Ye generalized beta mixtures of gaussians, A. Armagan, D. Dunson, M. Clyde sparse manifold Clustering and embedding, E. Elhamifar, R. Vidal data skeletonization via reeb graphs, X. Ge, I. Safa, M. Belkin, Y. Wang sparse filtering, J. Ngiam, P. Koh, Z. Chen, S. Bhaskar, A. Ng ICA with Reconstruction Cost for Efficient overcomplete feature learning, Q. Le, A. Karpenko, J. Ngiam, A. Ng W52 budgeted optimization with Concurrent stochastic-duration experiments, J. Azimi, A. Fern, X. Fern fast and accurate k-means for large datasets, M. Shindler, A. Wong, A. Meyerson scalable training of mixture models via Coresets, D. Feldman, M. Faulkner, A. Krause minimax localization of structural information in large noisy matrices, M. Kolar, S. Balakrishnan, A. Rinaldo, A. Singh noise thresholds for spectral Clustering, S. Balakrishnan, M. Xu, A. Krishnamurthy, A. Singh maximum Covariance unfolding : manifold learning for bimodal data, V. Mahadevan, C. Wong, J. Costa Pereira, T. Liu, N. Vasconcelos, L. Saul The Manifold Tangent Classifier, S. Rifai, Y. Dauphin, P. Vincent, Y. Bengio, X. Muller dimensionality reduction using the sparse linear mode, l I. Gkioulekas, T. Zickler large-scale sparse Principal Component analysis with application to text data, Y. Zhang, L. Ghaoui divide-and-Conquer matrix factorization, L. Mackey, A. Talwalkar, M. Jordan Convergent bounds on the euclidean distance, Y. Hwang, H. Ahn directed graph embedding: an algorithm based on Continuous limits of laplacian-type operator, s D. Perrault-Joncas, M. Meila improving topic Coherence with regularized topic models, D. Newman, E. Bonilla, W. Buntine expressive Power and approximation errors of restricted boltzmann machines, G. Montufar, J. Rauh, N. Ay Complexity of inference in latent dirichlet allocation, D. Sontag, D. Roy Hierarchically supervised latent dirichlet allocation, A. Perotte, F. Wood, N. Elhadad, N. Bartlett distributed delayed stochastic optimization, A. Agarwal, J. Duchi a concave regularization technique for sparse mixture models, M. Larsson, J. Ugander fast approximate submodular minimization, S. Jegelka, H. Lin, J. Bilmes
W56 W57
W42 W43
W64 W65
W66 W67
77
Wednesday - ConferenCe
W71 Convergence rates of inexact Proximal-gradient methods for Convex optimization, M. Schmidt, N. Le Roux, F. Bach linearized alternating direction method with adaptive Penalty for low-rank representation, Z. Lin, R. Liu, Z. Su Approximating Semidefinite Programs in Sublinear time, D. Garber, E. Hazan Hogwild: a lock-free approach to Parallelizing stochastic gradient descent, B. Recht, C. Re, S. Wright, F. Niu beating sgd: learning sVms in sublinear time, E. Hazan, T. Koren, N. Srebro learning large-margin halfspaces with more malicious noise, P. Long, R. Servedio k-nn regression adapts to local intrinsic dimension, S. Kpotufe A Collaborative mechanism for Crowdsourcing Prediction Problems, J. Abernethy, R. Frongillo multi-armed bandits on implicit metric spaces, A. Slivkins Efficient Online Learning via Randomized rounding, N. Cesa-Bianchi, O. Shamir from bandits to experts: on the Value of sideobservations, S. Mannor, O. Shamir improved algorithms for linear stochastic bandits, Y. Abbasi-yadkori, D. Pal, C. Szepesvari stochastic convex optimization with bandit feedback, A. Agarwal, D. Foster, D. Hsu, S. Kakade, A. Rakhlin Predicting Dynamic Difficulty, O. Missura, T. Gaertner lower bounds for Passive and active learning, M. Raginsky, A. Rakhlin optimal learning rates for least squares sVms using gaussian kernels, M. Eberts, I. Steinwart Phase transition in the family of p-resistances, M. Alamgir, U. von Luxburg bayesian spike-triggered Covariance analysis, I. Park, J. Pillow learning to learn with Compound Hd models, R. Salakhutdinov, J. Tenenbaum, A. Torralba reconstructing Patterns of information diffusion from incomplete observations, F. Chierichetti, J. Kleinberg, D. Liben-Nowell W91 Continuous-time regression models for longitudinal networks, D. Vu, A. Asuncion, D. Hunter, P. Smyth differentially Private m-estimators, J. Lei bayesian bias mitigation for Crowdsourcing, F. Wauthier, M. Jordan t-divergence based approximate inference, N. Ding, S. Vishwanathan, Y. Qi Query-aware mCmC, M. Wick, A. McCallum learning sparse inverse covariance matrices in the presence of confounders, O. Stegle, C. Lippert, J. Mooij, N. Lawrence, K. Borgwardt gaussian Process training with input noise, A. McHutchon, C. Rasmussen iterative learning for reliable Crowdsourcing systems, D. Karger, S. Oh, D. Shah High-dimensional graphical model selection: tractable graph families and necessary Conditions, A. Anandkumar, V. Tan, A. Willsky on learning discrete graphical models using greedy methods, A. Jalali, C. Johnson, P. Ravikumar Clustered multi-task learning Via alternating structure optimization, J. Zhou, J. Chen, J. Ye
W72
W73 W74
W100 W101
DEMONSTRATIONS
5:45 11:59 Pm 1B real-time social media analysis with tWimPaCt Mikio Braun, Matthias Jugel, Klaus-Robert Mller, TU Berlin aisoy1, a robot that Perceives, feels and makes decisions Diego Garca Snchez, AISoy Robotics S.L.; David Rios Insua, Rey Juan Carlos University real-time multi-class segmentation using depth Cues Clement Farabet, Nathan Silberman, New York University Contour-based large scale image retrieval Platform Rong Zhou, Liqing Zhang, Shanghai Jiao Tong University
2B
3B
4B
78
FRONT ENTRANCE
W100 W91 W82 W73 W64 W55 W46 W37 W28 W19 W10 W01 W101 W92 W83 W74 W65 W56 W47 W38 W29 W20 W11 W02 W102 W93 W84 W75 W66 W57 W48 W39 W30 W21 W12 W03 W94 W85 W76 W67 W58 W49 W40 W31 W22 W13 W04 W95 W86 W77 W68 W59 W50 W41 W32 W23 W14 W05 W96 W87 W78 W69 W60 W51 W42 W33 W24 W15 W06 W97 W88 W79 W70 W61 W52 W43 W34 W25 W16 W07 W98 W89 W81 W71 W62 W53 W44 W35 W26 W17 W08 W99 W90 W82 W72 W63 W54 W45 W36 W27 W18 W09
FLOOR ONE
Internet Area
To Cafeteria
3B
Andalucia 3
1B
Andalucia 3
Demonstrations 4B 2B
Andalucia 2
Cafeteria
5
79
Wednesday - abstraCts
W1 a rational model of causal inference with continuous causes
M. Pacer mpacer@berkeley.edu T. Griffiths tom_griffiths@berkeley.edu University of California, Berkeley Rational models of causal induction have been successful in accounting for peoples judgments about the existence of causal relationships. However, these models have focused on explaining inferences from discrete data of the kind that can be summarized in a 2 $ imes$ 2 contingency table. This severely limits the scope of these models, since the world often provides non-binary data. We develop a new rational model of causal induction using continuous dimensions, which aims to diminish the gap between empirical and theoretical approaches and real-world causal induction. This model successfully predicts human judgments from previous studies better than models of discrete causal inference, and outperforms several other plausible models of causal induction with continuous causes in accounting for peoples inferences in a new experiment. Subject Area: Cognitive Science
W3
W2
W4
80
Wednesday - abstraCts
W5 understanding the intrinsic memorability of images
P. Isola phillipi@mit.edu A. Torralba torralba@csail.mit.edu A. Oliva oliva@mit.edu Massachusetts Institute of Technology D. Parikh dparikh@ttic.edu TTIC Artists, advertisers, and photographers are routinely presented with the task of creating an image that a viewer will remember. While it may seem like image memorability is purely subjective, recent work shows that it is not an inexplicable phenomenon: variation in memorability of images is consistent across subjects, suggesting that some images are intrinsically more memorable than others, independent of a subjects contexts and biases. In this paper, we used the publicly available memorability dataset of Isola et al., and augmented the object and scene annotations with interpretable spatial, content, and aesthetic image properties. We used a feature-selection scheme with desirable explaining-away properties to determine a compact set of attributes that characterizes the memorability of any individual image. We find that images of enclosed spaces containing people with visible faces are memorable, while images of vistas and peaceful scenes are not. Contrary to popular belief, unusual or aesthetically pleasing scenes do not tend to be highly memorable. This work represents one of the first attempts at understanding intrinsic image memorability, and opens a new domain of investigation at the interface between human cognition and computer vision. Subject Area: Cognitive Science
W7
Nonparametric Bayesian methods are developed for analysis of multi-channel spike-train data, with the feature learning and spike sorting performed jointly. The feature learning and sorting are performed simultaneously across all channels. Dictionary learning is implemented via the beta-Bernoulli process, with spike sorting performed via the dynamic hierarchical Dirichlet process (dHDP), with these two models coupled. The dHDP is augmented to eliminate refractoryperiod violations, it allows the ``appearance and ``disappearance of neurons over time, and it models smooth variation in the spike statistics. Subject Area: Neuroscience
W8
W6
81
Wednesday - abstraCts
W9 neural reconstruction with approximate message Passing (neuramP)
A. Fletcher alyson@eecs.berkeley.edu University of California, Berkeley S. Rangan srangan@poly.edu Polytechnic Institute of New York University L. Varshney varshney@alum.mit IBM A. Bhargava aniruddha@wisc.edu University of Wisconsin-Madison Many functional descriptions of spiking neurons assume a cascade structure where inputs are passed through an initial linear filtering stage that produces a low-dimensional signal that drives subsequent nonlinear stages. This paper presents a novel and systematic parameter estimation procedure for such models and applies the method to two neural estimation problems: (i) compressed-sensing based neural mapping from multi-neuron excitation, and (ii) estimation of neural receptive yields in sensory neurons. The proposed estimation algorithm models the neurons via a graphical model and then estimates the parameters in the model using a recently-developed generalized approximate message passing (GAMP) method. The GAMP method is based on Gaussian approximations of loopy belief propagation. In the neural connectivity problem, the GAMP-based method is shown to be computational efficient, provides a more exact modeling of the sparsity, can incorporate nonlinearities in the output and significantly outperforms previous compressed-sensing methods. For the receptive field estimation, the GAMP method can also exploit inherent structured sparsity in the linear weights. The method is validated on estimation of linear nonlinear Poisson (LNP) cascade models for receptive fields of salamander retinal ganglion cells. Subject Area: Neuroscience
W11 Efficient coding of natural images with a population of noisy linear-nonlinear neurons
Y. karklin yan.karklin@nyu.edu E. Simoncelli eero.simoncelli@nyu.edu HHMI / New York University Efficient coding provides a powerful principle for explaining early sensory coding. Most attempts to test this principle have been limited to linear, noiseless models, and when applied to natural images, have yielded oriented filters consistent with responses in primary visual cortex. Here we show that an efficient coding model that incorporates biologically realistic ingredients -- input and output noise, nonlinear response functions, and a metabolic cost on the firing rate -- predicts receptive fields and response nonlinearities similar to those observed in the retina. Specifically, we develop numerical methods for simultaneously learning the linear filters and response nonlinearities of a population of model neurons, so as to maximize information transmission subject to metabolic costs. When applied to an ensemble of natural images, the method yields filters that are center-surround and nonlinearities that are rectifying. The filters are organized into two populations, with On- and Off-centers, which independently tile the visual space. As observed in the primate retina, the Off-center neurons are more numerous and have filters with smaller spatial extent. In the absence of noise, our method reduces to a generalized version of independent components analysis, with an adapted nonlinear ``contrast function; in this case, the optimal filters are localized and oriented. Subject Area: Neuroscience
W10 two is better than one: distinct roles for familiarity and recollection in retrieving palimpsest memories
C. Savin cs664@cam.ac.uk M. Lengyel m.lengyel@eng.cam.ac.uk University of Cambridge P. Dayan dayan@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit Storing a new pattern in a palimpsest memory system comes at the cost of interfering with the memory traces of previously stored items. Knowing the age of a pattern thus becomes critical for recalling it faithfully. This implies that there should be a tight coupling between estimates of age, as a form of familiarity, and the neural dynamics of recollection, something which current theories omit. Using a normative model of autoassociative memory, we show that a dual memory system, consisting of two interacting modules for familiarity and recollection, has best performance for both recollection and recognition. This finding provides a new window onto actively contentious psychological and neural aspects of recognition memory. Subject Area: Neuroscience
82
Wednesday - abstraCts
W13 empirical models of spiking in neural populations
J. Macke jakob@gatsby.ucl.ac.uk L. Buesing lars@gatsby.ucl.ac.uk M. Sahani maneesh@gatsby.ucl.ac.uk University College London J. Cunningham jpc74@cam.ac.uk University of Cambridge B. Yu byronyu@cmu.edu Carnegie Mellon University K. Shenoy shenoy@stanford.edu Stanford University Neurons in the neocortex code and compute as part of a locally interconnected population. Large-scale multielectrode recording makes it possible to access these population processes empirically by fitting statistical models to unaveraged data. What statistical structure best describes the concurrent spiking of cells within a local network? We argue that in the cortex, where firing exhibits extensive correlations in both time and space and where a typical sample of neurons still reflects only a very small fraction of the local population, the most appropriate model captures shared variability by a low-dimensional latent process evolving with smooth dynamics, rather than by putative direct coupling. We test this claim by comparing a latent dynamical model with realistic spiking observations to coupled generalised linear spike-response models (GLMs) using cortical recordings. We find that the latent dynamical approach outperforms the GLM in terms of goodness-of-fit, and reproduces the temporal correlations in the data more accurately. We also compare models whose observations models are either derived from a Gaussian or point-process models, finding that the nonGaussian model provides slightly better goodness-of-fit and more realistic population spike counts. Subject Area: Neuroscience demonstrate that full pairwise connectivity at the pixel level produces significantly more accurate segmentations and pixel-level label assignments. Subject Area: Vision
W15 Fast and Balanced: Efficient Label Tree learning for large scale object recognition
J. Deng Princeton University S. Satheesh F. Li Stanford University A. Berg Stony Brook jiadeng@cs.princeton.edu ssanjeev@stanford.edu feifeili@cs.stanford.edu aberg@cs.stonybrook.edu
We present a novel approach to efficiently learn a label tree for large scale classification with many classes. The key contribution of the approach is a technique to simultaneously determine the structure of the tree and learn the classifiers for each node in the tree. This approach also allows fine grained control over the efficiency vs accuracy trade-off in designing a label tree, leading to more balanced trees. Experiments are performed on large scale image classification with 10184 classes and 9 million images. We demonstrate significant improvements in test accuracy and efficiency with less training time and more balanced trees compared to the previous state of the art by Bengio et al. Subject Area: Vision
W14 Efficient Inference in Fully Connected CRFs with gaussian edge Potentials
P. Krhenbhl V. Koltun Stanford University philkra@gmail.com vladlen@stanford.edu
Most state-of-the-art techniques for multi-class image segmentation and labeling use conditional random fields defined over pixels or image regions. While regionlevel models often feature dense pairwise connectivity, pixel-level models are considerably larger and have only permitted sparse graph structures. In this paper, we consider fully connected CRF models defined on the complete set of pixels in an image. The resulting graphs have billions of edges, making traditional inference algorithms impractical. Our main contribution is a highly efficient approximate inference algorithm for fully connected CRF models in which the pairwise edge potentials are defined by linear combinations of Gaussian kernels. Our algorithm can approximately minimize fully connected models on tens of thousands of variables in a fraction of a second. Quantitative and qualitative results on the MSRC-21 and PASCAL VOC 2010 datasets
Wednesday - abstraCts
W17 Heavy-tailed distances for gradient based image descriptors
Y. Jia T. Darrell UC Berkeley jiayq@eecs.berkeley.edu trevor@eecs.berkeley.edu
Many applications in computer vision measure the similarity between images or image patches based on some statistics such as oriented gradients. These are often modeled implicitly or explicitly with a Gaussian noise assumption, leading to the use of the Euclidean distance when comparing image descriptors. In this paper, we show that the statistics of gradient based image descriptors often follow a heavy-tailed distribution, which undermines any principled motivation for the use of Euclidean distances. We advocate for the use of a distance measure based on the likelihood ratio test with appropriate probabilistic models that fit the empirical data distribution. We instantiate this similarity measure with the Gamma-compound-Laplace distribution, and show significant improvement over existing distance measures in the application of SIFT feature matching, at relatively low computational cost. Subject Area: Vision
W18 -mrf: Capturing spatial and semantic structure in the Parameters for scene understanding
C. Li A. Saxena T. Chen Cornell University cl758@cornell.edu asaxena@cs.cornell.edu tsuhan@ece.cornell.edu
For most scene understanding tasks (such as object detection or depth estimation), the classifiers need to consider contextual information in addition to the local features. We can capture such contextual information by taking as input the features/attributes from all the regions in the image. However, this contextual dependence also varies with the spatial location of the region of interest, and we therefore need a different set of parameters for each spatial location. This results in a very large number of parameters. In this work, we model the independence properties between the parameters for each location and for each task, by defining a Markov Random Field (MRF) over the parameters. In particular, two sets of parameters are encouraged to have similar values if they are spatially close or semantically close. Our method is, in principle, complementary to other ways of capturing context such as the ones that use a graphical model over the labels instead. In extensive evaluation over two different settings, of multiclass object detection and of multiple scene understanding tasks (scene categorization, depth estimation, geometric labeling), our method beats the state-of-the-art methods in all the four tasks. Subject Area: Vision
We propose a robust filtering approach based on semisupervised and multiple instance learning (MIL). We assume that the posterior density would be unimodal if not for the effect of outliers that we do not wish to explicitly model. Therefore, we seek for a point estimate at the outset, rather than a generic approximation of the entire posterior. Our approach can be thought of as a combination of standard finite-dimensional filtering (Extended Kalman Filter, or Unscented Filter) with multiple instance learning, whereby the initial condition comes with a putative set of inlier measurements. We show how both the state (regression) and the inlier set (classification) can be estimated iteratively and causally by processing only the current measurement. We illustrate our approach on visual tracking problems whereby the object of interest (target) moves and evolves as a result of occlusions and deformations, and partial knowledge of the target is given in the form of a bounding box (training set). Subject Area: Vision\Motion and Tracking
84
Wednesday - abstraCts
W21 im2text: describing images using 1 million Captioned Photographs
V. Ordonez vicente.ordonez@gmail.com G. Kulkarni girish86@gmail.com T. Berg tlberg@cs.sunysb.edu Stony Brook University We develop and demonstrate automatic image description methods using a large captioned photo collection. One contribution is our technique for the automatic collection of this new dataset -- performing a huge number of Flickr queries and then filtering the noisy results down to 1 million images with associated visually relevant captions. Such a collection allows us to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results. We also develop methods incorporating many state of the art, but fairly noisy, estimates of image content to produce even more pleasing results. Finally we introduce a new objective performance measure for image captioning. Subject Area: Vision\Object Recognition
W22 Exploiting spatial overlap to efficiently compute appearance distances between image windows
B. Alexe bogdan@vision.ee.ethz.ch ETH ZURICH V. Petrescu petrescu.viviana@gmail.com V. Ferrari ferrari@vision.ee.ethz.ch University of Edinburgh We present a computationally efficient technique to compute the distance of high-dimensional appearance descriptor vectors between image windows. The method exploits the relation between appearance distance and spatial overlap. We derive an upper bound on appearance distance given the spatial overlap of two windows in an image, and use it to bound the distances of many pairs between two images. We propose algorithms that build on these basic operations to efficiently solve tasks relevant to many computer vision applications, such as finding all pairs of windows between two images with distance smaller than a threshold, or finding the single pair with the smallest distance. In experiments on the PASCAL VOC 07 dataset, our algorithms accurately solve these problems while greatly reducing the number of appearance distances computed, and achieve larger speedups than approximate nearest neighbour algorithms based on trees [18]and on hashing [21]. For example, our algorithm finds the most similar pair of windows between two images while computing only 1\% of all distances on average. Subject Area: Vision\Object Recognition
85
Wednesday - abstraCts
W25 learning a tree of metrics with disjoint Visual features
S. Hwang sjhwang@cs.utexas.edu K. Grauman grauman@cs.utexas.edu University of Texas at Austin F. Sha feisha@usc.edu University of Southern California We introduce an approach to learn discriminative visual representations while exploiting external semantic knowledge about object category relationships. Given a hierarchical taxonomy that captures semantic similarity between the objects, we learn a corresponding tree of metrics (ToM). In this tree, we have one metric for each non-leaf node of the object hierarchy, and each metric is responsible for discriminating among its immediate subcategory children. Specifically, a Mahalanobis metric learned for a given node must satisfy the appropriate (dis)similarity constraints generated only among its subtree members training instances. To further exploit the semantics, we introduce a novel regularizer coupling the metrics that prefers a sparse disjoint set of features to be selected for each metric relative to its ancestor supercategory nodes metrics. Intuitively, this reflects that visual cues most useful to distinguish the generic classes (e.g., feline vs. canine) should be different than those cues most useful to distinguish their component fine-grained classes (e.g., Persian cat vs. Siamese cat). We validate our approach with multiple image datasets using the WordNet taxonomy, show its advantages over alternative metric learning approaches, and analyze the meaning of attribute features selected by our algorithm. Subject Area: Vision\Object Recognition
86
Wednesday - abstraCts
W29 rtrmC: a riemannian trust-region method for low-rank matrix completion
N. Boumal P. Absil U.C.Louvain nicolas.boumal@uclouvain.be absil@inma.ucl.ac.be
W31 active learning ranking from Pairwise Preferences with almost optimal Query Complexity
N. Ailon Technion nailon@cs.technion.ac.il
We consider large matrices of low rank. We address the problem of recovering such matrices when most of the entries are unknown. Matrix completion finds applications in recommender systems. In this setting, the rows of the matrix may correspond to items and the columns may correspond to users. The known entries are the ratings given by users to some items. The aim is to predict the unobserved ratings. This problem is commonly stated in a constrained optimization framework. We follow an approach that exploits the geometry of the low-rank constraint to recast the problem as an unconstrained optimization problem on the Grassmann manifold. We then apply first- and second-order Riemannian trust-region methods to solve it. The cost of each iteration is linear in the number of known entries. Our methods, RTRMC 1 and 2, outperform state-of-the-art algorithms on a wide range of problem instances. Subject Area: Applications
Given a set V of n elements we wish to linearly order them using pairwise preference labels which may be non-transitive (due to irrationality or arbitrary noise). The goal is to linearly order the elements while disagreeing with as few pairwise preference labels as possible. Our performance is measured by two parameters: The number of disagreements (loss) and the query complexity (number of pairwise preference labels). Our algorithm adaptively queries at most O(n poly(log n,"-1)) preference labels for a regret of " times the optimal loss. This is strictly better, and often significantly better than what non-adaptive sampling could achieve. Our main result helps settle an open problem posed by learning-to-rank (from pairwise information) theoreticians and practitioners: What is a provably correct way to sample preference labels? Subject Area: Applications
Focusing on short term trend prediction in a financial context, we consider the problem of selective prediction whereby the predictor can abstain from prediction in order to improve performance. We examine two types of selective mechanisms for HMM predictors. The first is a rejection in the spirit of Chows well-known ambiguity principle. The second is a specialized mechanism for HMMs that identifies low quality HMM states and abstain from prediction in those states. We call this model selective HMM (sHMM). In both approaches we can trade-off prediction coverage to gain better accuracy in a controlled manner. We compare performance of the ambiguity-based rejection technique with that of the sHMM approach. Our results indicate that both methods are effective, and that the sHMM model is superior. Subject Area: Applications
87
Wednesday - abstraCts
W33 Active Classification based on Value of Classifier
T. Gao D. Koller Stanford University tianshig@stanford.edu koller@cs.stanford.edu
Modern classification tasks usually involve many class labels and can be informed by a broad range of features. Many of these tasks are tackled by constructing a set of classifiers, which are then applied at test time and then pieced together in a fixed procedure determined in advance or at training time. We present an active classification process at the test time, where each classifier in a large ensemble is viewed as a potential observation that might inform our classification process. Observations are then selected dynamically based on previous observations, using a value-theoretic computation that balances an estimate of the expected classification gain from each observation as well as its computational cost. The expected classification gain is computed using a probabilistic model that uses the outcome from previous observations. This active classification process is applied at test time for each individual test instance, resulting in an efficient instancespecific decision path. We demonstrate the benefit of the active scheme on various real-world datasets, and show that it can achieve comparable or even higher classification accuracy at a fraction of the computational costs of traditional methods. Subject Area: Supervised Learning\Classification
This paper proposes a novel boosting algorithm called VadaBoost which is motivated by recent empirical Bernstein bounds. VadaBoost iteratively minimizes a cost function that balances the sample mean and the sample variance of the exponential loss. Each step of the proposed algorithm minimizes the cost efficiently by providing weighted data to a weak learner rather than requiring a brute force evaluation of all possible weak learners. Thus, the proposed algorithm solves a key limitation of previous empirical Bernstein boosting methods which required brute force enumeration of all possible weak learners. Experimental results confirm that the new algorithm achieves the performance improvements of EBBoost yet goes beyond decision stumps to handle any weak learner. Significant performance gains are obtained over AdaBoost for arbitrary weak learners including decision trees (CART). Subject Area: Supervised Learning
88
Wednesday - abstraCts
instances/bags within many applications of MIL. Ignoring this structure information limits the performance of existing MIL algorithms. This paper explores the research problem as multiple instance learning on structured data (MILSD) and formulates a novel framework that considers additional structure information. In particular, an effective and efficient optimization algorithm has been proposed to solve the original non-convex optimization problem by using a combination of Concave-Convex Constraint Programming (CCCP) method and an adapted Cutting Plane method, which deals with two sets of constraints caused by learning on instances within individual bags and learning on structured data. Our method has the nice convergence property, with specified precision on each set of constraints. Experimental results on three different applications, i.e., webpage classification, market targeting, and protein fold identification, clearly demonstrate the advantages of the proposed method over state-of-the-art methods. Subject Area: Supervised Learning however, does not scale well. We reveal several important properties of the maximal root-tree, based on which we design a bottom-up algorithm with merge for efficiently finding the maximal root-tree. The proposed algorithm has a (worst-case) linear time complexity for a sequential list, and O(p2) for a general tree. We report simulation results showing the effectiveness of the max-heap for regression with an ordered tree structure. Empirical results show that the proposed algorithm has an expected linear time complexity for many special cases including a sequential list, a full binary tree, and a tree with depth 1. Subject Area: Supervised Learning
W41 dynamic Pooling and unfolding recursive autoencoders for Paraphrase detection
R. Socher E. Huang J. Pennin A. Ng C. Manning Stanford University richard@socher.org ehhuang@stanford.edu jpennin@stanford.edu ang@cs.stanford.edu manning@stanford.edu
Paraphrase detection is the task of examining two sentences and determining whether they have the same meaning. In order to obtain high accuracy on this task, thorough syntactic and semantic analysis of the two statements is needed. We introduce a method for paraphrase detection based on recursive autoencoders (RAE). Our unsupervised RAEs are based on a novel unfolding objective and learn feature vectors for phrases in syntactic trees. These features are used to measure the word- and phrase-wise similarity between two sentences. Since sentences may be of arbitrary length, the resulting matrix of similarity measures is of variable size. We introduce a novel dynamic pooling layer which computes a fixed-sized representation from the variable-sized matrices. The pooled representation is then used as input to a classifier. Our method outperforms other state-ofthe-art approaches on the challenging MSRP paraphrase corpus. Subject Area: Supervised Learning
89
Wednesday - abstraCts
W42 Practical Variational inference for neural networks
A. Graves University of Toronto alex.graves@gmail.com
Variational methods have been previously explored as a tractable approximation to Bayesian inference for neural networks. However the approaches proposed so far have only been applicable to a few simple network architectures. This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks. Along the way it revisits several common regularisers from a variational perspective. It also provides a simple pruning heuristic that can both drastically reduce the number of network weights and lead to improved generalisation. Experimental results are provided for a hierarchical multidimensional recurrent neural network applied to the TIMIT speech corpus. Subject Area: Supervised Learning
Non-negative data are commonly encountered in numerous fields, making non-negative least squares regression (NNLS) a frequently used tool. At least relative to its simplicity, it often performs rather well in practice. Serious doubts about its usefulness arise for modern high-dimensional linear models. Even in this setting - unlike first intuition may suggest - we show that for a broad class of designs, NNLS is resistant to overfitting and works excellently for sparse recovery when combined with thresholding, experimentally even outperforming L1regularization. Since NNLS also circumvents the delicate choice of a regularization parameter, our findings suggest that NNLS may be the method of choice. Subject Area: Supervised Learning
This work describes a conceptually simple method for structured sparse coding and dictionary design. Supposing a dictionary with K atoms, we introduce a structure as a set of penalties or interactions between every pair of atoms. We describe modifications of standard sparse coding algorithms for inference in this setting, and describe experiments showing that these algorithms are efficient. We show that interesting dictionaries can be learned for interactions that encode tree structures or locally connected structures. Finally, we show that our framework allows us to learn the values of the interactions from the data, rather than having them pre-specified. Subject Area: Supervised Learning
Wednesday - abstraCts
convex dual problem, which allows the use of the gradient descent type of algorithms for the optimization. We have performed empirical evaluations using both synthetic and the breast cancer gene expression data set, which consists of 8,141 genes organized into (overlapping) gene sets. Experimental results show that the proposed algorithm is more efficient than existing state-of-the-art algorithms. Subject Area: Supervised Learning
In recent years, a rich variety of shrinkage priors have been proposed that have great promise in addressing massive regression problems. In general, these new priors can be expressed as scale mixtures of normals, but have more complex forms and better properties than traditional Cauchy and double exponential priors. We first propose a new class of normal scale mixtures through a novel generalized beta distribution that encompasses many interesting priors as special cases. This encompassing framework should prove useful in comparing competing priors, considering properties and revealing close connections. We then develop a class of variational Bayes approximations through the new hierarchy presented that will scale more efficiently to the types of truly massive data sets that are now encountered routinely. Subject Area: Supervised Learning
We propose an algorithm called Sparse Manifold Clustering and Embedding (SMCE) for simultaneous clustering and dimensionality reduction of data lying in multiple nonlinear manifolds. Similar to most dimensionality reduction methods, SMCE finds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The key difference is that SMCE finds both the neighbors and the weights automatically. This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold and approximately span a low-dimensional affine subspace. The optimal solution encodes information that can be used for clustering and dimensionality reduction using spectral clustering and embedding. Moreover, the size of the optimal neighborhood of a data point, which can be different for different points, provides an estimate of the dimension of the manifold to which the point belongs. Experiments demonstrate that our method can effectively handle multiple manifolds that are very close to each other, manifolds with non-uniform sampling and holes, as well as estimate the intrinsic dimensions of the manifolds. Subject Area: Manifold Learning
Unsupervised feature learning has been shown to be effective at learning representations that perform well on image, video and audio classification. However, many existing feature learning algorithms are hard to use and require extensive hyperparameter tuning. In this work, we present sparse filtering, a simple new algorithm which is efficient and only has one hyperparameter, the number of features to learn. In contrast to most other feature learning methods, sparse filtering does not explicitly attempt to construct a model of the data distribution. Instead, it optimizes a simple cost function -- the sparsity of `2-normalized features -- which can easily be implemented in a few lines of MATLAB code. Sparse filtering scales gracefully to handle high-dimensional inputs, and can also be used to learn meaningful features in additional layers with greedy layer-wise stacking. We evaluate sparse filtering on natural images, object classification (STL10), and phone classification (TIMIT), and show that our method works well on a range of different modalities. Subject Area: Unsupervised & Semi-supervised Learning
91
Wednesday - abstraCts
W51 ICA with Reconstruction Cost for Efficient overcomplete feature learning
Q. Le A. Karpenko J. Ngiam A. Ng Stanford University quocle@stanford.edu akarpenko@stanford.edu jngiam@cs.stanford.edu ang@cs.stanford.edu
Independent Components Analysis (ICA) and its variants have been successfully used for unsupervised feature learning. However, standard ICA requires an orthonoramlity constraint to be enforced, which makes it difficult to learn overcomplete features. In addition, ICA is sensitive to whitening. These properties make it challenging to scale ICA to high dimensional data. In this paper, we propose a robust soft reconstruction cost for ICA that allows us to learn highly overcomplete sparse features even on unwhitened data. Our formulation reveals formal connections between ICA and sparse autoencoders, which have previously been observed only empirically. Our algorithm can be used in conjunction with off-theshelf fast unconstrained optimizers. We show that the soft reconstruction cost can also be used to prevent replicated features in tiled convolutional neural networks. Using our method to learn highly overcomplete sparse features and tiled convolutional neural networks, we obtain competitive performances on a wide variety of object recognition tasks. We achieve state-of-the-art test accuracies on the STL-10 and Hollywood2 datasets. Subject Area: Unsupervised & Semi-supervised Learning
How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset will also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. More precisely, we prove that a weighted set of O(dk3="2) data points suffices for computing a (1 + ")-approximation for the optimal model on the original n data points. Moreover, such coresets can be efficiently constructed in a map-reduce style computation, as well as in a streaming setting. Our results rely on a novel reduction of statistical estimation to problems in computational geometry, as well as new complexity results about mixtures of Gaussians. We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones. Subject Area: Unsupervised & Semi-supervised Learning
92
Wednesday - abstraCts
W55 minimax localization of structural information in large noisy matrices
M. Kolar mladenk@cs.cmu.edu S. Balakrishnan sbalakri@cs.cmu.edu A. Rinaldo arinaldo@stat.cmu.edu A. Singh aartisingh@cmu.edu Carnegie Mellon University We consider the problem of identifying a sparse set of relevant columns and rows in a large data matrix with highly corrupted entries. This problem of identifying groups from a collection of bipartite variables such as proteins and drugs, biological species and gene sequences, malware and signatures, etc is commonly referred to as biclustering or co-clustering. Despite its great practical relevance, and although several ad-hoc methods are available for biclustering, theoretical analysis of the problem is largely non-existent. The problem we consider is also closely related to structured multiple hypothesis testing, an area of statistics that has recently witnessed a flurry of activity. We make the following contributions: i) We prove lower bounds on the minimum signal strength needed for successful recovery of a bicluster as a function of the noise variance, size of the matrix and bicluster of interest. ii) We show that a combinatorial procedure based on the scan statistic achieves this optimal limit. iii) We characterize the SNR required by several computationally tractable procedures for biclustering including elementwise thresholding, column/row average thresholding and a convex relaxation approach to sparse singular vector decomposition. Subject Area: Unsupervised & Semi-supervised Learning
We combine three important ideas present in previous work for building classifiers: the semi-supervised hypothesis (the input distribution contains information about the classifier), the unsupervised manifold hypothesis (data density concentrates near low-dimensional manifolds), and the manifold hypothesis for classification (different classes correspond to disjoint manifolds separated by low density). We exploit a new algorithm for capturing manifold structure (high-order contractive autoencoders) and we show how it builds a topological atlas of charts, each chart being characterized by the principal singular vectors of the Jacobian of a representation mapping. This representation learning algorithm can be stacked to yield a deep architecture, and we combine it with a domain knowledge-free version of the TangentProp algorithm to encourage the classifier to be insensitive to local directions changes along the manifold. Record-breaking results are obtained and we find that the learned tangent directions are very meaningful. Subject Area: Unsupervised & Semi-supervised Learning
93
Wednesday - abstraCts
W59 dimensionality reduction using the sparse linear model
I. Gkioulekas T. Zickler Harvard University igkiou@seas.harvard.edu zickler@eecs.harvard.edu algorithm, and combines the subproblem solutions using techniques from randomized matrix approximation. Our experiments with collaborative filtering, video background modeling, and simulated data demonstrate the near-linear to super-linear speed-ups attainable with this approach. Moreover, our analysis shows that DFC enjoys highprobability recovery guarantees comparable to those of its base algorithm. Subject Area: Unsupervised & Semi-supervised Learning
We propose an approach for linear unsupervised dimensionality reduction, based on the sparse linear model that has been used to probabilistically interpret sparse coding. We formulate an optimization problem for learning a linear projection from the original signal domain to a lower-dimensional one in a way that approximately preserves, in expectation, pairwise inner products in the sparse domain. We derive solutions to the problem, present nonlinear extensions, and discuss relations to compressed sensing. Our experiments using facial images, texture patches, and images of object categories suggest that the approach can improve our ability to recover meaningful structure in many classes of signals. Subject Area: Unsupervised and Semi-supervised Learning\ICA, PCA, CCA & Other Linear Models
W60 large-scale sparse Principal Component analysis with application to text data
Y. Zhang L. Ghaoui UC Berkeley zyw@eecs.berkeley.edu elghaoui@berkeley.edu
Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models. Subject Area: Unsupervised & Semi-supervised Learning
Given a set V of n vectors in d-dimensional space, we provide an efficient method for computing quality upper and lower bounds of the Euclidean distances between a pair of the vectors in V . For this purpose, we define a distance measure, called the MS-distance, by using the mean and the standard deviation values of vectors in V . Once we compute the mean and the standard deviation values of vectors in V in O(dn) time, the MS-distance between them provides upper and lower bounds of Euclidean distance between a pair of vectors in V in constant time. Furthermore, these bounds can be refined further such that they converge monotonically to the exact Euclidean distance within d refinement steps. We also provide an analysis on a random sequence of refinement steps which can justify why MS-distance should be refined to provide very tight bounds in a few steps of a typical sequence. The MS-distance can be used to various problems where the Euclidean distance is used to measure the proximity or similarity between objects. We provide experimental results on the nearest and the farthest neighbor searches. Subject Area: Unsupervised & Semi-supervised Learning
W63 directed graph embedding: an algorithm based on Continuous limits of laplacian-type operators
D. Perrault-Joncas dcpj@stat.washington.edu M. Meila mmp@stat.washington.edu University of Washington This paper considers the problem of embedding directed graphs in Euclidean space while retaining directional information. We model the observed graph as a sample from a manifold endowed with a vector field, and we design an algo- rithm that separates and recovers the features of this process: the geometry of the manifold, the data density and the vector field. The algorithm is motivated by our analysis of Laplacian-type operators and their continuous limit as generators of diffusions on a manifold. We illustrate the recovery algorithm on both artificially constructed and real data. Subject Area: Unsupervised & Semi-supervised Learning
Wednesday - abstraCts
W64 improving topic Coherence with regularized topic models
D. Newman newman@uci.edu University of California, Irvine E. Bonilla edwin.bonilla@nicta.com.au W. Buntine wray.buntine@nicta.com.au NICTA Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set diversity analysis, and document retrieval. However, when dealing with small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful. To overcome this, we propose two methods to regularize the learning of topic models. Our regularizers work by creating a structured prior over words that reflect broad patterns in the external data. Using thirteen datasets we show that both regularizers improve topic coherence and interpretability while learning a faithful representation of the collection of interest. Overall, this work makes topic models more useful across a broader range of text data. Subject Area: Unsupervised & Semi-supervised Learning we show that, when a document has a large number of topics, finding the MAP assignment of topics to words in LDA is NP-hard. Next, we consider the problem of finding the MAP topic distribution for a document, where the topicword assignments are integrated out. We show that this problem is also NP-hard. Finally, we briefly discuss the problem of sampling from the posterior, showing that this is NP-hard in one restricted setting, but leaving open the general question. Subject Area: Unsupervised & Semi-supervised Learning
We introduce hierarchically supervised latent Dirichlet allocation (HSLDA), a model for hierarchically and multiply labeled bag-of-word data. Examples of such data include web pages and their placement in directories, product descriptions and associated categories from product hierarchies, and free-text clinical records and their assigned diagnosis codes. Out-of-sample label prediction is the primary goal of this work, but improved lower-dimensional representations of the bag-of-word data are also of interest. We demonstrate HSLDA on large-scale data from clinical document labeling and retail product categorization tasks. We show that leveraging the structure from hierarchical labels improves out-of-sample label prediction substantially when compared to models that do not. Subject Area: Unsupervised & Semi-supervised Learning
95
Wednesday - abstraCts
W69 a concave regularization technique for sparse mixture models
M. Larsson J. Ugander Cornell University mol23@cornell.edu jhu5@cornell.edu second term. We show that the basic proximal-gradient method, the basic proximal-gradient method with a strong convexity assumption, and the accelerated proximalgradient method achieve the same convergence rates as in the error-free case, provided the errors decrease at an appropriate rate. Our experimental results on a structured sparsity problem indicate that sequences of errors with these appealing theoretical properties can lead to practical performance improvements. Subject Area: Optimization
Latent variable mixture models are a powerful tool for exploring the structure in large datasets. A common challenge for interpreting such models is a desire to impose sparsity, the natural assumption that each data point only contains few latent features. Since mixture distributions are constrained in their L1 norm, typical sparsity techniques based on L1 regularization become toothless, and concave regularization becomes necessary. Unfortunately concave regularization typically results in EM algorithms that must perform problematic nonconcave M-step maximizations. In this work, we introduce a technique for circumventing this difficulty, using the socalled Mountain Pass Theorem to provide easily verifiable conditions under which the M-step is well-behaved despite the lacking concavity. We also develop a correspondence between logarithmic regularization and what we term the pseudo-Dirichlet distribution, a generalization of the ordinary Dirichlet distribution well-suited for inducing sparsity. We demonstrate our approach on a text corpus, inferring a sparse topic mixture model for 2,406 weblogs. Subject Area: Optimization
W72 linearized alternating direction method with adaptive Penalty for low-rank representation
Z. Lin zhoulin@microsoft.com Microsoft Research Asia R. Liu rsliu0705@gmail.com Z. Su zxsu@dlut.edu.cn Dalian University of Technology Many machine learning and signal processing problems can be formulated as linearly constrained convex programs, which could be efficiently solved by the alternating direction method (ADM). However, usually the subproblems in ADM are easily solvable only when the linear mappings in the constraints are identities. To address this issue, we propose a linearized ADM (LADM) method by linearizing the quadratic penalty term and adding a proximal term when solving the subproblems. For fast convergence, we also allow the penalty to change adaptively according a novel update rule. We prove the global convergence of LADM with adaptive penalty (LADMAP). As an example, we apply LADMAP to solve low-rank representation (LRR), which is an important subspace clustering technique yet suffers from high computation cost. By combining LADMAP with a skinny SVD representation technique, we are able to reduce the complexity O(n3)of the original ADM based method to O(rn2)$, where r and n are the rank and size of the representation matrix, respectively, hence making LRR possible for large scale applications. Numerical experiments verify that for LRR our LADMAP based methods are much faster than state-of-the-art algorithms. Subject Area: Optimization
In recent years semidefinite optimization has become a tool of major importance in various optimization and machine learning problems. In many of these problems the amount of data in practice is so large that there is a constant need for faster algorithms. In this work we present the first sublinear time approximation algorithm for semidefinite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. We present the algorithm and its analysis alongside with some theoretical lower bounds and an improved algorithm for the special problem of supervised learning of a distance metric. Subject Area: Optimization\Convex Optimization
Wednesday - abstraCts
W74 Hogwild: a lock-free approach to Parallelizing stochastic gradient descent
B. Recht brecht@cs.wisc.edu C. Re chrisre@cs.wisc.edu S. Wright swright@cs.wisc.edu F. Niu leonn@cs.wisc.edu University of Wisconsin-Madison Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented *without any locking*. We present an update scheme called Hogwild which allows processors access to shared memory with the possibility of overwriting each others work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then Hogwild achieves a nearly optimal rate of convergence. We demonstrate experimentally that Hogwild outperforms alternative schemes that use locking by an order of magnitude. Subject Area: Optimization\Stochastic Methods no algorithm for learning gamma-margin halfspaces that minimizes a convex proxy for misclassification error can tolerate malicious noise at a rate greater than Theta(eps gamma); this may partially explain why previous algorithms could not achieve the higher noise tolerance of our new algorithm. Subject Area: Learning Theory
Many nonparametric regressors were recently shown to converge at rates that depend only on the intrinsic dimension of data. These regressors thus escape the curse of dimension when high-dimensional data has low intrinsic dimension (e.g. a manifold). We show that k-NN regression is also adaptive to intrinsic dimension. In particular our rates are local to a query x and depend only on the way masses of balls centered at x vary with radius. Furthermore, we show a simple way to choose k = k(x) locally at any x so as to nearly achieve the minimax rate at x in terms of the unknown intrinsic dimension in the vicinity of x. We also establish that the minimax rate does not depend on a particular choice of metric space or distribution, but rather that this minimax rate holds for any metric space and doubling measure. Subject Area: Learning Theory
We present an optimization approach for linear SVMs based on a stochastic primal-dual approach, where the primal step is akin to an importance-weighted SGD, and the dual step is a stochastic update on the importance weights. This yields an optimization method with a sublinear dependence on the training set size, and the first method for learning linear SVMs with runtime less then the size of the training set required for learning! Subject Area: Optimization\Stochastic Methods
We describe a simple algorithm that runs in time poly (n, 1=, 1=) and learns an unknown n-dimensional gammamargin halfspace to accuracy 1 - in the presence of malicious noise, when the noise rate is allowed to be as high as (log(1=)). Previous efficient algorithms could only learn to accuracy eps in the presence of malicious noise of rate at most Theta(eps gamma). Our algorithm does not work by optimizing a convex loss function. We show that
Machine Learning competitions such as the Netflix Prize have proven reasonably successful as a method of ``crowdsourcing prediction tasks. But these competitions have a number of weaknesses, particularly in the incentive structure they create for the participants. We propose a new approach, called a Crowdsourced Learning Mechanism, in which participants collaboratively ``learn a hypothesis for a given prediction task. The approach draws heavily from the concept of a prediction market, where traders bet on the likelihood of a future event. In our framework, the mechanism continues to publish the current hypothesis, and par- ticipants can modify this hypothesis by wagering on an update. The critical incentive property is that a participant will profit an amount that scales according to how much her update improves performance on a released test set. Subject Area: Theory
97
Wednesday - abstraCts
W79 multi-armed bandits on implicit metric spaces
A. Slivkins Microsoft Research slivkins@microsoft.com action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. The observation structure is encoded as a graph, where node i is linked to node j if sampling i provides information on the reward of j. This setting naturally interpolates between the well-known ``experts setting, where the decision maker can view all rewards, and the multi-armed bandits setting, where the decision maker can only view the reward of the chosen action. We develop practical algorithms with provable regret guarantees, which depend on non-trivial graph-theoretic properties of the information feedback structure. We also provide partially-matching lower bounds. Subject Area: Theory\Online Learning
The multi-armed bandit (MAB) setting is a useful abstraction of many online learning tasks which focuses on the tradeoff between exploration and exploitation. In this setting, an online algorithm has a fixed set of alternatives (arms), and in each round it selects one arm and then observes the corresponding reward. While the case of small number of arms is by now well-understood, a lot of recent work has focused on multi-armed bandits with (infinitely) many arms, where one needs to assume extra structure in order to make the problem tractable. In particular, in the Lipschitz MAB problem there is an underlying similarity metric space, known to the algorithm, such that any two arms that are close in this metric space have similar payoffs. In this paper we consider the more realistic scenario in which the metric space is *implicit* -- it is defined by the available structure but not revealed to the algorithm directly. Specifically, we assume that an algorithm is given a tree-based classification of arms. For any given problem instance such a classification implicitly defines a similarity metric space, but the numerical similarity information is not available to the algorithm. We provide an algorithm for this setting, whose performance guarantees (almost) match the best known guarantees for the corresponding instance of the Lipschitz MAB problem. Subject Area: Theory\Online Learning
We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auers UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. Subject Area: Theory\Online Learning
We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen
98
Wednesday - abstraCts
W84 Predicting Dynamic Difficulty
O. Missura T. Gaertner University of Bonn olanochka@gmail.com thomas.gaertner@iais.fraunhofer.de
Motivated by applications in electronic games as well as teaching systems, we investigate the problem of dynamic difficulty adjustment. The task here is to repeatedly find a game difficulty setting that is neither `too easy and bores the player, nor `too difficult and overburdens the player. The contributions of this paper are ($i$) formulation of difficulty adjustment as an online learning problem on partially ordered sets, ($ii$) an exponential update algorithm for dynamic difficulty adjustment, ($iii$) a bound on the number of wrong difficulty settings relative to the best static setting chosen in hindsight, and ($iv$) an empirical investigation of the algorithm when playing against adversaries. Subject Area: Theory\Online Learning
W86 optimal learning rates for least squares sVms using gaussian kernels
M. Eberts Mona.Eberts@mathematik.uni-stuttgart.de I. Steinwart ingo.steinwart@mathematik.uni-stuttgart.de University of Stuttgart We prove a new oracle inequality for support vector machines with Gaussian RBF kernels solving the regularized least squares regression problem. To this end, we apply the modulus of smoothness. With the help of the new oracle inequality we then derive learning rates that can also be achieved by a simple data-dependent parameter selection method. Finally, it turns out that our learning rates are asymptotically optimal for regression functions satisfying certain standard smoothness conditions. Subject Area: Theory\Statistical Learning Theory
99
Wednesday - abstraCts
W89 learning to learn with Compound Hd models
R. Salakhutdinov rsalakhu@utstat.toronto.edu University of Toronto J. Tenenbaum jbt@mit.edu A. Torralba torralba@csail.mit.edu Massachusetts Institute of Technology We introduce HD (or ``Hierarchical-Deep) models, a new compositional learning architecture that integrates deep learning models with structured hierarchical Bayesian models. Specifically we show how we can learn a hierarchical Dirichlet process (HDP) prior over the activities of the top-level features in a Deep Boltzmann Machine (DBM). This compound HDP-DBM model learns to learn novel concepts from very few training examples, by learning low-level generic features, high-level features that capture correlations among low-level features, and a category hierarchy for sharing priors over the high-level features that are typical of different kinds of concepts. We present efficient learning and inference algorithms for the HDP-DBM model and show that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets. Subject Area: Probabilistic Models and Methods
Motivated by the spread of on-line information in general and on-line petitions in particular, recent research has raised the following combinatorial estimation problem. There is a tree T that we cannot observe directly (representing the structure along which the information has spread), and certain nodes randomly decide to make their copy of the information public. In the case of a petition, the list of names on each public copy of the petition also reveals a path leading back to the root of the tree. What can we conclude about the properties of the tree we observe from these revealed paths, and can we use the structure of the observed tree to estimate the size of the full unobserved tree T? Here we provide the first algorithm for this size estimation task, together with provable guarantees on its performance. We also establish structural properties of the observed tree, providing the first rigorous explanation for some of the unusual structural phenomena present in the spread of real chain-letter petitions on the Internet. Subject Area: Probabilistic Models and Methods
Biased labelers are a systemic problem in crowdsourcing, and a comprehensive toolbox for handling their responses is still being developed. A typical crowdsourcing application can be divided into three steps: data collection, data curation, and learning. At present these steps are often treated separately. We present Bayesian Bias Mitigation for Crowdsourcing (BBMC), a Bayesian model to unify all three. Most data curation methods account for the {\ it effects} of labeler bias by modeling all labels as coming from a single latent truth. Our model captures the {\it sources} of bias by describing labelers as influenced by shared random effects. This approach can account for more complex bias patterns that arise in ambiguous or hard labeling tasks and allows us to merge data curation and learning into a single computation. Active learning
100
Wednesday - abstraCts
integrates data collection with learning, but is commonly considered infeasible with Gibbs sampling inference. We propose a general approximation strategy for Markov chains to efficiently quantify the effect of a perturbation on the stationary distribution and specialize this approach to active learning. Experiments show BBMC to outperform many common heuristics. Subject Area: Probabilistic Models and Methods
W96 Efficient inference in matrix-variate Gaussian models with \iid observation noise
O. Stegle oliver.stegle@tuebingen.mpg.de K. Borgwardt karsten.borgwardt@tuebingen.mpg.de C. Lippert christoph.lippert@tuebingen.mpg.de Max Planck Institute Biological Cybernetics J. Mooij j.mooij@cs.ru.nl Radboud University Nijmegen N. Lawrence N.Lawrence@shef.ac.uk University of Sheffield Inference in matrix-variate Gaussian models has major applications for multi-output prediction and joint learning of row and column covariances from matrix-variate data. Here, we discuss an approach for efficient inference in such models that explicitly account for \iid observation noise. Computational tractability can be retained by exploiting the Kronecker product between row and column covariance matrices. Using this framework, we show how to generalize the Graphical Lasso in order to learn a sparse inverse covariance between features while accounting for a low-rank confounding covariance between samples. We show practical utility on applications to biology, where we model covariances with more than 100,000 dimensions. We find greater accuracy in recovering biological network structures and are able to better reconstruct the confounders. Subject Area: Probabilistic Models and Methods
Approximate inference is an important technique for dealing with large, intractable graphical models based on the exponential family of distributions. We extend the idea of approximate inference to the t-exponential family by defining a new t-divergence. This divergence measure is obtained via convex duality between the log-partition function of the t-exponential family and a new t-entropy. We illustrate our approach on the Bayes Point Machine with a Students t-prior. Subject Area: Probabilistic Models and Methods
Traditional approaches to probabilistic inference such as loopy belief propagation and Gibbs sampling typically compute marginals for it all the unobserved variables in a graphical model. However, in many real-world applications the users interests are focused on a subset of the variables, specified by a query. In this case it would be wasteful to uniformly sample, say, one million variables when the query concerns only ten. In this paper we propose a query-specific approach to MCMC that accounts for the query variables and their generalized mutual information with neighboring variables in order to achieve higher computational efficiency. Surprisingly there has been almost no previous work on query-aware MCMC. We demonstrate the success of our approach with positive experimental results on a wide range of graphical models. Subject Area: Probabilistic Models and Methods
101
Wednesday - abstraCts
W98 iterative learning for reliable Crowdsourcing systems
D. Karger karger@mit.edu S. Oh sewoong79@gmail.com D. Shah devavrat@mit.edu Massachusetts Institute of Technology Crowdsourcing systems, in which tasks are electronically distributed to numerous ``information piece-workers, have emerged as an effective paradigm for human-powered solving of large scale problems in domains such as image classification, data entry, optical character recognition, recommendation, and proofreading. Because these lowpaid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in some way such as majority voting. In this paper, we consider a general model of such rowdsourcing tasks, and pose the problem of minimizing the total price (i.e., number of task assignments) that must be paid to achieve a target overall reliability. We give new algorithms for deciding which tasks to assign to which workers and for inferring correct answers from the workers answers. We show that our algorithm significantly outperforms majority voting and, in fact, are asymptotically optimal through comparison to an oracle that knows the reliability of every worker. Subject Area: Probabilistic Models and Methods W100 on learning discrete graphical models using
greedy methods
A. Jalali alij@mail.utexas.edu C. Johnson cjohnson@cs.utexas.edu P. Ravikumar pradeepr@cs.utexas.edu University of Texas, Austin In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a high-dimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forward-backward greedy algorithm as applied to general statistical models. As a special case, we then apply this algorithm to learn the structure of a discrete graphical model via neighborhood estimation. As a corollary of our general result, we derive sufficient conditions on the number of samples n, the maximum node-degree d and the problem size p, as well as other conditions on the model parameters, so that the algorithm recovers all the edges with high probability. Our result guarantees graph selection for samples scaling as n = W(d2 log (p)), in contrast to existing convex-optimization based algorithms that require a sample complexity of n = W(d3 log (p)). Further, the greedy algorithm only requires a restricted strong convexity condition which is typically milder than irrepresentability assumptions. We corroborate these results using numerical simulations at the end. Subject Area: Probabilistic Models and Methods W101 Clustered multi-task learning Via alternating
W99 High-dimensional graphical model selection: tractable graph families and necessary Conditions
A. Anandkumar animakumar@gmail.com UC Irvine V. Tan vtan@wisc.edu University of Wisconsin-Madison A. Willsky willsky@mit.edu Massachusetts Institute of Technology We consider the problem of Ising and Gaussian graphical model selection given n i.i.d. samples from the model. We propose an efficient threshold-based algorithm for structure estimation based known as conditional mutual information test. This simple local algorithm requires only low-order statistics of the data and decides whether two nodes are neighbors in the unknown graph. Under some transparent assumptions, we establish that the proposed algorithm is structurally consistent (or sparsistent) when the number of samples scales as n = W(Jmin-4 log p), where p is the number of nodes and Jmin is the minimum edge potential. We also prove novel non-asymptotic necessary conditions for graphical model selection. Subject Area: Probabilistic Models and Methods
structure optimization
J. Zhou jiayu.zhou@asu.edu J. Chen Jianhui.Chen@asu.edu J. Ye jieping.ye@asu.edu Arizona State University Multi-task learning (MTL) learns multiple related tasks simultaneously to improve generalization performance. Alternating structure optimization (ASO) is a popular MTL method that learns a shared low-dimensional predictive structure on hypothesis spaces from multiple related tasks. It has been applied successfully in many real world applications. As an alternative MTL approach, clustered multi-task learning (CMTL) assumes that multiple tasks follow a clustered structure, i.e., tasks are partitioned into a set of groups where tasks in the same group are similar to each other, and that such a clustered structure is unknown a priori. The objectives in ASO and CMTL differ in how multiple tasks are related. Interestingly, we show in this paper the equivalence relationship between ASO and CMTL, providing significant new insights into ASO and CMTL as well as their inherent relationship. The CMTL formulation is non-convex, and we adopt a convex relaxation to the CMTL formulation. We further establish the equivalence relationship between the proposed convex relaxation of CMTL and an existing convex relaxation of ASO, and show that the proposed convex CMTL formulation is significantly more efficient especially for high-dimensional data. In addition, we present three algorithms for solving the convex CMTL formulation. We report experimental results on benchmark datasets to demonstrate the efficiency of the proposed algorithms. Subject Area: Probabilistic Models and Methods
102
demonstrations abstraCts
1b real-time social media analysis with tWimPaCt
Mikio Braun Matthias Jugel Klaus-Robert Mller TU Berlin Social media analysis has attracted quite some interest recently. Typical questions are identifying current trends, rating the impact or influence of users, summarizing what people are saying on some topic or brand, and so on. One challenge is the enormous amount of data you need to process. Current approaches often need to filter the data first and then do the actual analysis after the event. However, ideally you would want to be able to look into the stream of social media updates in real-time as the event unfolds, and not a few days or weeks later when youve processed your data. We will showcase the real-time social media analysis engine developed by TWIMPACT in cooperation with the TU Berlin. You will be able to look at the current trends on Twitter in real-time, in addition to being able to have a look at the last year which we will bring in a pre-analyzed form, such that you can zoom in on the historic events of this year like the revolutions in Egypt, or Libya, and so on
3b
2b
4b
103
tHursday ConferenCe
104
tHursday - ConferenCe
ORAL SESSION
session 11 - 9:30 10:40 am Session Chair: Mate Lengyel
ORAL SESSION
session 12 - 10:40 - 11:20 am Session Chair: Sasha Rakhlin
INVITED TALK: The Neuronal Replicator Hypothesis: Novel Mechanisms for Information Transfer and Search in the Brain
Ers Szathmry szathmary.eors@gmail.com Chrisantha Fernando ctf20@sussex.ac.uk Parmenides Foundation Francis Crick called Gerald Edelmans Neural Darwiniam Neural Edelmanism because he could not identify any units of evolution, i.e. entities that multiply and have hereditary variation. Whilst a sufficient condition for the production of adaptation is to satisfy George Prices equation, a condition that most hill-climbers, competitive learning, and reinforcement learning algorithms, and Edelmans neuronal groups do satisfy, a full Darwinian dynamics of populations in which there is information transfer between individuals, i.e. true units of evolution, has additional algorithmic properties, notably the capacity for the evolution of evolvability to structure exploration (proposal) distributions. This capacity for Darwinian populations to learn has inspired us to search for true units of evolution in the brain. To this end we have identified several candidate units with varying degrees of biological plausibility, and have showed how these units can replicate within neuronal tissue at timescales from milliseconds to minutes. Thus we present a theory that is legitimately called Darwinian Neurodynamics. This is joint work with Chrisantha Fernando.
105
tHursday - ConferenCe
ORAL SESSION
session 13 - 12:00 1:10 am Session Chair: Jonathan Pillow
ORAL SESSION
session 14 - 1:10 - 1:50 Pm Session Chair: Ronan Collobert
We combine three important ideas present in previous work for building classifiers: the semi-supervised hypothesis (the input distribution contains information about the classifier), the unsupervised manifold hypothesis (data density concentrates near low-dimensional manifolds), and the manifold hypothesis for classification (different classes correspond to disjoint manifolds separated by low density). We exploit a new algorithm for capturing manifold structure (high-order contractive autoencoders) and we show how it builds a topological atlas of charts, each chart being characterized by the principal singular vectors of the Jacobian of a representation mapping. This representation learning algorithm can be stacked to yield a deep architecture, and we combine it with a domain knowledge-free version of the TangentProp algorithm to encourage the classifier to be insensitive to local directions changes along the manifold. Record-breaking results are obtained and we find that the learned tangent directions are very meaningful. Subject Area:
Neurons in the neocortex code and compute as part of a locally interconnected population. Large-scale multielectrode recording makes it possible to access these population processes empirically by fitting statistical models to unaveraged data. What statistical structure best describes the concurrent spiking of cells within a local network? We argue that in the cortex, where firing exhibits extensive correlations in both time and space and where a typical sample of neurons still reflects only a very small fraction of the local population, the most appropriate model captures shared variability by a low-dimensional latent process evolving with smooth dynamics, rather than by putative direct coupling. We test this claim by comparing a latent dynamical model with realistic spiking observations to coupled generalised linear spike-response models (GLMs) using cortical recordings. We find that the latent dynamical approach outperforms the GLM in terms of goodness-of-fit, and reproduces the temporal correlations in the data more accurately. We also compare models whose observations models are either derived from a Gaussian or point-process models, finding that the non-Gaussian model provides slightly better goodness-of-fit and more realistic population spike counts.
106
Motivated by the spread of on-line information in general and on-line petitions in particular, recent research has raised the following combinatorial estimation problem. There is a tree T that we cannot observe directly (representing the structure along which the information has spread), and certain nodes randomly decide to make their copy of the information public. In the case of a petition, the list of names on each public copy of the petition also reveals a path leading back to the root of the tree. What can we conclude about the properties of the tree we observe from these revealed paths, and can we use the structure of the observed tree to estimate the size of the full unobserved tree T? Here we provide the first algorithm for this size estimation task, together with provable guarantees on its performance. We also establish structural properties of the observed tree, providing the first rigorous explanation for some of the unusual structural phenomena present in the spread of real chain-letter petitions on the Internet. Subject Area:
reVieWers
Ryan Adams Alekh Agarwal Arvind Agarwal Shivani Agarwal Deepak Agrawal Amr Ahmed Misha Ahrens Nir Ailon Edoardo Airoldi Yasemin Altun Mauricio Alvarez David Andrzejewski Andras Antos Alfred Anwander Pablo Arbelaez Cedric Archambeau Andreas Argyriou Raman Arora Hiroki Asari Hideki Asoh Arthur Asuncion Chris Atkeson Peter Auer Joseph Austerweil Pranjal Awasthi Francis Bach Drew Bagnell Bing Bai Doru Balcan Nina Balcan Christopher Baldassano Luca Baldassarre Pierre Baldi David Balduzzi Dana Ballard Aharon Bar Hillel Nick Barnes Jonathan T. Barron Peter Bartlett Curzio Basso Sugato Basu Dhruv Batra Peter Battaglia Tim Behrens Mikhail Belkin Shai Ben-David Yoshua Bengio Philipp Berens Alexander C. Berg Tamara Berg Matthias Bethge Alina Beygelzimer Shalabh Bhatnagar Indrajit Bhattacharya Sourangshu Bhattacharya Chiru Bhattarcharyya Jinbo Bi Misha Bilenko Aharon Birnbaum Gilles Blanchard Matthew Blaschko David Blei John Blitzer Avrim Blum Charles Blundell Phil Blunsom Liefeng Bo Botond Bocsi Danushka Bollegala Byron Boots Antoine Bordes Ali Borji Joerg Bornschein Leon Bottou Alexandre Bouchard-Ct Guillaume Bouchard Stephane Boucheron Abdeslam Boularias Michael Bowling Jordan Boyd-Graber Ulf Brefeld Marcus Brubaker Emma Brunskill Sebastien Bubeck Daphna Buchsbaum Michael Buice Razvan Bunescu Wray Buntine Chris J. C. Burges David Burkett Keith Bush Lucian Busoniu Charles Cadieu Kevin Canini Stephane Canu Andrea Caponnetto Barbara Caputo Constantine Caramanis Lawrence Carin Francois Caron Gert Cauwenberghs Gavin Cawley Lawrence Cayton Asli Celikyilmaz Alain Celisse Soumen Chakrabarti Tanmoy Chakraborty America Chambers Manmohan Chandraker Jonathan Chang Kai-Wei Chang Olivier Chapelle Gal Chechik Jianhui Chen Xi Chen Silvia Chiappa Sharat Chikkerur Arthur Choi Seungjin Choi Wongun Choi Mario Christoudias Stephan Clemencon Adam Coates Shay Cohen Michael Collins Ronan Collobert Greg Corrado Timothee Cour Aaron Courville David Cox Koby Crammer Daniel Cremers John Cunningham Marco Cuturi Florence DAlche-Buc Arnak Dalalyan Amit Daniely Dipanjan Das Sanjoy Dasgupta Hal Daume Nathaniel Daw Peter Dayan Nando De Freitas Fernando De la Torre Marc Deisenroth Ofer Dekel Olivier Delalleau Jia Deng Li Deng Misha Denil Inderjit Dhillon Tom Diethe Thomas Dietterich Laura Dietz Chris Ding Francesco Dinuzzo Chuong Do Eizaburo Doi Vincent J. Dorie Finale Doshi Arnaud Doucet Kenji Doya Mark Dredze Gideon Dror Mathias Drton Jan Drugowitsch John Duchi Miroslav Dudik Delbert Dueck Kevin Duh Jennifer Dy Michael Elad Jacob Eisenstein Jason Eisner Tal El Hay Tina Eliassi-rad Gal Elidan Charles Elkan Lloyd Elliott Dominik M. Endres Barbara Engelhardt Damien Ernst Fang Fang Clement Farabet Naomi Feldman Gerhard J. Felipe Rob Fergus Vittorio Ferrari Sanja Fidler Jenny Finkel Jozsef Fiser Andrew Fitzgibbon David Fleet Francois Fleuret David Forsyth Dean Foster Emily Fox Vojtech Franc Jordan Frank Michael Frank Alexander Fraser Bill Freeman Yoav Freund Hironori Fujisawa Kenji Fukumizu Jack Gallant Kuzman Ganchev Arvind Ganesh Surya Ganguli Ravi Ganti Phil Garner Gilles Gasso Jan Gasthaus Michael Gastpar Eric Gaussier Peter Gehler Andreas Geiger Andrew Gelman Claudio Gentile Sean Gerrish Sam Gershman Mohammad Ghavamzadeh Martin A. Giese Ran Gilad-Bachrach Jennifer Gillenwater Kevin Gimpel Inmar Givoni Amir Globerson Vibhav Gogate Andrew Goldberg Anna Goldenberg Mark Goldman Sally Goldman Pollina Gollant Ben Golub Alon Gonen Noah Goodman Geoff Gordon Dilan Gorur Stephen Gould Joao V. Graca Thore Graepel David Grangier Alexander Gray Russ Greiner Arthur Gretton Remi Gribonval Moritz Grosse-Wentrup Liu Guangcan Shengbo Guo Yuhong Guo Rahul Gupta Michael Gutmann Steven C. HOI Hirotaka Hachiya Ralf Haefner
107
reVieWers
Gholamreza Haffari Patrick Haffner Aria Haghighi Ulrike Hahn David Hall Keith Hall Lauren Hannah Zaid Harchaoui Bharath Hariharan Stefan Harmeling Kohei Hatano James Hays Elad Hazan Tamir Hazan Martial Hebert Mohamed Hebiri Matthias Hein Katherine A. Heller Philipp Hennig Ralf Herbrich Mark Herbster Tom Heskes Shohei Hido Matt Hoffman Thomas Hofmann Derek Hoiem Antti Honkela Cho-Jui Hsieh Daniel J. Hsu Gang Hua Chang Huang Junzhou Huang Jonathan Huggins Zakria Hussain Ferenc Huszar Seth Hutchinson Tuyen N. Huynh Tsuyoshi Ide Christian Igel Alex Ihler Shiro Ikeda Charles Isbell Vladimir Itskov Laurent Itti Laurent Jacob Nathan Jacobs Florian Jaeger Jagadeesh Jagarlamudi SakethaNath Jagarlapudi Prateek Jain Shaili Jain Vidit Jain Viren Jain Ali Jalali Dominik Janzing Rodolphe Jenatton Hueihan Jhuang Shuiwang Ji Jinzhu Jia Xiaoye Jiang Rong Jin Mark Johnson Kresimir Josic Anatoli Juditsky
108
Ata Kaban Hachem Kadri Adam Kalai Satyen Kale Hetunandan Kamisetty Takafumi Kanamori Anitha Kannan Ashish Kapoor Bert Kappen Yan Karklin Matthias Kaschube Hisashi Kashima Samuel Kaski Koray Kavukcuoglu Yoshinobu Kawahara Motoaki Kawanabe Sathiya Keerthi Balazs Kegl Charles Kemp Kristian Kersting Paul Kidwell Seyoung Kim Akisato Kimura Arto Klami Robert Kleinberg Marius Kloft David A. Knowles Kei Kobayashi Jens Kober Kilian Koepsell Pushmeet Kohli Mladen Kolar Vladimir Kolmogorov J. Zico Kolter George Konidaris Terry Koo Anna Koop Samory Kpotufe Andreas Krause Balaji Krishnapuram Oliver Kroemer Rui Kuang Brian Kulis M. Pawan Kumar Sanjiv Kumar Takio Kurita Branislav Kveton James Kwok Simon Lacoste-Julien John Lafferty Brenden Lake Christoph H. Lampert Ni Lao Hugo Larochelle Edith Law Alessandro Lazaric Svetlana Lazebnik Nevena Lazic Yann LeCun Quoc Le Guy Lebanon Guillaume Lecu Honglak Lee Yuh-Jye Lee
Victor Lempitsky Mate Lengyel Christina Leslie Guy Lever Roger Levy Fei Fei Li Fuxin Li Hang Li Lihong Li Ping Li Xiaodong Li Percy Liang Xuejun Liao Lek-Heng Lim Chih-Jen Lin Hsuan-Tien Lin Yuanqing Lin Zhouchen Lin Jenn Listgarten Ce Liu Han Liu Tie-Yan Liu Bo Long Phil Long Manuel C. Lopes Yonatan Lowenstein Christopher G. Lucas Jorg Lucke Yi Ma Christian Machens Hamid R. Maei Sridhar Mahadevan Odalric-Ambrym Maillard Julien Mairal Subhransu Maji Hiroshi Mamitsuka Gideon Mann Shie Mannor Oded Margalit Ben Marlin Andre Martins Winter Mason Jiri Matas Luke Maurits David McAllester Jon McAuliffe Andrew McCallum John McCoy Brian McFee Chris Meek Nishant A. Mehta Marina Meila Bartlett Mel Francisco S. Melo Roland Memisevic Aditya Menon Ethan Meyers Tomer Michaeli Lily Mihalkova Krystian Mikolajczyk Brian Milch Kurt Miller David Mimno Einat Minkov
Andriy Mnih Shakir Mohamed Claire Monteleoni Joris M. Mooij Taesup Moon Tetsuro Morimura Quaid Morris Indraneel Mukherjee Sayan Mukherjee Remi Munos Noburu Murata Iain Murray Hiroshi Nakagawa Hiroyuki Nakahara Shinichi Nakajima Sahand Negahban Blaine Nelson Bernhard Nessler Gerhard Neumann Hani Neuvirth-Telem Andrew Ng Yizhao Ni Hannes Nickisch Alex Niculescu-Mizil Juan Carlos Niebles Scott Niekum Mahesan Niranjan William Stafford Noble Sebastian Nowozin Timothy J. ODonnell Shigeyuki Oba Guillaume Obozinski Alice Oh Daisuke Okanohara Aude Oliva Bruno Olshausen Sylvie Ong Hans Op de Beeck Manfred Opper Gergo Orban Noga Oron Luis Ortiz Sarah Osentoski Hua Ouyang Jean-Francois Paiement John Paisley Liam Paninski Bernardino R. Paredes Il M. Park Ronald Parr Adam Pauls Kristiaan Pelckmans Vianney Perchet Florent Perronnin Jan Peters Jonas Peters Slav Petrov David Pfau Jean-Pascal Pfister Jonathan Pillow Joelle Pineau Nicolas Pinto Gordon Pipa John Platt
reVieWers
Robert Pless Russ Poldrack Massi Pontil Hoifung Poon Doina Precup Philippe Preux Yanjun Qi Yuan Qi Ariadna Quattoni Maxim Raginsky Ali Rahimi Piyush Rai Sasha Rakhlin Alain Rakotomamonjy Liva Ralaivola Parikshit Ram Deva Ramanan MarcAurelio Ranzato Garvesh Raskutti Frederic Ratle Magnus Rattray Pradeep Ravikumar Mark Reid Joseph Reisinger Xiaofeng Ren Lev Reyzin Sebastian Riedel Philippe Rigollet Abel Rodriguez Karl Rohe Lorenzo Rosasco Saharon Rosset Afshin Rostamizadeh Dan Roth Stefan Roth Volker Roth Constantin Rothkopf Juho Rousu Daniel M. Roy Ran Rubin Benjamin Rubinstein Cynthia Rudin Sasha Rush Daniil Ryabko Sivan Sabato Ankan Saha Maneesh Sahani Hiroto Saigo Jun Sakuma Ruslan Salakhutdinov Mathieu Salzmann Sujay Sanghavi Guido Sanguinetti Guillermo Sapiro Ben Sapp Sunita Sarawagi Suchi Saria Simo Sarkka Issei Sato Kengo Sato Cristina Savin Ashutosh Saxena Stefan Schaal Tom Schaul Katya Scheinberg Warren Schudy Dale Schuurmans Odelia Schwartz Alexander G. Schwing Clayton Scott Matthias Seeger Nicola Segata Yevgeny Seldin Thomas Serre Jun Sese Fei Sha Amir H. Shabani Patrick Shafto Greg Shakhnarovich Shai Shalev-Shwartz Tatyana Sharpee Xiaotong Shen Shirish Shevade Tao Shi Shohei Shimizu Lavi Shpigelman Ilya Shpitser Marco Signoretto Michael Silver Ozgur Simsek Vikas Sindhwani Yoram Singer Sameer Singh Kaushik Sinha Noam Slonim Cristian Sminchisescu David Smith Alexander J. Smola Ben Snyder Richard Socher Imri Sofer Peter Sollich Fritz Sommer Le Song Stam Sotiropoulos Matthijs Spaan Peter Spirtes Karthik Sridharan Bharath Sriperumbudur Isabelle Stanton Oliver Stegle Ingo Steinwart Ian H. Stevenson Alan A. Stocker Peter Stone Hao Su Amar Subramanya Masashi Sugiyama Jian Sun Min Sun Siddharth Suri Ilya Sutskever Charles Sutton Taiji Suzuki Kevin Swersky Umar Syed Csaba Szepesvari Prasad Tadepalli Akiko Takeda Takashi Takenouchi Ichiro Takeuchi Eiji Takimoto Partha Talukdar Toshiyuki Tanaka Matthew Tayler Graham Taylor Yee Whye Teh Josh Tenebaum Timothy Teravainen Ambuj Tewari Olivier Teytaud Daniel Ting Jo-Anne Ting Ivan Titov Michalis Titsias Gasper Tkacik Sinisa Todorovic Ryota Tomioka Antonio Torralba Alexander Toshev Long Tran Volker Tresp Bill Triggs Ivor Tsang Ioannis Tsochantaridis Koji Tsuda Srinivas Turaga Joseph Turian Richard Turner Naonori Ueda Tomer D. Ullman Lyle Ungar Raquel Urtasun Nicolas Usunier Benjamin Van Roy Manik Varma Nuno Vasconfelos Nicolas Vayatis Andrea Vedaldi Jean-Philippe Vert Rene Vidal Sethu Vijayakumar S.V.N. Vishwanathan Ed Vul Hanna Wallach Chong Wang Gang Wang Liwei Wang Yizhou Wang Zhikun Wang Ziyu Wang Larry Wasserman Kazuho Watanabe Yusuke Watanabe Chu Wei Kilian Q. Weinberger Yair Weiss Tomas Werner Jason Weston Daan Wierstra Chris Williams Robert C. Williamson Ross Williamson Sinead Williamson Andrew Wilson David Wingate Ole Winther Frank Wood John Wright Steve Wright Qiang Wu Wei Wu Lin Xiao Eric Xing Huan Xu Makoto Yamada Yoshihiro Yamanishi Shuicheng Yan Keiji Yanai Allen Yang Jianchao Yang Ming Yang Ming-Hsuan Yang Bangpeng Yao Yuan Yao Jieping Ye Sun Yi Yiming Ying Byron Yu Kai Yu Shipeng Yu Ming Yuan Yisong Yue Lihi Zelnik-Manor Haizhang Zhang Kun Zhang Tong Zhang Dengyong Zhou Xueyuan Zhou Yan Zhou Zhi-Hua Zhou Jerry Zhu Jun Zhu Shenghuo Zhu Andrew Zisserman Larry Zitnick Onno Zoeter Daniel Zoran Alon Zweig
109
autHor index
Abbasi-yadkori, Yasin, 98 Abbott, Joshua, 80 Abernethy, Jacob, 76,97 Absil, Pierre-Antoine, 87 Adametz, David, 28 Agarwal, Alekh, 95,98 Agosta, John-Mark, 69 Ahmadian, Yashar, 67 Ahmed, Amr, 10 Ahn, Hee-Kap, 94 Ailon, Nir, 87 Airoldi, Edoardo, 33 Alamgir, Morteza, 99 Alexe, Bogdan, 85 Allahverdyan, Armen, 70 Anand, Abhishek, 54 Anandkumar, Animashree, 63,73,102 Anguita, Davide, 32 Archambeau, Cedric, 56 Armagan, Artin, 91 Asuncion, Arthur, 100 Auer, Peter, 62 Austerweil, Joseph, 81 Ay, Nihat, 95 Azimi, Javad, 92 Babacan, S. Derin, 35 Bach, Francis, 58,59,62,96 Bagdanov, Andrew, 22 Bai, Xiang, 53 Balakrishnan, Sivaraman, 93 Baldi, Pierre, 86 Ball, Tonio, 90 Bar-Joseph, Ziv, 55 Baracos, Vickie, 55 Baraniuk, Richard, 23 Bardenet, Rmi, 26 Barreto, Andre, 16 Bartlett, Nicholas, 95 Barto, Andrew, 17 Belkin, Mikhail, 91 Bengio, Yoshua, 26,33,89,93,106 Berg, Alexander, Berg, Tamara, 85 Bergamo, Alessandro, 22 Bergstra, James, 26 Berkes, Pietro, 81 Bernardino, Alexandre, 27 Bhand, Maneesh, 21 Bhargava, Aniruddha, 82 Bhaskar, Sonia, 91 Bilmes, Jeff, 65,96 Blanchard, Gilles, 57,66 Blei, David, 36 Blitzer, John, 24 Blundell, Charles, 41,68 Bo, Liefeng, 22 Boahen, Kwabena, 18 Bogojeska, Jasmina, 55 Bonilla, Edwin, 95 Borgwardt, Karsten, 101 Bornschein, Jorg, 81 Bouchard-Ct, Alexandre, 41,68 Boumal, Nicolas, 87 Boutsidis, Christos, 58 Bowling, Michael, 48 Boyles, Levi, 61 Braun, Mikio, 103 Brea, Johanni, 52 Brendel, Wieland, 60 Briggman, K, 84
110
Bubeck, Sebastien, 65 Buesing, Lars, 83,106 Bui, Loc, 31 Buntine, Wray, 95 Cabral, Ricardo, 27 Caetano, Tiberio, 26 Caiafa, Cesar, 90 Cao, Liangliang, 54 Carin, Lawrence, 30,69,81 Carlson, David, 81 Carpentier, Alexandra, 58,65 Carreira-Perpinan, Miguel, 60 Carreira, Joao, 83 Cemgil, Ali Taylan, 29 Cesa-Bianchi, Nicol, 66,74,98 Chapelle, Olivier, 55 Chattopadhyay, Rita, 57 Chaudhuri, Kamalika, 63 Chellappa, Rama, 27 Chen, Bo, 50,81 Chen, Jianhui, 102 Chen, Justin, 71 Chen, Ke, 86 Chen, Kewei, 51 Chen, Minmin, 24 Chen, Ning, 36 Chen, Tsuhan, 84 Chen, Zhenghao, 91 Cheng, Weiwei, 69 Chierichetti, Flavio, 100,106 Chin, Tat-jun, 38 Choi, David, 33 Choi, Jaedeug, 16 Cichocki, Andrzej,90 Clyde, Merlise, 91 Clmenon, Stphan, 66 Coates, Adam, 28 Collins, Michael, 9 Collobert, Ronan, 71 Costa Pereira, Jose, 93 Costeira, Joao, 27 Cotter, Andrew, 62 Courville, Aaron, 33 Cunningham, John, 19,83,106 Dai, Dong, 33 Damianou, Andreas, 68 Darrell, Trevor, 84 Daume III, Hal, 28,68 Dauphin, Yann, 93,106 Daw, Nathaniel, 50 Dayan, Peter, 82 De Campos, Cassio, De la Torre, Fernando, 27 Delaitre, Vincent, 86 Delalleau, Olivier, 89 Delgosha, Payam, 87 Dembczynski, Krzysztof, 69 Deng, Jia, 83 Denk, Winfried, 84 Desjardins, Guillaume, 33 Dethier, Julie, 18 Dhillon, Inderjit, 23,32,59,61 Dhillon, Paramveer, 24 Dietterich, Thomas, 23,37 Diggavi, Suhas, 87 Dikmen, Onur, 60 Ding, Chris, 25 Ding, Nan, 101 Ding, Shilin, 70 Do, Huyen, 29 Doppa, Janardhan Rao, 23 Drineas, Petros, 58 Dubout, Charles, 25
Duchi, John, 95 Dunson, David, 30,69,91 Dupont, Pierre, 37 Duvenaud, David, 37 Eberts, Mona, 99 Ekanadham, Chaitanya, 67 El-Yaniv, Ran, 33,87 Elasaad, Shauki, 18 Elhadad, Noemie, 95 Elhamifar, Ehsan, 91 Eliasmith, Chris, 18 Elliott, Lloyd, 41,68 Ermon, Stefano, 67 Farabet, Clement, 103 Farahmand, Amir-massoud, 48 Faulkner, Matthew, 92,105 Feldman, Dan, 92,105 Felzenszwalb, Pedro, 85 Feng, Jean, 71 Fergus, Rob, 38 Fern, Alan, 63,92 Fern, Xiaoli, 23,92 Fernando, Chrisantha,105 Ferrari, Vittorio, 85 Fvotte, Cdric, 60 Fitzgibbon, Andrew, 22 Fleisher, Adam, 51 Fletcher, Alyson, 82 Fleuret, Francois, 25 Foster, Dean, 24,98 Fox, Dieter, 22 Foygel, Rina, 28 Friesen, Abram, 81 Frongillo, Rafael, 75,97 Fukumizu, Kenji, 64,88 Gabillon, Victor, 65 Gaertner, Thomas, 82,99 Gall, Juergen, 53 Galstyan, Aram, 70 Gao, Tianshi, 88 Garber, Dan, 96 Ge, Xiaoyin, 91 Gehler, Peter, 52 Geiger, Andreas, 20 Gentile, Claudio, 66 Gerstner, Wulfram, 18 Ghahramani, Zoubin, 65,80 Ghaoui, Laurent, 94 Ghavamzadeh, Mohammad, 17 Gheshlaghi Azar, Mohammad, 17 Ghio, Alessandro, 32 Ghosh, Soumya, 36 Gibson, Richard, 64 Gilbert, Anna, 42 Girshick, Ross, 85 Gkioulekas, Ioannis, 94 Globerson, Amir, 8 Goernitz, Nico, 26 Gomes, Carla P., 67 Gomes, Ryan, 60 Gong, Yihong, 54 Goodman, Noah, 35 Gool, Luc, 53 Grauman, Kristen, 86 Grave, Edouard, 59 Graves, Alex, 90 Gregor, Karol, 90 Greiner, Russ, 55 Gretton, Arthur, 88 Griffiths, Tom, 80,81 Grunwald, Peter, 65 Guestrin, Carlos, 65
Guillory, Andrew, 65 Gunawardana, Asela, 38 Guo, Shengbo, 56,88 Gutkin, Boris, 49 Hachiya, Hirotaka, 17,56 Hamprecht, Fred, 89 Hansen, Lars,71 Hayashi, Kohei, 31 Hazan, Elad, 31,96,97 He, Xiaofei, 52 Hein, Matthias, 29,90 Heller, Katherine, 80 Helmstaedter, Moritz, 84 Hennig, Philipp, 17 Hernndez-Lobato, Jose Miguel, 37 Hernndez-lobato, Daniel, 37 Hero, Alfred, 24 Heskes, Tom, 38 Hewlett, Daniel, 49 Hglund, Mattias,37 Hirayama, Jun-ichiro, 29 Horwitz, Greg, 52 Hsieh, Cho-Jui, 61 Hsu, Daniel, 63,98 Hsu, David, 16 Huang, Bert, 59 Huang, Eric, 89 Huang, Heng, 25 Huang, Shuai, 51 Huang, Thomas, 38,54 Huang, Tzu-Kuo, 38 Hullermeier, Eyke, 69 Hunter, David, 100,105 Hwang, Sung Ju, 86 Hwang, Yoonho, 94 Hyvarinen, Aapo, 29 Insua, David Rios,103 Ion, Adrian, 83 Iouditski, Anatoli, 31 Isola, Phillip, 81 Jaakkola, Tommi,8 Jacob, Laurent, 64 Jacobs, Robert, 80 Jain, Prateek, 23 Jain, Viren, 29,84 Jalali, Ali, 102 Jamieson, Kevin, 24 Janzing, Dominik, 38 Jebara, Tony, 59,88 Jegelka, Stefanie, 96 Jern, Alan, 50 Ji, Qiang, 51 Jia, Yangqing, 84 Jiang, Jiarong, 68 Joachims, Thorsten, 54 Johari, Ramesh, 31 Johnson, Christopher, 102 Jordan, Michael, 94,100 Jugel, Matthias, 103 Kahles, Andre, 26 Kakade, Sham, 30,63,98 Kalai, Adam, 30 Kale, Satyen, 31 Kalousis, Alexandros, 29 Kamangar, Farhad, 25 Kanade, Varun, 30 Kanamori, Takafumi, 56 Kappen, Hilbert, 17 Kapralov, Michael, 64 Kar, Purushottam, 29 Karger, David, 75,102 Karklin, Yan, 82
autHor index
Karpenko, Alexandre, 92 Kashima, Hisashi, 31 Kawahara, Yoshinobu, 61 Kayala, Matthew, 86 Kgl, Balzs, 26 Keil, Matthias, 19 Kemp, Charles, 50 Keramati, Mehdi, 49 Keshet, Joseph, 42,58 Khan, Fahad, 22 Khan, Faisal, 80 Khan, Omar, 69 Kiefel, Martin, 52 Kilinc Karzan, Fatma, 31 Kim, Dae Il, 68 Kim, Kee-Eung, 16 Kim, Sungwoong, 21 Kleinberg, Jon, 100,106 Kloft, Marius, 57 Knowles, David, 36 Kobayashi, Ryota, 20 Koerding, Konrad, 19 Koh, Pang Wei, 91 Kohli, Pushmeet, 21 Kokkinos, Iasonas, 85 Kolar, Mladen, 93 Koller, Daphne, 88 Kolter, J. Zico, 48 Koltun, Vladlen, 18,74,83 Konidaris, George, 50 Knig, Arnd, 55 Koolen, Wouter, 31,65 Koppula, Hema, 54 Korattikara, Anoop, 61 Koren, Tomer, 97 Kotlowski, Wojciech, 31 Kpotufe, Samory, 75,97 Krause, Andreas, 60, 69,92,105 Krishnamurthy, Akshay, 93 Kroemer, Oliver, 40,48 Krhenbhl, Philipp, 74,83 Kulkarni, Girish, 85 Kumar, Abhishek, 28 Kunapuli, Gautam, 25 Kurtek, Sebastian, 23 Ladicky, Lubor, 57 Lagergren, Jens, 37 Lampert, Christoph, 58 Lanckriet, Gert, 64 Lanctot, Marc, 48 Lansky, Petr, 20 Laptev, Ivan, 53,86 Larsen, Jakob Eg, 71 Larsson, Martin, 96 Latecki, Longin Jan, 53 Latham, Peter, 20 Laughlin, Simon, 18 Laurent, Gilles, 106 Laviolette, Francois, 62 Lawrence, Neil, 68,101 Lawrence, Richard, 88 Lazaric, Alessandro, 48,65 Lzaro-Gredilla, Miguel, 34 Le Roux, Nicolas, 96 LeCun, Yann, 90 Le, Hai-Son, 55 Le, Quoc V., 71,92 Lee, Gyemin, 66 Lei, Jing, 100 Leibo, Joel, 20 Lempitsky, Victor, 21 Lengyel, Mate, 19,20,82 Levine, Sergey, 18 Li, Congcong, 84 Li, Fei Fei, 22,83 Li, Jing, 51 Li, Lihong, 55 Li, Ping, 55 Li, Zhen, 54 Liben-Nowell, David, 100,106 Lim, Joseph, 21 Lim, Zhan Wei, 16 Lima, Pedro, 17 Lin, Binbin, 52 Lin, Hsiu-Chin, 55 Lin, Hui, 55,96 Lin, Zhouchen, 96 Link, Benjamin, 54 Lippert, Christoph, 101 Liu, Jun, 89,90 Liu, Risheng, 96 Liu, Tom, 93 Liu, Wenyu, 53 Liu, Yan, 88 Liu, Yi-Kai, 63 Lizotte, Dan, 49 Loh, Po-Ling, 43,65 Long, Phil, 63,97 Lopes, Miles, 64 Lou, Xinghua, 89 Low, Tiffany, 71 Lozano, Aurelie, 27 Lu, Zhaosong, 61 Lu, Zhengdong, 60 Lucas, Christopher, 50 Lucke, Jorg, 81 Lyu, Siwei, 30 Machens, Christian, 60 Macke, Jakob, 20,83,106 Mackey, Lester, 94 Maclin, Richard, 25 Magdon-Ismail, Malik, 58 Mahadevan, Vijay, 93 Mahoney, Michael, 30 Maillard, Odalric-Ambrym, 49,58 Mandic, Danilo, 90 Manning, Christopher, 89 Mannor, Shie, 31,98 Matthews, Iain, 38 Maua, Denis, 70 McCallum, Andrew, 101 McHutchon, Andrew, 101 Mcallester, David, 42,58,85 Meek, Christopher, 38 Mehta, Neville, 63 Meila, Marina, 94 Meir, Ron, 37 Mensi, Skander, 18 Messias, Joo, 17,99 Meyerson, Adam, 92,105 Miller, Ken, 67 Minka, Tom, 36 Missura, Olana, 99 Mohajer, Soheil, 87 Montufar, Guido, 95 Mooij, Joris, 38,101 Moore, Joshua, 55 Morioka, Nobuyuki, 54 Morrison, Clayton, 49 Moulines, Eric, 62 Mozer, Michael, 54 Mudur, Ritvik, 21 Muller, Klaus-Robert, 103 Muller, Xavier, 93 Munos, Remi, 17,32,40,49,58,65 Murillo Fuentes, Juan Jos, 34 Murray, Iain, 20 Mutch, Jim, 20 Mutlu, Bilge, 80 Nakajima, Shinichi, 35 Nasrabadi, Nasser, 59 Naud, Richard, 18 Navalpakkam, Vidhya, 50 Nemirovski, Arkadi, 31 Newman, David, 95 Ney, Hermann, 62 Ng, Andrew, 21,28,71,89,91,92 Ngiam, Jiquan, 91,92 Nguyen, Nam, 59 Nickisch, Hannes, 37 Nie, Feiping, 25 Niekum, Scott, 17,50 Ning, Huazhong, 54 Niu, Feng, 97 Niu, Gang, 17 Niven, Jeremy, 18 Nowak, Rob, 24 Nowozin, Sebastian, 21 Nuyujukian, Paul, 18 Obozinski, Guillaume, 59 Oh, Sewoong, 102 Oliva, Aude, 81 Olmos, Pablo, 34 Oneto, Luca, 32 Ong, Cheng Soon, 69 Opper, Manfred, 37,70 Orbanz, Peter, 9 Ordonez, Vicente, 85 Orhan, Emin, 80 Orr, Walker, 23 Ortner, Ronald, 62 Pacer, Michael, 80 Pajarinen, Joni, 48 Pal, David, 98 Panchanathan, Sethuraman, 57 Panigrahy, Rina, 64 Paninski, Liam, 66 Parikh, Ankur, 26 Parikh, Devi, 81 Park, Il Memming, 99 Park, Mijung, 52 Pashler, Harold, 54 Peltonen, Jaakko, 48 Pennin, Jeffrey, 89 Perez-Cruz, Fernando, 34 Perona, Pietro, 50,60 Perotte, Adler, 95 Perrault-Joncas, Dominique, 94 Perry, Patrick, 30 Peters, Jan, 48 Petersen, Michael, 71 Petrescu, Viviana, 85 Petreska, Biljana, 19 Petterson, James, 26 Pfister, Jean-Pascal, 52 Pham, Trung, 38 Pidan, Dmitry, 87 Pillow, Jonathan, 8,52,99 Pineau, Joelle, 16 Pitkow, Xaq, 67 Poczos, Barnabas, 27 Poggio, Tomaso, 20 Polyak, Boris, 31 Popovic, Zoran, 18 Poupart, Pascal, 69 Precup, Doina, 16 Qi, Yuan (Alan), 67,100 Quigley, Morgan, 71 Raetsch, Gunnar, 26 Raginsky, Maxim, 99 Rahimi, Ali, 34 Rahnama Rad, Kamiar, 66 Rai, Piyush, 28,68 Rakhlin, Alexander, 32,98,99 Ramadge, Peter, 56 Ramanan, Deva, 53,61 Rangan, Sundeep, 82 Rao, Vinayak, 36 Rasi, Marc, 71 Rasmussen, Carl Edward, 37,101 Rauh, Johannes, 95 Ravikumar, Pradeep, 32,59,61,102 Raykar, Vikas, 36 Re, Christopher, 97 Recht, Benjamin, 97 Reichert, David, 51 Reid, Mark, 63 Reiman, Eric, 51 Ren, Lu, 69 Ren, Xiaofeng, 22 Restelli, Marcello, 48 Rezende, Danilo, 82 Ridella, Sandro, 32 Rifai, Salah, 93,106 Rinaldo, Alessandro, 93 Romo, Ranulfo, 60 Rooij, Steven, 65 Roth, Volker, 28 Rother, Carsten, 52 Roy, Dan, 95 Ruttor, Andreas, Rush, Alexander,9 Ryabko, Daniil, 49 Ryu, Stephen, 19 Saberian, Mohammad, 25 Sabharwal, Ashish, 67 Saeedi, Ardavan, 68 Safa, Issam, 91 Saffari, Amir, 57 Sahani, Maneesh, 19,34,83,106 Salakhutdinov, Ruslan, 21,28,100 Salamanca, Luis, 34 Salman, Ahmad, 86 Sanchez, Diego Garcia, 103 Sanguinetti, Guido, 70 Sankaranarayanan, Aswin, 23 Santhanam, Gopal, 19 Satheesh, Sanjeev, 83 Satoh, Shinichi, 54 Saul, Lawrence, 93 Savin, Cristina, 82 Saxe, Andrew, 21 Saxena, Ashutosh, 54,84 Schalk, Gerwin, 51 Schmidt, Mark, 96 Schneider, Jeff, 27,38 Schulze-bonhage, Andreas, 90 Schlkopf, Bernhard, 38,52,73 Schwartz, Marc-Olivier,71 Scott, Clay, 66 Seldin, Yevgeny, 62 Selman, Bart, 67 Sengupta, Biswa, 18 Senn, Walter, 52 Series, Peggy, 51 Servedio, Rocco, 63,97
111
autHor index
Setzer, Simon, 29 Seung, H. Sebastian, 84 Sha, Fei, 86 Shah, Devavrat, 102 Shalev-Shwartz, Shai, 57 Shamir, Ohad, 28,30,62,98 Shashua, Amnon, 57 Shavlik, Jude, 25 Shaw, Blake, 59 Shawe-Taylor, John, 62 Sheikh, Abdul Saboor, 81 Sheldon, Daniel, 37 Shelton, Jacquelyn, 81 Shenoy, Krishna, 18,19,83,106 Shindler, Michael, 92,105 Shinomoto, Shigeru, 20 Shivaswamy, Pannagadatta, 88 Shrivastava, Anshumali, 23,55 Shroff, Nitesh, 27 Si, Luo, 88 Silbert, Nathan, 103 Sigal, Leonid, 38 Silva, Ricardo, 34 Simon, Dylan, 50 Simoncelli, Eero, 67,82 Simsekli, Umut, 29 Sindhwani, Vikas, 27 Singh, Aarti, 93 Siskind, Jeffrey, 35 Sivic, Josef, 86 Sjlund, Erik37 Slawski, Martin, 90 Slivkins, Aleksandrs, 98 Sminchisescu, Cristian, 83 Smola, Alexander, 10 Smyth, Padhraic, 100,105 Soatto, Stefano, 84 Socher, Richard, 89 Song, Le, 26,63,88 Sonnenburg, Soeren, 26 Sontag, David, 95 Sorower, Mohammad Shahed, 23 Spaan, Matthijs, 17 Srebro, Nati, 28,62,65,97 Sricharan, Kumar, 24 Sridharan, Karthik, 32,62,65 Sriperumbudur, Bharath, 64 Srivastava, Anuj, 23 Stegle, Oliver, 101 Steinwart, Ingo, 99 Stemmler, Martin, 18 Stevenson, Ian, 19 Stewart, Terrence, 18 Stimberg, Florian, 70 Stopczynski, Arkadiusz,71 Storkey, Amos, 51 Stahlhut, Carsten71 Stuhlmueller, Andreas, 35 Su, Zhixun, 96 Sudderth, Erik, 36,68 Sugiyama, Masashi, 17,35,56,61 Sun, Lee, 16 Sun, Liang, 89 Sun, Qian, 57 Suresh, Bipin, 21 Susemihl, Alex, 37 Sustik, Matyas, 61 Suter, David, 38 Sutton, Charles, 35 Sutton, Richard,40 Suzuki, Taiji, 31,32,56
112
Szafron, Duane, 64 Szathmary, Eors, 105 Szepesvari, Csaba, 98 Szlam, Arthur, 90 Tadepalli, Prasad, 23,63 Takeuchi, Ichiro, 61 Talwalkar, Ameet, 94 Tan, Vincent, 102 Taylor, Graham, 38 Teh, Yee Whye, 9,36,68 Telgarsky, Matus, 88 Tenenbaum, Josh, 100 Tewari, Ambuj, 23,32,59,65 Thomas, Philip, 16,50 Tishby, Naftali, 10 Titsias, Michalis, 34,68 Tofigh, Ali, 37 Tomioka, Ryota, 31 Torr, Philip, 57 Torralba, Antonio, 21,81,100 Torresani, Lorenzo, 22 Tran, Trac, 59 Tranchina, Daniel, 67 Tschopp, Dominique, 87 Tsubo, Yasuhiro, 20 Turaga, Pavan, 27 Turaga, Srinivas, 84 Turner, Richard, 34 Ugander, Johan, 96 Ujfalussy, Balazs, 19 Ungar, Lyle, 24 Ungureanu, Andrei, 36 Urtasun, Raquel, 20,53 Van den Broeck, Guy, 70 Van de Weijer, Joost22 Van Erven, Tim, 65 Vanrell, Maria, 22 Varshney, Lav, 82 Vasconcelos, Nuno, 25,93 Vedaldi, Andrea, 21 Veness, Joel, 48 Vernet, Elodie, 63 Vidal, Rene, 91 Vincent, Pascal, 93,106 Vishwanathan, S.V.N., 101 Vitale, Fabio, 66 Von Luxburg, Ulrike, 99 Vondrick, Carl, 53 Vu, Duy, 100,105 Waegeman, Willem, 69 Wagner, Paul, 16 Wahba, Grace, 70 Wainwright, Martin, 56,64 Walsh, Thomas, 49 Wang, Hua, 25 Wang, Jun, 29 Wang, Weiran, 60 Wang, Xinggang, 53 Wang, Yingjian, 69 Wang, Yusu, 91 Wang, Zuoguan, 51 Warmuth, Manfred, 31 Washio, Takashi, 61 Watanabe, Yusuke, 35 Waters, Andrew, 23 Wauthier, Fabian, 100 Weinberger, Kilian, 24 Welinder, Peter, 60 Welling, Max, 61 Wexler, Yonatan, 57 Wick, Michael, 101 Widmer, Christian, 26 Wiener, Yair, 33
Wierstra, Daan, 82 Wiesler, Simon, 62 Williamson, Robert, 63 Willsky, Alan, 102 Wingate, David, 35 Wipf, David, 59 Wnuk, Kamil, 84 Wojek, Christian, 20 Wolfe, Patrick, 33 Wong, Alex, 92,102 Wong, Chi Wah, 93 Wood, Frank, 95 Woznica, Adam, 29 Wright, Stephen, 97 Wu, Teresa, 51 Wu, Wei, 23 Xiang, Zhen James, 56 Xing, Eric, 22,26,36 Xiong, Liang, 27 Xu, Hao, 56 Xu, Min, 93 Xu, Puyang, 38 Yamada, Makoto, 56 Yan, Feng, 67 Yang, Liu, 28 Yang, Shulin, 34 Yang, Xingwei, 53 Yao, Angela, 53 Ye, Jieping, 51,57,89,90,102 Ylmaz, Kenan, 29 Yoo, Chang D., 21 Yu, Byron, 19,83,106
Yu, Chun-Nam, 55 Yu, Jin, 38 Yu, Shipeng, 36 Yuan, Lei, 90 Yue, Yisong, 65 Zappella, Giovanni, 66 Zeiler, Matthew, 38 Zeller, Georg, 26 Zhang, Chiyuan, 52 Zhang, Dan, 88 Zhang, Jian, 88 Zhang, Liqing, 90 Zhang, Lumin, 52 Zhang, Tong, 33,54,63 Zhang, Xianxing, 30 Zhang, Yichuan, 35 Zhang, Yong, 61 Zhang, Youwei, 94 Zhang, Ziming, 57 Zhao, Bin, 22 Zhao, Qibin, 90 Zhao, Tingting, 17 Zhao, Yibiao, 53 Zhou, Jiayu, 102 Zhou, Rong, 103 Zhu, Jun, 36,53 Zhu, Song-chun, 53 Zhu, Xiaojin (Jerry), 70,80 Zickler, Todd, 94 Zisserman, Andrew, 21 Zoeter, Onno, 56 Zou, Will, 71
next ConferenCe
2012 - 2014
Lake Tahoe Nevada