## Reinforcement Learning

### prof. Piotr Miłoś, dr Łukasz Kuciński

Monday, 10.00 - 12.00 room 106, ul. Sniadeckich 8, Warsaw

Our seminars can be attended via Hangouts Meet. Contact us (Łukasz Kuciński or Piotr Miłoś) if you want to be added to seminar's mailing list.

01.03.2021, **Stefan Bauer, Max Planck Institute for Intelligent Systems**, CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and Transfer Learning.

The live steaming of the seminar will be available on Hangouts Meet.

Abstract: Despite recent successes of reinforcement learning (RL), it remains a common problem that agents fail to transfer their learned skills to related environments. To facilitate research addressing this challenge, we propose CausalWorld, a benchmark for causal structure and transfer learning in a robotic manipulation environment. This environment is a simulation of an open-source robotic platform, hence offering the possibility of sim-to-real transfer. Tasks consist of constructing 3D shapes from a given set of blocks - inspired by how children learn to build complex structures. The key strengths of CausalWorld is that it provides a combinatorial family of such tasks with a common causal structure and underlying factors (including e.g. robot and object masses, colors, sizes). The user (or even the agent) may intervene on all causal variables, which allows for fine-grained control over how similar different tasks (or task distributions) are. Hence, one can easily define training and evaluation distributions of a desired difficulty level, targeting a desired form of generalization (e.g. only changes in appearance or object mass). Further, this common parametrization facilitates defining curricula by interpolating between an initial and a target task. While users may define their own task distributions, we present eight meaningful distributions as concrete benchmarks, ranging from simple to extremely challenging, all of which require long-horizon planning and precise low-level motor control at the same time. Finally, we provide baseline results for a subset of these tasks on distinct training curricula and corresponding evaluation protocols, verifying the feasibility of the tasks in this benchmark.

14.12.2020, **Konrad Czechowski, Piotr Kozakowski, Piotr Januszewski, Piotr Miłoś, Tomasz Odrzygóźdź, Michał Zawalski,** News from NeurIPS 2020.

The live steaming of the seminar will be available on Hangouts Meet.

Papers discussed:

Language as a Cognitive Tool to Imagine Goals in Curiosity-Driven Exploration

Continuous Latent Search for Combinatorial Optimization

On Efficiency in Hierarchical Reinforcement Learning

Goal-directed Generation of Discrete Structures with Conditional Generative Models

Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement

Tutorial: Designing Learning Dynamics

Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design

Simple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awareness

Amortized Variational Deep Q Network

Action and Perception as Divergence Minimization

Value-driven Hindsight Modelling

Gradient Surgery for Multi-Task Learning

30.11.2020, **Michal Valko, DeepMind**, Bootstrap your own latent: A new approach to self-supervised learning.

The live steaming of the seminar will be available on Hangouts Meet.

Abstract: We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network. While state-of-the art methods intrinsically rely on negative pairs, BYOL achieves a new state of the art without them. BYOL reaches 74.3 per cent top-1 classification accuracy on ImageNet using the standard linear evaluation protocol with a ResNet-50 architecture and 79.6 per cent; with a larger ResNet. We show that BYOL performs on par or better than the current state of the art on both transfer and semi-supervised benchmarks.

09.11.2020, **Jacek Tabor, Jagiellonian University, **Introduction to multi-label classification.

The live steaming of the seminar will be available on Hangouts Meet.

Abstract: We will talk about what is multi-label classification and how we can modify the loss function so that the model can use additional information about the number of classes.

19.10.2020, **Jacek Cyranka, University of Warsaw**, State Planning Policy Reinforcement Learning

The live steaming of the seminar will be available on Hangouts Meet.

Abstract: The question we address: How to develop physics-informed reinforcement learning algorithms that guarantee safety and interpretability? It is widely known that policies trained using reinforcement learning (RL) to solve simulated robotics problems (MuJoCo) are extremely brittle and unstable, i.e. your solution will most likely break down after perturbing a bit (e.g. poking the robot) or transferring it to a similar task. It is often impossible to provide any safety guarantees for constraint satisfaction or an interpretation of how the trained policies work. To address these issues we created State Planning Policy Reinforcement Learning.

5.10.2020, **Aleksander Mądry, MIT**, What Do Our Models Learn?

The live steaming of the seminar will be available on Hangouts Meet.

Abstract: Large-scale vision benchmarks have driven---and often even defined---progress in machine learning. However, these benchmarks are merely proxies for the real-world tasks we actually care about. How well do our benchmarks capture such tasks? In this talk, I will discuss the alignment between our benchmark-driven ML paradigm and the real-world uses cases that motivate it. First, we will explore examples of biases in the ImageNet dataset, and how state-of-the-art models exploit them. We will then demonstrate how these biases arise as a result of design choices in the data collection and curation processes. Throughout, we illustrate how one can leverage relatively standard tools (e.g., crowdsourcing, image processing) to quantify the biases that we observe. Based on joint works with Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Jacob Steinhardt, Dimitris Tsipras and Kai Xiao.

====== End of academinc year 2019/2020 ======

**Yuhuai Wu, University of Toronto,**Towards Symbolic Generalizations in Neural Reasoning. Exceptionally at 16:00 CET.

The live steaming of the seminar will be available on Hangouts Meet.

Abstract: In this talk, I'll talk about a few recent works about generalization issues in symbolic reasoning. I'll first talk about my work on a novel architectural design for compositional generation in Raven's Progressive Matrices (RPM) -- a visual analogical reasoning task that is often used to test human Intelligence Quotient (IQ). We propose Scattering Compositional Learner, achieving state-of-the-art results on two datasets on RPM, with a relative improvement of 48% over the previous state-of-the-art. The learned agent is capable of generalizing to unseen analogy without further training. Next, I'll talk about the second work, a study on the generalization ability of neural network models in theorem proving. We design a theorem generator that synthesizes a dataset of inequality theorems, which help us to diagnose the generalization ability across six dimensions. We perform a comprehensive evaluation of the existing neural network models, and further show that performing planning can greatly improve generalization. Lastly, I'll show some exciting results on applying neural network models to Sharp SAT (model counting problem), where the learned neural network agents are able to generalize to problems of much larger size, so as to achieve orders of magnitude wall-clock time improvements. I'll conclude by showing some future directions, in terms of understanding generalizations in symbolic reasoning, also more opportunities in symbolic reasoning, such as an incoming dataset for human-oriented theorem proving.

08.06.2020, **Beat Schaad, Sky TV**, Big Data in media: Content evaluation and opportunities for RL in customer management.

The live steaming of the seminar will be available on Hangouts Meet.

25.05.2020,** Ofir Nachum, Google Brain Mountain View**, Convex Duality for RL.

Abstract: We review basic concepts of convex duality and summarize how this duality may be applied to a variety of reinforcement learning (RL) settings, including policy evaluation or optimization, online or offline learning, and discounted or undiscounted rewards. The derivations yield a number of intriguing results, including the DICE family of works we've recently published. Thus, through the lens of convex duality, we provide a unified treatment and perspective on these works, which we hope will enable researchers to better use and apply the tools of convex duality to make further progress in RL.

14.04.2020, **Stanisław Jastrzębski, NYU, Molecule.one**, Towards Reverse Engineering Stochastic Gradient Descent.

Abstract: First-order methods such as Stochastic Gradient Descent are unreasonably effective at training well-generalizing deep neural networks. Why is it the case? And can we improve the training of DNNs in more challenging settings such as multi-task learning? In the main part of the talk, I will discuss our recent work on the importance of the early phase of training. We argue for the existence of the "break-even" point on the optimization trajectory, beyond which key properties of the loss surface are implicitly regularized by SGD. I will also discuss our early results on trying to "reverse engineering" SGD. Inspired by the discovered implicit regularization effects of SGD, we designed a new explicit regularizer, which shows promising results for multi-task learning.

16.03.2020, **Jakub Świątkowski, MIM UW, NoMagic.AI**, The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks.

The Bayesian paradigm in machine learning promises a wide range of beneficial properties. In particular, Bayesian neural networks (BNN) are less prone to overfitting and provide better calibration of predictive uncertainty, which makes them well suited for Reinforcement Learning tasks. The most popular method for training BNNs is Variational Inference (VI). In this talk, I will start with an introduction to the VI training of the BNNs. I will then present my recent work on developing a more efficient VI method that exploits the sparsity in the neural networks. The proposed method drastically reduces the number of model parameters, increases the signal-to-noise ratio of gradient estimates, and thus speeds up convergence. Finally, I will conclude with an open-ended discussion on the material benefits of the Bayesian paradigm in deep learning.

09.03.2020, **Maciej Wołczyk, Jagiellonian University**, Continual Learning - Setting, Methods and Connections to Reinforcement Learning.

Abstract: Deep neural networks suffer from catastrophing forgetting - i.e. training on a new task often causes a drastic fall of performance on previously trained tasks. It is a serious problem in real life applications where tasks are learned in a sequence and the iid assumption is violated. Because of that in the recent years there has been a surge of interest in continual learning - the problem of creating models which are able to learn tasks in sequence, without forgetting the patterns they have encountered before. I will describe the continual learning setting (along with some of its variants) and some of the most popular approaches for solving this task. I will also emphasize the connections between this field and reinforcement learning, such as the usage of a replay buffer for learning.

24.02.2020,** Razvan Pascanu, DeepMind London**, Optimization, learning and data-efficiency.

Abstract: The question I'd like to ask is how can we improve training efficiency for neural networks. In particular I'm interested in the role played by the optimizer. The lecture will start with a brief overview of learning techniques, going from stochastic gradient descent to second order methods and natural gradient. We will follow with a discussion on a more recent paradigm for efficient learning, namely meta-learning. The topic has seen an extensive growth in the last few years, so we will not cover all different forms of meta-learning, but rather focus on those that explicitly learn or parameterize an optimizer. In particular we will discuss a newly proposed algorithm called Warp Gradient Descent as well as other previous related work. For simplicity, part of the discussion will be carried out in the context of supervised learning. However we will also discuss the RL setting, particularly for meta-learning. We will describe the unique challenges imposed by RL and how these could affect the presented algorithms. Finally we will draw some conclusions regarding the role of the optimizer in improving efficiency of learning and discuss potential directions for moving forward.

20.12.2019, **Michał Garmulewicz, MIM UW and Brain Corp**, Structural Priors for Reinforcement Learning.

Abstract: In the most recent reinforcement learning literature we are observing a shift towards approaches with structural priors, inductive biases, auxiliary assumptions, innate machinery etc. These take many forms, among others: model-based, object-based, planning-based, relational, temporal, self-attentive etc. But are these necessary or harmful in the long term? In this talk we will investigate the claim whether such structural biases are "asymptotically necessary" in order to realize the promise of learning to solve any task that human can do in under 2 seconds. We will present arguments (and concrete data points) for and against this point from the literature, and briefly zoom into some particularly interesting approaches utilizing structural priors.

25.11.2019, **Karol Hausman, Google Brain Mountain View**, Multi-Task Reinforcement Learning - a Curse or a Blessing?

Abstract: Multi-task reinforcement learning has emerged as a promising approach for sharing structure across multiple tasks to enable more efficient learning. In addition, it has a potential of addressing many of the reinforcement learning challenges such as the abundance of resets and rewards, long-horizon tasks or compositional task representations. Nevertheless, due to its challenges varying from optimization to task design multi-task reinforcement is yet to delivered on its promises.

In this talk, I'll present various advancements and applications of multi-task reinforcement learning, including reward-free learning and learning of long-horizon tasks. I'll also talk about different ways to characterize and evaluate multi-task reinforcement learning challenges. Finally, I'll present a benchmark that aims to systematize the advancements in this field.

18.11.2019, **Marcin Andrychowicz, Google Brain Zurich,** Reinforcement Learning for Robotics.

Abstract: The talk will describe how we can use Reinforcement Learning to train control policies for physical robots. The first part of the talk is going to be devoted to efficient learning from sparse and binary reward signals with the technique called Hindsight Experience Replay https://arxiv.org/pdf/1707.01495.pdf). In the second part of the talk, I'll discuss the issue of transferring control policies from a simulator to the real world and present the technique of Automatic Domain Randomization, which relies on randomizing the appearance as well as the dynamics of the simulated environment. In particular, I'll focus on the problem of dexterous in-hand manipulation with a humanoid hand (https://openai.com/blog/solving-rubiks-cube/).

28.10.2019, **Piotr Kozakowski, MIM UW and Google Brain Mountain View,** Forecasting Deep Learning Dynamics for Hyperparameter Tuning

Abstract: Hyperparameter tuning for deep neural networks is an important and challenging problem. Many person-hours are spent on tuning new architectures on new problems, hence the need for automated systems. Furthermore, some hyperparameters can and should be varied during model training, e.g. the learning rate. I will present an approach based on model-based reinforcement learning, developed during my internship at Google Brain. First I will frame the problem as a partially observable Markov decision process and present a naive model-free approach to solving it. Then I will introduce SimPLe, a model-based approach based on learning a predictive model of the environment and using it to optimize a policy. I will explain the design choices and technical details of modeling this specific environment using a Transformer language model. Next, I will present the results comparing the model-free and model-based approach, both in terms of the final performance and computational requirements. I will end with a qualitative analysis of learned hyperparameter schedules.

14.10.2019, **Christian Szegedy, Google Brain Mountain View**, An AI Driven Approach to Mathematical Reasoning.

Abstract: Deep learning have made inroads into machine perception and recently also into natural language processing. Today, most state of the art approaches in NLP, computer vision and speech recognition rely heavily on deep neural network models trained by stochastic gradient descent. Although the same techniques serve as basis for super-human AI systems for some logical games like chess and the game of go, high level mathematical reasoning is still beyond the frontiers of current AI systems. This has several reasons: infinite, expanding action space, the need for coping with large knowledge bases, the heterogeneity of the problem domain and the lack of training data available in structured form. Here we present an ambitious programme towards a strong AI system that would be capable of arguing about mathematics at a human level or higher and give informal evidence of the feasibility of this direction. Autoformalization is the combination of natural language understanding and formal reasoning where the task is to transcribe some formal content (for example, mathematical text) into structured, computer digestible and verifiable form. I will argue that there is mounting evidence that strong auto-formalization will become possible in the coming years and give an outline of a rough path towards it that is based on recent advances in deep learning.

3.10.2019, **Karol Kurach, Google Brain Zurich**, Google Research Football: Learning to Play Football with Deep RL.

Abstract: Recent progress in the field of reinforcement learning has been accelerated by virtual learning environments such as video games, where novel algorithms and ideas can be quickly tested in a safe and reproducible manner. We introduce the Google Research Football Environment, a new reinforcement learning environment where agents are trained to play football in an advanced, physics-based 3D simulator. The resulting environment is challenging, easy to use and customize, and it is available under a permissive open-source license. In addition, it provides support for multiplayer and multi-agent experiments. We propose three full-game scenarios of varying difficulty with the Football Benchmarks and report baseline results for three commonly used reinforcement algorithms (IMPALA, PPO, and Ape-X DQN). We also provide a diverse set of simpler scenarios with the Football Academy and showcase several promising research directions.

====== End of academinc year 2018/2019 ======

01.04.2019, **Michał Zawalski**, Visual Hindsight Experience Replay.

Abstract: Reinforcement learning algorithms usually require millions of interactions with environment to learn successful policy. Hindsight Experience Replay was introduced as a technique to learn from unsuccessful episodes and thus improve sample efficiency. However it cannot be directly applied to visual domains. I will show a modification of this approach called Visual Hindsight Experience Replay, which aims to solve this issue. The key part of this approach is a method of fooling the agent into thinking that it has actually reached the goal in a sampled unsuccessful episode.

25.03.2019, **Andrzej Nagórko**, Parallelized Nested Rollout Policy Adaptation.

Abstract: Nested Rollout Policy Adaptation (NRPA) is a Monte Carlo

tree search algorithm. It beats more general Monte Carlo tree search

algorithms in the domain of single agent optimization problems. I'll

show how to parallelize NRPA and discuss performance of the parallel

version in the Morpion Solitaire benchmark.

11.03.2019, **Piotr Kozakowski**, Discrete Autoencoders: Gumbel-Softmax vs Improved Semantic Hashing.

Abstract: Gumbel-softmax (Jang et al - Categorical Reparameterization with Gumbel-Softmax, 2016) and improved semantic hashing (Kaiser et al - Discrete Autoencoders for Sequence Models, 2018) are two approaches to relaxation of discrete random variables that can be used to train autoencoders with discrete latent representations. They have not yet been rigorously compared in domains other than language modeling. I will start by describing the two methods and the original results. Then I will analyze their performance, both qualitatively and quantitatively, in an image generation task. I will end with sharing some practical considerations learned while implementing those methods.

04.03.2019, **Jakub Świątkowsk**i, Deep Reinforcement Learning based on Zambaldi, et. al. "Deep reinforcement learning with relational inductive biases".

Abstract: We will talk about relational deep reinforcement learning, which was applied to train AlphaStar, as described in Zambaldi, et. al. "Deep reinforcement learning with relational inductive biases".

25.02.2019, **Łukasz Kucińsk**i, Neural Expectation Maximization, based on Greff, et. al. “Neural Expectation Maximization”.

Abstract: We will talk about the classical Expectation Maximization algorithm and its differentiable counterpart, as described in Greff, et. al. “Neural Expectation Maximization”.

18.02.2019, **Konrad Czechowski**, Universal Planning Networks, based on Srinivas et. al. "Universal Planning Networks".

Abstract: As authors write "A key challenge in complex visuomotor control is learning abstract representations that are effective for specifying goals, planning, and generalization". I'll present how the proposed method, Universal Planning Networks, provides promising results in these directions.

11.02.2019, **Błażej Osiński**, Goal-conditioned hierarchical reinforcement learning (based on “Data-Efficient Hierarchical Reinforcement Learning”, Nachum et al and “Near-Optimal Representation Learning for Hierarchical Reinforcement Learning” Nachum et al).

Abstract: Humans naturally plan and execute actions in hierarchical fashion - when one plans to go somewhere, they don’t think about every foot step on the way. This hints at using hierarchical methods also in the context of reinforcement learning. Though the idea seems to be obvious, these methods were rarely successfully applied to complex environments. In the presentation, I’ll focus on goal-conditioned methods, which seem to convincingly apply hierarchical RL methods to learn highly complex behaviours.

04.02.2019, **Krzysztof Galias, Adam Jakubowski,** RL for autonomous driving: A case study.

Abstract: We will go over Reinforcement Learning project for a big automotive company where the goal is to train a car driving policy in a simulator and transfer it to the real world. We will discuss techniques used, lessons learned and share progress on the task.

28.01.2019, **Karol Strzałkowski**, Abstract representation learning (based on 'Decoupling Dynamics and Reward for Transfer Learning', Zhang et al and 'Combined Reinforcement Learning via Abstract Representations', Francois-Lavet et al).

Abstract: There are several reasons to try to mix model-based and model-free approaches in reinforcement learning. While in many cases model-free approaches perform better than planning using a model of the environment, a good state space representation might lead to better sample efficiency and easier transfer learning. The authors of the first paper propose such method of learning an abstract environment representation in a modular way, which supports transferability in many ways. The authors of the latter improve this setting and obtain even better sample efficiency and interpretability of the learned representation.

14.01.2019, **Piotr Miłoś**, Dr Uncertainty or: How I Learned to Stop Worrying and Love the Ensembles.

Abstract: Though measuring uncertainty is a fundamental idea in statistics it has been somewhat absent in deep learning. One of the major obstacles has been lack of efficient Bayesian learning. While still not fully resolved promising works emerged recently. In my talk I will give a non-exhaustive overview starting with papers:

- Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

- Randomized Prior Functions for Deep Reinforcement Learning

07.01.2019, **Mikołaj Błaż**, Policy-guided tree search na bazie pracy Orseau et. al. "Single-Agent Policy Tree Search With Guarantees".

Abstract: Tree search is a standard and continuously investigated task of Artificial Intelligence. In the first part of the talk I will briefly discuss some common tree search algorithms. Second part will focus on the "Single-Agent Policy Tree Search With Guarantees" paper. Its authors propose two novel policy-guided tree search algorithms with provable upper bound on the number (or expected number) of tree nodes expanded before reaching a goal state. Algorithms are then analyzed and evaluated on Sokoban environment.

17.12.2018, **Maciej Klimek, Konrad Czechowski, Maciej Jaśkowski, Łukasz Kuciński**, NeurIPS 2018 summary.

10.12.2018, **Michał Zawalski**, Learning to navigate.

26.11.2018, **Piotr Kozakowski**, Exploration by Random Network Distillation na bazie pracy Burda et. al. "Exploration by Random Network Distillation".

Abstract: Eksploracja przez Destylację Losowych Sieci to nowa metoda, która uzyskała godne uwagi wyniki na grze Atari Montezuma's Revenge. Zacznę od opisu gry i trudności które się z nią wiążą, w szczególności związanych z eksploracją. Wprowadzę też problem eksploracji i pewne ogólne metody radzenia sobie z nim. Następnie opiszę Destylację Losowych Sieci jako mechanizm eksploracji. Podam pewne podstawowe intuicje i postaram się uzasadnić metodę używając argumentów z Bayesowskiego Głębokiego Uczenia. Potem podam szczególy techniczne eksperymentów autorów z metodą i zakończę opisem wyników.

19.11.2018, **Łukasz Krystoń**, Oh et. al. "Self-Imitation Learning" oraz "Contingency-Aware Exploration in Reinforcement Learning".

05.11.2018, **Łukasz Kuciński, Piotr Miłoś**, Semminar programme