AMA with Richard Liaw & Kai Fricke from Raytune

Richard Liaw, Team Lead and Kai Fricke, Software Engineer both at Anyscale joined us to talk about their work at Raytune, discussed how to choose the right search method and various uses cases for sub parts of the ray library. .
Angelica Pan
Richard Liaw is currently a PhD student in the Computer Science Department at UC Berkeley. He is also a member of the RISELab, the AUTOLAB, and the Berkeley AI Research Lab. At AnyScale he is leading a team to develop and maintain open source integrations for the open-source project Ray.
For more AMAs with industry leaders, join the W&B slack community.

Questions

Q: Adrian
I have a question about managing GPU clusters hosted in cloud providers like AWS/GCP. Smaller teams often have cost concerns when running large GPU clusters. Since GPUs are expensive and bill by the hour, these teams want to be able to acquire GPU resources when it comes time to train a model, and then release those resources upon training completion. Do ray clusters support or plan to support any sort of on-demand autoscaling for GPU workloads? If so, are there semantics available that allow an individual job to specify what resources it needs to be added to the cluster (which ideally are then released when the job is finished)?
A: Richard Liaw
Ray clusters already support on-demand autoscaling - there are even experimental features for supporting multiple node types (which is your second question I believe?) https://github.com/ray-project/ray/pull/9096 . This will be productized as part of our Ray 1.0 release in September.

Q: Sairam
When going about hyperparam optimization, is there a suggested checklist you would suggest for choosing value ranges, and secondly, would you suggest grid search or bayesian opt search for a first pass?
A: Kai Fricke
To generate ideas for good hyperparameter ranges it is often good to look into the papers of the authors, or just to see what other people did. Often it needs some intuition, but depending on the problem it might be worth going crazy and try some more extreme parameters. Whether to use a grid search of Bayes Opt depends a bit on the problem. If you have a small problem that doesn't need much time/resources, a grid search is often useful. It is also great if you have a large number of hyperparameters. For larger problems where you don't want to train many runs, you might be better off with BayesOpt. Note that BO doesn't work well with a large number of parameters. TLDR: For estimating ranges, look at what others have done and follow your intuition. Use BO for large problems and a small number of hyperparameters. Use grid search for small problems and large numbers of parameters.

Q: Axel Setyanto
Is there a way to use a successive halving algorithm (SHA) and asynchronous SHA as ways to perform early stopping with ray tune? Can I combine these with bayesian optimization?
A: Kai Fricke
Absolutely - we have both synchronous SHA and ASHA schedulers in Tune: https://docs.ray.io/en/master/tune/api_docs/schedulers.html#tune-scheduler-hyperband . I think what you're looking for is BOHB - BayesianOptimization HyperBand. HyperBand is an extension of ASHA, so performs early stopping, and the BO part does the parameter suggestions. We have an implementation of BOHB ready to use in Tune: https://docs.ray.io/en/master/tune/api_docs/suggestion.html#suggest-tunebohb .

Q: Boris Dayma
Hi Ray Tune team, I'm interested in knowing any advice you would have regarding which search method to use and which parameter range values when having limited compute.
A: Kai Fricke
Choosing a search method depends on some properties of your task. Is the problem small (i.e. doesn't take long to train) or large? Does it require expensive hardware or not? Do you need to tune a few parameters or many? For expensive problems (those that take longer or require expensive resources) you'll want to limit the number of training runs. So here Bayesian Optimization methods like BOHB can help limit your expenses. Population-Based Training is also a great way to not waste many resources if a parameter schedule is acceptable (or even preferred). For smaller problems, my go-to recommendation would be to use a Random search with early stopping (e.g. ASHA). Usually, this results in good configurations and you quickly see which parameter ranges make sense and lead to good results.
A: Richard Liaw:
Boris Dayma great question! (I started typing up the answer but didn’t submit this) 1. Something that does early stopping (i.e., asha, hyperband) is generally a good go-to. 2. A small random search across a wide parameter range is always good to start (with early stopping) because then you can identify “parameter importances” 3. Parameter importance are then used to help identify a subsequent parameter tuning run.

Q: Charles
The last time I checked in on the automated hyperparameter optimization literature (mid/late 2010s), there was a lot of excitement around HyperBand and smart random search, rather than Bayesian methods. Is that still the feeling of the community? Have clear winners emerged for specific algorithms, eg Random Forests vs DNNs, or for specific domains, eg Computer Vision versus Reinforcement Learning? And what are the prospects for fully replacing "graduate student descent"?
A: Richard Liaw
Great question! Here’s the view from my eyes (obviously could be biased): 1. HyperBand/ASHA is sort of a “standard”. Many if not all auto-ml or hpo offerings provide this. 2. in RL, deepmind leads the way for research here and it seems like they use extensive use of Population-based Training. 3. in NLP, everyone seems to be using “grid search”. 4. Users of Classical ML (svms, logistic reg, rf) seem to be happy with gridsearch/bayesian optimization/stuff that scikit-learn offers. 5. No particular winner for Vision. Graduate student descent is a bit interesting because if done properly, the graduate student should be learning and evaluating different hypotheses per descent step but perhaps due to these tools, life is a bit better for those who do graduate student descent for the sake of actually optimizing the model.

Q: Krisha Mehta
Hi Ray/Tune, what was the biggest challenge the team faced while developing the library that supports such high computation?
A: Kai Fricke
Hi Krisha Mehta! One of the things that stand out is that we have to make sure that the library actually does run on large clusters. We do most of the development on our local machines, but once we do tests in a cloud environment, debugging becomes much harder. Often there are subtle bugs that add up in overhead (for instance memory leaks), and finding and fixing those can be quite hard. And testing these things is not something you do only once, but periodically.
A: Richard Liaw
Yeah I think of the couple of years I’ve worked on Tune, the harder problems come up in finding out how to work seamlessly on a preemptible cluster (where machines can be killed at any point in time). It’s hard to think about all of the edge cases ahead of time, and it’s usually a slow and gruesome process for reproducing the issues (often at the expense of another user).

Q: Yash Kotadia
Ray/Tune: Can you briefly explain how does tune terminates bad hyperparameter evaluations?
A: Kai Fricke
Hi Yash Kotadia, great question, it's always good to look behind the curtains. Tune's scheduling algorithms follow their papers. This is the ASHA paper for instance. In ASHA you define a reduction factor. If this is for instance 2, you would stop half of the trials after a certain amount of steps. In that case, you just look at the performance of the trials after they trained for the same amount of time and then stop the bottom half and continue to train the top half.

Q: Siddhi Viayak Tripathi
Seems like Ray has been designed to quickly iterate ML experiments into production, a lot of which consists of MLOps tasks. Moving forward, does Ray aim to be a one-stop solution for most of MLOps task( from training, tuning to serving at scale)?
A: Richard Liaw
Great question! I think the ML Ops workload would be a great workload to run on Ray. Broadly speaking, Ray aims to simplify the execution and orchestration of distributed workload, so a lot of focus will also be placed on other workloads like streaming and data

Q: Aritra RG
Hi Ray/Tune, I would really like for you all to talk about the complexities and time efficiency of the high-end algorithms that are provided. Due to the fact that these scales so well, what is the thought of the team responsible for building such great scalable and time-efficient algorithms.
A: Richard Liaw
Aritra RG thanks for the praise! Ray Tune leverages Ray core underneath the hood. The great thing about Ray's core is that it solves all of the scalable execution problems for you. Ray Tune is just simply a wrapper around Ray that provides some extra goodies. Ray Tune is a product of a lot of user feedback too, so many of the reasons we have these great algorithms are because our users and friends pointed out to us where we were lacking

Q: Illya Kaynov
Hi Ray/Tune. I marvel at the complexity of software engineering required to build such an efficient and easy to use the library. My question is: what kind of team collaboration framework/software you are currently using that, in your opinion, has added the most value?
A: Richard Liaw
Ah, we use: 1. A lot of google docs/drawings (great for communication, organizing thoughts) 2. Airtable (great for prioritizing and planning) 3. Slack 4. github all 4 are super useful

Q: Ajay Arasanipalai
Hyperparameter optimizations methods seem great when you can afford to run them, but it seems like we're also starting to see new techniques for getting better baselines/defaults. For example, the fastai learning rate finder and all the newer learning rate schedules + optimizers seem to alleviate, to some degree, the need to perform an extensive sweep over learning rate values. Where do you see this idea of getting better defaults or automated hyperparameter selection going in the future? Or do you think that training multiple models and focusing on the algorithms to pick the next hyperparameter given previous results is the superior approach?
A: Richard Liaw
Ajay Arasanipalai this is an excellent question. A learning rate finder is a great thing to use. However, as you know, a learning rate finder gives you a static learning rate. In contrast, we’ve seen in many recent models that learning rate schedules require quite a bit of parameterization (how fast should you decay? how fast should you ramp up)? Generally, better defaults is a good thing for everyone (i.e., imagine if you had to search for the right resnet model every time!) However, models and training procedures just seem to get more and more complex, so I think the answer is that the two are not mutually exclusive. Rather, think of it as - better defaults allow you to focus on tuning other parameters.

Q: Armand MCqueen
Have you found using HPO to find a good learning rate schedule to be effective?
A:
Armand McQueen yeah - it definitely depends on your task and your baseline. For example, we’ve actually been able to use Population-based training (which automatically changes your hyperparameters over time) to increase the performance of BERT models for fine-tuning certain tasks over the original paper baselines using the same compute budget as the original paper’s hyperparameter search.

Q: Elena Khachatryan
In the docs for Ray SGD, it is mentioned that it can be scaled up and down multiple GPUs and CPUs using 2 lines of code. Working with Ray, is it advised to let SGD handle all the Multi-GPU processing and not manually allocate and free GPU memory?
A: Richard Liaw
Yeah, the idea is to not think about doing multi-gpu programming or processing as you might normally do in Torch (torch.device(0), torch.device(1)), etc. Let me know if you have any other questions about this (I might not have properly answered this question).

Q: Carlo Lepelaars
Question for Ray Tune: I noticed there are many subparts of the ray library and there are multiple ways of performing a particular task (for example Distributed training). Ideally, how would you divide the use cases for these libraries depending on a task (ray core, SGD, Tune, Serve, Cluster etc)?
A: Richard Liaw
Yeah this is a great question, and I think some of this needs to definitely go into the documentation. Here are someone/two-line summaries for each. Note that for every library, the same idea of “writing your code for the laptop and scale it to a hundred machines without changing your code” applies: 1. core/cluster: is like multi-processing. Use it when you want to run things in parallel. 2. rllib: reinforcement learning (with like many sota implementations) if you want to scale or go to production. 3. Tune: a wrapper for executing your hyperparameter search/multiple model training. Use it when you don’t want to write your own hyperparameter search tool. 4. Serve: want to deploy your model 5. Cluster Launcher: This is like a nice cli tool for interacting with your cloud provider. 6. SGD: this is currently more nascent, but the idea is to be a library for scaling your training. It’ll come with many features specifically for simplifying distributed training (in the near future).

Q: Kevin Shen
What are other benefits/motivations behind using Ray Serve other than it not requiring a vendor-lockin? How does Ray Serve improve the scaling process?
A: Kai Fricke
Hi Kevin Shen, it's great to see a question about other parts of the Ray ecosystem! There are a couple of benefits of Ray Serve that I'd like to highlight. First, it is really easy to use. You can deploy a model in only a couple of lines of code. And you can easily do this in one go after tuning your parameters with Tune and training a model. We recently published a tutorial on how to use Ray Tune and Ray Serve in one end-to-end workflow. As for the benefits of Ray Serve itself, one thing that stands out is that it leverages Ray's distributed computing model, and easily scales your backends to many nodes. This is as simple as just adding nodes to your cluster and calling ray start

Q: Devjyoti Chakraborty
Hello, thank you in advance. In the report, it's written that Ray Tune fires up a server on a locakhost port if you are using your local system? What about using Ray Tune with a remote server? How will it assign port?
A: Richard Liaw
Yep! Ray Tune offers a server for interacting with the ongoing experiment. You can set the port with tune.run(server_port=…).

Q: Kyle Goyette
Hi there, thanks for doing the ama! I was wondering if there were any tasks or problem domains that are known to not find a good set of hyperparameters when using Bayesian hyperparameter search?
A: Kai Fricke
One of the main constraints of BO is that it doesn't work well with a large number of hyperparameters. The Gaussian process is then expensive to train and the acquisition function becomes inaccurate. There are extensions for BO that enable us to evaluate discrete parameters, but in its core, it's more suited for continuous parameter spaces.

Q: Shawn Lewis
Can you give some examples of large compute jobs using Ray that push its limits? Where are the bottlenecks today?
A: Richard Liew
I think sometimes we get scaling issues for users who are running Ray clusters with >2000 of cores (and somehow exhausting the resource limits for a single machine, since we save logs on the head node). For Tune specifically, scheduling/optimization can become a bottleneck at larger scales. For example, in certain Bayesian optimization, the optimization process may become much slower as more data points are added.

For more AMAs with industry leaders, join the W&B slack community.