Moving Beyond Databricks Notebooks

This article explains how to migrate away from Databricks notebooks to a more developer-friendly approach to software development. If your team has been using Databricks notebooks and you have code running in production, this article may be exactly what you’re looking for…

Nature vector created by brgfx - www.freepik.com

At Machine Alchemy we think notebooks are a great tool for data exploration and visualisation. However, for teams working within the Databricks platform, it is common to find codebases composed entirely of Databricks notebooks: Notebooks are used to run other notebooks, and all code is developed inside the Databricks user interface. While this setup works well for one-off tasks, it becomes an overwhelming headache for developing production code. Fortunately it is possible to migrate to a more developer-friendly world while still using Databricks, a world of unit tests, fully-featured IDEs, efficient debugging, and effective collaboration. This article explains the process of migration. While the examples use Python, the migration process is language-agnostic and consists of the following steps:

  1. Move code to version-controlled repositories and use the Repos feature in Databricks.

  2. Develop reusable code using a modern IDE such as VSCode/PyCharm by enabling Arbitrary File Support in Databricks.

  3. Optional: Develop reusable code as installable packages.

It might be the case that your team is already mid-way through this process - for example maybe you use notebooks in version controlled repositories. If so I would encourage you to explore the next step in the process. You may find your team becomes happier and more productive as a result.

Why migrate away from notebooks?

There are a few reasons why extensive use of notebooks in production software often creates a quagmire of technical debt and confusion:

  • Notebooks are difficult to unit-test. Testing is typically a core component of software development because it reduces the likelihood of bugs and eases collaboration.

  • Notebooks are great for data-exploration, but lack many of the basic user-friendly features found in modern IDEs such as VSCode and PyCharm:

    • Code linting

    • Quick navigation to definitions

    • Tools for automatically creating docstrings/typehints.

    • Integration with testing frameworks.

    • Ease of debugging

    • Integration with virtual environment tools to isolate dependencies.

  • Notebooks do not naturally encourage software engineering best-practice. There is no automatic code-formatting, it is not possible to import specific functions/classes from other notebooks, and data scientists often rely on copying code from one notebook to another.

The ultimate goal of any development team is to create useful, reliable and maintainable software. This often starts by using version control…

Step 1: Move code to version-controlled repositories and set up Databricks Repos

Technology vector created by stories - www.freepik.com

The following describes a step-by-step procedure for migrating existing code into Repos:

  • Configure your Databricks workspace to allow the Repos feature. The below docks explain how to do this for each cloud provider:

  • Create an empty repository using whichever version-control tool your team uses. If you’re new to version-control, I would recommend starting by using GitHub, however if your team uses Databricks on Azure, then Azure DevOps may be your best choice since it integrates well with other Azure services.

  • Inside Databricks Repos create a folder for your user like so, and clone your repository into it:

  • In the Databricks user interface, create a branch in the repository.

  • Move the code (notebooks and all) from your Workspace, or wherever it currently lives, into the repository.

  • git commit your changes and git push them to the remote repo (which lives on Github/Gitlab/Azure Devops etc).

  • Merge your branch into the main branch of your repository. You should now be able to see all of your notebooks on the main branch of your repository.

Now your Databricks notebooks live inside a version controlled repository, and can be developed by your team using a version control workflow. For example you could use this popular modern workflow. This is a big step forward and will help your team collaborate effectively!

Step 2: Allow for non-notebook development by enabling Arbitrary File Support

The aim of this section is to enable developers to begin writing code in .py/.scala/.r files using their preferred IDE, instead of in notebooks. Among a host of other benefits, this will enable your developers to use their favourite IDE, import functionality they’ve written using import statements, lint their code, and write unit-tests.

Allowing developers to write code using an IDE such as VSCode/PyCharm is likely to result in the largest improvement in developer-happiness and productivity of anything in this guide.

To allow this, Databricks recently released ‘Arbitrary File Support’ for Repos. Prior to this only notebooks were allowed to live inside Repos. However it is now possible (and advisable) to allow any arbitrary file-types, such as .py, .yaml, .scala, etc.

The process of enabling developers to work in their preferred IDEs is as follows:

  1. Enable Arbitrary File Support for the repository in Databricks by following this short guide in the official docs.

  2. Developers should now clone remote repositories to their local laptops, and begin writing reusable code as .py/.scala/.r files in their preferred IDE instead of in notebooks.

  3. Developers can then push their code to the remote repository (preferably on a new branch). With ‘Arbitrary File Support’ enabled, this code can be pulled into Databricks using the Repos feature outlined in the previous stage.

  4. There are then two options for running code on a Databricks cluster: 

    • Developers open Databricks, pull their branch using the Repos feature, and run their code in a Databricks notebook.

    • Developers use databricks-connect to run their code from their local laptops on a Databricks cluster.

In general if developers are writing code for data scientists to use in their notebooks, then option a is preferable. If developers themselves are submitting jobs to databricks clusters, then option b makes more sense because it allows developers to both write code and submit jobs from their laptops.

Once reusable code has been migrated out of notebooks and into .py/.scala/.r files, you may find you’d still like to use notebooks to execute that code. Arbitrary File Support allows you to import the code you’ve written into Databricks notebooks.

For example, imagine the function my_function used to live in notebook_z, and now lives in my_module.py. This code:

…can now be replaced with this:

This is a significantly cleaner solution and encourages explicit coding.

Your developers are now free to work in their preferred IDEs, and code is no-longer tied to notebooks. This represents a big step towards a developer-friendly environment and is a legitimate basis on which to begin developing software on Databricks.

A note on adding unit tests

If your codebase now consists of some non-notebook code, it is a good idea to start thinking about unit-testing. There are a few different ways of organising unit tests, but an approach which works well here is to store tests in a tests folder next to the code they’re testing, like this:

This structure keeps your tests close to the code they’re testing, and makes finding the tests straightforward. For more details on writing unit tests in Python, see this post from RealPython. Our team has had a lot of success using PyTest in particular.

(Optional) Step 3: Organise Reusable Code into Installable Packages

If one of the following scenarios applies to you, you may want to consider developing reusable code as installable packages, which can be installed onto your Databricks cluster:

  • Your engineering/development team creates functionality that your data scientists (or any other group of users) use in their notebooks.

  • You feel you do not need to use notebooks at all (apart from maybe to call the software you’ve written)

Developing code as installable packages facilitates easier development and testing, and broadens your options for deployment. Python, R and Scala each have their own method for package creation. In the case of Python, we recommend using a tool called pyscaffold to create packages.

Create a Python package

In our experience it’s usually easiest to create a new repository to hold your package, rather than wrestling an existing codebase into a package structure. The following steps run through how to create a new Python package using pyscaffold:

  1. On your local machine, install pyscaffold using pip install pyscaffold

  2. Choose a name for your package (e.g. my_awesome_product).

  3. Run putup my_awesome_product. This will create a directory containing all the metadata and basic structure for your package.

  4. The src folder is designed to hold all of your source code (i.e. all the reusable code you develop).

  5. Set up a new empty repository using your version control system (e.g. GitHub).

  6. Push the package from your local machine to this repository. Instructions for how to do this are typically provided by your version control system when you create an empty repository.

  7. Now begin the process of migrating code from your existing codebase into your new package.

  8. Your package can now be installed onto your local machine. We advise installing your package into a new virtual environment - useful tutorial on virtual environments. Simply navigate to the root of your package and run pip install -e . This will install your package in ‘edit mode’, meaning any changes you make to your codebase will automatically update the installed package!

Install the package onto a Databricks cluster

Now that you’re developing an installable package, you can install it onto your Databricks cluster for use by other users or as part of a production pipeline. Common options for doing this include:

  • Install the package onto a cluster using databricks-connect (explained below)

  • Install the package onto a cluster by copying the package to the databricks filesystem (this process can be integrated into a CI/CD pipeline in Azure DevOps as explained in this article by Meziness)

  • Install the package into specific notebooks using %pip and pointing to the remote repository like so

    %pip install git+https://<my-repository-url>

The official documentation explains how to set this up: https://docs.databricks.com/libraries/notebooks-python-libraries.html#install-a-library-from-a-version-control-system-with-pip

If you’re developing in Python, the option I’d recommend starting with is the first one: to use databricks-connect to install your package onto a Databricks cluster. If you haven’t configured databricks-connect, instructions for how to do so can be found in our blogpost.

Once you’ve configured databricks-connect, install the package onto your cluster using the following commands:

  1. Create a wheel file containing your package. If you used pyscaffold to set up your package, this can be done using tox (you may need to pip install tox):

    tox build

    The above command should create a wheel (.whl) file inside the `dist` directory of your package.

  2. Copy your package onto the databricks filesystem:

    databricks fs cp dist/<wheel_filename> dbfs:/libraries/<wheel_filename> --overwrite

  3. Install the package onto a specific cluster using:

    databricks libraries install --whl dbfs:/libraries/<wheel_filename> --cluster-id <cluster_id>

    Tip: to obtain the ID of the cluster on which you’d like to install, you can use:

    databricks clusters list

Now your package is available to anyone using the cluster!

The above is a manual process, and you should consider integrating a process like this into an automated CI/CD process. This blog by Meziness explains an option for automatically installing Python packages onto databricks clusters using Azure DevOps pipelines.

This article aimed to help your team adopt a more software-oriented approach to developing code in Databricks. I hope you found it useful! If you have any questions please get in touch at lachlan@machinealchemy.com.

Lachlan McLachlan

Lachlan is a Machine Learning Engineer and data-nerd. He’s one of the co-founders of Machine Alchemy and loves nothing more than receiving no comments on his merge requests.

Previous
Previous

Writing reusable code in Python: the tools you need

Next
Next

How to configure databricks-connect