Writing reusable code in Python: the tools you need
The ability to write code that’s reusable often differentiates the best data scientists and engineers from the rest. If you’re a data scientist, machine learning engineer, data engineer, or data analyst, and you want to start writing reusable code in Python instead of throwaway scripts, this article aims to equip you with the tools and processes you need to get started. At Machine Alchemy we regularly work with businesses that benefit from implementing these practices (and beyond) to boost their teams’ happiness and productivity.
The most common steps involved in taking your code to the next level are listed below. If you’ve already ticked a few off the list, then use this article to fill in the blanks!
Using version control
Using a modern IDE such as VSCode/PyCharm
Developing code as installable Python packages
Using virtual environments to isolate dependencies
Using Version Control
Any code that you write for your work or your personal projects should be version controlled using Git. Git allows you to track changes made to your code, and easily collaborate with others on a shared codebase (note: your future self is also a collaborator and will thank you for using Git).
The barrier to entry for using git is low, but often the exact process of using it can be confusing. The hidden truth of git is that most developers do not understand its intricacies, they just use it. Using git is like driving a car, you don’t need to understand how it works under the hood to use it effectively. With that said, here’s how to use it:
Using git locally
- Check that git is installed on your machine by opening up a terminal and running git --help. If this complains that git isn’t installed, then follow these instructions.
Open a terminal and navigate to the parent directory of your code. If your code isn’t organised into a directory structure, then consider grouping together your code into a single folder, with as many subfolders as you need to structure your code. For example:
In the terminal, navigate to the root folder of your codebase (in the above image this would be my_project_codebase, and run
git init
. This tells git to start tracking changes made to the code inside this folder. You should be able to see a.git
file if you list all the contents of the folder.Now you can start using git to track your codebase. The way to use git is as follows: each time you make a change to the codebase (for example you write a new function, or you add some documentation, or you delete some old code), you take a snapshot of your codebase. This means that you can code safe in the knowledge that if everything goes wrong, you can undo your changes. The workflow is as follows:
- Make a change to the codebase
- From the root of your codebase run
git add -A
. This tells git that you want to add the changes you've just made to your next snapshot of the codebase. (The-A
means 'everything that has changed'). - Run:
git commit -m '<commit message>'
. Replace <commit data-preserve-html-node="true" message with a short description of what you've changed. For example:git commit -m 'remove unused load_raw_data function from ETL pipeline'
- Rinse and repeat.
- From the root of your codebase run
- Make a change to the codebase
Pushing to a remote repository
In order to ensure your codebase doesn’t die with your laptop, and to enable collaboration, it’s worth pushing your code to a remote repository. A remote repository is a copy of your local repository. It gets updated every time you git push
your changes to it.
To set up a remote repository:
Create an account with a provider such as github, or with whichever version control platform your company uses (it might be GitLab, Azure DevOps, Bitbucket etc.)
Create a new repo using the platform. Give it the same name as the folder on your local computer that contains your code. In the above example, I'd call this repo 'my_project_codebase'. For example in Github navigate to Repositories -> New. Importantly do NOT initialise this repo with any files inside it (e.g. do not initialise it with a README as is sometimes optional). We need an empty repository. .
- Follow the instructions provided by your platform to push an existing codebase to the remote repo. For example in GitHub, this is demonstrated immediately after creation: .
Git workflow
Without branching and merging (see next section), the basic git workflow from your local laptop’s command line is:
git pull
(ensures you have the latest changes from the remote repo)…write a small amount of code…
git add -A
(includes all changes in the upcoming snapshot)git commit -m <commit message>
(takes a snapshot of your code)git push
(pushes your snapshot to the remote repo)repeat…
Branching and merging
Branching involves creating isolated versions of a codebase on which you can make changes. It’s extremely useful when working with other people: each person works on a separate branch and then merges their changes into a common ‘main’ or ‘master’ branch:
A general rule for branching is to create one branch for each small new piece of functionality you want to add to the codebase. Branches should live for a few days at a maximum and should not involve large sweeping changes. If big changes are required, the best thing to do is to split them up into smaller stages and create a branch for each small stage.
The branching and merging workflow is the same as the git workflow described above, except that it now starts with a branch and ends with a merge.
The git workflow now looks like this:
Scenarios you may encounter
It’s important to remember that this is a basic workflow and you’ll encounter scenarios in which you’ll need to alter it to fit the situation. Here are a few scenarios you may encounter and suggestions for how to approach them:
Committing a subset of your changes: Change the
git add
command to point to specific files instead of using the-A
option.Resolving merge conflicts: Both VSCode and PyCharm have extremely good interfaces for resolving merge conflicts.
Undoing changes made in a previous commit: For this you’ll need to learn about
revert
commits.Discarding all your current changes since the most recent commit: For this you may want to use the
reset
command.
Using a Modern IDE
If you’re looking to write reusable code, I would recommend using either PyCharm or VSCode. These are (as of Jan 2022) the most widely used integrated development environments (IDEs). IDEs boost productivity and preserve sanity by providing:
An easy-to-use, customisable interface.
Syntax highlighting and error detection.
Efficient debugging tools.
…and a ton of other functionality (integrated version control, support for multiple languages, user-defined code snippets, the list goes on)
VSCode
Developed by Microsoft, VSCode is a free-to-use, lightweight, modular and sleek IDE with all the functionality required to develop reusable software in pretty much any language. It can be downloaded from here.
PyCharm
A JetBrains product, PyCharm is a fully-featured IDE with an exceptionally good debugger and easy-to-use interface. It’s a slightly steeper learning curve than VSCode, and much of the initial learning involves figuring out which features not to use. PyCharm makes dealing with merge conflicts easy via a well-designed interface. It can be downloaded from here.
Overall I would recommend using the same tool as your colleagues (if they use one). Using the same tool as your colleagues means you’ll have a great support network and pair-programming will be easier. I used to use PyCharm, but then switched to VSCode because my colleagues did. Ultimately these are both great IDEs but if you’re looking to get started quickly and your team doesn’t already use VSCode or PyCharm, then I’d recommend VSCode.
Developing code as an installable Python package
This really takes your skills to the next level. If you’re looking to develop reusable code, packaging your code into an installable library is often a sensible choice. This enables you to import your functions and use them like you would any other Python library. For example:
from my_package.module import awesome_function
This is useful because it enables you to:
Organise your code in a clear structure
Install specific versions of your package into certain environments (e.g. installing the master-branch into your production environment)
Easily write unit tests for your classes/functions.
List your package’s dependencies and automatically install them when your package is installed
Not have to worry about relative imports
Creating a package
In order to start writing code as an installable package, I would recommend using a tool called PyScaffold. To create a Python package, install PyScaffold and run:
putup <name_of_package>
This will create a directory containing a basic package structure. Do not feel daunted by all the different files in this folder. The most important is setup.cfg
. This file contains metadata about the package, such as the description, where to find the source code (spoiler: in the src
folder), and a list of the package’s dependencies:
Developing code
Now it’s time to start writing code! For this you’ll need to install your package into your local environment. You may want to consider using a virtual environment for this, which is explained in the next section. However for now just open up a terminal, navigate to your package’s root directory (the one with setup.cfg inside), and run the following command to install your package into your current environment:
pip install -e .
This command should have installed your package like any other package. The -e
tells pip to install in edit mode. This means whenever you develop code in your package, the installation will automatically update without you having to re-install it!
Now you’re free to develop code inside the src
folder of your package. Here’s an example of one module in a package importing and using a function from another module in the same package:
Now you can treat your functions/classes like any other installable Python library!
Testing your package
Now that you’re working on an installable Python package, you may want to start writing unit tests for your functions/classes. In general it is worth unit-testing reusable functionality because it reduces the chances of bugs both when writing the code in the first place, and when changing it in the future.
A common way to organise tests is to mirror the structure of the src
folder in a folder called tests
, such that every <module>.py
file in src
has a corresponding test_module.py
file in tests
. Here’s an example showing how this works:
There are a few testing frameworks in Python, but a popular and easy-to-use framework is pytest. Once you’ve written a test, navigate to the root folder of your package and run:
pytest
This will search through the package for modules and functions beginning with the word ‘test’. Notice how in the above screenshot the testing module is prefixed with ‘test’, and so is the test function itself. This is required for pytest to find the tests.
Using virtual environments to manage dependencies
I remember being intimidated by anything with the word ‘virtual’ in front of it. Virtual environments sounded daunting but in actuality they are a simple concept.
Whenever you run code in Python, there is a Python executable somewhere on your machine that gets used. On a mac, this is often /usr/bin/python
. If you’re on a mac or a linux system you can run which python
to find out where Python lives. This Python executable is called the ‘system’ Python, because it’s the one which is called by default and installed onto your computer. This ‘system’ Python executable is also linked to the Python packages installed on your machine (packages like numpy, pandas, sklearn, etc.)
A virtual environment is just another Python executable and set of Python packages, stored somewhere else on your machine. Crucially, it’s isolated from your system Python executable and packages, meaning you can install any Python packages into it without affecting the system Python environment or any other virtual environments. This makes virtual environments perfect for developing code, because you can install your packages into virtual environments and not worry about conflicting dependencies. For example, if package A you’re developing requires numpy 1.1.2 and package B requires numpy 2.1.0, you’d just create two separate virtual environments with different versions of numpy installed. A good rule of thumb is to create one virtual environment for every package you develop.
Tools for managing virtual environments
A good tool to start with is virtualenv. It’s a relatively simple-to-use tool for creating virtual environments:
Install virtualenv by running
pip install virtualenv
Create a virtual environment (I would personally recommend doing this outside your package/repo) by running
virtualenv <name of environment>
. Give the environment a name which matches the project it’s going to be used for. This will create a folder containing a Python executable, a set of Python packages, and a few other files.Activate the virtual environment by running source
<name of environment>/bin/activate
. Now you can install packages and develop Python code in an isolated environment!The virtual environment can be deactivated by running
deactivate
.
This article has aimed to introduce some of the core tools and processes used when developing reusable code in Python. I hope you’ve found it useful! Any questions please get in touch at lachlan@machinealchemy.com. At Machine Alchemy we’re happy to help businesses implement these tools and beyond to increase the productivity and happiness of their teams!