It doesn't just matter what you have to say. How you say it is what matters. This holds true of information, mathematics but also coding. In this post, I try to give a brief introduction to best practices to make sure you are understood when you write Python code.

Overview

This post was designed specifically with Python in mind, with the aim to help coders follow best practices while being minimally constraining.

When coding in other languages, such as Java and C++, very different standards are sometimes used. Conversely, the wealth of guides written for those (typically, lower level languages) are not necessarily a good match for Python. One example is design patterns, best examplified and developed for Java, that don't always translate well to Python. Similarly, OOP (Object Oriented Programming) is now considered a little outdated. Encapsulation for instance makes little sense in Python where nothing is truly private.

Here, I aim to give you broad guidance and direction on writing better code while restricting you as little as possible.

Motivation: the why

Let's say you just created an amazing piece of code and contributed it to your favourite open-source project, or even internal project. You wrote the code, you should understand it - but others may not.

Fortunately, it should be straighforward to make it clear for others. If you truly understand your code, you should be able to explain it clearly to others, through a clean flow, and clear documentation, examples and comments.

I can already hear some of you asking why they would have to follow best practices and change anything - "After all, existing conventions are wrong, my own ones are better, I'm just faster, better smarter". Let's admit it is the case and leave the Dunning–Kruger effect aside for a moment.

Instead of arguing with you, I'll quote Richard Feyman, who wrote about this very topic in his biograph "Surely you're joking, Mr. Feynman". Speaking of trigonometry, and how he originally came up with a new notation before realizing the importance of established, understood notation, he said:

“I thought my symbols were just as good, if not better, than the regular symbols — it doesn’t make any difference what symbols you use — but I discovered later that it does make a difference. Once when I was explaining something to another kid in high school, without thinking I started to make these symbols, and he said, “What the hell are those?” I realized then that if I’m going to talk to anybody else, I’ll have to use the standard symbols, so I eventually gave up my own symbols.”

And here lays the core of it: (at least in most cases) you're not writing in a cave, you're not writing code for yourself. You're writing code to be read by others, who may not have your knowledge of the problem not be familiar with the context or the way you typically write code.

The goal of good code writing is to write correct code, that can be easily used and easily read, understood and maintained. The rest of this document aims to giving guidelines for this, in Python.

What do you have to say?

This should be an obvious one, but the first step is to get intimately familiar with what you are trying to accomplish. What will your code do, how and why?

Ce que l’on conçoit bien s’énonce clairement, Et les mots pour le dire arrivent aisément

This was said by famous French poet Nicolas Boileau, who also happened to be passionate about style and wrote a book on how to write well (Art Poétique). It translates roughtly to:

What we conceive well can be clearly expressed, And the words to express it come easily

"But Jean, you're talking about literature, this is science, it doesn't work the same", some might say. I'd respond that in literature as in science it is all about rigour and I'd adapt Boileau's saying to for Machine Learning to:

What we conceive well can be clearly coded, And the API to express it comes easily

And if you don't trust me, ask ChatGPT.

Guiding principles

So, we agree on the need to write good code, and you are clear and what you want to achieve: now what?

A trap many beginners fall is is to focus on trying to write the most impressive, cleverest piece of code, just as some academics try to impress with hard to read math. Both are equally bad.

Write code to be read and maintained by someone who doesn't know the work or the code (this may be you in a few months/years!). As always, take all of these with a pinch of salt, and use your best judgment.

Good code design can sometimes be a personal thing, but this is about finding a common ground and set of common rules. When in doubt, discuss so you can converge to a shared compromise.

Writing clear code

Your code should be self-explanatory. This means following existing syntactic conventions so anyone read the codebase in the same way, choosing good variable names, writing documentation, unit-tests.

This means:

  • Use clear variables
  • Adopt the PEP8 syntax
  • Document and comment your code
  • Verify correctness with self-contained unit-tests

A rose by any other name

Use clear, descriptive variable names, including (and especially) internal variables not exposed to the user.

For instance, avoid names such as aa, NcD etc unless there is a specific reason. No need to be extra verbose (e.g. number_of_dimensions is overkill, D is ambiguous, n_dim or n_dimensions are good).

Reserve CamelCase (a variable name without spaces where each word starts with a capital letter) for Class names. For variable, use underscore_separated_words. Typically, all caps is reserved for global variable which are almost never used in Python DONOTUSETHISUNLESS YOUKNOWWHAT YOUAREDOING.

A common pitfall I often see, especially from researchers/students coding a paper, is to follow verbatim the notation from the paper being implemented. While variables such as the ubiquitous x are common in math, we try to avoid them in coding. Instead, try to give your variable descriptive and helpful names such as n_dim, ground_truth, prediction, etc.

Syntax: PEP8

Before you submit your changes, you should also make sure your code adheres to the style-guide of the project you are contributing to, as well as [PEP8]. Note that PEP stands for Python Enhancement Proposal. PEP8 is an early one establishing what has become the standard for the language. The easiest way to make sure your code is complient is with black:

pip install black
black .

As a side note, when in doubt, follow the established standards of the project you are contributing to, rather than rigidly follow a given rule.

Documentation

A crutial part of good code is its documentation. The same way you'd have a notation table in a paper, and prose to explain an equation. This consists in comments (where needed to explain a particularly complex/unintuitive line of code) and docstring, which details a function's (or class') role, as well as its input parameters, what it returns and potential notes and examples of use.

I tend to use the Numpy style. For instance, for a function, we expect a docstring with the following structure:

def function(arg):
   """One line description

   Longer description,
   possibly in several lines

   Parameters
   ----------
   arg : type
      description
      **Notice the space before and after the colon!**

   Returns
   -------
   variable : type
      description

   Examples
   --------
   text
   >>> code
   expected result

   Notes
   -----
   Detailed explanation
   """
   pass

Notes and Examples are optional, but always describe the function/class' role and input parameters.

In the docstring, use single backticks for variable's names: `variable`.

Double backticks are used for inline code: ``inline code``.

For blocks of code, use double colons, leave a white line and indent said lines of code

::

   block of code
   on
   several
   lines...

Unit-tests: if it's not tested, it's broken

Each new function should be accompanied by a small function that checks the function's behaviour for correctness. Ideallly you want to test for mathematical/algorithmic correctness and API correctness at least. I would go further and suggest you write the unit-test before you write the actual function. And everytime you fix / find a bug, ideally, add a test case to make sure it does not happen again in the future!

For a function foo, contained in module.py you should add a function test_module.py which contains one (or more) test function(s) for all the functions in that module (in our case, foo). This allows for continuous integration tests to be run automatically everytime a change is made in the repository and ensure the changes are not breaking any functionality.

I like to use PyTest for this, as I find it very intuitive and it allows me to easily run the tests without getting in my way.

That test should be in a file test_foo.py and look like this:

def test_foo():
   """Unit-test for foo

   Notice that the function does not take any argument
   """

You can simply run pytest in your folder to run all these tests.

We typically don't use the syntax:

if __name__ == '__main__':
   do something

This induces a bunch of unintuitive behaviours that lead to hard to debug issues (e.g. local imports failing, using a local module instead of looking in the global path, etc).

If you remember one thing, let it be this, quote, from Bruce Eckel:

If it is not tested, it is broken!

You might know Bruce from his excellent book Thinking in C++ [Eckel2000]. Why am I using Python then? And why am I quoting the author of a famous C++ book? Well, first of all, I started with C and C++ long before I made the jump to Python. A jump I do not regret, Python is much more adapted to my research - but that is a story for another post.

For now, I will leave you with these other quotes by Bruce Eckel:

Life is short
(You need Python)

and

Life's better without braces.

Do not abuse classes and inheritance

You might be tempted to adopt a Java or C++ approach to coding, but this is Python. Privilege composition over inheritance. In general try to keep the number of classes to a minimum, and do not create virtual (abstract) classes unless really justified.

As Zed Shaw puts it, in Learn Python the HARD WAY [LearnPythonShaw]: "Most of the uses of inheritance can be simplified or replaced with composition, and multiple inheritance should be avoided at all costs". He also gives the following three rules of thumb:

  1. Avoid multiple inheritance at all costs, as it’s too complex to be useful reliably. If you’re stuck with it, then be prepared to know the class hierarchy and spend time finding where everything is coming from.
  2. Use composition to package up code into modules that are used in many different unrelated places and situations.
  3. Use inheritance only when there are clearly related reusable pieces of code that fit under a single common concept or if you have to because of something you’re using

Once again, in summary, the goal is to make the code easy to read and maintain, and save the readers/maintainers to have to browse recursively through the files to understand some odd behaviour.

Writing code efficiently

Premature Optimization

You probably have head this quote before:

“The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.”

It was written by Sir Tony Hoare and made famous by none other than Donald Knuth.

In practice: write a code that works and is well tested first. Then, benchmark to find the bottlenecks. Lastly, optimize.

The best way to end up with bad code or wrong code is to start optimizing too early. You may end up going down the rabbit hole and miss the important part. One obvious example is parallelism. You typically want to parallelize at the highest-level. Enabling parallelisation at a lower level would probably result in race conditions or unefficient code where the threads/processes are waiting for each other.

As always, apply good judgment: we are not saying that you should not think about optimization or put thoughts about efficiency during the design/planning phase [PrematureOptimizationFallacy] !

Communicate!

You might be surprised to hear that coding is actually a small portion of writing software. Most of the time should be spent designing, planning and, most importantly, communicating.

Communication is core to any good collorative effort, especially software. Communicate efficiently and often. Don't go off coding what you think is needed in the way you think is best without talking to others: it is the best way to add work for everyone. In that way, failure to communicate could be seen as a special case of premature optimization.

Finally, remember that everyone is here to contribute to the same project and make it successful - be kind and assume best intentions. A lot of subtelty is lost when communicated through messages or emails. If you are having a bad day, it's ok, it happens, but take time of and do not take it out on others.

Good User-Interface design

We now can write elegant code that is tested and be be easily read and maintained. However it also needs to be easily used by end users, which means having a nice UI.

A nice API needs to be

  • Easy to understand * The user should not have to read pages of documentation to understand it
  • Simple to use: * Intuitve
  • Enable the desired behaviours * The functions should do what they say they do..
  • Be easily extensible

In this section I will have a more Machine-Learning oriented focus but the principles should be widely applicable. I recommend anyone interested in Python API design to read the [API2013Builtink] paper by the [Scikit-Learn] authors, and to look at Scikit-Learn in general as it is incredibly well designed.

The Scikit-Learn project follows the following guidelines (from the [API2013Builtink]:

  • Consistency: consistent interface with a limited set of objects
  • Inspection: parameters and their values should be easy to access and find
  • Non-proliferation of classes: this also relates to the consistency point and keeps the library manageable
  • Composition: make it easy to chain operations
  • Sensible defaults: unless the parameter needs to always be set, make it optional with a good default value

Contributing on Github

An important part of any codebase is version control, which allows several people to work on the same codebase over time without (totally) loosing their sanity.

Here, I give the basics Git based workflow. If you've never heard of Git, I encourage you to read about it, a good start is the Git parable [GitParable]

I use a single branch, main rather than the old dev/main paradigm.

When you want to add a feature, you first fork the repository (create your own copy of the repo) which you then clone on your machine.

First clone the original repository.

git clone my_local_fork

You can then add a remote (link) to your own copy.

cd original_repo
git remote add upstream original_repository

You then make your changes, either on the main branch or a local branch

git checkout -b new_feature

You make your changes, commit them and push them to your own fork.

First, make sure your fork is up to date with the main repository:

git merge upstream main

Then add your changes:

git commit -a -m 'My awesome fix'
git push origin new_feature

Github then provides a convenient UI to create a Pull-Request, which will allow upstream maintainers (from the original repository) to review your code, potentially ask for changes and eventually merge the PR into the main repository.

Conclusion

I hope you found this little guide helpful. Whenever writing code, remember these principles but also use your intuition and never stick rigidly to a given rule, take into account the specific context! When in doubt, use the standards use by your community: they are the ones you primarily need to communicate with.

Please let me know in email/messages/comments if you have any feedback, suggestions or if you spot any error!

References

[1]Writing Code for Science and Data (Keynote), Gael Varoquaux, 2017, https://www.youtube.com/watch?v=AaqsGRKdoQ0
[PrematureOptimizationFallacy]The Fallacy of Premature Optimization, Randall Hyde, 2009, https://ubiquity.acm.org/article.cfm?id=1513451
[GitParable]The Git parable, Tom Preston-Werner, 2009, https://tom.preston-werner.com/2009/05/19/the-git-parable.html
[PEP8]PEP8, Guido van Rossum, Barry Warsaw, Nick Coghlan, 2001-2013, https://peps.python.org/pep-0008/
[ZenOfPython]The Zen of Python, Tim Peters, 2004
[Eckel2000]Thinking in C++, Vol. 1: Introduction to Standard C++, Bruce Eckel, 2000
[API2013Builtink](1, 2) API design for machine learning software: experiences from the scikit-learn project, Builtinks at al, 2013
[Scikit-Learn]Scikit-learn: Machine Learning in Python, Pedregosa et al, 2011, https://jmlr.org/papers/v12/pedregosa11a.html
[LearnPythonShaw]Learn Python the HARD WAY, A Very Simple Introduction to the Terrifyingly Beautiful World of Computers and Code (Zed Shaw's Hard Way Series), Zed A. Shaw, 2013, https://shop.learncodethehardway.org/

Leave a comment