10 Python libraries I consider essential for work

This article will show you ten Python libraries that I consider to be my daily bread as a quant, data-scientist, and algo-trader.  By my daily bread, I mean packages that I use often. I’ll explain what are their great benefits, and for what kind of tasks I use them.


Definitely Pandas. Pandas (Python Data Analysis Library) is the most useful package when working with structured datasets.

I use it daily for basically any task.  In comparison with Microsoft Excel that everyone knows and uses for working with some small datasets and applying some row/column operations. Well Excel, compared to Pandas, is like Trabant compared to Tesla. Almost every possible process you want to do with data you can do with Pandas, with DataFrames. The only limitation is your RAM capacity because Pandas are not built for Big Data (real Big Data is in terabytes). To work with it you need a database system like Hadoop, and in python you can use really fast pyspark or Nvidia GPU computing tools like RAPIDS. I use Pandas every day from 2016, and still, it can surprise me with some useful functions I didn’t know about.


Basically, all math and vector calculations are done with NumPy. Pandas is built on this package, so it works perfectly together. It is super easy to use and natural. For example, you can calculate drawdowns of your equity with just one line of code: 

x - numpy.maximum.accumulate(x) 
where x is the vector (NumPy's array, resp. Pandas' series) of cumulative sum of returns.


This section depends on what brokerage are you using or planing to use. I use Interactive Brokers [IB] from 2015. They provide professional services, low cost, almost all markets, all instruments, and have API into many programming languages. ib_insync is a lightweight API built on the original TWS Python API created by IB. (TWS, in this case, is Trader Workstation)


For better insights into data and understanding what is happening, we need many visualizations. 

For quick visualizations, I use Matplotlib with which can plot anything. I also use Seaborn a lot, it is built on Matplotlib and contains many plots for doing statistical analysis. When I want to present something in a better format, or in more interactive way, I use Plotly.


Frequently you have to do some statistical tests or use some special functions. The best libraries are SciPy & Statsmodels, which are also built on NumPy. 


I don’t use these packages as often as previous ones, but when I have to do some statistical tests of all types, and I use them or also for creating some statistical models. Not models that come to production, for it, I use other packages, but when I want to do deep analysis on model results. The task I use packages for: Statsmodels –  statistical tests, regression, time series models or linear models; SciPy –  special functions, optimization, Fourier transforms, or some basic signal processing.


For financial statistics and also some new methodologies in algo-trading, I use mlfinlab library. It is based on one of the best books on modern algo-trading while writing this article. Marcos Lopez de Prado: Advances in Financial Machine Learning. This library is for advanced users.


I believe that every Data Scientist who has ever done some Machine Learning model has heard about scikit-learn or sk-learn. This package contains almost all machine learning models that you can imagine. Any ML task you want to do, the first thing you go for is this software. You can process your data, prepare features, and do a lot of deep analyses on them. After that you can apply any model from regression, classification, and clustering, and run some analyses on the results. This library is like Pandas or NumPy – limitless to its functions.


For creating Artificial Neural Networks [ANN] I use PyTorch (GPU or CPU). You can make some ANN in scikit-learn but only the basic ones. When you really want to go into deep learning, you need TensorFlow or PyTorch (or Keras, which is user-friendlier working with TensorFlow). I personally prefer PyTorch, and I created a lot of impressive ANN for trading there. For example, convolutional autoencoders, recurrent neural networks, and so on. This topic is very advanced, and if you are a beginner, it does not have any sense to go into this (for now).


Gradient boosting is a part of machine learning but this one deserves its own section, XGBoost. I must say, from ML models for regression and classification tasks, this is my favorite model. XGBoost is gradient boosting on steroids and can also be also used with GPU (really faster training times). Each ML model is better for some specific tasks, but when dealing with some tasks in general, most of the time, XGBoost has a way better results on structured data (not visual, text or sound). But it is easy to overfit with this model. As with other models, never use a model if you don’t understand at least the idea behind the model. A lot of ML results are kind of black-box for us, so if ML model itself is black-box for you, rather not use it for algo-trading.


Definitely because of a lot of alternative and big data, GPU computing need it’s own section here. I will finish this list with Nvidia, which created a handy set of packages that work fully on GPUs, Nvidia RAPIDS. 

It contains DataFrames calculations fully on GPU, cuDF (something like pandas on GPU), and library for machine learning on GPU, cuML (basically scikit-learn on GPU). Truly, for now, I don’t have so much experience with these packages because our trading server does not support GPU (yet).


There are many other packages, but from my experience, those in the list above are the most useful ones. Depending on specific tasks, you might need to use other softwares like: 

beautifulsoup4 – for web scraping, 

empyrical – for calculating financial metrics, 

feather-format – for quick loading/saving bigger data files, or 

arch – for autoregressive conditional heteroscedasticity

Leave a comment