Plotting Data in Python: matplotlib vs plotly

Data visualization provides a powerful tool to explore, understand, and communicate the valuable insights and relationships that may be hidden within data. Whether it’s an initial exploratory analysis or a presentation to non-technical colleagues, proper visualization lies at the heart of data science. When it comes down to choosing how to visualize one’s data, the best tool for the job depends on the type of data, the purpose of  the visualization, and the aesthetics which you hope to achieve. In this article, I will compare and demonstrate two common visualization tools used in Python: matplotlib and plotly.

 

Installing Python

If you want to follow along with this tutorial, you’ll need to have Python installed with the required packages. If you don’t have a recent version of Python, I recommend doing one of the following: Download and install the pre-built “Data Plotting” runtime environment for Windows 10 or CentOS 7, or

If you’re on a different OS, you can automatically build your own custom Python runtime with just the packages you’ll need for this project by creating a free ActiveState Platform account, after which you will see the following image:

  • Click the Get Started button and choose Python and the OS you’re working in. Choose the packages you’ll need for this tutorial, including matplotlib, plotly and pandas.
  • Once the runtime builds, you can either download it directly, or else download the State Tool CLI and use it to install your runtime.

And that’s it! You now have installed Python in a virtual environment.

 

Plotting Data with Matplotlib

Matplotlib is quite possibly the simplest way to plot data in Python. It is similar to plotting in MATLAB, allowing users full control over fonts, line styles, colors, and axes properties. This allows for complete customization and fine control over the aesthetics of each plot, albeit with a lot of additional lines of code. There are many third-party packages that extend the functionality of matplotlib such as Basemap and Cartopy, which are ideal for plotting geospatial and map-like data. Seaborn and Holoviews provide higher level interfaces, which results in a more intuitive experience. Matplotlib is also integrated into the pandas package, which provides a quick and efficient tool for exploratory analysis.

I’ll be using pandas in addition to Basemap, which doesn’t come with the standard installation of matplotlib. You can install Basemap by following the instructions here.

To demonstrate the versatility of matplotlib, let’s import a few different datasets:

import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

wine_names = ['Class', 'Alcohol', 'MalicAcid', 'Ash', 'Alc.Ash', 'Magnesium', 'TotalPhenols', \
              'Flavanoids', 'Nonflav.Phenols', 'Proanthocyanins', 'ColorIntensity', 'Hue', 'OD280/OD315',\
              'Proline']
wine_df = pd.DataFrame(pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', names = wine_names)) 
wine_df.Class = wine_df.Class - 1
wine_df.Class = wine_df.Class.astype('object')

nino_names = ['bouy','day', 'latitude', 'longitude', 'zon.winds', 'mer.winds', 'humidity', 'air.temp', 's.s.temp']
nino_df = pd.DataFrame(pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/el_nino-mld/elnino.gz',
                                   header = None,na_values = '.', sep = '\s+', names = nino_names))

nino_df = nino_df.loc[nino_df['day'] == 3, ['bouy','latitude', 'longitude', 's.s.temp']].dropna()

I pulled two different datasets from the UCI Machine Learning Repository. The first is the wine dataset, which provides 178 clean observations of wine grown in the same region in Italy. Each observation consists of 13 features that are the result of a chemical analysis. The second is the el nino dataset, which contains spatiotemporal data from a series of buoys in the Pacific Ocean taken during the El Nino cycle of 1982-1983.

When dealing with data for the first time, an exploratory analysis is typically the first thing that is done. Plotting is an extremely useful tool in gaining an initial understanding of the data. In this case, we can plot wines based on their alcohol content (i.e., the x axis) and degree of dilution (i.e., an OD280/OD315 value shown along the y axis) in order to place them in a Class between 0 to 2.

Using pandas, different types of plots can be generated in a single line of code:

ax = wine_df.plot(kind = 'scatter', x = 'Alcohol', y = 'OD280/OD315', c= 'Class', figsize=(12,8), colormap='jet')
scatter plot

For further customization, a similar plot can be made using just matplotlib:

fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(x = wine_df['Alcohol'], y = wine_df['OD280/OD315'], c = wine_df['Class'])

ax.set_xlabel('Alcohol', fontsize=15)
ax.set_ylabel('OD280/OD315', fontsize=15)
ax.set_title('Wine Dataset')

ax.grid(True)
fig.tight_layout()

plt.show()
Matplotlib customized plot

Notice that each additional plot feature typically requires an additional line of code. This does not necessarily add complexity, as each line can be understood by Python novices since the language is simple and straightforward. Additional plotting features, such as circles, lines, text, or even indicator arrows can be added using matplotlib with little difficulty. Examples demonstrating this can be found here.

The previous plots convey meaningful information in that they tell us two features alone can separate the observations into three different clusters. However, the visualizations are relatively simple. To illustrate a slightly more complex example, we can use Basemap to plot temperature data from each buoy in the El Nino dataset.

plt.figure(figsize=(14, 14))
m = Basemap(projection='ortho', resolution=None, lat_0=0, lon_0=-150)
m.bluemarble(scale=0.5);
m.scatter(nino_df['longitude'].values, nino_df['latitude'].values, latlon=True,
          c=nino_df['s.s.temp'].values, s = 100,
          cmap='coolwarm', alpha=0.5, marker = "o")
plt.colorbar(label='Temperacture (C)')
plt.clim(25, 30)
plt.show()
matplotlib buoy data

Clearly, matplotlib and the extensive third-party packages that are built upon it, provide powerful tools for plotting data of various types. Despite being syntactically tedious, it is an excellent way to produce quality static visualizations worthy of publications or professional presentations.

 

Plotting Data with Plotly

Plotly is another great Python visualization tool that’s capable of handling geographical, scientific, statistical, and financial data. The company behind Plotly, also known as Plotly, makes an entire suite of visualization tools for multiple programming languages, all of which create interactive web-based visualizations and even web applications. Plotly has several advantages over matplotlib. One of the main advantages is that only a few lines of codes are necessary to create aesthetically pleasing, interactive plots. The interactivity also offers a number of advantages over static matplotlib plots:

  • Saves time when initially exploring your dataset
  • Makes it easy to modify and export your plot
  • Offers a more ornate visualization, which is well-suited for conveying the important insights hidden within your dataset.

Just like matplotlib, Plotly also has a few tools that allow it to easily integrate with pandas to make plotting even more efficient.

Previous versions of plotly offered an offline mode and an online mode. By using plotly online, your data visualizations are automatically uploaded so that you can access them through the online interface, regardless of how you create them. This feature is still available through Chart Studio, but I will be using the offline version.

Plotly Express is a great option for exploring pandas dataframes. It is a high-level wrapper included in the most recent version of plotly. To create a scatter plot similar to the one we created with matplotlib, run:  

import plotly.express as px

fig = px.scatter(wine_df, x="Alcohol", y='OD280/OD315', color="Class", marginal_y="box",
           marginal_x="box")
fig.show()

Feel free to play around with the interactive features of the plot. It is already clear that plotly creates superior visualizations to matplotlib with ease.

It can also tackle our temperature data from the El Nino dataset without breaking a sweat. Try this:

fig = px.scatter_geo(nino_df, lat='latitude', lon='longitude', locations=None, locationmode=None, 
     color='s.s.temp', text=None, hover_name='bouy', 
     color_discrete_map={}, color_continuous_scale='bluered', projection='orthographic')
fig.show()
Plotly buoy data

Notice that when you move your cursor over each buoy, the buoy number, temperature value, and location information are displayed. This is all done by default with plotly, and this is barely scratching the surface of what it can do.

 

Conclusions

To summarize, matplotlib is a quick and straightforward tool for creating visualizations within Python. The verbosity required to make anything more than a basic plot makes it more suitable for an initial exploratory analysis or a minimalist design. Matplotlib is also a great place for new Python users to start their data visualization education, because each plot element is declared explicitly in a logical manner.

Plotly, on the other hand, is a more sophisticated data visualization tool that is better suited for creating elaborate plots more efficiently. The interactivity and elegant aesthetics that come with plotly are benefits that cannot be ignored.

Dante Sblendorio

Dante Sblendorio

Guest blogger: Dante is a physicist currently pursuing a PhD in Physics at École polytechnique fédérale de Lausanne. He has a Masters in Data Science, and continues to experiment with and find novel applications for machine learning algorithms. He lives in Lausanne, Switzerland.