Reporting with Pandas and Seals and Pythons, Oh My

I spend perhaps too much time generating and reviewing numbers and charts and reports, but the right combination of tools can make this enjoyable (or at least less tedious). For me, that generally means using the pandas data analysis library and the python programming language to analyze data stored in MariaDB or MySQL.
There are a number of BI and analytics tools available, all of which have tradeoffs depending on your needs and your circumstances. You can get by with spreadsheets, write so much SQL you begin to mumble it in your sleep, spend a hefty chunk of money investing in a commercial solution, or burn too much time configuring solutions whose purpose seems to be generating sales for the vendor’s consulting services rather than helping you get things done.
I’ve done all of these, some rather more than others (I may be bitter), but my “go to” analytics toolset today revolves around the pandas library. It has the perfect balance of capability, flexibility, and usability that I need in a tool that sees daily use.
Let’s walk through a few examples. I note that I use Python 3.4 under OS X for these. While this will not be a tutorial (for Pandas, Python, or MariaDB), I hope that seeing a few samples of what it can do will encourage you to try it. First, let’s set up a working environment:
pyvenv-3.4 ~/py34 cd ~/py34 source bin/activate pip install matplotlib pandas ipython sqlalchemy mysql-connector-python --allow-external mysql-connector-python
This creates a clean, virtual python environment in the py34 directory and installs a few dependencies, and takes less than a minute for me. Pandas uses matplotlib for charting, and in recent versions prefers SQLAlchemy for querying SQL databases; IPython is my preferred interactive shell when working with python; and I will use MySQL Connector/Python to query an existing MariaDB database.
The database, and thus my examples here, contains data regarding a hypothetical, growing technical support organization, purely because I happen to have that kind of thing handy. I will not paste that into this post.
Let’s put a few more details in place and analyze some data. We will start IPython, import the libraries we need, set an option to make charts look a bit prettier, and configure the database connection:
ipython --pylab import sqlalchemy import pandas pandas.set_option('display.mpl_style', 'default') e = sqlalchemy.create_engine('mysql+mysqlconnector://dellis@localhost/support')
Now let’s just load everything from a single table, in this case the “issues” table:
df = pandas.io.sql.read_sql_table('issues', e)
This creates a pandas DataFrame (somewhat akin to a spreadsheet) containing the table’s data. You can use the read_sql() method for populating a DataFrame with more realistic/complex queries. The columns correspond to basic information about support issues:
In [6]: df.columns Out[6]: Index(['issue', 'customer', 'product', 'severity', 'status', 'created', 'emails'], dtype='object')
Let’s answer a few very high level questions:
- How many total support issues have been opened?
- How many unique customers have opened issues?
In [7]: df.issue.count() Out[7]: 3511 In [8]: df.customer.nunique() Out[8]: 503
We can get a little more complex: which customers have opened the most issues? This is similar to SQL, in that we will group by the customer identity, count the size of each group, sort the result in descending order, and list the top 5:
x = df.groupby('customer').issue.count() x.sort(ascending=False) x[:5] # or: df.groupby('customer').issue.count().sort(inplace=False, ascending=False)[:5] Out[9]: customer Customer 111 20 Customer 37 19 Customer 45 19 Customer 3 18 Customer 19 18 Name: issue, dtype: int64
We could target more proactive use cases if we queried a live, production database. Ex: How many issues are open and waiting on a support engineer to take action? How many at each Severity level?
In [10]: df[df.status == 'Support'].issue.count() Out[10]: 6 In [11]: df[df.status == 'Support'].severity.value_counts() Out[11]: S4 3 S3 2 S1 1 dtype: int64
I could retrieve the issue numbers themselves by simply not executing aggregate functions. ex: df[df.status == 'Support'].issue
Answering questions in this fashion can be useful for a wide variety of tasks, but now let’s explore trends over time. In this case, I want to review my data based on when the issues were created.
There are several ways of doing this in pandas, and I use different methods depending on my needs. In this case, I will reindex the DataFrame I’ve been using, changing its index to the “created” column (it has been indexed via a simple “row identifier” so far). Then, I will create another index based on the year and month of the issue creation dates, and I will use these to examine new issue inflow:
df.set_index('created', drop=False, inplace=True) pi = df.index.to_period('M') In [14]: df['2013'].issue.count() Out[14]: 1256 In [15]: df['2014Q1'].issue.count() Out[15]: 476 In [16]: df.groupby(pi).issue.count()[-5:] Out[16]: created 2014-03 165 2014-04 156 2014-05 188 2014-06 167 2014-07 190 Freq: M, Name: issue, dtype: int64
Let’s review our inflow by quarters, instead:
In [17]: df.groupby(pi.asfreq('Q')).issue.count()[-5:] Out[17]: created 2013Q3 317 2013Q4 408 2014Q1 476 2014Q2 511 2014Q3 190 Freq: Q-DEC, Name: issue, dtype: int64
Perhaps we want to review our Q2 issues that have a larger than “normal” number of emails associated with them. Let’s arbitrarily define “larger than normal” as more than the mean plus the mean absolute deviation:
In [18]: df[ df.emails > (df.emails.mean() + df.emails.mad()) ]['2014Q2'].issue.count() Out[18]: 118
Where would we be without charts? Let’s see our monthly inflow in a graph, along with the rolling mean (or moving average) with a six month period:
In [19]: df.groupby(pi).issue.count().plot(legend=True, label="Inflow") In [20]: pandas.rolling_mean(df.groupby(pi).issue.count(), 6).plot(legend=True, label="Average")
We can review the evolution of our weekly inflow:
In [21]: pandas.rolling_mean(df.groupby(df.index.to_period('W')).issue.count(),6).plot()
Or, the number of unique customers reporting issues by quarter:
In [22]: df.groupby(pi.asfreq('Q')).customer.nunique().plot()
We can see at a glance our issue growth rate for different products:
In [23]: df.groupby(pi).product.value_counts().unstack().plot()
There are several other types of charts available. Here we have a bar chart:
In [24]: df.groupby(pi.asfreq('Q')).issue.count()[:-1].plot(kind='bar')
Pandas can load data from a variety of sources, whether a MariaDB database or CSV, Excel, HDF, JSON, and many others. It can also export that data in different formats, including writing back to a database.
If, for example, you are constructing an analytics web service with pandas on the back end, you can convert your results into various native python data types (lists, dicts, etc), or directly to JSON or other formats, to be consumed however your application needs it.
There are countless ways to use it, whether you are connecting directly to a production OLTP server for adhoc “queries” or automated alert reporting, or generating monthly reports against a denormalized data mart.
I could barely even scratch the surface in this post. If these couple dozen little examples piqued your interest, however, you can read much more about it at the project’s website: http://pandas.pydata.org/
Post a Comment
Log into your MariaDB ID account to post a comment.