Chris Moffitt: Automated Report Generation with Papermill: Part 2

Introduction

This guest post is the second article about using python tools to automate a report generation pipeline. If you have not read the first post you should probably review before continuing with this article.

Before we get started, I would like to re-introduce the author:

Hey there! My name is Duarte O.Carmo and I’m a digital consultant at Jabra. I absolutely love using python for automation, web development, data science and analysis, or pretty much any other subject. Learn more about me by visiting my website, feel free to get in touch 🙂

Part 2 – Designing a solution

Welcome to part 2 of this two-part series post about automating report generation using python, jupyter, papermill, and a couple of other tools.

In the first part, we covered 4 main important processes that are part of the automation process. In this second and final part, we will bring everything together and build our report automation system.

Note: This code was written in python 3.7. You might have to adapt the code for older versions of python.

All of the code for this article is available on GitHub.

A workflow to automatically generate reports in a shared cloud folder

Let’s imagine you want to generate automatic reports for every similar excel file of sales reports. You also want to share them with your colleagues. Your colleagues are interested in the reports, but not in learning how to program python, how would you proceed?

There are a lot of options, and hardly any incorrect ones, but one I found particularly interesting was using what a lot of people and companies already use: a cloud folder (Google Drive, OneDrive, Dropbox).

Cloud folders (particularly shared ones) are very popular in companies. So a good idea would be to create a shared folder where everyone can upload sales excel reports, and automatically generate Html reports from them, so everyone can read!

Here is the basic architecture of the solution:

Process Flow

Let’s describe each one of the steps:

  • A user uploads a new excel sales report to a shared cloud folder.
  • We sync the cloud folder with a local folder and detect a new excel sales report.
  • We use papermill to generate a new notebook file from that new excel sales report.
  • We use nbconvert to generate an Html file from that new notebook file.
  • We upload the Html file to the cloud folder, so the user can read it.

Let’s start building this step by step:

1. Sync a cloud folder with a local folder and detect new files

To sync cloud directories with local directories, we will a tool called Rclone. Of course, we will integrate it with python.

Start by installing rclone in the same machine as your local folder (your personal computer or a virtual private server for example).

To do so, on a Mac or Linux machine, you should:

 $   curl https://rclone.org/install.sh | sudo bash 

On Windows, download the executable in the Rclone downloads page.

Once rclone is installed, we must configure it. Depending on your cloud provider (Dropbox, Google Drive, OneDrive), the instructions will vary, so make sure to follow the configuration instructions.

Once configured, let us do a first sync from the command line:

 $   rclone sync remote:REMOTE_FOLDER_NAME LOCAL_FOLDER_NAME 

This will sync your local folder with your remote folder.

We can also spark this command from a python script using the core subprocess library. That allows you to run command-line programs from python:

import subprocess  # define our variables REMOTE_FOLDER_NAME="shared folder" LOCAL_FOLDER="local folder"  # run the rclone sync command from python subprocess.run(         ["rclone", "sync", f"remote:{REMOTE_FOLDER_NAME}", LOCAL_FOLDER]     ) 

Now that we know how to sync a local and a cloud directory, how do we detect if a user has uploaded a new file to our cloud directory? Well, an option would be to navigate to our local directory and use the ls command and see what pops out.

Rclone also allows us to list files in our cloud directory. Having this, we can create a python function that detects new files if they have been uploaded to the cloud folder:

def get_new_files(remote_folder, local_folder):     """     A function that returns files that were uploaded to the cloud folder and     do not exist in our local folder.     """     # list the files in our cloud folder     list_cloud = subprocess.run(         ["rclone", "lsf", f"remote:{remote_folder}"],         capture_output=True,         text=True,     )      # transform the command output into a list     cloud_directories = list_cloud.split("\n")[0:-1]      print(f"In the cloud we have: \n{cloud_directories}")      # list the files in our local folder     list_cloud = subprocess.run(         ["ls", local_folder], capture_output=True, text=True     )      # transform the command output into a list     local_directories = list_cloud.stdout.split("\n")[0:-1]      print(f"In the local copy we have: \n{local_directories}")      # create a list with the differences between the two lists above     new_files = list(set(cloud_directories) - set(local_directories))      return new_files 

A couple of notes about the script above:

  • The capture_output file in the subprocess.run function, allows us to capture the output of the command.
  • The text flag allows us to treat everything as text, avoiding problems with spaces for example.
  • After running subprocess.run , we apply the .split function to parse the output of the subprocess.run function which is a string of different files separated by a line break (\n). This split function allows us to but all the elements into a nicely formatted python list.
  • The new_files list will contain only files that are in the cloud directory, but not in the local directory, or in other words: the excel file that users have uploaded to the cloud drive. In case there are no differences, the function will return an empty list.

2. Using Papermill and Nbconvert to generate new reports

Once we have a reliable way of detecting if new files are uploaded to the cloud, we now need to process that new file and generate an html report from it.

We will use two of the tools mentioned in the first article: papermill, and nbconvert.

We start by creating a function that will produce a new notebook file, based on an excel report. Using, of course, a notebook template (for example template.ipynb ) as previously described in part 1.

import papermill as pm  def run_notebook(excel_report, notebook_template):     # take only the name of the file, and ignore the .xlsx ending     no_extension_name = excel_report.split(".")[0]     # run with papermill     pm.execute_notebook(         notebook_template,         f"{no_extension_name}.ipynb",         parameters=dict(filename=excel_report),     )     return no_extension_name 

Then, we must convert the notebook to an Html file. To do this, we create another function that calls the nbconvert command from the python interpreter.

import subprocess  def generate_html_report(notebook_file):     generate = subprocess.run(         [             "jupyter",             "nbconvert",             notebook_file,             "--to=html",         ]     )     print("HTML Report was generated")     return True 

This function runs the nbconvert command previously described in the beginning of the article, from a python script.

4. Uploading an Html file back to the cloud folder

There is another Rclone command that is pretty handy. If you want to push a file from a local folder to a cloud folder, you can use the following from the command line:

 $   rclone copy FILENAME remote:REMOTE_FOLDER_NAME 

We could do it from the command line, but why not do it from python? With the subprocess library, it’s pretty straightforward:

import subprocess  def push_to_cloud(remote_folder, html_report):     push = subprocess.run(         ["rclone", "copy", html_report, f"remote:{remote_folder}"]     )     print("Report Published!!!") 

5. Bringing it all together

Finally, after giving you a rundown of all of the major tools and processes, here is the full script that scans the cloud folder for new excel sales reports, then generates and uploads an Html analysis of the reports.

The script, cloud_reporter.py follows:

import subprocess import sys import papermill as papermill   REMOTE_FOLDER = "your cloud folder name" LOCAL_FOLDER = "your local folder name" TEMPLATE_NOTEBOOK = "template_notebook.ipynb"   def get_new_files(remote_folder, local_folder):     """     A function that returns files that were uploaded to the cloud folder and do not exist in our local folder.     """     # list the files in our cloud folder     list_cloud = subprocess.run(         ["rclone", "lsf", f"remote:{remote_folder}"],         capture_output=True,         text=True,     )      # transform the command output into a list     cloud_directories = list_cloud.split("\n")[0:-1]      print(f"In the cloud we have: \n{cloud_directories}")      # list the files in our local folder     list_cloud = subprocess.run(         ["ls", local_folder], capture_output=True, text=True     )      # transform the command output into a list     local_directories = list_cloud.stdout.split("\n")[0:-1]      print(f"In the local copy we have: \n{local_directories}")      # create a list with the differences between the two lists above     new_files = list(set(cloud_directories) - set(local_directories))      return new_files   def sync_directories(remote_folder, local_folder):     """     A function that syncs a remote folder with a local folder     with rclone.     """     sync = subprocess.run(         ["rclone", "sync", f"remote:{remote_folder}", local_folder]     )      print("Syncing local directory with cloud....")     return sync.returncode   def run_notebook(excel_report, template_notebook):     """     A function that runs a notebook against an excel report     via papermill.     """     no_extension_name = excel_report.split(".")[0]     papermill.execute_notebook(         template_notebook,         f"{no_extension_name}.ipynb",         parameters=dict(filename=excel_report),     )     return no_extension_name   def generate_html_report(notebook_file):     """     A function that converts a notebook into an html     file.     """     generate = subprocess.run(         ["jupyter", "nbconvert", notebook_file, "--to=html"]     )     print("HTML Report was generated")     return True   def push_to_cloud(remote_folder, filename):     """     A function that pushes to a remote cloud folder     a specific file.     """      push = subprocess.run(         ["rclone", "copy", filename, f"remote:{remote_folder}"]     )     print("Report Published!!!")  def main():     print("Starting updater..")      # detect if there are new files in the remote folder     new_files = get_new_files(         remote_folder=REMOTE_FOLDER, local_folder=LOCAL_FOLDER     )      # if there are none, exit     if not new_files:         print("Everything is synced. No new files.")         sys.exit()     # else, continue     else:         print("There are files missing.")         print(new_files)      # sync directories to get new excel report     sync_directories(remote_folder=REMOTE_FOLDER, local_folder=LOCAL_FOLDER)      # generate new notebook and extract the name     clean_name = run_notebook(new_files[0])      # the new notebook generate will have the following name     notebook_name = f"{clean_name}.ipynb"      # generate the html report from the notebook     generate_html_report(notebook_name)      # the notebook name will be the following     html_report_name = f"{clean_name}.html"      # push the new notebook to the cloud     push_to_cloud(html_report=html_report_name, remote_folder=ONEDRIVE_FOLDER)      # make sure everything is synced again     sync_directories(remote_folder=REMOTE_FOLDER, local_folder=LOCAL_FOLDER)      print("Updater finished.")      return True   if __name__ == "main":     main() 

6. Running the updater regularly

Once you get the script running, one option is to copy it to a virtual private server (you can get one in digitalocean.com for example) and have it run regularly via something like cron .

Security Warning
If you are going to sync sensitive company information to a virtual private server, please make sure that you have permission, and that you take necessary security measures to protect the server.

You should read more about cron before messing with it. It allows you to run scripts every X amount of time. A simple approach to our problem would be:

  1. Make sure the script is running successfully in your server by:

    • Installing and configuring rclone.
    • Installing jupyter and nbconvert.
    • Creating a local folder to serve as a remote copy.
    • Modifying the script above with your variables (base notebook, remote folder name, and local folder name).
    • Making sure the script runs.
  2. Editing your crontab by:

     $   crontab -e 
  3. Adding a crontab job that navigates to a certain directory and runs the cloud_reporter.py file, every X minutes using python.

    Here is an example of it running every 4 minutes:

     */4 * * * * python /path/to/your/folder/cloud_reporter.py 
  4. Uploading a new excel file to your cloud folder and wait a minimum of 4 minutes, and a new Html report should be generated and uploaded automatically!

  5. Give access to the shared cloud folder (Dropbox, Google Drive) to your colleagues, and let them upload any excel report.

Final thoughts

And just like this, we reach the end of this article series!

Hopefully, these tools and scripts will inspire you to go out and automate report generation or any other process around you. Making it as simple as possible to your colleagues to generate reports.

I would like to thank Chris for allowing me to collaborate with him in these posts. I really had a blast building these tools and writing these “guides”. A team effort that started with a simple reach out on twitter:

Twitter reach out

All of the code for this article series is in this GitHub repo.

Planet Python

Podcast.__init__: Open Source Automated Machine Learning With MindsDB

Machine learning is growing in popularity and capability, but for a majority of people it is still a black box that we don’t fully understand. The team at MindsDB is working to change this state of affairs by creating an open source tool that is easy to use without a background in data science. By simplifying the training and use of neural networks, and making their logic explainable, they hope to bring AI capabilities to more people and organizations. In this interview George Hosu and Jorge Torres explain how MindsDB is built, how to use it for your own purposes, and how they view the current landscape of AI technologies. This is a great episode for anyone who is interested in experimenting with machine learning and artificial intelligence. Give it a listen and then try MindsDB for yourself.

Summary

Machine learning is growing in popularity and capability, but for a majority of people it is still a black box that we don’t fully understand. The team at MindsDB is working to change this state of affairs by creating an open source tool that is easy to use without a background in data science. By simplifying the training and use of neural networks, and making their logic explainable, they hope to bring AI capabilities to more people and organizations. In this interview George Hosu and Jorge Torres explain how MindsDB is built, how to use it for your own purposes, and how they view the current landscape of AI technologies. This is a great episode for anyone who is interested in experimenting with machine learning and artificial intelligence. Give it a listen and then try MindsDB for yourself.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $ 20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Podcast.init listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $ 300 off is available until July 26th, with early bird pricing for up to $ 200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing George Hosu and Jorge Torres about MindsDB, a framework for streamlining the use of neural networks

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what MindsDB is and the problem that it is trying to solve?
    • What was the motivation for creating the project?
  • Who is the target audience for MindsDB?
  • Before we go deep into MindsDB can you explain what a neural network is for anyone who isn’t familiar with the term?
  • For someone who is using MindsDB can you talk through their workflow?
    • What are the types of data that are supported for building predictions using MindsDB?
    • How much cleaning and preparation of the data is necessary before using it to generate a model?
    • What are the lower and upper bounds for volume and variety of data that can be used to build an effective model in MindsDB?
  • One of the interesting and useful features of MindsDB is the built in support for explaining the decisions reached by a model. How do you approach that challenge and what are the most difficult aspects?
  • Once a model is generated, what is the output format and can it be used separately from MindsDB for embedding the prediction capabilities into other scripts or services?
  • How is MindsDB implemented and how has the design changed since you first began working on it?
    • What are some of the assumptions that you made going into this project which have had to be modified or updated as it gained users and features?
  • What are the limitations of MindsDB and what are the cases where it is necessary to pass a task on to a data scientist?
  • In your experience, what are the common barriers for individuals and organizations adopting machine learning as a tool for addressing their needs?
  • What have been the most challenging, complex, or unexpected aspects of designing and building MindsDB?
  • What do you have planned for the future of MindsDB?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA


Planet Python

Reinout van Rees: Write drunk, test automated: documentation quality assurance – Sven Strack

This is my summary of the write the docs meetup in Amsterdam at the Adyen office, november 2018.

Sven’s experience is mostly in open source projects (mainly Plone, a python CMS). He’s also involved in https://testthedocs.org and https://rakpart.testthedocs.org, a collection of tools for documentation tests. He has some disclaimers beforehand:

  • There is no perfect setup.
  • Automated checks can only help up to a certain level.
  • Getting a CI (continuous integration) setup working is often tricky.

Before you start testing your documentation, you’ll need some insight. Start with getting an overview of your documentation. Who is committing to it? Which parts are there? Which parts of the documentation are updated most often? Are the committers native speakers yes/no? Which part of the documentation has the most bug reports. So: gather statistics.

Also: try to figure out who reads your documentation. Where do they come from? What are the search terms they use to find your documentation in google? You can use these statistics to focus your development effort.

Important: planning. If your documentation in English, plan beforehand if you want it to be in UK or US English. Define style guides. If you have automatic checks, define standards beforehand: do you want a check to fail on line length yes/no? Spelling errors? Etc. How long is the test allowed to take?

A tip: start your checks small, build them up step by step. If possible, start from the first commit. And try to be both strict and friendly:

  • Your checks should be strict.
  • Your error messages should be clear and friendly.

In companies it might be different, but in open source projects, you have to make sure developers are your friends. Adjust the documentation to their workflow. Use a Makefile, for instance. And provide good templates (with cookiecutter, for instance) and good examples. And especially for programmers, it is necessary to have short and informative error messages. Don’t make your output too chatty.

Standards for line lengths, paragraphs that are not too long: checks like that help a lot to keep your documentation readable and editable. Also a good idea: checks to weed out common English shortcuts ("we’re" instead of "we are") that make the text more difficult to read for non-native speakers.

Programmers are used to keeping all the code’s tests pass all the time. Merging code with failing tests is a big no-no. The same should be valid for the automatic documentation checks. They’re just as important.

Some further tips:

  • Protect your branches, for instance, to prevent broken builds from being merged.
  • Validate your test scripts, too.
  • Keep your test scripts simple and adjustable. Make it clear how they work.
  • Run your checks in a specific order, as some checks might depend on the correctness as checked by earlier tests.
  • You can also check the html output instead of "only" checking the source files. You can find broken external links this way, for instance.

Something to consider: look at containerization. Dockers and so. This can run on both your local OS and on the continuous integration server. Everyone can use the same codebase and the same test setup. Less duplication of effort and more consistent results. The most important is that it is much easier to set up. Elaborate setups are now possible without scaring away everybody!

For screenshots, you could look at puppeteer for automatically generating your screenshots.

A comment from the audience: if you have automatic screenshots, there are tools to compare them and warn you if the images changed a lot. This could be useful for detecting unexpected errors in css or html.

Another comment from the audience: you can make a standard like US-or-UK-English more acceptable by presenting it as a choice. It is not that one of them is bad: we "just" had to pick one of them. "Your own preference is not wrong, it is just not the standard we randomly picked". 🙂

Planet Python

Davide Moro: High quality automated docker hub push using Github, TravisCI and pyup for Python tool distributions

Let’s say you want to distribute a Python tool with docker using known good dependency versions ready to be used by end users… In this article you will see how to continuously keeping up to date a Docker Hub container with minimal managing effort (because I’m a lazy guy) using github, TravisCI and pyup.

The goal was to reduce as much as possible any manual activity for updates, check all works fine before pushing, minimize build times and keep docker container always secure and updated with a final high quality confidence.

As an example let’s see what happens under the hood behind every pytest-play Docker Hub update on the official container https://cloud.docker.com/u/davidemoro/repository/docker/davidemoro/pytest-play (by the way if you are a pytest-play user: did you know that you can use Docker for running pytest-play and that there is a docker container ready to be used on Docker Hub? See a complete and working example here https://davidemoro.blogspot.com/2019/02/api-rest-testing-pytest-play-yaml-chuck-norris.html)

Repositories

The docker build/publish stuff lives on another repository, so https://github.com/davidemoro/pytest-play-docker is the repository that implements the Docker releasing workflow for https://github.com/pytest-dev/pytest-play on Docker Hub (https://hub.docker.com/r/davidemoro/pytest-play).

Workflow

This is the highly automated workflow at this time of writing for the pytest-play publishing on Docker Hub:

All tests executions run against the docker build so there is a warranty that what is pushed to Docker Hub works fine (it doesn’t check only that the build was successful but it runs integration tests against the docker build), so no versions incompatibilities, no integration issues between all the integrated third party pytest-play plugins and no issues due to the operative system integration (e.g., I recently experienced an issue on alpine linux with a pip install psycopg2-binary that apparently worked fine but if you try to import psycopg2 inside your code you get an unexpected import error due to a recent issue reported here https://github.com/psycopg/psycopg2/issues/684).

So now every time you run a command like the following one (see a complete and working example here https://davidemoro.blogspot.com/2019/02/api-rest-testing-pytest-play-yaml-chuck-norris.html):

docker run –rm -v $ (pwd):/src davidemoro/pytest-play

you know what was the workflow for every automated docker push for pytest-play.

Acknowledgements

Many thanks to Andrea Ratto for the 10 minutes travis build speedup due to Docker cache, from ~11 minutes to ~1 minute is a huge improvement indeed! It was possible thanks to the docker pull davidemoro/pytest-play command, the build with the –cache-from davidemoro/pytest-play option and running the longest steps in a separate and cacheable step (e.g., the very very long cassandra-driver compilation moved to requirements_cassandra.txt will be executed only if necessary).

Relevant technical details about pytest-play-docker follows (some minor optimizations are still possible saving in terms of final size).

pytest-play-docker/.travis.yml

sudo: required
services:
– docker
– …

env:
  global:
  – IMAGE_NAME=davidemoro/pytest-play
  – secure: …
before_script:
– …

script:
– travis_wait docker pull python:3.7
– travis_wait docker pull “$ IMAGE_NAME:latest”
– travis_wait 25 docker build –cache-from “$ IMAGE_NAME:latest” -t “$ IMAGE_NAME” .
– docker run -i –rm -v $ (pwd)/tests:/src –network host -v /var/run/mysqld/mysqld.sock:/var/run/mysqld/mysqld.sock $ IMAGE_NAME –splinter-webdriver=remote
  –splinter-remote-url=$ REMOTE_URL
deploy:
  provider: script
  script: bash docker_push
  on:
    branch: master

pytest-play-docker/docker_push

#!/bin/bash
echo “$ DOCKER_PASSWORD” | docker login -u “$ DOCKER_USERNAME” –password-stdin
docker tag “$ IMAGE_NAME” “$ IMAGE_NAME:$ TRAVIS_COMMIT”
docker tag “$ IMAGE_NAME” “$ IMAGE_NAME:latest”
docker push “$ IMAGE_NAME”:”$ TRAVIS_COMMIT”
docker push “$ IMAGE_NAME”:latest

Feedback

Any feedback will be always appreciated.

Do you like the Docker hub push process for pytest-play? Let me know becoming a pytest-play stargazer! Star
Planet Python