Kushal Das: PyPI and gpg signed packages

Yesterday night, on #pypa IRC channel, asked about uploading detached gpg signatures for the packages. According to , twine did not upload the signature, even with passing -s as an argument. I tried to do the same in test.pypi.org, and at first, I felt the same, as the package page was not showing anything. As I started reading the source of twine to figure out what is going on, I found that it uploads the signature as part of the metadata of package. The JSON API actually showed that the release is signed. Later, and explained that we just have to add .asc at the end of the url of the package to download the detached signature.

During the conversation, mentioned that only 4% of the total packages are actually gpg signed. And gpg is written in C and also a GPL licensed software, so, it can not be packaged inside of CPython (as pip is packaged inside of CPython). The idea of a future PyPI where all packages must be signed (how will still have to discussed) was also discussed in the IRC channel. We also get to know that we can delete any file/relase from PyPI, but, we can not reload those files again. One has to do a new release. This is also very important incase you want to upload signatures, you will have to do that at the time of uploading the package.

also wrote about the idea of signing the packages a few years ago.

Planet Python

Semaphore Community: Generating Fake Data for Python Unit Tests with Faker

This article is brought with ❤ to you by Semaphore.

Introduction

When writing unit tests, you might come across a situation where you need to generate test data or use some dummy data in your tests. If you already have some data somewhere in a database, one solution you could employ is to generate a dump of that data and use that in your tests (i.e. fixtures).

However, you could also use a package like faker to generate fake data for you very easily when you need to. This tutorial will help you learn how to do so in your unit tests.

Prerequisites

For this tutorial, it is expected that you have Python 3.6 and Faker 0.7.11 installed.

Basic Examples in the Command Line

Let’s see how this works first by trying out a few things in the shell.

Before we start, go ahead and create a virtual environment and run it:

$   python3 -m venv faker 
$   source faker/bin/activate 

Once in the environment, install faker.

$   pip install faker 

After that, enter the Python REPL by typing the command python in your terminal.

Once in the Python REPL, start by importing Faker from faker:

Python 3.6.0 (default, Jan  4 2017, 15:38:35) [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from faker import Faker >>> 

Then, we are going to use the Faker class to create a myFactory object whose methods we will use to generate whatever fake data we need.

>>> myFactory = Faker() 

Let’s generate a fake text:

>>> myFactory.text() 'Commodi quidem ipsam occaecati. Porro veritatis numquam nisi corrupti.' 

As you can see some random text was generated. Yours will probably look very different.

Let us try a few more examples:

>>> myFactory.words() ['libero', 'commodi', 'deleniti'] >>> myFactory.name() 'Joshua Wheeler' >>> myFactory.month() '04' >>> myFactory.sentence() 'Iure expedita eaque at odit soluta repudiandae nam.' >>> myFactory.state() 'Michigan' >>> myFactory.random_number() 2950548 

You can see how simple the Faker library is to use. Once you have created a factory object, it is very easy to call the provider methods defined on it. You should keep in mind that the output generated on your end will probably be different from what you see in our example — random output.

If you would like to try out some more methods, you can see a list of the methods you can call on your myFactory object using dir.

>>> dir(myFactory) 

You can also find more things to play with in the official docs.

Integrating Faker with an Actual Unit Test

Let’s now use what we have learnt in an actual test.

If you are still in the Python REPL, exit by hitting CTRL+D. Do not exit the virtualenv instance we created and installed Faker to it in the previous section since we will be using it going forward.

Now, create two files, example.py and test.py, in a folder of your choice. Our code will live in the example file and our tests in the test file.

Look at this code sample:

# example.py  class User:   def __init__(self, first_name, last_name, job, address):     self.first_name = first_name     self.last_name = last_name     self.job = job     self.address = address    @property   def user_name(self):     return self.first_name + ' ' + self.last_name    @property   def user_job(self):     return self.user_name + " is a " + self.job    @property   def user_address(self):     return self.user_name + " lives at " + self.address 

This code defines a User class which has a constructor which sets attributes first_name, last_name, job and address upon object creation.

It also defines class properties user_name, user_job and user_address which we can use to get a particular user object’s properties.

In our test cases, we can easily use Faker to generate all the required data when creating test user objects.

# test.py  import unittest  from faker import Faker  from example import User  class TestUser(unittest.TestCase):     def setUp(self):         self.fake = Faker()         self.user = User(             first_name = self.fake.first_name(),             last_name = self.fake.last_name(),             job = self.fake.job(),             address = self.fake.address()         )      def test_user_creation(self):         self.assertIsInstance(self.user, User)      def test_user_name(self):         expected_username = self.user.first_name + " " + self.user.last_name         self.assertEqual(expected_username, self.user.user_name) 

You can see that we are creating a new User object in the setUp function. Python calls the setUp function before each test case is run so we can be sure that our user is available in each test case.

The user object is populated with values directly generated by Faker. We do not need to worry about coming up with data to create user objects. Faker automatically does that for us.

We can then go ahead and make assertions on our User object, without worrying about the data generated at all.

You can run the example test case with this command:

$   python -m unittest 

At the moment, we have two test cases, one testing that the user object created is actually an instance of the User class and one testing that the user object’s username was constructed properly. Try adding a few more assertions.

Localization

Faker comes with a way of returning localized fake data using some built-in providers. Some built-in location providers include English (United States), Japanese, Italian, and Russian to name a few.

Let’s change our locale to to Russia so that we can generate Russian names:

# example.py  from faker import Factory  myGenerator = Factory.create('ru_RU')  print(myGenerator.name()) 

In this case, running this code gives us the following output:

> python example.py Мельникова Прасковья Андреевна 

Providers

Providers are just classes which define the methods we call on Faker objects to generate fake data. In the localization example above, the name method we called on the myGenerator object is defined in a provider somewhere. You can see the default included providers here.

Let’s create our own provider to test this out.

# example.py  import random  from faker import Faker from faker.providers import BaseProvider  fake = Faker()  # Our custom provider inherits from the BaseProvider class TravelProvider(BaseProvider):     def destination(self):         destinations = ['NY', 'CO', 'CA', 'TX', 'RI']          # We select a random destination from the list and return it         return random.choice(destinations)  # Add the TravelProvider to our faker object fake.add_provider(TravelProvider)  # We can now use the destination method: print(fake.destination()) 

To define a provider, you need to create a class that inherits from the BaseProvider. That class can then define as many methods as you want. Our TravelProvider example only has one method but more can be added.

Once your provider is ready, add it to your Faker instance like we have done here:

fake.add_provider(TravelProvider) 

Here is what happens when we run the above example:

$   python example.py CA 

Of course, you output might differ. Try running the script a couple times more to see what happens.

Seeds

Sometimes, you may want to generate the same fake data output every time your code is run. In that case, you need to seed the fake generator.

You can use any random number as a seed.

Example:

# example.py  from faker import Faker  myGenerator = Faker()  myGenerator.random.seed(5467)  for i in range(10):     print(myGenerator.name()) 

Running this code twice generates the same 10 random names:

> python example.py Denise Reed Megan Douglas Philip Obrien William Howell Michael Williamson Cheryl Jackson Janet Bruce Colton Martin David Melton Paula Ingram > python example.py Denise Reed Megan Douglas Philip Obrien William Howell Michael Williamson Cheryl Jackson Janet Bruce Colton Martin David Melton Paula Ingram 

If you want to change the output to a different set of random output, you can change the seed given to the generator.

Using Faker on Semaphore

To use Faker on Semaphore, make sure that your project has a requirements.txt file which has faker listed as a dependency. If you used pip to install Faker, you can easily generate the requirements.txt file by running the command pip freeze > requirements.txt. This will output a list of all the dependencies installed in your virtualenv and their respective version numbers into a requirements.txt file.

After pushing your code to git, you can add the project to Semaphore, and then configure your build settings to install Faker and any other dependencies by running pip install -r requirements.txt. That command simply tells Semaphore to read the requirements.txt file and add whatever dependencies it defines into the test environment.

After that, executing your tests will be straightforward by using python -m unittest discover.

Conclusion

In this tutorial, you have learnt how to use Faker’s built-in providers to generate fake data for your tests, how to use the included location providers to change your locale, and even how to write your own providers.

We also covered how to seed the generator to generate a particular fake data set every time your code is run.

Lastly, we covered how to use Semaphore’s platform for Continuous Integration.

Feel free to leave any comments or questions you might have in the comment section below.

This article is brought with ❤ to you by Semaphore.

Planet Python

Chris Moffitt: Building a Repeatable Data Analysis Process with Jupyter Notebooks

Introduction

Over the past couple of months, there has been an ongoing discussion about Jupyter Notebooks affectionately called the “Notebook Wars”. The genesis of the discussion is Joel Grus’ presentation I Don’t Like Notebooks and has been followed up with Tim Hopper’s response, aptly titled I Like Notebooks. There have been several follow-on posts on this topic including thoughtful analysis from Yihui Xie.

The purpose of this post is to use some of the points brought up in these discussions as a background for describing my personal best practices for the analysis I frequently perform with notebooks. In addition, this approach can be tailored for your unique situation. I think many new python users do not take the time to think through some of these items I discuss. My hope is that this article will spark some discussion and provide a framework that others can build off for making repeatable and easy to understand data analysis pipelines that fit their needs.

Specific Use Cases

My use case is much narrower than what Joel describes. As much as possible, I try to use a Jupyter Notebook as my go-to solution when I need to do moderately complex data analysis in a business setting. Instead of creating an Excel spreadsheet, I build a consistent set of notebook files to document my analysis journey. The key distinctions between my approach and the data science scenarios discussed in the presentations above are:

  • This analysis is only used by me. I do not share actual python code with anyone. All results are shared by other means (email, presentations, Excel, etc).
  • I do not build models that are put into production.
  • All analysis is internal, proprietary and not shared publicly.
  • If a solution needs to be used by others, I will build a standalone python script for them to use.
  • The vast majority of work I describe is data wrangling, EDA and simple statistical analysis. The work is the bread and butter work that Excel is used for in most organizations.

The rest of this article will outline the approach I use in the hopes it can be a framework for others and might help people develop their own repeatable and maintainable work flow.

Why Have Standards?

I imagine that most people that have used Jupyter Notebooks for any significant time have ended up with a directory structure that looks like this:

Messy notebook layout

At a quick glance, there are a lot of problems with this “structure:”

  • Inconsistent or absent naming scheme for notebooks
  • Mixture of notebooks, scripts, Excel, CSV, images, etc all in one directory
  • Vague directory names
  • Difficult to follow “flow” of the processing steps

On top of the non-intuitive structure, each notebook has its own unique structure for analyzing data. Some are documented but many are not. None of these issues is a flaw with notebooks per se but is an example of a sloppy approach to solving a problem. You could just as easily end up with this situation with Excel files or stand alone python scripts.

I’ve certainly done all of the things described above. It is incredibly frustrating when I know I did some really useful analysis but I can’t find it 6 months after the fact. If only I had a little more discipline up front, it would have saved a lot of time over the long run.

One of my biggest complaints about Excel is that it is really difficult to understand how the data was pulled together and how all the cells, formulas and VBA relate to each other. There are very limited options for documenting Excel data analysis flow. I believe that using a well-formed Jupyter Notebook structure can lead to a much more reusable set of data analysis artifacts.

Directory Structures

The first step in the process is creating a consistent directory structure. I have leveraged very heavily from the Cookiecutter Data Science project. If you are doing more complex modeling and sharing code with others, then I encourage you to use the above mentioned cookiecutter framework.

In general, I create a new directory for each analysis and take the time to give the directory a descriptive name. Then, I setup the following directory structure:

 FY_18_Sales_Comp/   ├── 1-Data_Prep.ipynb   ├── 2-EDA.ipynb   ├── data   │   ├── interim   │   ├── processed   │   └── raw   └── reports 

I will cover the details of the notebooks in a bit but the important item to note is that I include a number followed by the stage in the analysis process. This convention helps me quickly figure out where I need to go to learn more. If I’m just interested in the final analysis, I look in the 2-EDA notebook. If I need to see where the data comes from, I can jump into 1-Data_Prep . I will often create multiple EDA files as I am working through the analysis and try to be as careful as possible about the naming structure so I can see how items are related.

The other key structural issue is that the input and output files are stored in different directories:

  • raw – Contains the unedited csv and Excel files used as the source for analysis.
  • interim – Used if there is a multi-step manipulation. This is a scratch location and not always needed but helpful to have in place so directories do not get cluttered or as a temp location form troubleshooting issues.
  • processed – In many cases, I read in multiple files, clean them up and save them to a new location in a binary format. This streamlined format makes it easier to read in larger files later in the processing pipeline.

Finally, any Excel, csv or image output files are stored in the reports directory.

Here is a simple diagram of how the data typically flows in these types of scenarios:

Data flow process

Notebook Structure

Once I create each notebook, I try to follow consistent processes for describing the notebooks. The key point to keep in mind is that this header is the first thing you will see when you are trying to figure out how the notebook was used. Trust me, future you will be eternally thankful if you take the time to put some of these comments in the notebook!

Here’s an image of the top of an example notebook:

Notebook header

There are a couple of points I always try to include:

  • A good name for the notebook (as described above)
  • A summary header that describes the project
  • Free form description of the business reason for this notebook. I like to include names, dates and snippets of emails to make sure I remember the context.
  • A list of people/systems where the data originated.
  • I include a simple change log. I find it helpful to record when I started and any major changes along the way. I do not update it with every single change but having some date history is very beneficial.

I tend to include similar imports in most of my notebooks:

import pandas as pd from pathlib import Path from datetime import datetime 

Then I define all my input and output file paths and directories. It is very useful to do this all in one place at the top of the file. The other key thing I try to do is make all of my file path references relative to the notebook directory. By using Path.cwd() I can move notebook directories around and it will still work.

I also like to include date and time stamps in the file names. The new f-strings plus pathlib make this simple:

today = datetime.today() sales_file = Path.cwd() / "data" / "raw" / "Sales-History.csv" pipeline_file = Path.cwd() / "data" / "raw" / "pipeline_data.xlsx" summary_file = Path.cwd() / "data" / "processed" / f"summary_{today:%b-%d-%Y}.pkl" 

If you are not familiar with the Path object, my previous article might be useful.

The other important item to keep in mind is that raw files should NEVER be modified.

The next section of most of my notebooks includes a section to clean up column names. My most common steps are:

  • Remove leading and trailing spaces in column names
  • Align on a naming convention (dunder, CamelCase, etc.) and stick with it
  • When renaming columns, do not include dashes or spaces in names
  • Use a rename dictionary to put all the renaming options in one place
  • Align on a name for the same value. Account Num, Num, Account ID might all be the same. Name them that way!
  • Abbreviations may be ok but make sure it is consistent (for example – always use num vs number)

After cleaning up the columns, I make sure all the data is in the type I expect/need. This previous article on data types should be helpful:

  • If you have a need for a date column, make sure it is stored as one.
  • Numbers should be int or float and not object
  • Categorical types can be used based on your discretion
  • If it is a Yes/No, True/False or 1/0 field make sure it is a boolean
  • Some data like US zip codes or customer numbers might come in with a leading 0. If you need to preserve the leading 0, then use an object type.

Once the column names are cleaned up and the data types are correct, I will do the manipulation of the data to get it in the format I need for further analysis.

Here are a few other guidelines to keep in mind:

  • If you find a particular tricky piece of code that you want to include, be sure to keep a link to where you found it in the notebook.

  • When saving files to Excel, I like to create an ExcelWriter object so I can easily save multiple sheets to the output file. Here is what it looks like:

    writer = pd.ExcelWriter(report_file, engine='xlsxwriter') df.to_excel(writer, sheet_name='Report') writer.save() 

Operationalizing & Customing This Approach

There are a lot of items highlighted here to keep in mind. I am hopeful that readers have thought of their own ideas as well. Fortunately, you can build a simple framework that is easy to replicate for your own analysis by using the cookiecutter project to build your own template. I have placed an example based on this project on github.

Once you install cookiecutter, you can replicate this structure for your own projects:

$   cookiecutter https://github.com/chris1610/pbp_cookiecutter $   project_name [project_name]: Deep Dive On December Results $   directory_name [deep_dive_on_december_results]: $   description [More background on the project]: R&D is trying to understand what happened in December 

After answering these questions, you will end up with the directory structure and a sample notebook that looks like this:

Final Notebook

The nice result of this approach is that you only need to answer a couple of simple questions to get the template started and populate the notebook with some of the basic project description. My hope is that this lightweight approach will be easy to incorporate into your analysis. I feel this providew a framework for repeatable analysis but is not so burdensome that you do not want to use it because of the additional work in implementing it.

In addition, if you find this approach useful, you could tailor it even more for your own needs by adding conditional logic to the process or capturing additional information to include in the notebooks. One idea I have played around with is including a snippets.py file in the cookiecutter template where I save some of my random/useful code that I use frequently.

I’ll be curious what others think about this approach and any ideas you may have incorporated in your own workflow. Feel free to chime in below with your input in the comments below.

Planet Python

gamingdirectional: Create the Overlap class for Pygame project

Hello there, sorry for a little bit late today because I am busy setting up my old website which is now ready for me to add in more articles into it, if you are interested in more programming articles then do visit this site because I am going to create a brand new laptop application project with python and if you are a python lover then go ahead and bookmark this site and visit it starting from…

Source

Planet Python