Don't you love data?

Data Science Blog

Intoduction to Importing Data in Python. Part 2

March 15th, 2020

In my Importing Data in Python (part 1) post I talked about importing tabular data into Python. Coming from a Statistics/Econometrics background, I mainly used tabular data for analysis, so it definitely took some time getting used to working with non-tabular data such as JSON and XML.

Why is it important to know how to work with JSON or XML data? If you ever find yourself in a situation where you need to access data via an API (Application Programming Interface),your output will either be returned in a JSON or XML format, and knowing how to extract information from an API response is an absolutely useful skill.

In order to access data from an API, a few things are needed:

  • URL where the data resides
  • Authentication may be required and it comes in a variety of forms (Bearer Token, Oath, etc.)
  • Method will typically be GET or POST
  • Parameters or Request Body

JSON

JSON (JavaScript Object Notation) is a file format that stores information in key-value pairs where each key is enclosed into double quotes and value can be a string, integer, float, boolean, another object or even an array of objects, integers or strings. It's a convenient way to store information and requires mainly knowing how to extract certain attributes from a dictionary or object

Below I will show how to send a request to a REST API in Python, load a JSON response, parse it and convert it into a dataframe. I will use Open Air Quality API in this example. If you look at the documentation (which you should always review before working with APIs), there are several types of data you can get. I will use countries API, where the returned data looks like this:

screenshot of JSON file

To access the data from this API, we will use get methods from requests library. To view the returned string, we can use text method. We can't easily extract information from a string, so we'll need to convert it to a dictionary with load function from json package.

We can see that the response contains a key "meta", and its value is a dictionary itself. We can skip this dictionary as it doesn't contain any useful information. The second key after "meta" is "results", and its value is a list of dictionaries where each dictionary contains a 2-lettered country code represented by "code" key, country name represented by "name" key, number of measurements for the country ("count"), number of cities in the country ("cities), and number of locations ("locations").

Below are a few examples of how you can extract various information from the returned response. If you want to convert a list of dictionaries into a data frame, you will simply need to use DataFrame function from pandas library. It will set column names to key values.

XML

XML (Extensible Markup Language) isn't as common as JSON, but when you work with SOAP APIs (and even some REST APIs), their response would be returned as XML. If you are familiar with HTML, you might find XML more readable. While searching for an example of a publicly available API with XML response, I couldn't find one, so I took the following XML example from the Internet.

screenshot of XML file

There is an xml package in python that you could use, but I think there is a simpler way to work with xml output by utilizing xmltodict library. The name of the library probably gave it away. You can use its parse function to convert an xml response to ordered dictionary which you would parse exactly like a regular dictionary.

This is what an ordered dictionary of the above xml looks like. I formatted the output slightly so that the structure of the file is more intuitive. It has a main object called products and another object product nested in the main object which in turn contains a list of individual product names, quantities and categories.

screenshot of XML file