Don't you love data?

Data Science Blog

Introduction to Importing Data in Python. Part 1

March 5th, 2020

As a data scientist, you really need to get comfortable with working with data. Before you can build a cool machine learning model, you need to spend significant amount of time on figuring out where the data resides and in which format, how to access it, join multiple tables together and prepare it for further analysis. In this post, I will cover most tabular data formats I had a chance to work with and how to import them in Python.

CSV Files

CSV files are probably the least common data format that I personally get to work with, but I've interviewed quite a few people who primarily work with csv files, so it's important to cover them.

I downloaded a GDP csv file from the World Bank. Fortunately, the downloaded file is quite messy. There are extra empty columns, empty top and bottom rows, columns with information that we may not want, and column names take up two cells. While you can clean the file up in Excel, let's have some fun and clean it in Python.

screenshot of CSV file

In this example, we will use read_csv function from Pandas library which comes with a variety of helpful options. One of Panda's disadvantages is the library is big, and I've ran into a few problems with referencing it during deployment stages, but we are simply trying to load a csv file into Python, so I opted to use it. There is a csv module in Python that is also worth looking into.

How do we clean up the file? We can use skiprows option to remove empty top rows. We can also drop the 3rd and 6th columns since they aren't needed. We will also rename columns in the data frame. Lastly, we will use dropna function to remove rows with at least one missing value.

Relational Databases

Most of the data I work with resides in relational databases, such as MySQL, Microsoft SQL Server or Oracle. I'll cover three Python libraries that I've used in the past.

  • ✓ Pypyodbc
  • ✓ PyMySQL
  • ✓ cx_Oracle

Before we access the data, we need to establish a connection by specifying server name, database name, password and username. We then need to create a cursor object. To get all the data from a query, use fetchall function. If you want to get only one row back, then you can use fetchone function. You can have the output of fetchall saved to pretty much any data type including data frames by utilizing Panda's DataFrame function. Keep in mind, this won't keep the column names, so you'll need to list them separately.

A few things to note. Pypyodbc might give you problems with deploying an application in Serverless environment on AWS. You can also try Panda's read_sql function to import data from MySQL or MS SQL Server. A more advanced way of transferring data from relational databases into Python is to use Object-relational Mappers (ORMs) such as SQLAlchemy or PonyORM. It requires a bit more coding, but comes with a variety of advantages. Lastly, you should make sure that you have appropriate drivers installed.