Don't you love data?

Data Science Blog

Intro to SQL

October 12th, 2020

I strongly believe that if you want a Data Analyst or Data Scientist job, you must be comfortable with extracting and manipulating data from SQL. This post should not be used as a comprehensive guide to SQL, but it will provide an overview of most common SQL commands.

What is SQL? SQL stands for Structured Query Language (you can pronounce it as 'sequel' or 'ess-cue-elle'). It allows you to retrieve data stored in relational databases, such as Microsoft SQL Server, MySQL or Oracle, and manipulate it in a variety of ways.

If a concept of a database is new to you, you can think of it as a collection of tables. For instance, you can draw a comparison to an Excel workbook with multiple worksheets where each worksheet is somehow related to another. Worksheets in this comparison are SQL tables and the entire workbook would be a database. The key is to have some sort of relationship between the tables that allows you to bring all that data together. Please keep in mind that this is just an analogy and Excel workbook is not a database.

Each table in SQL consists of rows and columns where columns represent different data attributes and rows represent observations or data records. Tables can contain different types of data but the main types are integer, decimal, character or string, date, and time.

In this tutorial, I won't go into details on how to setup or create a database but will focus on several useful SQL commands instead. You will learn how to select data from tables, create new variables using conditional logic, filter records, aggregate tables, sort records in a table, and join multiple tables together. These are a must know basic commands for anyone interested in learning SQL.

SELECT

SELECT is the most important command as it allows you to retrieve data from a table. SELECT on itself won't do anything, so you need to specify what data you want to retrieve and from where.

You can retrieve data from a table, so you need to specify a table name in the SELECT statement. SELECT statement refers to the columns part of a table and not the rows, so you also need to list columns. Here are a few things to remember:

  • To select a column, you will simply need to type its name
  • You can select all columns by typing their name or you can use * to select all columns without typing their names
  • You can select as many or as few columns from a table as you want
  • You can even create new columns in the SELECT statement

Let's assume that we have a table named "Provinces" that contains two columns -- "Province" and "ProvinceCode".

screenshot of aggregate table
You can use the following statement to select all rows and columns from this table.

SELECT Province, ProvinceCode FROM Provinces

The same can be achieved by running the following command.

SELECT * FROM Provinces

You can also select only Province from this table.

SELECT Province FROM Provinces

You can add a column to the output of your select statement. Let's say you want to indicate that all of this provinces are in Canada by adding a column "Country" with a value of "Canada". You'll need to specify first what the value should be followed by a key word as and the name you want this column to take. Please note that this statement does not modify "Provinces" table! Also, you can only use single quotes in SQL when working with strings.

SELECT *, 'Canada' as Country FROM Provinces

If you want to add a product of two numeric columns to the output of the select statement, this is what you can do

SELECT num_col1, num_col2, num_col1*num_col2 as num_col_product FROM table_name

Any other mathematical operator will work. There are also built-in functions that you can apply on columns to create new columns. Let's say you have a date column, and you want to add a year calculated from that column.

SELECT date_column, year(date_column) as yr FROM table_name

If you want to create a column that contains first two characters from one column and last three characters from another column separated by -, you can use LEFT, RIGHT functions and + to combine multiple strings together.

SELECT str_col1, str_col2, left(str_col1,2)+'-'+right(str_col2,3) as new_col FROM table_name

You can find an overview of various functions Here. I encourage you to read documentation and familiarize yourself with what functions you can use. You don't need to remember the exact syntax or arguments that a function takes. Should you ever need to use one of these functions, you can always Google it.

WHERE

The commands above would select all rows from table "Provinces". There will be cases where you don't want all of the observations to be returned; that's when you use the WHERE clause. It is typically specified after you specify a table name (unless you do a join, more on this to come). This is a typical syntax WHERE column_name operator value. Here is a list of operations that you can perform in the WHERE clause.

  • Mathematical comparisons with the following operators =, <,>, <=, >=, <> . Note that <> means not equal. =, <> can be used for strings and numbers
  • To search for a pattern in a string, you can use LIKE operator
  • You can use IN operator to specify multiple values in a column
  • You can have multiple filters in the WHERE clause separated by AND or OR. Don't forget to specify column name for every filter you pass in the WHERE clause
Don't worry if the above is a little confusing or feels too much. The examples below should help you understand the syntax better.

Let's say we want to select all columns from "Provinces" table where column "ProvinceCode" is ON.

SELECT * FROM Provinces WHERE ProvinceCode = 'ON'

Let's say we want to select all columns from "Provinces" table where column "ProvinceCode" is not ON.

SELECT * FROM Provinces WHERE ProvinceCode <> 'ON'

If we want to filter based on multiple values, we can do the following

SELECT * FROM Provinces WHERE ProvinceCode = 'ON' or ProvinceCode = 'BC'

You can see that you can specify multiple filters by using OR in this case. Keep in mind if you have two or more filters in the WHERE clause, you need to type variable name and operator again.

This will return an ERROR

INVALID SYNTAX!! SELECT * FROM Provinces WHERE ProvinceCode = 'ON' or 'BC'

The filtering example above is not very efficient. You would typically use multiple filtering criteria for different columns. If you want to select multiple values from a column, you can use IN operator.

SELECT * FROM Provinces WHERE ProvinceCode in ('ON', 'BC')

If you want all values but ON and BC, you can simply add not before IN

SELECT * FROM Provinces WHERE ProvinceCode not in ('ON', 'BC')

When you have strings in a column and want to select values based on a certain pattern you can use LIKE operator. There are two wildcards that can be used with it. % which represents zero, one or multiple characters or _ which represents single character.

Let's select all provinces that have an s in their name. Since the length of each Province name can vary and it can have an s anywhere in the name, you can do the following

SELECT * FROM Provinces WHERE Province like '%s%'

If you want to select a Province that has an i as second to last letter, you can use _

SELECT * FROM Provinces WHERE Province like '%i_'

CASE WHEN

You can use conditional logic to create variables by following this syntax.

CASE WHEN condition THEN result1 ELSE result2 END AS new_variable

You can also have multiple conditions.

CASE WHEN condition1 THEN result1 WHEN condition2 THEN result2 ELSE result3 END AS new_variable

Using "Province" table, you can create a new column "Region" that takes "Atlantic" value if "ProvinceCode" is NL, NB, PE or NS. For all other provinces, let's set the value to "Non-Atlantic".

SELECT *, CASE WHEN ProvinceCode in ('NL', 'NB', 'PE', 'NS') then 'Atlantic' else 'Non-Atlantic' END AS Region FROM Provinces

GROUP BY

GROUP BY can be used to summarize values in a table (aggregate, average, count, etc.). The following table can serve as an example for when you would want to use GROUP BY. The table contains 8 records, 4 observations for each year where each row represents a quarter for the corresponding year.

screenshot of aggregate table

What if you were asked to calculate average sales or total sales for 2019 and 2020? That's where you can use GROUP BY. In order to use it correctly, in the SELECT statement, you will first need to specify the columns for which you want to display aggregated data (year in our case), a column that needs to be transformed and indicate which transformation needs to be applied (summed, averaged, counted, take the max or min).

Please note that in order for this query to work correctly, you will need to exclude "Quarter" column. If you include it, since there is one record per Quart & Year, you will simply get the same output as the original table.

The rest of your query would follow as standard, i.e. you indicate FROM which table you need to retrieve the data. If there are any filters you need to apply, you specify them in the WHERE clause. Finally, you add GROUP BY which should contain one or more variables separated by commas for which you are grouping the data.

SELECT var1, var2, sum(var3) as sum_var3 FROM table GROUP BY var1, var2

Notice that you only specify the variable that don't need to be grouped in the GROUP BY statement.

Let's see how we can calculate total sales for each year.

SELECT Year, sum(Sales) as Sales FROM qrt_sales GROUP BY Year

This is how you can get average annual sales.

SELECT Year, avg(Sales) as avg_Sales FROM qrt_sales GROUP BY Year

You can get total and average sales all in one statement.

SELECT Year, avg(Sales) as avg_Sales, sum(Sales) as sum_Sales FROM qrt_sales GROUP BY Year

You can also count how many observations there are in this table for each year (it's obvious in this example, but there will be cases where you might need to get observation count).

SELECT Year, count(Sales) as count_Sales FROM qrt_sales GROUP BY Year

ORDER BY

If you want to sort the order of the SELECT statement, you can use ORDER BY. This doesn't change the order of records in a table, and only affects the output of the statement. You can sort in descending order by adding desc key word after the name of the variable you want to sort. The default order is ascending. You can also sort by multiple variables; you just need to separate them by commas.

SELECT * FROM Provinces ORDER BY ProvinceCode

JOIN

If you remember in the beginning of the post, I mentioned that a database is a collection of tables where each table has some sort of identifier or key that allows you to join the tables together. That's really what a JOIN clause does.

joins diagram
  • LEFT JOIN returns all records from the left table and records from the right table that are found in the left table.
  • RIGHT JOIN returns all records from the right table and records from the left table that are found in the right table.
  • INNER JOIN only returns records that are present in both tables.
  • FULL JOIN returns all records from both tables.
joins table

This is what the output of the four joins would look like using the tables in the example above. Notice that the unique key in Table 1 and Table 2 is City. Also, a JOIN returns a NULL for when a match isn't found; this is how SQL flags missing values.

joins output

Here is the syntax for four different joins shown above.

SELECT t1.City, t1.Airport, t2.Ocean FROM Table1 t1 LEFT JOIN Table2 t2 ON t1.City=t2.City SELECT t2.City, t2.Ocean, t1.Airport FROM Table1 t1 RIGHT JOIN Table2 t2 ON t1.City=t2.City SELECT t1.City, t1.Airport, t2.Ocean FROM Table1 t1 INNER JOIN Table2 t2 ON t1.City=t2.City SELECT t1.City, t1.Airport, t2.Ocean FROM Table1 t1 FULL JOIN Table2 t2 ON t1.City=t2.City

A few things to remember about the syntax.

  • You need to specify join type after the FROM command.
  • If certain columns exist in both tables, you need to specify from which table you want to select a column by adding table name before column name SELECT table_name1.variable_name1, table_name2.variable_name2
  • Instead of typing full table name before column name, you can assign an alias in the FROM command and JOIN clause. In the example above, you can see that in FROM Table1 t1 LEFT JOIN Table2 t2 t1 and t2 are aliases which allows me to use them before column names rather than full table names.
  • You need to specify a key or keys on which you want to join tables, like this ON t1.key=t2.key
  • You can join table on multiple keys ON t1.key1=t2.key1 and t1.key2=t2.key2. Please note that the names of the key columns you want to join on, don't need to match.
  • You can join more than two tables and can even perform different types of joins.

SELECT t1.Key, t2.Name1, t3.Name2 FROM table1 t1 LEFT JOIN table2 t2 ON t1.key = t2.key RIGHT JOIN table3 t3 ON t1.key = t3.key

UNION

UNION can also combine two or more tables but unlike a JOIN clause, it appends the tables on top of each other. In order for UNION to work correctly, you need to specify the same number of columns appearing in the same order from each table.

SELECT col1, col2 FROM table1 UNION SELECT col1, co2 FROM table2

When can UNION be helpful? If you have similar data stored in different tables. Perhaps, data from one year is stored in one table whereas data from another year is stored in a different table, and you want to combine it in one table, that's when you would use the UNION operator.

Putting all the code mentioned above together, this is how your syntax would look like.

SELECT t1.col1, t1.col2, sum(t2.col3) as col3 FROM table1 t1 LEFT JOIN table2 t2 on t1.id=t2.id WHERE t1.col1 in ('A', 'B') and t1.col2 < 50 GROUP BY t1.col1, t1.col2 ORDER BY t1.col1, t1.col2

Notice how each command starts on a separate line? If you write everything in one line, your code will still work, but it will be not very readable, so I'd say it's a good practice to write your code in a similar way.

I'm planning on adding a SQL playground to this post where you can write basic commands covered in this tutorial. I'm hoping it can be completed within a few weeks. In the meantime, here are a few useful links you can checkout.