October 15, 2019 Author: Matthew Renze

In data science, the most common data structure we encounter is a table of data. We refer to this as “tabular data”.

In order to extract information from tabular data, there are five main concepts that you need to understand: tables, observations, variables, relationships, and queries.

Tables

A table is a two-dimensional grid of data. However, unlike other two-dimensional data structures, a table follows a simple rule: All cells in each column must contain the same data type. This is different from a matrix — where every cell in the matrix must contain the same data type. And it’s different from a spreadsheet — where any cell can contain any data type.

Observations

An observation is a recording of the qualities and quantities of an observable phenomenon in the natural world. For example, the hourly temperature reading from a thermometer. This includes anything that we can see, hear, feel, or measure with sensors.

In data science, we record observations on the rows of a table. The rows are the horizontal groups of data that are contained within the table.

Variables

A variable is a placeholder for a value that changes. For example, recording both the temperature and the time the temperature was recorded produces two separate variables. We call them “variables” because their values “vary” across each observation.

In data science, we store variables on the columns of a table. Columns are the vertical groups of data that are contained within the table.

Relationships

In data science, we want each table to contain only a single type of observation or type of entity. A relationship is a way to express how a row of data in one table is related to a row of data in another table.

We create relationships between rows using “keys”. We use a primary key to uniquely identify an observation in its source table. Then we use a foreign key in observations in a second table to refer back to the original observation in the source table.

Queries

A query is a computer representation of a question we want to answer using a table of data. Queries allow us to ask questions about the data and return answers as results. Queries are typically created using a special type of programming language called a query language.

The most popular query language is Structured Query Language (or SQL for short). However, you can also perform queries using other programming languages like Python and R.

Understanding the basics of working with tabular data is one of the first steps in becoming proficient with data science. To learn more, please watch my free online course Intro to Data for Data Science.

Share this Article