Pandas Tutorial: DataFrames in Python
Explore data analysis with Python. Pandas DataFrames make manipulating your data easy, from selecting or replacing columns and indices to reshaping your data.
Pandas is a popular Python package for data science, and with good reason: it offers powerful, expressive and flexible data structures that make data manipulation and analysis easy, among many other things. The DataFrame is one of these structures.
This tutorial covers Pandas DataFrames, from basic manipulations to advanced operations, by tackling 11 of the most popular questions so that you understand -and avoid- the doubts of the Pythonistas who have gone before you.
Content
- How To Create a Pandas DataFrame
- How To Select an Index or Column From a DataFrame
- How To Add an Index, Row or Column to a DataFrame
- How To Delete Indices, Rows or Columns From a DataFrame
- How To Rename the Columns or Indices of a DataFrame
- How To Format the Data in Your DataFrame
- How To Create an Empty DataFrame
- Does Pandas Recognize Dates When Importing Data?
- When, Why and How You Should Reshape Your DataFrame
- How To Iterate Over a DataFrame
- How To Write a DataFrame to a File
(For more practice, try the first chapter of this Pandas DataFrames course for free!)
What Are Pandas Data Frames?
Before you start, let’s have a brief recap of what DataFrames are.Those who are familiar with R know the data frame as a way to store data in rectangular grids that can easily be overviewed. Each row of these grids corresponds to measurements or values of an instance, while each column is a vector containing data for a specific variable. This means that a data frame’s rows do not need to contain, but can contain, the same type of values: they can be numeric, character, logical, etc.
Now, DataFrames in Python are very similar: they come with the Pandas library, and they are defined as two-dimensional labeled data structures with columns of potentially different types.
In general, you could say that the Pandas DataFrame consists of three main components: the data, the index, and the columns.
- Firstly, the DataFrame can contain data that is:
- a Pandas
DataFrame
- a Pandas
Series
: a one-dimensional labeled array capable of holding any data type with axis labels or index. An example of a Series object is one column from a DataFrame. - a NumPy
ndarray
, which can be a record or structured - a two-dimensional
ndarray
- dictionaries of one-dimensional
ndarray
’s, lists, dictionaries or Series.
Note the difference between
np.ndarray
and np.array()
. The former is an actual data type, while the latter is a function to make arrays from other data structures.Structured arrays allow users to manipulate the data by named fields: in the example below, a structured array of three tuples is created. The first element of each tuple will be called
foo
and will be of type int
, while the second element will be named bar
and will be a float.Record arrays, on the other hand, expand the properties of structured arrays. They allow users to access fields of structured arrays by attribute rather than by index. You see below that the
foo
values are accessed in the r2
record array.- Besides data, you can also specify the index and column names for your DataFrame. The index, on the one hand, indicates the difference in rows, while the column names indicate the difference in columns. You will see later that these two components of the DataFrame will come in handy when you’re manipulating your data.
If you’re still in doubt about Pandas DataFrames and how they differ from other data structures such as a NumPy array or a Series, you can watch the small presentation below:
Note that in this post, most of the times, the libraries that you need have already been loaded in. The Pandas library is usually imported under the alias
pd
, while the NumPy library is loaded as np
. Remember that when you code in your own data science environment, you shouldn’t forget this import step, which you write just like this:import numpy as np
import pandas as pd
Now that there is no doubt in your mind about what DataFrames are, what they can do and how they differ from other structures, it’s time to tackle the most common questions that users have about working with them!1. How To Create a Pandas DataFrame
Obviously, making your DataFrames is your first step in almost anything that you want to do when it comes to data munging in Python. Sometimes, you will want to start from scratch, but you can also convert other data structures, such as lists or NumPy arrays, to Pandas DataFrames. In this section, you’ll only cover the latter. However, if you want to read more on making empty DataFrames that you can fill up with data later, go to question 7.Among the many things that can serve as input to make a ‘DataFrame’, a NumPy
ndarray
is one of them. To make a data frame from a NumPy array, you can just pass it to the DataFrame()
function in the data
argument.Pay attention to how the code chunks above select elements from the NumPy array to construct the DataFrame: you first select the values that are contained in the lists that start with
Row1
and Row2
, then you select the index or row numbers Row1
and Row2
and then the column names Col1
and Col2
.Next, you also see that, in the DataCamp Light chunk above, you printed out a small selection of the data. This works the same as subsetting 2D NumPy arrays: you first indicate the row that you want to look in for your data, then the column. Don’t forget that the indices start at 0! For
data
in the example above, you go and look in the rows at index 1 to end, and you select all elements that come after index 1. As a result, you end up selecting 1
, 2
, 3
and 4
.This approach to making DataFrames will be the same for all the structures that
DataFrame()
can take on as input.Try it out in the code chunk below:
Remember that the Pandas library has already been imported for you as
pd
.Note that the index of your Series (and DataFrame) contains the keys of the original dictionary, but that they are sorted: Belgium will be the index at 0, while the United States will be the index at 3.
After you have created your DataFrame, you might want to know a little bit more about it. You can use the
shape
property or the len()
function in combination with the .index
property:These two options give you slightly different information on your DataFrame: the
shape
property will provide you with the dimensions of your DataFrame. That means that you will get to know the width and the height of your DataFrame. On the other hand, the len()
function, in combination with the index
property, will only give you information on the height of your DataFrame.This all is totally not extraordinary, though, as you explicitly give in the
index
property.You could also use
df[0].count()
to get to know more about the height of your DataFrame, but this will exclude the NaN
values (if there are any). That is why calling .count()
on your DataFrame is not always the better option.If you want more information on your DataFrame columns, you can always execute
list(my_dataframe.columns.values)
. Try this out for yourself in the DataCamp Light block above!Fundamental DataFrame Operations
Now that you have put your data in a more convenient Pandas DataFrame structure, it’s time to get to the real work!This first section will guide you through the first steps of working with DataFrames in Python. It will cover the basic operations that you can do on your newly created DataFrame: adding, selecting, deleting, renaming, … You name it!
2. How To Select an Index or Column From a Pandas DataFrame
Before you start with adding, deleting and renaming the components of your DataFrame, you first need to know how you can select these elements. So, how do you do this?Even though you might still remember how to do it from the previous section: selecting an index, column or value from your DataFrame isn’t that hard, quite the contrary. It’s similar to what you see in other languages (or packages!) that are used for data analysis. If you aren’t convinced, consider the following:
In R, you use the [,] notation to access the data frame’s values.
Now, let’s say you have a DataFrame like this one:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
And you want to access the value that is at index 0, in column ‘A’.Various options exist to get your value
1
back:The most important ones to remember are, without a doubt,
.loc[]
and .iloc[]
. The subtle differences between these two will be discussed in the next sections.Enough for now about selecting values from your DataFrame. What about selecting rows and columns? In that case, you would use:
For now, it’s enough to know that you can either access the values by calling them by their label or by their position in the index or column. If you don’t see this, look again at the slight differences in the commands: one time, you see
[0][0]
, the other time, you see [0,'A']
to retrieve your value 1
.3. How To Add an Index, Row or Column to a Pandas DataFrame
Now that you have learned how to select a value from a DataFrame, it’s time to get to the real work and add an index, row or column to it!Adding an Index to a DataFrame
When you create a DataFrame, you have the option to add input to the ‘index’ argument to make sure that you have the index that you desire. When you don’t specify this, your DataFrame will have, by default, a numerically valued index that starts with 0 and continues until the last row of your DataFrame.However, even when your index is specified for you automatically, you still have the power to re-use one of your columns and make it your index. You can easily do this by calling
set_index()
on your DataFrame. Try this out below!Adding Rows to a DataFrame
Before you can get to the solution, it’s first a good idea to grasp the concept ofloc
and how it differs from other indexing attributes such as .iloc[]
and .ix[]
:.loc[]
works on labels of your index. This means that if you give inloc[2]
, you look for the values of your DataFrame that have an index labeled2
..iloc[]
works on the positions in your index. This means that if you give iniloc[2]
, you look for the values of your DataFrame that are at index ’2`..ix[]
is a more complex case: when the index is integer-based, you pass a label to.ix[]
.ix[2]
then means that you’re looking in your DataFrame for values that have an index labeled2
. This is just like.loc[]
! However, if your index is not solely integer-based,ix
will work with positions, just like.iloc[]
.
This all might seem very complicated. Let’s illustrate all of this with a small example:
Note that in this case, you used an example of a DataFrame that is not solely integer-based as to make it easier for you to understand the differences. You clearly see that passing
2
to .loc[]
or .iloc[]
/.ix[]
does not give back the same result!- You know that
.loc[]
will go and look at the values that are at label2
. The result that you get back will be
48 1
49 2
50 3
.iloc[]
will go and look at the positions in the index. When you pass 2
, you will get back:48 7
49 8
50 9
.ix[]
will have the same behavior as iloc
and look at the positions in the index. You will get back the same result as .iloc[]
.Now that the difference between
.iloc[]
, .loc[]
and .ix[]
is clear, you are ready to give adding rows to your DataFrame a go!Tip: as a consequence of what you have just read, you understand now also that the general recommendation is that you use
.loc
to insert rows in your DataFrame. That is because if you would use df.ix[]
, you might try to reference a numerically valued index with the index value and accidentally overwrite an existing row of your DataFrame. You better avoid this!Check out the difference once more in the DataFrame below:
You can see why all of this can be confusing, right?
Adding a Column to Your DataFrame
In some cases, you want to make your index part of your DataFrame. You can easily do this by taking a column from your DataFrame or by referring to a column that you haven’t made yet and assigning it to the.index
property, just like this:
=
In other words, you tell your DataFrame that it should take column
A
as its index.However, if you want to append columns to your DataFrame, you could also follow the same approach as when you would add an index to your DataFrame: you use
.loc[]
or .iloc[]
. In this case, you add a Series to an existing DataFrame with the help of .loc[]
:
=
Remember a Series object is much like a column of a DataFrame. That explains why you can easily add a Series to an existing DataFrame. Note also that the observation that was made earlier about
.loc[]
still stays valid, even when you’re adding columns to your DataFrame!Resetting the Index of Your DataFrame
When your index doesn’t look entirely the way you want it to, you can opt to reset it. You can easily do this with.reset_index()
. However, you should still watch out, as you can pass several arguments that can make or break the success of your reset:Now try replacing the
drop
argument by inplace
in the code chunk above and see what happens!Note how you use the
drop
argument to indicate that you want to get rid of the index that was there. If you would have used inplace
, the original index with floats is added as an extra column to your DataFrame.4. How to Delete Indices, Rows or Columns From a Pandas Data Frame
Now that you have seen how to select and add indices, rows, and columns to your DataFrame, it’s time to consider another use case: removing these three from your data structure.Deleting an Index from Your DataFrame
If you want to remove the index from your DataFrame, you should reconsider because DataFrames and Series always have an index.However, what you *can* do is, for example:
- resetting the index of your DataFrame (go back to the previous section to see how it is done) or
- remove the index name, if there is any, by executing
del df.index.name
, - remove duplicate index values by resetting the index, dropping the duplicates of the index column that has been added to your DataFrame and reinstating that duplicateless column again as the index:
/
- and lastly, remove an index, and with it a row. This is elaborated further on in this tutorial.
Now that you know how to remove an index from your DataFrame, you can go on to removing columns and rows!
Deleting a Column from Your DataFrame
To get rid of (a selection of) columns from your DataFrame, you can use thedrop()
method:
=
You might think now: well, this is not so straightforward; There are some extra arguments that are passed to the
drop()
method!- The
axis
argument is either 0 when it indicates rows and 1 when it is used to drop columns. - You can set
inplace
to True to delete the column without having to reassign the DataFrame.
Removing a Row from Your DataFrame
You can remove duplicate rows from your DataFrame by executingdf.drop_duplicates()
. You can also remove rows from your DataFrame, taking into account only the duplicate values that exist in one column.Check out this example:
If there is no uniqueness criterion to the deletion that you want to perform, you can use the
drop()
method, where you use the index
property to specify the index of which rows you want to remove from your DataFrame:After this command, you might want to reset the index again.
Tip: try resetting the index of the resulting DataFrame for yourself! Don’t forget to use the
drop
argument if you deem it necessary.5. How to Rename the Index or Columns of a Pandas DataFrame
To give the columns or your index values of your dataframe a different value, it’s best to use the.rename()
method.Tip: try changing the
inplace
argument in the first task (renaming your columns) to False
and see what the script now renders as a result. You see that now the DataFrame hasn’t been reassigned when renaming the columns. As a result, the second task takes the original DataFrame as input and not the one that you just got back from the first rename()
operation.Beyond The Pandas DataFrame Basics
Now that you have gone through a first set of questions about Pandas’ DataFrames, it’s time to go beyond the basics and get your hands dirty for real because there is far more to DataFrames than what you have seen in the first section.6. How To Format The Data in Your Pandas DataFrame
Most of the times, you will also want to be able to do some operations on the actual values that are contained within your DataFrame. In the following sections, you’ll cover several ways in which you can format your DataFrame’s valuesReplacing All Occurrences of a String in a DataFrame
To replace certain strings in your DataFrame, you can easily usereplace()
: pass the values that you would like to change, followed by the values you want to replace them by.Just like this:
Note that there is also a
regex
argument that can help you out tremendously when you’re faced with strange string combinations:In short,
replace()
is mostly what you need to deal with when you want to replace values or strings in your DataFrame by others!Removing Parts From Strings in the Cells of Your DataFrame
Removing unwanted parts of strings is cumbersome work. Luckily, there is an easy solution to this problem!
==
You use
map()
on the column result
to apply the lambda function over each element or element-wise of the column. The function in itself takes the string value and strips the +
or -
that’s located on the left, and also strips away any of the six aAbBcC
on the right.Splitting Text in a Column into Multiple Rows in a DataFrame
This is somewhat a more difficult formatting task. However, the next code chunk will walk you through the steps:In short, what you do is:
- First, you inspect the DataFrame at hand. You see that the values in the last row and in the last column are a bit too long. It appears there are two tickets because a guest has taken a plus-one to the concert.
- You take the
Ticket
column from the DataFramedf
and strings on a space. This will make sure that the two tickets will end up in two separate rows in the end. Next, you take these four values (the four ticket numbers) and put them into a Series object:
0 1
0 23:44:55 NaN
1 66:77:88 NaN
2 43:68:05 56:34:12
NaN
values in there! You have to stack the Series to make sure you don’t have any NaN
values in the resulting Series.0 0 23:44:55
1 0 66:77:88
2 0 43:68:05
1 56:34:12
0 23:44:55
1 66:77:88
2 43:68:05
2 56:34:12
dtype: object
Ticket
column.Applying A Function to Your Pandas DataFrame’s Columns or Rows
You might want to adjust the data in your DataFrame by applying a function to it. Let’s begin answering this question by making your own lambda function:doubler = lambda x: x*2
Tip: if you want to know more about functions in Python, consider taking this Python functions tutorial.Note that you can also select the row of your DataFrame and apply the
doubler
lambda function to it. Remember that you can easily select a row from your DataFrame by using .loc[]
or .iloc[]
.Then, you would execute something like this, depending on whether you want to select your index based on its position or based on its label:
df.loc[0].apply(doubler)
Note that the apply()
function only applies the doubler
function along the axis of your DataFrame. That means that you target either the index or the columns. Or, in other words, either a row or a column.However, if you want to apply it to each element or element-wise, you can make use of the
map()
function. You can just replace the apply()
function in the code chunk above with map()
. Don’t forget to still pass the doubler
function to it to make sure you multiply the values by 2.Let’s say you want to apply this doubling function not only to the
A
column of your DataFrame but to the whole of it. In this case, you can use applymap()
to apply the doubler
function to every single element in the entire DataFrame:
==
Note that in these cases, we have been working with lambda functions or anonymous functions that get created at runtime. However, you can also write your own function. For example:
If you want more information on the flow of control in Python, you can always read up on it here.
7. How To Create an Empty DataFrame
The function that you will use is the PandasDataframe()
function: it requires you to pass the data that you want to put in, the indices and the columns.Remember that the data that is contained within the data frame doesn’t have to be homogenous. It can be of different data types!
There are several ways in which you can use this function to make an empty DataFrame. Firstly, you can use
numpy.nan
to initialize your data frame with NaN
s. Note that numpy.nan
has type float
.Right now, the data type of the data frame is inferred by default: because
numpy.nan
has type float, the data frame will also contain values of type float. You can, however, also force the DataFrame to be of a particular type by adding the attribute dtype
and filling in the desired type. Just like in this example:Note that if you don’t specify the axis labels or index, they will be constructed from the input data based on common sense rules.
8. Does Pandas Recognize Dates When Importing Data?
Pandas can recognize it, but you need to help it a tiny bit: add the argumentparse_dates
when you’reading in data from, let’s say, a comma-separated value (CSV) file:import pandas as pd
pd.read_csv('yourFile', parse_dates=True)
# or this option:
pd.read_csv('yourFile', parse_dates=['columnName'])
There are, however, always weird date-time formats.No worries! In such cases, you can construct your own parser to deal with this. You could, for example, make a lambda function that takes your DateTime and controls it with a format string.
import pandas as pd
dateparser = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
# Which makes your read command:
pd.read_csv(infile, parse_dates=['columnName'], date_parser=dateparse)
# Or combine two columns into a single DateTime column
pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)
9. When, Why And How You Should Reshape Your Pandas DataFrame
Reshaping your DataFrame is transforming it so that the resulting structure makes it more suitable for your data analysis. In other words, reshaping is not so much concerned with formatting the values that are contained within the DataFrame, but more about transforming the shape of it.This answers the when and why. But how would you reshape your DataFrame?
There are three ways of reshaping that frequently raise questions with users: pivoting, stacking and unstacking and melting.
Pivotting Your DataFrame
You can use thepivot()
function to create a new derived table out of your original one. When you use the function, you can pass three arguments:values
: this argument allows you to specify which values of your original DataFrame you want to see in your pivot table.columns
: whatever you pass to this argument will become a column in your resulting table.index
: whatever you pass to this argument will become an index in your resulting table.
=
When you don’t specifically fill in what values you expect to be present in your resulting table, you will pivot by multiple columns:
Note that your data can not have rows with duplicate values for the columns that you specify. If this is not the case, you will get an error message. If you can’t ensure the uniqueness of your data, you will want to use the
pivot_table
method instead:Note the additional argument
aggfunc
that gets passed to the pivot_table
method. This argument indicates that you use an aggregation function used to combine multiple values. In this example, you can clearly see that the mean
function is used.
Using stack()
and unstack()
to Reshape Your Pandas DataFrame
You have already seen an example of stacking in the answer to question 5! In essence, you might still remember that, when you stack a DataFrame, you make it taller. You move the innermost column index to become the innermost row index. You return a DataFrame with an index with a new inner-most level of row labels.Go back to the full walk-through of the answer to question 5 if you’re unsure of the workings of
stack()
.The inverse of stacking is called unstacking. Much like
stack()
, you use unstack()
to move the innermost row index to become the innermost column index.For a good explanation of pivoting, stacking and unstacking, go to this page.
Reshape Your DataFrame With melt()
Melting is considered useful in cases where you have data that has one or more columns that are identifier variables, while all other columns are considered measured variables.These measured variables are all “unpivoted” to the row axis. That is, while the measured variables that were spread out over the width of the DataFrame, the melt will make sure that they will be placed in the height of it. Or, yet in other words, your DataFrame will now become longer instead of wider.
As a result, you have two non-identifier columns, namely, ‘variable’ and ‘value’.
Let’s illustrate this with an example:
==
If you’re looking for more ways to reshape your data, check out the documentation.
10. How To Iterate Over a Pandas DataFrame
You can iterate over the rows of your DataFrame with the help of afor
loop in combination with an iterrows()
call on your DataFrame:iterrows()
allows you to efficiently loop over your DataFrame rows as (index, Series) pairs. In other words, it gives you (index, row) tuples as a result.11. How To Write a Pandas DataFrame to a File
When you have done your data munging and manipulation with Pandas, you might want to export the DataFrame to another format. This section will cover two ways of outputting your DataFrame: to a CSV or to an Excel file.Output a DataFrame to CSV
To write a DataFrame as a CSV file, you can useto_csv()
:import pandas as pd
df.to_csv('myDataFrame.csv')
That piece of code seems quite simple, but this is just where the difficulties begin for most people because you will have specific requirements for the output of your data. Maybe you don’t want a comma as a delimiter, or you want to specify a specific encoding, …Don’t worry! You can pass some additional arguments to
to_csv()
to make sure that your data is outputted the way you want it to be!- To delimit by a tab, use the
sep
argument:
import pandas as pd
df.to_csv('myDataFrame.csv', sep='\t')
encoding
argument:import pandas as pd
df.to_csv('myDataFrame.csv', sep='\t', encoding='utf-8')
NaN
or missing values to be represented, whether or not you want to output the header, whether or not you want to write out the row names, whether you want compression, … Read up on the options here.Writing a DataFrame to Excel
Similarly to what you did to output your DataFrame to CSV, you can useto_excel()
to write your table to Excel. However, it is a bit more complicated:import pandas as pd
writer = pd.ExcelWriter('myDataFrame.xlsx')
df.to_excel(writer, 'DataFrame')
writer.save()
Note, however, that, just like with to_csv()
, you have a lot of extra arguments such as startcol
, startrow
, and so on, to make sure output your data correctly. Go to this page to read up on them.If, however, you want more information on IO tools in Pandas, you check out this page.
No comments:
Post a Comment
Thanks for your comments