Data Handling with Pandas
Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.
1. Summary
Let's take a small introduction to the data manipulation and analysis library Pandas
. With it, we will be able to handle and process tabular data which will help us operate with it and obtain information in a very valuable way.
2. What is Pandas?
Pandas is a Python library that is designed to make working with relational or labeled data easy and intuitive Pandas is designed for many different types of data:
- Tabular data with columns of heterogeneous types, such as in an SQL table or an Excel spreadsheet* Ordered and unordered time series data (not necessarily of fixed frequency).* Arbitrary matrix data (homogeneous or heterogeneous) with row and column labels* Any other form of observational/statistical datasets. It is not necessary to label the data at all to place it in a pandas data structure.
The two main data structures in Pandas are Series
(one-dimensional) and DataFrame
(two-dimensional). Pandas is built on top of NumPy and is designed to integrate well within a scientific computing environment with many other third-party libraries.
For data scientists, working with data generally involves several stages: collecting and cleaning data, analyzing/modeling it, and then organizing the analysis results into a suitable form for plotting or displaying in tabular form. pandas is the ideal tool for all these tasks.
Another feature is that pandas is fast, many of the low-level algorithms have been built in C
.
2.1. Pandas as pd
Generally, when importing pandas, it is usually imported with the alias pd
.
import pandas as pdprint(pd.__version__)
1.0.1
3. Pandas Data Structures
In Pandas, there are two types of data structures: Series
and DataFrames
.
3.1. Series
The Series
data type is a one-dimensional labeled array capable of holding any data type (integers, strings, floating-point numbers, Python objects, etc.). It is divided into indices.
To create a Series
data type the most common way is
```series = pd.Series(data, index=index)
Where `data` can be* A dictionary* A list or tuple* A Numpy ndarray* A scalar value
As one of the data types can be a NumPy ndarray, we import NumPy in order to use it
import numpy as np
3.1.1. Series from a Dictionary
import numpy as npdiccionario = {"b": 1, "a": 0, "c": 2}serie = pd.Series(diccionario)serie
b 1a 0c 2dtype: int64
If an index is passed, the values of the data corresponding to the index labels will be extracted. If they do not exist, they are created as NaN
(not a number).
diccionario = {"b": 1, "a": 0, "c": 2}serie = pd.Series(diccionario, index=["b", "c", "d", "a"])serie
b 1.0c 2.0d NaNa 0.0dtype: float64
3.1.2. Series from a list or tuple
If the data comes from a list or tuple and no index is passed, one will be created with values [0, ..., len(data)-1]
serie = pd.Series([1, 2, 3, 4])serie
0 11 22 33 4dtype: int64
If an index is passed, it must have the same length as the data
serie = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"])serie
a 1b 2c 3d 4dtype: int64
3.1.3. Series from an ndarray
If the data comes from an ndarray and no index is passed, one with values [0, ..., len(data)-1]
will be created.
serie = pd.Series(np.random.randn(5))serie
0 1.2678651 -0.8778572 -0.1385563 -0.1329874 -0.827295dtype: float64
If an index is passed, it must have the same length as the data
serie = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])serie
a -1.091828b -0.584243c 0.220398d 1.248923e 1.652351dtype: float64
3.1.4. Series from a Scalar
If the series is created from a scalar, it will be created with a single item
serie = pd.Series(5.0)serie
0 5.0dtype: float64
If you want to create more items in the series, you need to pass the index with the number of items you want, so that all items will have the value of the scalar
serie = pd.Series(5.0, index=["a", "b", "c", "d", "e"])serie
a 5.0b 5.0c 5.0d 5.0e 5.0dtype: float64
3.1.5. Operations with Series
Just like with Numpy, we can perform operations on all the elements of a series without having to iterate over each one of them.
serie = pd.Series(5.0, index=["a", "b", "c", "d", "e"])print(f"serie:\n{serie}")print(f"\nserie + serie =\n{serie + serie}")
serie:a 5.0b 5.0c 5.0d 5.0e 5.0dtype: float64serie + serie =a 10.0b 10.0c 10.0d 10.0e 10.0dtype: float64
serie = pd.Series(5.0, index=["a", "b", "c", "d", "e"])print(f"serie:\n{serie}")print(f"\nexp(serie) =\n{np.exp(serie)}")
serie:a 5.0b 5.0c 5.0d 5.0e 5.0dtype: float64exp(serie) =a 148.413159b 148.413159c 148.413159d 148.413159e 148.413159dtype: float64
One difference between Series
and ndarrays is that operations between Series
automatically align the data according to their labels. Therefore, calculations can be written without considering whether the Series
involved have the same labels. If a label is not found in one Series
or another, the result will be marked as missing (NaN).
serie = pd.Series(5.0, index=["a", "b", "c", "d", "e"])print(f"serie:\n{serie}")print(f"\nserie[1:] + serie[:-1] =\n{serie[1:] + serie[:-1]}")
serie:a 5.0b 5.0c 5.0d 5.0e 5.0dtype: float64serie[1:] + serie[:-1] =a NaNb 10.0c 10.0d 10.0e NaNdtype: float64
3.1.6. Series Name Attribute
One of the attributes of Series
is name
, which corresponds to the name they will have when added to a DataFrame. Conversely, when a series is obtained from a DataFrame, this series will have the name it had in the DataFrame.
serie = pd.Series(np.random.randn(5), name="aleatorio")serie
0 -0.1910091 -0.7931512 -0.9077473 -1.4405084 -0.676419Name: aleatorio, dtype: float64
You can change the name of a series using the rename()
method
serie = serie.rename("random")serie
0 -0.1910091 -0.7931512 -0.9077473 -1.4405084 -0.676419Name: random, dtype: float64
3.2. DataFrames
A DataFrame
is a labeled, two-dimensional data structure with columns of potentially different types, meaning that one column can have integer data, another column string data, etc. You can think of it as a spreadsheet or an SQL table, or a dictionary of Series
objects.
It is the most commonly used pandas object. Like Series
, DataFrame
s accept many different types of input:
Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and/or columns, you are ensuring the index and/or columns of the resulting DataFrame
. Therefore, a dictionary of Series
plus a specific index will discard all data that does not match the passed index.
If the axis labels are not provided, they will be constructed from the input data based on common sense rules.
3.2.1. DataFrames from a dictionary of Series
If a dictionary with Series
is passed, the DataFrame
will be created with as many columns as there are Series
in the dictionary.
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0])
}
dataframe = pd.DataFrame(diccionario)
dataframe
If each of the Series
has defined indexes, the resulting DataFrame
will be the union of these indexes
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0], index=["a", "b", "c", "d"])
}
dataframe = pd.DataFrame(diccionario)
dataframe
dataframe = pd.DataFrame(diccionario, index=["d", "b", "a"])
dataframe
If the columns are passed, they will appear in the given order
dataframe = pd.DataFrame(diccionario, columns=["dos", "tres"])
dataframe
3.2.2. DataFrames from a dictionary of ndarrays or lists
All ndarrays or lists must have the same length. If an index is passed, it must also have the same length as the ndarrays or lists.
diccionario = {
"uno": [1.0, 2.0, 3.0, 4.0],
"dos": [4.0, 3.0, 2.0, 1.0]
}
dataframe = pd.DataFrame(diccionario)
dataframe
3.2.3. DataFrames from a matrix
If an index is passed, it must have the same length as the number of rows in the matrix, and if columns are passed, they must have the same length as the columns in the matrix.
matriz = np.array([[1, 3], [2, 2], [3, 1]])
dataframe = pd.DataFrame(matriz, index=["a", "b", "c"], columns=["columna1", "columna2"])
dataframe
3.2.4. DataFrames from a list of dictionaries
lista = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
dataframe = pd.DataFrame(lista)
dataframe
3.2.5. DataFrames from a dictionary of tuples
diccionario = {
("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
("b", "b"): {("A", "D"): 9, ("A", "B"): 10},
}
dataframe = pd.DataFrame(diccionario)
dataframe
3.2.6. DataFrames from a Series
The result will be a DataFrame
with the same index as the input Series, and with a column whose name is the original name of the Series (only if no other column name is provided).
diccionario = {"b": 1, "a": 0, "c": 2}
serie = pd.Series(diccionario)
dataframe = pd.DataFrame(serie)
dataframe
4. Exploring a DataFrame
When a DataFrame
is very large it cannot be fully displayed
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train
So it is very useful to have methods for exploring it and obtaining information quickly.
4.1. Head of the DataFrame
To view the first few rows and get an idea of what the DataFrame
looks like, there is the head()
method, which by default shows the first 5 rows of the DataFrame
. If you want to see a different number of rows, specify it using the n
attribute.
california_housing_train.head(n=10)
4.2. DataFrame Tail
If you want to see the last rows, you can use the tail()
method, using the n
attribute to choose how many rows to display.
california_housing_train.tail()
4.3. DataFrame Information
Another very useful method is info()
which gives us information about the DataFrame
diccionario = {"uno": pd.Series([1.0, 2.0, 3.0]),"dos": pd.Series([4.0, 5.0, 6.0, 7.0])}dataframe = pd.DataFrame(diccionario)dataframediccionario = {"uno": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),"dos": pd.Series([4.0, 5.0, 6.0, 7.0], index=["a", "b", "c", "d"])}dataframe = pd.DataFrame(diccionario)dataframedataframe = pd.DataFrame(diccionario, index=["d", "b", "a"])dataframedataframe = pd.DataFrame(diccionario, columns=["dos", "tres"])dataframediccionario = {"uno": [1.0, 2.0, 3.0, 4.0],"dos": [4.0, 3.0, 2.0, 1.0]}dataframe = pd.DataFrame(diccionario)dataframematriz = np.array([[1, 3], [2, 2], [3, 1]])dataframe = pd.DataFrame(matriz, index=["a", "b", "c"], columns=["columna1", "columna2"])dataframelista = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]dataframe = pd.DataFrame(lista)dataframediccionario = {("a", "b"): {("A", "B"): 1, ("A", "C"): 2},("a", "a"): {("A", "C"): 3, ("A", "B"): 4},("a", "c"): {("A", "B"): 5, ("A", "C"): 6},("b", "a"): {("A", "C"): 7, ("A", "B"): 8},("b", "b"): {("A", "D"): 9, ("A", "B"): 10},}dataframe = pd.DataFrame(diccionario)dataframediccionario = {"b": 1, "a": 0, "c": 2}serie = pd.Series(diccionario)dataframe = pd.DataFrame(serie)dataframecalifornia_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")california_housing_traincalifornia_housing_train.head(n=10)california_housing_train.tail()california_housing_train.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 17000 entries, 0 to 16999Data columns (total 9 columns):# Column Non-Null Count Dtype--- ------ -------------- -----0 longitude 17000 non-null float641 latitude 17000 non-null float642 housing_median_age 17000 non-null float643 total_rooms 17000 non-null float644 total_bedrooms 17000 non-null float645 population 17000 non-null float646 households 17000 non-null float647 median_income 17000 non-null float648 median_house_value 17000 non-null float64dtypes: float64(9)memory usage: 1.2 MB
4.4. DataFrame Rows and Columns
Indices and columns of a DataFrame
can be obtained using the methods index
and columns
.
diccionario = {"uno": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),"dos": pd.Series([4.0, 5.0, 6.0, 7.0], index=["a", "b", "c", "d"])}dataframe = pd.DataFrame(diccionario)indices = dataframe.indexcolumnas = dataframe.columnsprint(f"El DataFrame tiene los índices {indices}\n")print(f"El DataFrame tiene las columnas {columnas}")
El DataFrame tiene los índicesIndex(['a', 'b', 'c', 'd'], dtype='object')El DataFrame tiene las columnasIndex(['uno', 'dos'], dtype='object')
4.5. DataFrame Description
The describe()
method shows a quick statistical summary of the DataFrame
data.
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train.describe()
4.6. Sorting the DataFrame
Rows of a DataFrame
can be sorted alphabetically using the sort_index()
method.
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train.sort_index().head()
As in this case the rows were already sorted, we set ascending=False
so that the order is reversed
california_housing_train.sort_index(ascending=False).head()
If what you want is to sort the columns, you need to introduce axis=1
because by default it is 0
california_housing_train.sort_index(axis=1).head()
If what we want is to sort the DataFrame
by a specific column, we need to use the sort_values()
method and specify the label of the column we want to sort by.
california_housing_train.sort_values('median_house_value')
4.7. DataFrame Statistics
You can obtain statistics from the DataFrame
, such as the mean, mode, standard deviation
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")california_housing_train.describe()california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")california_housing_train.sort_index().head()california_housing_train.sort_index(ascending=False).head()california_housing_train.sort_index(axis=1).head()california_housing_train.sort_values('median_house_value')california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")print(f"media:\n{california_housing_train.mean()}")print(f" desviación estandar:\n{california_housing_train.std()}")
media:longitude -119.562108latitude 35.625225housing_median_age 28.589353total_rooms 2643.664412total_bedrooms 539.410824population 1429.573941households 501.221941median_income 3.883578median_house_value 207300.912353dtype: float64desviación estandar:longitude 2.005166latitude 2.137340housing_median_age 12.586937total_rooms 2179.947071total_bedrooms 421.499452population 1147.852959households 384.520841median_income 1.908157median_house_value 115983.764387dtype: float64
If you want to obtain statistics on the rows instead of the columns, you need to indicate it with axis=1
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")print(f"media:\n{california_housing_train.mean(axis=1)}")print(f" desviación estandar:\n{california_housing_train.std(axis=1)}")
media:0 8357.5970671 10131.5277782 9664.6423223 8435.0290784 7567.436111...16995 12806.40856716996 9276.77087816997 12049.50792216998 10082.05330016999 10863.022744Length: 17000, dtype: float64desviación estandar:0 22026.6124451 26352.9392722 28514.3165883 24366.7547474 21730.014569...16995 36979.67689916996 26158.00677116997 34342.87679216998 28408.15232916999 31407.119788Length: 17000, dtype: float64
Another useful thing that can be obtained from DataFrames
is, for example, the number of times each item in a column repeats.
california_housing_train["total_rooms"].value_counts()
1582.0 161527.0 151717.0 141471.0 141703.0 14..157.0 12760.0 1458.0 110239.0 14068.0 1Name: total_rooms, Length: 5533, dtype: int64
For example, we can see that there are a total of 16 houses with 1582 rooms.
4.8. Used Memory
We can see the memory used by the dataframe
california_housing_train.memory_usage(deep=True)
Index 128longitude 136000latitude 136000housing_median_age 136000total_rooms 136000total_bedrooms 136000population 136000households 136000median_income 136000median_house_value 136000dtype: int64
5. Adding Data
5.1. Adding Columns
Columns can be easily added as operations from other columns
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0])
}
dataframe = pd.DataFrame(diccionario)
dataframe["tres"] = dataframe["uno"] + dataframe["dos"]
dataframe["flag"] = dataframe["tres"] > 7.0
dataframe
Columns can also be added indicating what value all their items will have.
dataframe["constante"] = 8.0
dataframe
If a Series
is added that does not have the same number of indices as the DataFrame
, it will be adjusted to the number of indices of the DataFrame
dataframe["Menos indices"] = dataframe["uno"][:2]
dataframe
With the previous methods, the column was added at the end, but if you want to add the column at a specific position, you can use the insert()
method.
For example, if you want to add a column in position 3 (considering that counting starts from position 0), with the column name being inserted column and its value being double that of column three, it would be done as follows
dataframe.insert(loc=3, column="columna insertada", value=dataframe["tres"]*2)
dataframe
If you want to add more than one column per command, you can use the assign()
method.
dataframe = dataframe.assign(
columna_asignada1 = dataframe["uno"] * dataframe["tres"],
columna_asignada2 = dataframe["dos"] * dataframe["tres"],
)
dataframe
5.2. Adding Rows
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0])
}
dataframe = pd.DataFrame(diccionario)
dataframe.head()
We can add a row at the end with the concat
method (which we will see in more detail later)
diccionario = {
"uno": [10.0],
"dos": [20.0]
}
dataframe = pd.concat([dataframe, pd.DataFrame(diccionario)])
dataframe
We see that the column has been added at the end, but it has index zero, so we reorder the indexes using the method reset_index(drop=True)
dataframe = dataframe.reset_index(drop=True)
dataframe
6. Data Deletion
6.1. Removal of Columns
A specific column can be removed using the pop()
method
dataframe.pop("constante")
dataframe
Or using del
del dataframe["flag"]
dataframe
6.1. Row Removal
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]),
"dos": pd.Series([11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]),
"tres": pd.Series([21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0])
}
dataframe = pd.DataFrame(diccionario)
dataframe.head()
If we want to delete a row, we can use the drop
method, specifying its position. For example, if we want to delete the row at position 1
dataframe = dataframe.drop(1)
dataframe
If we want to delete the last row
dataframe = dataframe.drop(len(dataframe)-1)
dataframe
If what we want is to delete a range of rows
dataframe = dataframe.drop(range(2, 5))
dataframe
If what we want is to delete a specific set of rows
dataframe = dataframe.drop([5, 7, 9])
dataframe
Just like when we add rows, we see that some indices have been removed, so we reorder the indices using the reset_index(drop=True)
method.
dataframe = dataframe.reset_index(drop=True)
dataframe
7. Operations on DataFrames
You can perform operations on DataFrame
s just as you could with Numpy
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]),
"dos": pd.Series([11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]),
"tres": pd.Series([21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0])
}
dataframe = pd.DataFrame(diccionario)
dataframe.head()
dataframe[ ["uno", "dos", "tres"] ] * 2
np.exp(dataframe[ ["uno", "dos", "tres"] ])
If you want to perform more complex operations, you can use the apply()
method.
diccionario = {"uno": pd.Series([1.0, 2.0, 3.0]),"dos": pd.Series([4.0, 5.0, 6.0, 7.0])}dataframe = pd.DataFrame(diccionario)dataframe["tres"] = dataframe["uno"] + dataframe["dos"]dataframe["flag"] = dataframe["tres"] > 7.0dataframedataframe["constante"] = 8.0dataframedataframe["Menos indices"] = dataframe["uno"][:2]dataframedataframe.insert(loc=3, column="columna insertada", value=dataframe["tres"]*2)dataframedataframe = dataframe.assign(columna_asignada1 = dataframe["uno"] * dataframe["tres"],columna_asignada2 = dataframe["dos"] * dataframe["tres"],)dataframediccionario = {"uno": pd.Series([1.0, 2.0, 3.0]),"dos": pd.Series([4.0, 5.0, 6.0, 7.0])}dataframe = pd.DataFrame(diccionario)dataframe.head()diccionario = {"uno": [10.0],"dos": [20.0]}dataframe = pd.concat([dataframe, pd.DataFrame(diccionario)])dataframedataframe = dataframe.reset_index(drop=True)dataframedataframe.pop("constante")dataframedel dataframe["flag"]dataframediccionario = {"uno": pd.Series([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]),"dos": pd.Series([11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]),"tres": pd.Series([21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0])}dataframe = pd.DataFrame(diccionario)dataframe.head()dataframe = dataframe.drop(1)dataframedataframe = dataframe.drop(len(dataframe)-1)dataframedataframe = dataframe.drop(range(2, 5))dataframedataframe = dataframe.drop([5, 7, 9])dataframedataframe = dataframe.reset_index(drop=True)dataframediccionario = {"uno": pd.Series([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]),"dos": pd.Series([11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]),"tres": pd.Series([21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0])}dataframe = pd.DataFrame(diccionario)dataframe.head()dataframe[ ["uno", "dos", "tres"] ] * 2np.exp(dataframe[ ["uno", "dos", "tres"] ])dataframe = dataframe.apply(lambda x: x.max() - x.min())dataframe
uno 9.0dos 9.0tres 9.0dtype: float64
A lambda
function has been used because it is a simple function, but in case we want to apply more complex functions, we can define and apply them.
def funcion(x):if x < 10:return np.exp(x) - np.log(5*x) + np.sqrt(x)elif x < 20:return np.sin(x) + np.cos(x) + np.tan(x)else:return np.log(x) + np.log10(x) + np.log2(x)dataframe = dataframe.apply(funcion)dataframe
uno 8102.277265dos 8102.277265tres 8102.277265dtype: float64
Using the apply
method in a dataframe is much faster than using a for
loop for each of the rows and performing the operation
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train.head()
Let's calculate the percentage of bedrooms out of the total number of rooms
california_housing_train["percent_bedrooms"] = None
%time california_housing_train["percent_bedrooms"] = california_housing_train.apply(lambda x: x["total_bedrooms"] / x["total_rooms"], axis=1)
california_housing_train.head()
california_housing_train["percent_bedrooms"] = None
%time for i in range(len(california_housing_train)): california_housing_train["percent_bedrooms"][i] = california_housing_train["total_bedrooms"][i] / california_housing_train["total_rooms"][i]
california_housing_train.head()
With the lambda
function it took about 300 ms, whereas with the for
loop it took over 1 second
8. Transpose
The transpose of a DataFrame
can be done using the T
method
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0], index=["fila a", "fila b", "fila c"]),
"dos": pd.Series([4.0, 5.0, 6.0], index=["fila a", "fila b", "fila c"])
}
dataframe = pd.DataFrame(diccionario)
dataframe["tres"] = dataframe["uno"] + dataframe["dos"]
dataframe["flag"] = dataframe["tres"] > 7.0
dataframe.T
9. Conversion to Numpy
If you want to convert a Series
or DataFrame
to NumPy you can use the to_numpy()
method or use the np.asarray()
function.
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0], index=["fila a", "fila b", "fila c"]),
"dos": pd.Series([4.0, 5.0, 6.0], index=["fila a", "fila b", "fila c"])
}
dataframe = pd.DataFrame(diccionario)
dataframe["tres"] = dataframe["uno"] + dataframe["dos"]
dataframe["flag"] = dataframe["tres"] > 7.0
dataframe
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")california_housing_train.head()california_housing_train["percent_bedrooms"] = None%time california_housing_train["percent_bedrooms"] = california_housing_train.apply(lambda x: x["total_bedrooms"] / x["total_rooms"], axis=1)california_housing_train.head()california_housing_train["percent_bedrooms"] = None%time for i in range(len(california_housing_train)): california_housing_train["percent_bedrooms"][i] = california_housing_train["total_bedrooms"][i] / california_housing_train["total_rooms"][i]california_housing_train.head()diccionario = {"uno": pd.Series([1.0, 2.0, 3.0], index=["fila a", "fila b", "fila c"]),"dos": pd.Series([4.0, 5.0, 6.0], index=["fila a", "fila b", "fila c"])}dataframe = pd.DataFrame(diccionario)dataframe["tres"] = dataframe["uno"] + dataframe["dos"]dataframe["flag"] = dataframe["tres"] > 7.0dataframe.Tdiccionario = {"uno": pd.Series([1.0, 2.0, 3.0], index=["fila a", "fila b", "fila c"]),"dos": pd.Series([4.0, 5.0, 6.0], index=["fila a", "fila b", "fila c"])}dataframe = pd.DataFrame(diccionario)dataframe["tres"] = dataframe["uno"] + dataframe["dos"]dataframe["flag"] = dataframe["tres"] > 7.0dataframematriz_np = dataframe.to_numpy()matriz_np
CPU times: user 309 ms, sys: 86 µs, total: 309 msWall time: 309 ms/home/wallabot/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy"""Entry point for launching an IPython kernel.array([[1.0, 4.0, 5.0, False],[2.0, 5.0, 7.0, False],[3.0, 6.0, 9.0, True]], dtype=object)
matriz_np = np.asarray(dataframe)matriz_np
array([[1.0, 4.0, 5.0, False],[2.0, 5.0, 7.0, False],[3.0, 6.0, 9.0, True]], dtype=object)
This example is not the most suitable, as it mixes numbers with booleans, and as we already explained in the previous post Matrix Calculation with NumPy, all elements of an ndarray
must be of the same type.
In this case, we are mixing numbers with booleans, so to solve it NumPy converts them all to objects.
To solve this, we keep only the numbers and convert them to an ndarray
matriz_np = dataframe[ ["uno", "dos", "tres"] ].to_numpy()matriz_np, matriz_np.dtype
(array([[1., 4., 5.],[2., 5., 7.],[3., 6., 9.]]), dtype('float64'))
Now it can be seen that an ndarray
has been created where all the data is of type float
10. Reading Data from External Sources
One of the greatest strengths of Pandas is its ability to read data from files, so it is not necessary to create a DataFrame
with the data to be processed, but rather it can be read from a file
In the same way that DataFrame
s can be created from external files, DataFrame
s can also be saved to files, allowing you to create your own dataset, configure it as you wish, and save it to a file for later use.
The following table shows the functions for reading and writing files in different formats
|Formato|Tipo de archivo|Función de lectura|Función de escritura||---|---|---|---| |texto|CSV|read_csv|to_csv| |texto|Fixed-Width Text File|read_fwf|| |texto|JSON|read_json|to_json| |texto|HTML|read_html|to_html| |texto|Local clipboard|read_clipboard|to_clipboard| |binary|MS Excel|read_excel|to_excel| |binary|OpenDocument|read_excel| |binary|HDF5 Format|read_hdf|to_hdf| |binary|Feather Format|read_feather|to_feather| |binary|Parquet Format|read_parquet|to_parquet| |binary|ORC Format|read_orc||binary|Msgpack|read_msgpack|to_msgpack| |binary|Stata|read_stata|to_stata| |binary|SAS|read_sas||binary|SPSS|read_spss||binary|Python Pickle Format|read_pickle|to_pickle| |SQL|SQL|read_sql|to_sql| |SQL|Google BigQuery|read_gbq|to_gbq|
11. Indexing in DataFrames
There are many ways to index in DataFrame
s,
fechas = pd.date_range('1/1/2000', periods=8)
dataframe = pd.DataFrame(np.random.randn(8, 4), index=fechas, columns=['A', 'B', 'C', 'D'])
dataframe
11.1. Column Indexing
To select columns within a DataFrame
we can do it by selecting the column between brackets []
, or by indicating the column as if it were a method of the DataFrame
.
fechas = pd.date_range('1/1/2000', periods=8)dataframe = pd.DataFrame(np.random.randn(8, 4), index=fechas, columns=['A', 'B', 'C', 'D'])dataframedataframe['A']
2000-01-01 0.8131532000-01-02 -0.2445842000-01-03 0.1257292000-01-04 0.3522752000-01-05 -2.0509762000-01-06 -0.3122962000-01-07 0.8978372000-01-08 0.271403Freq: D, Name: A, dtype: float64
dataframe.A
2000-01-01 0.8131532000-01-02 -0.2445842000-01-03 0.1257292000-01-04 0.3522752000-01-05 -2.0509762000-01-06 -0.3122962000-01-07 0.8978372000-01-08 0.271403Freq: D, Name: A, dtype: float64
If specific rows are desired, they are passed through a list
dataframe[ ['A', 'B'] ]
11.2. Row Indexing by Positions
You can select a range of rows from a DataFrame
as follows
dataframe[0:3]
If you only want to select a single row, you need to specify a range of rows that includes only that one, for example, if you want to select row number 1.
dataframe[1:2]
Another method to select a row by its position is the iloc[]
method
dataframe.iloc[0:3]
If certain rows are desired, a list with their positions is passed.
dataframe.iloc[ [0, 2, 4] ]
11.3. Row Indexing by Labels
To select a row by its labels, we can use the loc[]
method
dataframe[ ['A', 'B'] ]dataframe[0:3]dataframe[1:2]dataframe.iloc[0:3]dataframe.iloc[ [0, 2, 4] ]dataframe.loc['2000-01-01']
A 0.813153B -0.869356C 0.934293D 0.338644Name: 2000-01-01 00:00:00, dtype: float64
If you want to select a range of rows, you can index them using the colon :
dataframe.loc['2000-01-01':'2000-01-03']
If specific rows are desired, they are passed through a list
dataframe.loc[ ['2000-01-01', '2000-01-03', '2000-01-05'] ]
11.4. Selecting a Portion of the DataFrame by Positions
dataframe.iloc[0:3, 0:2]
If specific rows and columns are desired, lists with the desired positions are provided.
dataframe.iloc[ [0, 2, 4], [0, 2] ]
11.5. Selecting a portion of the DataFrame using labels
dataframe.loc['2000-01-01':'2000-01-03', 'A':'B']
If specific rows and columns are desired, lists with the desired labels are passed
dataframe.loc[ ['2000-01-01', '2000-01-03', '2000-01-05'], ['A', 'C'] ]
11.6. Indexing by Lambda Function
You can select data from a DataFrame
that meets a condition given by a lambda
function.
dataframe.loc[lambda dataframe:2*dataframe['A']+5*np.exp(dataframe['B'])>0.2]
As you can see, this form of indexing is very powerful
11.7. Conditional Indexing
If we don't need complex functions to index, but only conditionals, we can do
dataframe[dataframe['A']>0.2]
We can make multiple conditions
dataframe[(dataframe['A']>0.2) & (dataframe['B']>0.2)]
11.8. Random indexing
Using the sample()
method we will obtain a random row from the DataFrame
dataframe.sample()
If we want more than one sample, we indicate it with the attribute n
dataframe.sample(n=3)
If what you want are random columns, you need to indicate it with axis=1
dataframe.sample(axis=1)
If you want a single item from the DataFrame
, you need to call the sample()
method twice.
dataframe.sample(axis=1).sample()
12. Merging DataFrames
12.1. Concatenation of DataFrames
To concatenate multiple DataFrame
s we use the concat()
method, where it will be passed a list with the DataFrame
s that you want to join.
dataframe.loc['2000-01-01':'2000-01-03']dataframe.loc[ ['2000-01-01', '2000-01-03', '2000-01-05'] ]dataframe.iloc[0:3, 0:2]dataframe.iloc[ [0, 2, 4], [0, 2] ]dataframe.loc['2000-01-01':'2000-01-03', 'A':'B']dataframe.loc[ ['2000-01-01', '2000-01-03', '2000-01-05'], ['A', 'C'] ]dataframe.loc[lambda dataframe:2*dataframe['A']+5*np.exp(dataframe['B'])>0.2]dataframe[dataframe['A']>0.2]dataframe[(dataframe['A']>0.2) & (dataframe['B']>0.2)]dataframe.sample()dataframe.sample(n=3)dataframe.sample(axis=1)dataframe.sample(axis=1).sample()dataframe1 = pd.DataFrame({"A": ["A0", "A1", "A2", "A3"],"B": ["B0", "B1", "B2", "B3"],"C": ["C0", "C1", "C2", "C3"],"D": ["D0", "D1", "D2", "D3"],})dataframe2 = pd.DataFrame({"A": ["A4", "A5", "A6", "A7"],"B": ["B4", "B5", "B6", "B7"],"C": ["C4", "C5", "C6", "C7"],"D": ["D4", "D5", "D6", "D7"],})dataframe3 = pd.DataFrame({"A": ["A8", "A9", "A10", "A11"],"B": ["B8", "B9", "B10", "B11"],"C": ["C8", "C9", "C10", "C11"],"D": ["D8", "D9", "D10", "D11"],})dataframe = pd.concat([dataframe1, dataframe2, dataframe3])print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe2}")print(f"dataframe3:\n{dataframe3}")print(f"\ndataframe:\n{dataframe}")
dataframe1:A B C D0 A0 B0 C0 D01 A1 B1 C1 D12 A2 B2 C2 D23 A3 B3 C3 D3dataframe2:A B C D0 A4 B4 C4 D41 A5 B5 C5 D52 A6 B6 C6 D63 A7 B7 C7 D7dataframe3:A B C D0 A8 B8 C8 D81 A9 B9 C9 D92 A10 B10 C10 D103 A11 B11 C11 D11dataframe:A B C D0 A0 B0 C0 D01 A1 B1 C1 D12 A2 B2 C2 D23 A3 B3 C3 D30 A4 B4 C4 D41 A5 B5 C5 D52 A6 B6 C6 D63 A7 B7 C7 D70 A8 B8 C8 D81 A9 B9 C9 D92 A10 B10 C10 D103 A11 B11 C11 D11
As you can see, the indexes 0
, 1
, 2
, and 3
are repeated because each dataframe has those indexes. To prevent this from happening, you need to use the ignore_index=True
parameter.
dataframe = pd.concat([dataframe1, dataframe2, dataframe3], ignore_index=True)print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe2}")print(f"dataframe3:\n{dataframe3}")print(f"\ndataframe:\n{dataframe}")
dataframe1:A B C D0 A0 B0 C0 D01 A1 B1 C1 D12 A2 B2 C2 D23 A3 B3 C3 D3dataframe2:A B C D0 A4 B4 C4 D41 A5 B5 C5 D52 A6 B6 C6 D63 A7 B7 C7 D7dataframe3:A B C D0 A8 B8 C8 D81 A9 B9 C9 D92 A10 B10 C10 D103 A11 B11 C11 D11dataframe:A B C D0 A0 B0 C0 D01 A1 B1 C1 D12 A2 B2 C2 D23 A3 B3 C3 D34 A4 B4 C4 D45 A5 B5 C5 D56 A6 B6 C6 D67 A7 B7 C7 D78 A8 B8 C8 D89 A9 B9 C9 D910 A10 B10 C10 D1011 A11 B11 C11 D11
If you wanted to concatenate along the columns, you would have needed to introduce the variable axis=1
dataframe = pd.concat([dataframe1, dataframe2, dataframe3], axis=1)print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe2}")print(f"dataframe3:\n{dataframe3}")print(f"\ndataframe:\n{dataframe}")
dataframe1:A B C D0 A0 B0 C0 D01 A1 B1 C1 D12 A2 B2 C2 D23 A3 B3 C3 D3dataframe2:A B C D0 A4 B4 C4 D41 A5 B5 C5 D52 A6 B6 C6 D63 A7 B7 C7 D7dataframe3:A B C D0 A8 B8 C8 D81 A9 B9 C9 D92 A10 B10 C10 D103 A11 B11 C11 D11dataframe:A B C D A B C D A B C D0 A0 B0 C0 D0 A4 B4 C4 D4 A8 B8 C8 D81 A1 B1 C1 D1 A5 B5 C5 D5 A9 B9 C9 D92 A2 B2 C2 D2 A6 B6 C6 D6 A10 B10 C10 D103 A3 B3 C3 D3 A7 B7 C7 D7 A11 B11 C11 D11
12.1.1. Intersection of Concatenation
There are two ways to perform concatenation: by taking all the indices of the DataFrame
s or by taking only those that match. This is determined by the join
variable, which accepts the values 'outer'
(default) (takes all indices) or 'inner'
(only those that match).
Let's see an example of 'outer'
dataframe1 = pd.DataFrame({"A": ["A0", "A1", "A2", "A3"],"B": ["B0", "B1", "B2", "B3"],"C": ["C0", "C1", "C2", "C3"],"D": ["D0", "D1", "D2", "D3"],},index=[0, 1, 2, 3])dataframe4 = pd.DataFrame({"B": ["B2", "B3", "B6", "B7"],"D": ["D2", "D3", "D6", "D7"],"F": ["F2", "F3", "F6", "F7"],},index=[2, 3, 6, 7])dataframe = pd.concat([dataframe1, dataframe4], axis=1)print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe4}")print(f"\ndataframe:\n{dataframe}")
dataframe1:A B C D0 A0 B0 C0 D01 A1 B1 C1 D12 A2 B2 C2 D23 A3 B3 C3 D3dataframe2:B D F2 B2 D2 F23 B3 D3 F36 B6 D6 F67 B7 D7 F7dataframe:A B C D B D F0 A0 B0 C0 D0 NaN NaN NaN1 A1 B1 C1 D1 NaN NaN NaN2 A2 B2 C2 D2 B2 D2 F23 A3 B3 C3 D3 B3 D3 F36 NaN NaN NaN NaN B6 D6 F67 NaN NaN NaN NaN B7 D7 F7
Let's see an example of 'inner'
dataframe = pd.concat([dataframe1, dataframe4], axis=1, join="inner")print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe4}")print(f"\ndataframe:\n{dataframe}")
dataframe1:A B C D0 A0 B0 C0 D01 A1 B1 C1 D12 A2 B2 C2 D23 A3 B3 C3 D3dataframe2:B D F2 B2 D2 F23 B3 D3 F36 B6 D6 F67 B7 D7 F7dataframe:A B C D B D F2 A2 B2 C2 D2 B2 D2 F23 A3 B3 C3 D3 B3 D3 F3
12.2. Merge
of DataFrames
Previously we created a new dataframe by merging several dataframes, now we can complete a dataframe with another one, for this we use merge
, passing the on
parameter, indicating on which column we want the merge
to be done.
dataframe1 = pd.DataFrame({"Key": ["K0", "K1", "K2", "K3"],"A": ["A0", "A1", "A2", "A3"],"B": ["B0", "B1", "B2", "B3"],})dataframe2 = pd.DataFrame({"Key": ["K0", "K1", "K2", "K3"],"C": ["C0", "C1", "C2", "C3"],"D": ["D0", "D1", "D2", "D3"],})dataframe = dataframe1.merge(dataframe2, on="Key")print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe2}")print(f"\ndataframe:\n{dataframe}")
dataframe1:Key A B0 K0 A0 B01 K1 A1 B12 K2 A2 B23 K3 A3 B3dataframe2:Key C D0 K0 C0 D01 K1 C1 D12 K2 C2 D23 K3 C3 D3dataframe:Key A B C D0 K0 A0 B0 C0 D01 K1 A1 B1 C1 D12 K2 A2 B2 C2 D23 K3 A3 B3 C3 D3
In this case, the two dataframes had a key that was called the same (Key
), but in the case of having dataframes where their key is called differently, we can use the parameters left_on
and right_on
dataframe1 = pd.DataFrame({"Key1": ["K0", "K1", "K2", "K3"],"A": ["A0", "A1", "A2", "A3"],"B": ["B0", "B1", "B2", "B3"],})dataframe2 = pd.DataFrame({"Key2": ["K0", "K1", "K2", "K3"],"C": ["C0", "C1", "C2", "C3"],"D": ["D0", "D1", "D2", "D3"],})dataframe = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2")print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe2}")print(f"\ndataframe:\n{dataframe}")
dataframe1:Key1 A B0 K0 A0 B01 K1 A1 B12 K2 A2 B23 K3 A3 B3dataframe2:Key2 C D0 K0 C0 D01 K1 C1 D12 K2 C2 D23 K3 C3 D3dataframe:Key1 A B Key2 C D0 K0 A0 B0 K0 C0 D01 K1 A1 B1 K1 C1 D12 K2 A2 B2 K2 C2 D23 K3 A3 B3 K3 C3 D3
In the event that one of the keys does not match, the merge
will not be performed on that key.
dataframe1 = pd.DataFrame({"Key1": ["K0", "K1", "K2", "K3"],"A": ["A0", "A1", "A2", "A3"],"B": ["B0", "B1", "B2", "B3"],})dataframe2 = pd.DataFrame({"Key2": ["K0", "K1", "K2", np.nan],"C": ["C0", "C1", "C2", "C3"],"D": ["D0", "D1", "D2", "D3"],})dataframe = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2")print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe2}")print(f"\ndataframe:\n{dataframe}")
dataframe1:Key1 A B0 K0 A0 B01 K1 A1 B12 K2 A2 B23 K3 A3 B3dataframe2:Key2 C D0 K0 C0 D01 K1 C1 D12 K2 C2 D23 NaN C3 D3dataframe:Key1 A B Key2 C D0 K0 A0 B0 K0 C0 D01 K1 A1 B1 K1 C1 D12 K2 A2 B2 K2 C2 D2
To change this behavior we can use the how
parameter, which defaults to inner
, but we can pass it the values left
, right
, and outer
dataframe1 = pd.DataFrame({"Key1": ["K0", "K1", "K2", "K3"],"A": ["A0", "A1", "A2", "A3"],"B": ["B0", "B1", "B2", "B3"],})dataframe2 = pd.DataFrame({"Key2": ["K0", "K1", "K2", np.nan],"C": ["C0", "C1", "C2", "C3"],"D": ["D0", "D1", "D2", "D3"],})dataframe_inner = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2", how="inner")dataframe_left = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2", how="left")dataframe_right = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2", how="right")dataframe_outer = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2", how="outer")print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe2}")print(f"\ndataframe inner:\n{dataframe_inner}")print(f"\ndataframe left:\n{dataframe_left}")print(f"\ndataframe right:\n{dataframe_right}")print(f"\ndataframe outer:\n{dataframe_outer}")
dataframe1:Key1 A B0 K0 A0 B01 K1 A1 B12 K2 A2 B23 K3 A3 B3dataframe2:Key2 C D0 K0 C0 D01 K1 C1 D12 K2 C2 D23 NaN C3 D3dataframe inner:Key1 A B Key2 C D0 K0 A0 B0 K0 C0 D01 K1 A1 B1 K1 C1 D12 K2 A2 B2 K2 C2 D2dataframe left:Key1 A B Key2 C D0 K0 A0 B0 K0 C0 D01 K1 A1 B1 K1 C1 D12 K2 A2 B2 K2 C2 D23 K3 A3 B3 NaN NaN NaNdataframe right:Key1 A B Key2 C D0 K0 A0 B0 K0 C0 D01 K1 A1 B1 K1 C1 D12 K2 A2 B2 K2 C2 D23 NaN NaN NaN NaN C3 D3dataframe outer:Key1 A B Key2 C D0 K0 A0 B0 K0 C0 D01 K1 A1 B1 K1 C1 D12 K2 A2 B2 K2 C2 D23 K3 A3 B3 NaN NaN NaN4 NaN NaN NaN NaN C3 D3
As you can see, when left
is chosen, only the values from the left dataframe are added, and when right
is chosen, the values from the right dataframe are added.
12.3. Join
of dataframes
The last DataFrame joining tool is join
. It is similar to merge
, except that instead of looking for similarities based on specified columns, it looks for them based on the indices.
dataframe1 = pd.DataFrame({"A": ["A0", "A1", "A2", "A3"],"B": ["B0", "B1", "B2", "B3"],},index=["K0", "K1", "K2", "K3"])dataframe2 = pd.DataFrame({"C": ["C0", "C1", "C2", "C3"],"D": ["D0", "D1", "D2", "D3"],},index=["K0", "K1", "K2", "K3"])dataframe = dataframe1.join(dataframe2)print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe2}")print(f"\ndataframe:\n{dataframe}")
dataframe1:A BK0 A0 B0K1 A1 B1K2 A2 B2K3 A3 B3dataframe2:C DK0 C0 D0K1 C1 D1K2 C2 D2K3 C3 D3dataframe:A B C DK0 A0 B0 C0 D0K1 A1 B1 C1 D1K2 A2 B2 C2 D2K3 A3 B3 C3 D3
In this case, the indexes are the same, but when they are different, we can specify how to join the dataframes using the how
parameter, which defaults to inner
, but can also have the values left
, right
, and outer
.
dataframe1 = pd.DataFrame({"A": ["A0", "A1", "A2", "A3"],"B": ["B0", "B1", "B2", "B3"],},index=["K0", "K1", "K2", "K3"])dataframe2 = pd.DataFrame({"C": ["C0", "C2", "C3", "C4"],"D": ["D0", "D2", "D3", "D4"],},index=["K0", "K2", "K3", "K4"])dataframe_inner = dataframe1.join(dataframe2, how="inner")dataframe_left = dataframe1.join(dataframe2, how="left")dataframe_right = dataframe1.join(dataframe2, how="right")dataframe_outer = dataframe1.join(dataframe2, how="outer")print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe2}")print(f"\ndataframe inner:\n{dataframe_inner}")print(f"\ndataframe left:\n{dataframe_left}")print(f"\ndataframe rigth:\n{dataframe_right}")print(f"\ndataframe outer:\n{dataframe_outer}")
dataframe1:A BK0 A0 B0K1 A1 B1K2 A2 B2K3 A3 B3dataframe2:C DK0 C0 D0K2 C2 D2K3 C3 D3K4 C4 D4dataframe:A B C DK0 A0 B0 C0 D0K2 A2 B2 C2 D2K3 A3 B3 C3 D3dataframe:A B C DK0 A0 B0 C0 D0K1 A1 B1 NaN NaNK2 A2 B2 C2 D2K3 A3 B3 C3 D3dataframe:A B C DK0 A0 B0 C0 D0K2 A2 B2 C2 D2K3 A3 B3 C3 D3K4 NaN NaN C4 D4dataframe:A B C DK0 A0 B0 C0 D0K1 A1 B1 NaN NaNK2 A2 B2 C2 D2K3 A3 B3 C3 D3K4 NaN NaN C4 D4
13. Missing Data (NaN
)
In a DataFrame
there may be some missing data, Pandas represents them as np.nan
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0])
}
dataframe = pd.DataFrame(diccionario)
dataframe
13.1. Removal of Rows with Missing Data
To avoid having rows with missing data, these can be eliminated
dataframe.dropna(how="any")
13.2. Removing Columns with Missing Data
dataframe.dropna(axis=1, how='any')
13.3. Boolean mask with missing positions
pd.isna(dataframe)
13.4. Filling Missing Data
dataframe.fillna(value=5.5, inplace=True)
dataframe
Tip: Setting the variable
inplace=True
modifies theDataFrame
being operated on, so there's no need to writedataframe = dataframe.fillna(value=5.5)
14. Time Series
Pandas offers the ability to work with time series. For example, we create a Series
of 100 random data points every second starting from 01/01/2021
diccionario = {"uno": pd.Series([1.0, 2.0, 3.0]),"dos": pd.Series([4.0, 5.0, 6.0, 7.0])}dataframe = pd.DataFrame(diccionario)dataframedataframe.dropna(how="any")dataframe.dropna(axis=1, how='any')pd.isna(dataframe)dataframe.fillna(value=5.5, inplace=True)dataframeindices = pd.date_range("1/1/2021", periods=100, freq="S")datos = np.random.randint(0, 500, len(indices))serie_temporal = pd.Series(datos, index=indices)serie_temporal
2021-01-01 00:00:00 2412021-01-01 00:00:01 142021-01-01 00:00:02 1902021-01-01 00:00:03 4072021-01-01 00:00:04 94...2021-01-01 00:01:35 2752021-01-01 00:01:36 562021-01-01 00:01:37 4482021-01-01 00:01:38 1512021-01-01 00:01:39 316Freq: S, Length: 100, dtype: int64
This functionality of Pandas is very powerful, for example, we can have a dataset at certain hours in one time zone and change them to another time zone.
horas = pd.date_range("3/6/2021 00:00", periods=10, freq="H")datos = np.random.randn(len(horas))serie_horaria = pd.Series(datos, horas)serie_horaria
2021-03-06 00:00:00 -0.8535242021-03-06 01:00:00 -1.3553722021-03-06 02:00:00 -1.2675032021-03-06 03:00:00 -1.1557872021-03-06 04:00:00 0.7309352021-03-06 05:00:00 1.4359572021-03-06 06:00:00 0.4609122021-03-06 07:00:00 0.7234512021-03-06 08:00:00 -0.8533372021-03-06 09:00:00 0.456359Freq: H, dtype: float64
We locate the data in a time zone
serie_horaria_utc = serie_horaria.tz_localize("UTC")serie_horaria_utc
2021-03-06 00:00:00+00:00 -0.8535242021-03-06 01:00:00+00:00 -1.3553722021-03-06 02:00:00+00:00 -1.2675032021-03-06 03:00:00+00:00 -1.1557872021-03-06 04:00:00+00:00 0.7309352021-03-06 05:00:00+00:00 1.4359572021-03-06 06:00:00+00:00 0.4609122021-03-06 07:00:00+00:00 0.7234512021-03-06 08:00:00+00:00 -0.8533372021-03-06 09:00:00+00:00 0.456359Freq: H, dtype: float64
And now we can change them to another use
serie_horaria_US = serie_horaria_utc.tz_convert("US/Eastern")serie_horaria_US
2021-03-05 19:00:00-05:00 -0.8535242021-03-05 20:00:00-05:00 -1.3553722021-03-05 21:00:00-05:00 -1.2675032021-03-05 22:00:00-05:00 -1.1557872021-03-05 23:00:00-05:00 0.7309352021-03-06 00:00:00-05:00 1.4359572021-03-06 01:00:00-05:00 0.4609122021-03-06 02:00:00-05:00 0.7234512021-03-06 03:00:00-05:00 -0.8533372021-03-06 04:00:00-05:00 0.456359Freq: H, dtype: float64
15. Categorical Data
Pandas offers the possibility of adding categorical data in a DataFrame
. Let's suppose the following DataFrame
dataframe = pd.DataFrame(
{"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)
dataframe
We can convert the data in the raw_grade
column to categorical data using the astype()
method
dataframe['grade'] = dataframe["raw_grade"].astype("category")
dataframe
The columns raw_grade
and grade
appear to be the same, but if we look at the DataFrame
information we can see that this is not the case.
dataframe = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]})dataframedataframe['grade'] = dataframe["raw_grade"].astype("category")dataframedataframe.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 6 entries, 0 to 5Data columns (total 3 columns):# Column Non-Null Count Dtype--- ------ -------------- -----0 id 6 non-null int641 raw_grade 6 non-null object2 grade 6 non-null categorydtypes: category(1), int64(1), object(1)memory usage: 334.0+ bytes
You can see that the grade
column is of categorical type
We can see the categories of categorical data types using the cat.categories()
method.
dataframe["grade"].cat.categories
Index(['a', 'b', 'e'], dtype='object')
We can also rename the categories using the same method, but by providing a list with the new categories.
dataframe["grade"].cat.categories = ["very good", "good", "very bad"]
dataframe
Pandas gives us the possibility to numerically encode categorical data using the get_dummies
method
pd.get_dummies(dataframe["grade"])
16. Groupby
We can group the dataframes by values from one of the columns. Let's reload the dataframe with the value of houses in California.
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train.head()
Now we can group the data by one of the columns, for example, let's group the houses by the number of years and see how many houses there are of each age with count
california_housing_train.groupby("housing_median_age").count().head()
As we see in all the columns, we get the same value, which is the number of houses with a certain age, but we can find out the average value of each column with mean
california_housing_train.groupby("housing_median_age").mean().head()
We can obtain multiple measures for each age using the agg
(aggregation) command, passing the measures we want with a list. For example, let's see the minimum, maximum, and mean of each column for each house age.
california_housing_train.groupby("housing_median_age").agg(['min', 'max', 'mean']).head()
We can specify which columns we want to perform certain calculations on by passing a dictionary, where the keys will be the columns on which we want to perform calculations and the values will be lists with the calculations.
california_housing_train.groupby("housing_median_age").agg({'total_rooms': ['min', 'max', 'mean'], 'total_bedrooms': ['min', 'max', 'mean', 'median']}).head()
We can group by more than one column, for this, the columns need to be passed in a list
california_housing_train.groupby(["housing_median_age", "total_bedrooms"]).mean()
17. Graphics
Pandas offers the possibility of representing the data of our DataFrame
s in charts to obtain a better representation of them. To do this, it makes use of the matplotlib
library, which we will see in the next post.
17.1. Basic Graph
To represent the data in a graph, the easiest way is to use the plot()
method.
serie = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))
serie = serie.cumsum()
serie.plot()
In the case of having a DataFrame
, the plot()
method will represent each of the columns of the DataFrame
.
dataframe = pd.DataFrame(
np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"]
)
dataframe = dataframe.cumsum()
dataframe.plot()
17.2. Vertical Bar Chart
There are more methods to create charts, such as the vertical bar chart using plot.bar()
dataframe = pd.DataFrame(np.random.rand(10, 4), columns=["a", "b", "c", "d"])
dataframe.plot.bar()
If we want to stack the bars, we indicate it with the variable stacked=True
dataframe.plot.bar(stacked=True)
17.3. Horizontal Bar Chart
To create a horizontal bar chart we use plot.barh()
dataframe.plot.barh()
If we want to stack the bars, we indicate it using the variable stacked=True
dataframe.plot.barh(stacked=True)
17.4. Histogram
To create a histogram we use plot.hist()
dataframe = pd.DataFrame(
{
"a": np.random.randn(1000) + 1,
"b": np.random.randn(1000),
"c": np.random.randn(1000) - 1,
}
)
dataframe.plot.hist(alpha=0.5)
If we want to stack the bars, we indicate it with the variable stacked=True
.
dataframe.plot.hist(alpha=0.5, stacked=True)
If we want to add more columns, that is, if we want the histogram to be more informative or precise, we indicate it using the variable bins
dataframe.plot.hist(alpha=0.5, stacked=True, bins=20)
17.5. Candlestick Charts
To create a candlestick chart we use plot.box()
dataframe = pd.DataFrame(np.random.rand(10, 5), columns=["A", "B", "C", "D", "E"])
dataframe.plot.box()
17.6. Area Charts
To create an area chart we use plot.area()
dataframe.plot.area()
17.7. Scatter Plot
To create a scatter plot we use plot.scatter()
, where we need to specify the x
and y
variables of the plot
dataframe.plot.scatter(x='A', y='B')
17.8. Hexbin Plot
To create a hexagonal bin plot we use plot.hexbin()
, where we need to specify the x
and y
variables of the diagram and the mesh size using gridsize
.
dataframe = pd.DataFrame(np.random.randn(1000, 2), columns=["a", "b"])
dataframe["b"] = dataframe["b"] + np.arange(1000)
dataframe.plot.hexbin(x="a", y="b", gridsize=25)