Data handling with Pandas![link image 69](/icons/link.svg)
Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.
1. Summary![link image 70](/icons/link.svg)
Let's take a look at a brief introduction to the data manipulation and analysis library Pandas
. With it, we can handle and process tabular data, which will help us operate with the data and extract valuable information.
2. What is Pandas?![link image 71](/icons/link.svg)
Pandas is a Python library designed to make working with relational or labeled data easy and intuitive Pandas is designed for many different types of data:
- Tabular data with heterogeneous column types, such as in an SQL table or an Excel spreadsheet* Time series data, ordered and unordered (not necessarily of fixed frequency).* Arbitrary matrix data (homogeneous or heterogeneous) with row and column labels* Any other form of observational/statistical datasets. Data does not need to be labeled at all in order to place it into a Pandas data structure.
The two main data structures of Pandas are Series
(one-dimensional) and DataFrames
(two-dimensional). Pandas is built on top of NumPy and is designed to integrate well within a scientific computing environment with many other third-party libraries.
For data scientists, working with data is generally divided into several stages: collecting and cleaning data, analyzing/modeling it, and then organizing the analysis results in a suitable format for plotting or displaying them in tabular form. pandas is the ideal tool for all these tasks.
Another feature is that pandas is fast, many of the low-level algorithms have been built in C
.
2.1. Pandas as pd
![link image 72](/icons/link.svg)
Generally when importing pandas, it is usually imported with the alias pd
import pandas as pdprint(pd.__version__)
1.0.1
3. Data Structures in Pandas![link image 73](/icons/link.svg)
In Pandas there are two types of data structures: Series
and DataFrames
3.1. Series![link image 74](/icons/link.svg)
The Serie
data type is a one-dimensional labeled array capable of holding any data type (integers, strings, floating-point numbers, Python objects, etc.). It is divided into indices.
To create a Series
data type, the most common way is
series = pd.Series(data, index=index)```
Where `data` can be:* A dictionary* A list or tuple* A `ndarray` from Numpy* A scalar value
As one of the data types can be a ndarray
from NumPy, we import NumPy to be able to use it
import numpy as np
3.1.1. Series from a dictionary![link image 75](/icons/link.svg)
diccionario = {"b": 1, "a": 0, "c": 2}serie = pd.Series(diccionario)serie
b 1a 0c 2dtype: int64
If an index is passed, the values corresponding to the index labels will be extracted. If they do not exist, they are created as NaN
(not a number)
diccionario = {"b": 1, "a": 0, "c": 2}serie = pd.Series(diccionario, index=["b", "c", "d", "a"])serie
b 1.0c 2.0d NaNa 0.0dtype: float64
3.1.2. Series from a list or tuple![link image 76](/icons/link.svg)
If the data comes from a list or tuple and no index is passed, one will be created with values [0, ..., len(data)-1]
serie = pd.Series([1, 2, 3, 4])serie
0 11 22 33 4dtype: int64
If an index is passed, it must have the same length as the data
serie = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"])serie
a 1b 2c 3d 4dtype: int64
3.1.3. Series from an ndarray![link image 77](/icons/link.svg)
If the data comes from an ndarray and no index is passed, one will be created with values [0, ..., len(data)-1]
serie = pd.Series(np.random.randn(5))serie
0 1.2678651 -0.8778572 -0.1385563 -0.1329874 -0.827295dtype: float64
If an index is passed, it must have the same length as the data
serie = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])serie
a -1.091828b -0.584243c 0.220398d 1.248923e 1.652351dtype: float64
3.1.4. Series from a scalar![link image 78](/icons/link.svg)
If the series is created from a scalar, it will be created with a single item
serie = pd.Series(5.0)serie
0 5.0dtype: float64
If you want to create more items in the series, you need to pass the index with the number of items you want, so all the items will have the value of the scalar
serie = pd.Series(5.0, index=["a", "b", "c", "d", "e"])serie
a 5.0b 5.0c 5.0d 5.0e 5.0dtype: float64
3.1.5. Operations with Series![link image 79](/icons/link.svg)
Just like with Numpy, we can perform operations on all elements of a series without having to iterate through each one of them.
serie = pd.Series(5.0, index=["a", "b", "c", "d", "e"])print(f"serie:\n{serie}")print(f"\nserie + serie =\n{serie + serie}")
serie:a 5.0b 5.0c 5.0d 5.0e 5.0dtype: float64serie + serie =a 10.0b 10.0c 10.0d 10.0e 10.0dtype: float64
serie = pd.Series(5.0, index=["a", "b", "c", "d", "e"])print(f"serie:\n{serie}")print(f"\nexp(serie) =\n{np.exp(serie)}")
serie:a 5.0b 5.0c 5.0d 5.0e 5.0dtype: float64exp(serie) =a 148.413159b 148.413159c 148.413159d 148.413159e 148.413159dtype: float64
A difference between Series
and ndarrays is that operations between Series
automatically align the data according to their labels. Therefore, calculations can be written without considering whether the involved Series
have the same labels. If a label is not found in one Series
or the other, the result will be marked as missing (NaN).
serie = pd.Series(5.0, index=["a", "b", "c", "d", "e"])print(f"serie:\n{serie}")print(f"\nserie[1:] + serie[:-1] =\n{serie[1:] + serie[:-1]}")
serie:a 5.0b 5.0c 5.0d 5.0e 5.0dtype: float64serie[1:] + serie[:-1] =a NaNb 10.0c 10.0d 10.0e NaNdtype: float64
3.1.6. Attribute name of the Series![link image 80](/icons/link.svg)
One of the attributes of Series
is name
, which corresponds to the name they will have when added to a DataFrame. Conversely, when a series is obtained from a DataFrame, this series will have the same name it had in the DataFrame.
serie = pd.Series(np.random.randn(5), name="aleatorio")serie
0 -0.1910091 -0.7931512 -0.9077473 -1.4405084 -0.676419Name: aleatorio, dtype: float64
The name of a series can be changed using the rename()
method.
serie = serie.rename("random")serie
0 -0.1910091 -0.7931512 -0.9077473 -1.4405084 -0.676419Name: random, dtype: float64
3.2. DataFrames![link image 81](/icons/link.svg)
A DataFrame
is a labeled and two-dimensional data structure, with potentially different types of columns, meaning that one column can contain integer data, another column string data, etc. You can think of it as a spreadsheet or an SQL table, or a dictionary of Series
objects.
It is the most used pandas object. Just like Series
, DataFrames
accept many different types of input:
Along with the data, you can optionally pass index arguments (row labels) and columns (column labels). If you pass an index and/or columns, you are guaranteeing the index and/or columns of the resulting DataFrame
. Therefore, a dictionary of Series
plus a specific index will discard all data that does not match the passed index.
If axis labels are not provided, they will be constructed from the input data based on common sense rules.
3.2.1. DataFrames from a Dictionary of Series![link image 82](/icons/link.svg)
If a dictionary of Series
is passed, the DataFrame
will be created with as many columns as Series
the dictionary has.
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0])
}
dataframe = pd.DataFrame(diccionario)
dataframe
If each of the Series
has defined indices, the resulting DataFrame
will be the union of these indices.
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0], index=["a", "b", "c", "d"])
}
dataframe = pd.DataFrame(diccionario)
dataframe
dataframe = pd.DataFrame(diccionario, index=["d", "b", "a"])
dataframe
If the columns are passed, they will appear in the passed order
dataframe = pd.DataFrame(diccionario, columns=["dos", "tres"])
dataframe
3.2.2. DataFrames from a dictionary of ndarrays or lists![link image 83](/icons/link.svg)
All ndarrays or lists must have the same length. If an index is passed, it must also have the same length as the ndarrays or lists.
diccionario = {
"uno": [1.0, 2.0, 3.0, 4.0],
"dos": [4.0, 3.0, 2.0, 1.0]
}
dataframe = pd.DataFrame(diccionario)
dataframe
3.2.3. DataFrames from a matrix![link image 84](/icons/link.svg)
If an index is passed, it must have the same length as the number of rows in the array, and if columns are passed, they must have the same length as the columns in the array.
matriz = np.array([[1, 3], [2, 2], [3, 1]])
dataframe = pd.DataFrame(matriz, index=["a", "b", "c"], columns=["columna1", "columna2"])
dataframe
3.2.4. DataFrames from a list of dictionaries![link image 85](/icons/link.svg)
lista = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
dataframe = pd.DataFrame(lista)
dataframe
3.2.5. DataFrames from a dictionary of tuples![link image 86](/icons/link.svg)
diccionario = {
("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
("b", "b"): {("A", "D"): 9, ("A", "B"): 10},
}
dataframe = pd.DataFrame(diccionario)
dataframe
3.2.6. DataFrames from a Series![link image 87](/icons/link.svg)
The result will be a DataFrame
with the same index as the input Series, and with a column whose name is the original name of the Series (only if another column name is not provided).
diccionario = {"b": 1, "a": 0, "c": 2}
serie = pd.Series(diccionario)
dataframe = pd.DataFrame(serie)
dataframe
4. Exploration of a DataFrame![link image 88](/icons/link.svg)
When a DataFrame
is very large, it cannot be fully represented.
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train
So it's very useful to have methods for exploring it and getting information quickly.
4.1. DataFrame Header![link image 89](/icons/link.svg)
To view the first rows and get an idea of what the DataFrame
looks like, there is the head()
method, which by default displays the first 5 rows of the DataFrame
. If you want to see a different number of rows, specify it using the n
attribute.
california_housing_train.head(n=10)
4.2. Tail of the DataFrame![link image 90](/icons/link.svg)
If you want to see the last rows, you can use the tail()
method. By using the n
parameter, you can choose how many rows to display.
california_housing_train.tail()
4.3. DataFrame Information![link image 91](/icons/link.svg)
Another very useful method is info()
which gives us information about the DataFrame
california_housing_train.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 17000 entries, 0 to 16999Data columns (total 9 columns):# Column Non-Null Count Dtype--- ------ -------------- -----0 longitude 17000 non-null float641 latitude 17000 non-null float642 housing_median_age 17000 non-null float643 total_rooms 17000 non-null float644 total_bedrooms 17000 non-null float645 population 17000 non-null float646 households 17000 non-null float647 median_income 17000 non-null float648 median_house_value 17000 non-null float64dtypes: float64(9)memory usage: 1.2 MB
4.4. Rows and Columns of DataFrame![link image 92](/icons/link.svg)
The indices and columns of a DataFrame
can be obtained using the index
and columns
methods.
diccionario = {"uno": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),"dos": pd.Series([4.0, 5.0, 6.0, 7.0], index=["a", "b", "c", "d"])}dataframe = pd.DataFrame(diccionario)indices = dataframe.indexcolumnas = dataframe.columnsprint(f"El DataFrame tiene los índices\n {indices}\n")print(f"El DataFrame tiene las columnas\n {columnas}")
El DataFrame tiene los índicesIndex(['a', 'b', 'c', 'd'], dtype='object')El DataFrame tiene las columnasIndex(['uno', 'dos'], dtype='object')
4.5. Description of the DataFrame![link image 93](/icons/link.svg)
The describe()
method displays a quick statistical summary of the data in the DataFrame
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train.describe()
4.6. Sorting the DataFrame
![link image 94](/icons/link.svg)
Rows of a DataFrame
can be alphabetically sorted using the sort_index()
method.
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train.sort_index().head()
Since the rows were already sorted in this case, we set ascending=False
to reverse the order.
california_housing_train.sort_index(ascending=False).head()
If you want to sort the columns, you need to set axis=1
since the default is 0
.
california_housing_train.sort_index(axis=1).head()
If we want to sort the DataFrame
by a specific column, we need to use the sort_values()
method and specify the label of the column to sort by.
california_housing_train.sort_values('median_house_value')
4.7. DataFrame Statistics![link image 95](/icons/link.svg)
Statistics can be obtained from the DataFrame
, such as the mean, mode, and standard deviation.
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")print(f"media:\n{california_housing_train.mean()}")print(f"\n\ndesviación estandar:\n{california_housing_train.std()}")
media:longitude -119.562108latitude 35.625225housing_median_age 28.589353total_rooms 2643.664412total_bedrooms 539.410824population 1429.573941households 501.221941median_income 3.883578median_house_value 207300.912353dtype: float64desviación estandar:longitude 2.005166latitude 2.137340housing_median_age 12.586937total_rooms 2179.947071total_bedrooms 421.499452population 1147.852959households 384.520841median_income 1.908157median_house_value 115983.764387dtype: float64
If you want to get statistics about the rows instead of the columns, you need to specify it using axis=1
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")print(f"media:\n{california_housing_train.mean(axis=1)}")print(f"\n\ndesviación estandar:\n{california_housing_train.std(axis=1)}")
media:0 8357.5970671 10131.5277782 9664.6423223 8435.0290784 7567.436111...16995 12806.40856716996 9276.77087816997 12049.50792216998 10082.05330016999 10863.022744Length: 17000, dtype: float64desviación estandar:0 22026.6124451 26352.9392722 28514.3165883 24366.7547474 21730.014569...16995 36979.67689916996 26158.00677116997 34342.87679216998 28408.15232916999 31407.119788Length: 17000, dtype: float64
Another useful thing that can be obtained from DataFrame
s is, for example, the number of times each item in a column repeats.
california_housing_train["total_rooms"].value_counts()
1582.0 161527.0 151717.0 141471.0 141703.0 14..157.0 12760.0 1458.0 110239.0 14068.0 1Name: total_rooms, Length: 5533, dtype: int64
For example, we can see that there are a total of 16 houses with 1582 rooms.
4.8. Used Memory![link image 96](/icons/link.svg)
We can see the memory used by the DataFrame
california_housing_train.memory_usage(deep=True)
Index 128longitude 136000latitude 136000housing_median_age 136000total_rooms 136000total_bedrooms 136000population 136000households 136000median_income 136000median_house_value 136000dtype: int64
5. Addition of Data![link image 97](/icons/link.svg)
5.1. Addition of Columns![link image 98](/icons/link.svg)
Columns can be easily added as operations of other columns
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0])
}
dataframe = pd.DataFrame(diccionario)
dataframe["tres"] = dataframe["uno"] + dataframe["dos"]
dataframe["flag"] = dataframe["tres"] > 7.0
dataframe
Columns can also be added indicating what value all their items will have
dataframe["constante"] = 8.0
dataframe
If a Series
is added that does not have the same number of indices as the DataFrame
, it will be adjusted to match the number of indices of the DataFrame
.
dataframe["Menos indices"] = dataframe["uno"][:2]
dataframe
With the previous methods, the column was added at the end, but if you want to add the column in a specific position, you can use the insert()
method.
For example, if you want to add a column at position 3 (keeping in mind that indexing starts from position 0), with the name of the column being inserted column and its value being double that of the column three, it would be done as follows:
dataframe.insert(loc=3, column="columna insertada", value=dataframe["tres"]*2)
dataframe
If you want to add more than one column per command, you can use the assign()
method.
dataframe = dataframe.assign(
columna_asignada1 = dataframe["uno"] * dataframe["tres"],
columna_asignada2 = dataframe["dos"] * dataframe["tres"],
)
dataframe
5.2. Addition of Rows![link image 99](/icons/link.svg)
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0])
}
dataframe = pd.DataFrame(diccionario)
dataframe.head()
We can add a row at the end using the concat
method (which we will cover in more detail later)
diccionario = {
"uno": [10.0],
"dos": [20.0]
}
dataframe = pd.concat([dataframe, pd.DataFrame(diccionario)])
dataframe
We see that the column has been added at the end, but it has index zero, so we reorder the indices using the method reset_index(drop=True)
dataframe = dataframe.reset_index(drop=True)
dataframe
6. Data Deletion![link image 100](/icons/link.svg)
6.1. Column Deletion![link image 101](/icons/link.svg)
A specific column can be deleted using the pop()
method.
dataframe.pop("constante")
dataframe
Or using del
del dataframe["flag"]
dataframe
6.1. Row Deletion![link image 102](/icons/link.svg)
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]),
"dos": pd.Series([11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]),
"tres": pd.Series([21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0])
}
dataframe = pd.DataFrame(diccionario)
dataframe.head()
If we want to delete a row, we can use the drop
method, specifying its position. For example, if we want to delete the row at position 1
dataframe = dataframe.drop(1)
dataframe
If we want to delete the last row
dataframe = dataframe.drop(len(dataframe)-1)
dataframe
If what we want is to delete a range of rows
dataframe = dataframe.drop(range(2, 5))
dataframe
If what we want is to remove a specific set of rows
dataframe = dataframe.drop([5, 7, 9])
dataframe
Just like when we added rows, we see that some indices have been removed, so we reorder the indices using the method reset_index(drop=True)
dataframe = dataframe.reset_index(drop=True)
dataframe
7. Operations on DataFrames![link image 103](/icons/link.svg)
Operations can be performed on DataFrame
s just as they can with Numpy.
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]),
"dos": pd.Series([11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]),
"tres": pd.Series([21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0])
}
dataframe = pd.DataFrame(diccionario)
dataframe.head()
dataframe[ ["uno", "dos", "tres"] ] * 2
np.exp(dataframe[ ["uno", "dos", "tres"] ])
If more complex operations are desired, the apply()
method can be used.
dataframe = dataframe.apply(lambda x: x.max() - x.min())dataframe
uno 9.0dos 9.0tres 9.0dtype: float64
A lambda
function has been applied because it is a simple function, but if we want to apply more complex functions, we can define and apply them.
def funcion(x):if x < 10:return np.exp(x) - np.log(5*x) + np.sqrt(x)elif x < 20:return np.sin(x) + np.cos(x) + np.tan(x)else:return np.log(x) + np.log10(x) + np.log2(x)dataframe = dataframe.apply(funcion)dataframe
uno 8102.277265dos 8102.277265tres 8102.277265dtype: float64
Using the apply
method on a DataFrame is much faster than looping through each row and performing the operation.
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train.head()
Let's calculate the percentage of bedrooms out of the total number of rooms.
california_housing_train["percent_bedrooms"] = None
%time california_housing_train["percent_bedrooms"] = california_housing_train.apply(lambda x: x["total_bedrooms"] / x["total_rooms"], axis=1)
california_housing_train.head()
california_housing_train["percent_bedrooms"] = None
%time for i in range(len(california_housing_train)): california_housing_train["percent_bedrooms"][i] = california_housing_train["total_bedrooms"][i] / california_housing_train["total_rooms"][i]
california_housing_train.head()
The lambda
function took about 300 ms, while the for
loop took more than 1 second.
8. Transpose![link image 104](/icons/link.svg)
The transpose of a DataFrame
can be obtained using the T
method.
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0], index=["fila a", "fila b", "fila c"]),
"dos": pd.Series([4.0, 5.0, 6.0], index=["fila a", "fila b", "fila c"])
}
dataframe = pd.DataFrame(diccionario)
dataframe["tres"] = dataframe["uno"] + dataframe["dos"]
dataframe["flag"] = dataframe["tres"] > 7.0
dataframe.T
9. Conversion to Numpy![link image 105](/icons/link.svg)
If you want to convert a Series
or DataFrame
to NumPy, you can use the to_numpy()
method or the np.asarray()
function.
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0], index=["fila a", "fila b", "fila c"]),
"dos": pd.Series([4.0, 5.0, 6.0], index=["fila a", "fila b", "fila c"])
}
dataframe = pd.DataFrame(diccionario)
dataframe["tres"] = dataframe["uno"] + dataframe["dos"]
dataframe["flag"] = dataframe["tres"] > 7.0
dataframe
matriz_np = dataframe.to_numpy()matriz_np
array([[1.0, 4.0, 5.0, False],[2.0, 5.0, 7.0, False],[3.0, 6.0, 9.0, True]], dtype=object)
matriz_np = np.asarray(dataframe)matriz_np
array([[1.0, 4.0, 5.0, False],[2.0, 5.0, 7.0, False],[3.0, 6.0, 9.0, True]], dtype=object)
This example is not the most appropriate, as it mixes numbers with booleans, and as we explained in the previous post Matrix Calculus with NumPy, all elements of an ndarray
must be of the same type.
In this case we are mixing numbers with booleans, so to solve it NumPy converts them all to objects
To solve this, we keep only the numbers and convert them to an ndarray
matriz_np = dataframe[ ["uno", "dos", "tres"] ].to_numpy()matriz_np, matriz_np.dtype
(array([[1., 4., 5.],[2., 5., 7.],[3., 6., 9.]]), dtype('float64'))
Now it can be seen that an ndarray
has been created where all the data are of type float
10. Reading data from external sources![link image 106](/icons/link.svg)
One of the greatest strengths of Pandas is being able to read data from files, so it's not necessary to create a DataFrame
with the data you want to process; instead, it can be read from a file.
Just like you can create DataFrame
s from external files, you can also save DataFrame
s to files, so you can create your own dataset, configure it the way you want, and save it to a file to use it later.
The following table shows the functions for reading and writing files in different formats.
|Format|File type|Reading function|Writing function||---|---|---|---||text|CSV|read_csv|to_csv||text|Fixed-Width Text File|read_fwf|||text|JSON|read_json|to_json||text|HTML|read_html|to_html||text|Local clipboard|read_clipboard|to_clipboard||binary|MS Excel|read_excel|to_excel||binary|OpenDocument|read_excel||binary|HDF5 Format|read_hdf|to_hdf||binary|Feather Format|read_feather|to_feather||binary|Parquet Format|read_parquet|to_parquet||binary|ORC Format|read_orc||binary|Msgpack|read_msgpack|to_msgpack||binary|Stata|read_stata|to_stata||binary|SAS|read_sas||binary|SPSS|read_spss||binary|Python Pickle Format|read_pickle|to_pickle||SQL|SQL|read_sql|to_sql||SQL|Google BigQuery|read_gbq|to_gbq|
11. Indexing in DataFrames![link image 107](/icons/link.svg)
There are many ways to index into DataFrame
s.
fechas = pd.date_range('1/1/2000', periods=8)
dataframe = pd.DataFrame(np.random.randn(8, 4), index=fechas, columns=['A', 'B', 'C', 'D'])
dataframe
11.1. Column Indexing![link image 108](/icons/link.svg)
To select columns within a DataFrame
, you can do so by selecting the column between brackets []
, or by indicating the column as if it were an attribute of the DataFrame
.
dataframe['A']
2000-01-01 0.8131532000-01-02 -0.2445842000-01-03 0.1257292000-01-04 0.3522752000-01-05 -2.0509762000-01-06 -0.3122962000-01-07 0.8978372000-01-08 0.271403Freq: D, Name: A, dtype: float64
dataframe.A
2000-01-01 0.8131532000-01-02 -0.2445842000-01-03 0.1257292000-01-04 0.3522752000-01-05 -2.0509762000-01-06 -0.3122962000-01-07 0.8978372000-01-08 0.271403Freq: D, Name: A, dtype: float64
If specific rows are desired, they are passed through a list
dataframe[ ['A', 'B'] ]
11.2. Row Indexing by Positions![link image 109](/icons/link.svg)
A range of rows from a DataFrame
can be selected as follows
dataframe[0:3]
If you want to select a single row, you need to specify a range of rows that includes only that one. For example, if you want to select row number 1
dataframe[1:2]
Another method to select a row by its position is the iloc[]
method
dataframe.iloc[0:3]
If specific rows are desired, a list with their positions is passed.
dataframe.iloc[ [0, 2, 4] ]
11.3. Row Label Indexing![link image 110](/icons/link.svg)
To select a row by its labels, we can use the loc[]
method.
dataframe.loc['2000-01-01']
A 0.813153B -0.869356C 0.934293D 0.338644Name: 2000-01-01 00:00:00, dtype: float64
If you want to select a range of rows, you can index them using the colon :
dataframe.loc['2000-01-01':'2000-01-03']
If specific rows are desired, they are passed through a list
dataframe.loc[ ['2000-01-01', '2000-01-03', '2000-01-05'] ]
11.4. Selection of a Portion of the DataFrame by Positions![link image 111](/icons/link.svg)
dataframe.iloc[0:3, 0:2]
If specific rows and columns are desired, lists with the desired positions are passed.
dataframe.iloc[ [0, 2, 4], [0, 2] ]
11.5. Selecting a Portion of the DataFrame Using Labels![link image 112](/icons/link.svg)
dataframe.loc['2000-01-01':'2000-01-03', 'A':'B']
If you want certain rows and columns, you pass lists with the desired labels
dataframe.loc[ ['2000-01-01', '2000-01-03', '2000-01-05'], ['A', 'C'] ]
11.6. Lambda Function Indexing![link image 113](/icons/link.svg)
Data can be selected from a DataFrame
that meets a condition given by a lambda
function.
dataframe.loc[lambda dataframe:2*dataframe['A']+5*np.exp(dataframe['B'])>0.2]
As can be seen, this form of indexing is very powerful
11.7. Conditional Indexing![link image 114](/icons/link.svg)
If we don't need complex functions for indexing, but only conditionals, we can do
dataframe[dataframe['A']>0.2]
We can make multiple conditions
dataframe[(dataframe['A']>0.2) & (dataframe['B']>0.2)]
11.8. Random Indexing![link image 115](/icons/link.svg)
By using the sample()
method, we will obtain a random row from the DataFrame
.
dataframe.sample()
If we want more than one sample, we indicate it with the attribute n
dataframe.sample(n=3)
If random columns are desired, this must be indicated using axis=1
dataframe.sample(axis=1)
If a single item from the DataFrame
is desired, the sample()
method has to be called twice.
dataframe.sample(axis=1).sample()
12. Merging DataFrames![link image 116](/icons/link.svg)
12.1. Concatenation of DataFrames![link image 117](/icons/link.svg)
To concatenate multiple DataFrame
s we use the concat()
method, where a list of the DataFrame
s to be joined is passed.
dataframe1 = pd.DataFrame({"A": ["A0", "A1", "A2", "A3"],"B": ["B0", "B1", "B2", "B3"],"C": ["C0", "C1", "C2", "C3"],"D": ["D0", "D1", "D2", "D3"],})dataframe2 = pd.DataFrame({"A": ["A4", "A5", "A6", "A7"],"B": ["B4", "B5", "B6", "B7"],"C": ["C4", "C5", "C6", "C7"],"D": ["D4", "D5", "D6", "D7"],})dataframe3 = pd.DataFrame({"A": ["A8", "A9", "A10", "A11"],"B": ["B8", "B9", "B10", "B11"],"C": ["C8", "C9", "C10", "C11"],"D": ["D8", "D9", "D10", "D11"],})dataframe = pd.concat([dataframe1, dataframe2, dataframe3])print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe2}")print(f"dataframe3:\n{dataframe3}")print(f"\ndataframe:\n{dataframe}")
dataframe1:A B C D0 A0 B0 C0 D01 A1 B1 C1 D12 A2 B2 C2 D23 A3 B3 C3 D3dataframe2:A B C D0 A4 B4 C4 D41 A5 B5 C5 D52 A6 B6 C6 D63 A7 B7 C7 D7dataframe3:A B C D0 A8 B8 C8 D81 A9 B9 C9 D92 A10 B10 C10 D103 A11 B11 C11 D11dataframe:A B C D0 A0 B0 C0 D01 A1 B1 C1 D12 A2 B2 C2 D23 A3 B3 C3 D30 A4 B4 C4 D41 A5 B5 C5 D52 A6 B6 C6 D63 A7 B7 C7 D70 A8 B8 C8 D81 A9 B9 C9 D92 A10 B10 C10 D103 A11 B11 C11 D11
As can be seen, the indices 0
, 1
, 2
, and 3
are repeated because each dataframe has those indices. To prevent this, you should use the parameter ignore_index=True
.
dataframe = pd.concat([dataframe1, dataframe2, dataframe3], ignore_index=True)print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe2}")print(f"dataframe3:\n{dataframe3}")print(f"\ndataframe:\n{dataframe}")
dataframe1:A B C D0 A0 B0 C0 D01 A1 B1 C1 D12 A2 B2 C2 D23 A3 B3 C3 D3dataframe2:A B C D0 A4 B4 C4 D41 A5 B5 C5 D52 A6 B6 C6 D63 A7 B7 C7 D7dataframe3:A B C D0 A8 B8 C8 D81 A9 B9 C9 D92 A10 B10 C10 D103 A11 B11 C11 D11dataframe:A B C D0 A0 B0 C0 D01 A1 B1 C1 D12 A2 B2 C2 D23 A3 B3 C3 D34 A4 B4 C4 D45 A5 B5 C5 D56 A6 B6 C6 D67 A7 B7 C7 D78 A8 B8 C8 D89 A9 B9 C9 D910 A10 B10 C10 D1011 A11 B11 C11 D11
If the concatenation was intended to be performed along the columns, the variable axis=1
should have been used.
dataframe = pd.concat([dataframe1, dataframe2, dataframe3], axis=1)print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe2}")print(f"dataframe3:\n{dataframe3}")print(f"\ndataframe:\n{dataframe}")
dataframe1:A B C D0 A0 B0 C0 D01 A1 B1 C1 D12 A2 B2 C2 D23 A3 B3 C3 D3dataframe2:A B C D0 A4 B4 C4 D41 A5 B5 C5 D52 A6 B6 C6 D63 A7 B7 C7 D7dataframe3:A B C D0 A8 B8 C8 D81 A9 B9 C9 D92 A10 B10 C10 D103 A11 B11 C11 D11dataframe:A B C D A B C D A B C D0 A0 B0 C0 D0 A4 B4 C4 D4 A8 B8 C8 D81 A1 B1 C1 D1 A5 B5 C5 D5 A9 B9 C9 D92 A2 B2 C2 D2 A6 B6 C6 D6 A10 B10 C10 D103 A3 B3 C3 D3 A7 B7 C7 D7 A11 B11 C11 D11
12.1.1. Intersection of Concatenation![link image 118](/icons/link.svg)
There are two ways to perform the concatenation, either taking all the indices from the DataFrame
s or only those that match. This is determined by the join
variable, which accepts the values 'outer'
(default) (takes all indices) or 'inner'
(only those that match).
Let's see an example of 'outer'
dataframe1 = pd.DataFrame({"A": ["A0", "A1", "A2", "A3"],"B": ["B0", "B1", "B2", "B3"],"C": ["C0", "C1", "C2", "C3"],"D": ["D0", "D1", "D2", "D3"],},index=[0, 1, 2, 3])dataframe4 = pd.DataFrame({"B": ["B2", "B3", "B6", "B7"],"D": ["D2", "D3", "D6", "D7"],"F": ["F2", "F3", "F6", "F7"],},index=[2, 3, 6, 7])dataframe = pd.concat([dataframe1, dataframe4], axis=1)print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe4}")print(f"\ndataframe:\n{dataframe}")
dataframe1:A B C D0 A0 B0 C0 D01 A1 B1 C1 D12 A2 B2 C2 D23 A3 B3 C3 D3dataframe2:B D F2 B2 D2 F23 B3 D3 F36 B6 D6 F67 B7 D7 F7dataframe:A B C D B D F0 A0 B0 C0 D0 NaN NaN NaN1 A1 B1 C1 D1 NaN NaN NaN2 A2 B2 C2 D2 B2 D2 F23 A3 B3 C3 D3 B3 D3 F36 NaN NaN NaN NaN B6 D6 F67 NaN NaN NaN NaN B7 D7 F7
Let's see an example of 'inner'
dataframe = pd.concat([dataframe1, dataframe4], axis=1, join="inner")print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe4}")print(f"\ndataframe:\n{dataframe}")
dataframe1:A B C D0 A0 B0 C0 D01 A1 B1 C1 D12 A2 B2 C2 D23 A3 B3 C3 D3dataframe2:B D F2 B2 D2 F23 B3 D3 F36 B6 D6 F67 B7 D7 F7dataframe:A B C D B D F2 A2 B2 C2 D2 B2 D2 F23 A3 B3 C3 D3 B3 D3 F3
12.2. Merge
of DataFrames![link image 119](/icons/link.svg)
We previously created a new dataframe by merging several dataframes. Now we can complete one dataframe with another using merge
, passing the parameter on
to specify which column should be used for the merge
.
dataframe1 = pd.DataFrame({"Key": ["K0", "K1", "K2", "K3"],"A": ["A0", "A1", "A2", "A3"],"B": ["B0", "B1", "B2", "B3"],})dataframe2 = pd.DataFrame({"Key": ["K0", "K1", "K2", "K3"],"C": ["C0", "C1", "C2", "C3"],"D": ["D0", "D1", "D2", "D3"],})dataframe = dataframe1.merge(dataframe2, on="Key")print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe2}")print(f"\ndataframe:\n{dataframe}")
dataframe1:Key A B0 K0 A0 B01 K1 A1 B12 K2 A2 B23 K3 A3 B3dataframe2:Key C D0 K0 C0 D01 K1 C1 D12 K2 C2 D23 K3 C3 D3dataframe:Key A B C D0 K0 A0 B0 C0 D01 K1 A1 B1 C1 D12 K2 A2 B2 C2 D23 K3 A3 B3 C3 D3
In this case, both dataframes had a key with the same name (Key
), but if we have dataframes where their keys are named differently, we can use the left_on
and right_on
parameters.
dataframe1 = pd.DataFrame({"Key1": ["K0", "K1", "K2", "K3"],"A": ["A0", "A1", "A2", "A3"],"B": ["B0", "B1", "B2", "B3"],})dataframe2 = pd.DataFrame({"Key2": ["K0", "K1", "K2", "K3"],"C": ["C0", "C1", "C2", "C3"],"D": ["D0", "D1", "D2", "D3"],})dataframe = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2")print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe2}")print(f"\ndataframe:\n{dataframe}")
dataframe1:Key1 A B0 K0 A0 B01 K1 A1 B12 K2 A2 B23 K3 A3 B3dataframe2:Key2 C D0 K0 C0 D01 K1 C1 D12 K2 C2 D23 K3 C3 D3dataframe:Key1 A B Key2 C D0 K0 A0 B0 K0 C0 D01 K1 A1 B1 K1 C1 D12 K2 A2 B2 K2 C2 D23 K3 A3 B3 K3 C3 D3
In the case where one of the keys does not match, the merge
will not be performed on that key.
dataframe1 = pd.DataFrame({"Key1": ["K0", "K1", "K2", "K3"],"A": ["A0", "A1", "A2", "A3"],"B": ["B0", "B1", "B2", "B3"],})dataframe2 = pd.DataFrame({"Key2": ["K0", "K1", "K2", np.nan],"C": ["C0", "C1", "C2", "C3"],"D": ["D0", "D1", "D2", "D3"],})dataframe = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2")print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe2}")print(f"\ndataframe:\n{dataframe}")
dataframe1:Key1 A B0 K0 A0 B01 K1 A1 B12 K2 A2 B23 K3 A3 B3dataframe2:Key2 C D0 K0 C0 D01 K1 C1 D12 K2 C2 D23 NaN C3 D3dataframe:Key1 A B Key2 C D0 K0 A0 B0 K0 C0 D01 K1 A1 B1 K1 C1 D12 K2 A2 B2 K2 C2 D2
To change this behavior, we can use the how
parameter, which by default has the value inner
, but we can pass it the values left
, right
, and outer
.
dataframe1 = pd.DataFrame({"Key1": ["K0", "K1", "K2", "K3"],"A": ["A0", "A1", "A2", "A3"],"B": ["B0", "B1", "B2", "B3"],})dataframe2 = pd.DataFrame({"Key2": ["K0", "K1", "K2", np.nan],"C": ["C0", "C1", "C2", "C3"],"D": ["D0", "D1", "D2", "D3"],})dataframe_inner = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2", how="inner")dataframe_left = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2", how="left")dataframe_right = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2", how="right")dataframe_outer = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2", how="outer")print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe2}")print(f"\ndataframe inner:\n{dataframe_inner}")print(f"\ndataframe left:\n{dataframe_left}")print(f"\ndataframe right:\n{dataframe_right}")print(f"\ndataframe outer:\n{dataframe_outer}")
dataframe1:Key1 A B0 K0 A0 B01 K1 A1 B12 K2 A2 B23 K3 A3 B3dataframe2:Key2 C D0 K0 C0 D01 K1 C1 D12 K2 C2 D23 NaN C3 D3dataframe inner:Key1 A B Key2 C D0 K0 A0 B0 K0 C0 D01 K1 A1 B1 K1 C1 D12 K2 A2 B2 K2 C2 D2dataframe left:Key1 A B Key2 C D0 K0 A0 B0 K0 C0 D01 K1 A1 B1 K1 C1 D12 K2 A2 B2 K2 C2 D23 K3 A3 B3 NaN NaN NaNdataframe right:Key1 A B Key2 C D0 K0 A0 B0 K0 C0 D01 K1 A1 B1 K1 C1 D12 K2 A2 B2 K2 C2 D23 NaN NaN NaN NaN C3 D3dataframe outer:Key1 A B Key2 C D0 K0 A0 B0 K0 C0 D01 K1 A1 B1 K1 C1 D12 K2 A2 B2 K2 C2 D23 K3 A3 B3 NaN NaN NaN4 NaN NaN NaN NaN C3 D3
As can be seen, when left
is chosen, only the values from the left dataframe are added, and when right
is chosen, the values from the right dataframe are added.
12.3. Join
of dataframes![link image 120](/icons/link.svg)
The last tool for joining dataframes is join
. It is similar to merge
, except that instead of looking for similarities based on specified columns, it looks for them based on the indices.
dataframe1 = pd.DataFrame({"A": ["A0", "A1", "A2", "A3"],"B": ["B0", "B1", "B2", "B3"],},index=["K0", "K1", "K2", "K3"])dataframe2 = pd.DataFrame({"C": ["C0", "C1", "C2", "C3"],"D": ["D0", "D1", "D2", "D3"],},index=["K0", "K1", "K2", "K3"])dataframe = dataframe1.join(dataframe2)print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe2}")print(f"\ndataframe:\n{dataframe}")
dataframe1:A BK0 A0 B0K1 A1 B1K2 A2 B2K3 A3 B3dataframe2:C DK0 C0 D0K1 C1 D1K2 C2 D2K3 C3 D3dataframe:A B C DK0 A0 B0 C0 D0K1 A1 B1 C1 D1K2 A2 B2 C2 D2K3 A3 B3 C3 D3
In this case, the indices are the same, but when they are different we can specify how to join the dataframes using the how
parameter, which by default has the value inner
, but can also have the values left
, right
, or outer
.
dataframe1 = pd.DataFrame({"A": ["A0", "A1", "A2", "A3"],"B": ["B0", "B1", "B2", "B3"],},index=["K0", "K1", "K2", "K3"])dataframe2 = pd.DataFrame({"C": ["C0", "C2", "C3", "C4"],"D": ["D0", "D2", "D3", "D4"],},index=["K0", "K2", "K3", "K4"])dataframe_inner = dataframe1.join(dataframe2, how="inner")dataframe_left = dataframe1.join(dataframe2, how="left")dataframe_right = dataframe1.join(dataframe2, how="right")dataframe_outer = dataframe1.join(dataframe2, how="outer")print(f"dataframe1:\n{dataframe1}")print(f"dataframe2:\n{dataframe2}")print(f"\ndataframe inner:\n{dataframe_inner}")print(f"\ndataframe left:\n{dataframe_left}")print(f"\ndataframe rigth:\n{dataframe_right}")print(f"\ndataframe outer:\n{dataframe_outer}")
dataframe1:A BK0 A0 B0K1 A1 B1K2 A2 B2K3 A3 B3dataframe2:C DK0 C0 D0K2 C2 D2K3 C3 D3K4 C4 D4dataframe:A B C DK0 A0 B0 C0 D0K2 A2 B2 C2 D2K3 A3 B3 C3 D3dataframe:A B C DK0 A0 B0 C0 D0K1 A1 B1 NaN NaNK2 A2 B2 C2 D2K3 A3 B3 C3 D3dataframe:A B C DK0 A0 B0 C0 D0K2 A2 B2 C2 D2K3 A3 B3 C3 D3K4 NaN NaN C4 D4dataframe:A B C DK0 A0 B0 C0 D0K1 A1 B1 NaN NaNK2 A2 B2 C2 D2K3 A3 B3 C3 D3K4 NaN NaN C4 D4
13. Missing data (NaN
)![link image 121](/icons/link.svg)
In a DataFrame
there can be some missing data, Pandas represents them as np.nan
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0])
}
dataframe = pd.DataFrame(diccionario)
dataframe
13.1. Removal of Rows with Missing Data![link image 122](/icons/link.svg)
To avoid having rows with missing data, these can be removed.
dataframe.dropna(how="any")
13.2. Dropping Columns with Missing Data![link image 123](/icons/link.svg)
dataframe.dropna(axis=1, how='any')
13.3. Boolean mask with missing positions![link image 124](/icons/link.svg)
pd.isna(dataframe)
13.4. Filling Missing Data![link image 125](/icons/link.svg)
dataframe.fillna(value=5.5, inplace=True)
dataframe
Tip: By setting the variable
inplace=True
, theDataFrame
being operated on is modified, so there's no need to writedataframe = dataframe.fillna(value=5.5)
14. Time series![link image 126](/icons/link.svg)
Pandas offers the possibility of working with time series. For example, we create a Series
of 100 random data points every second starting from 01/01/2021
indices = pd.date_range("1/1/2021", periods=100, freq="S")datos = np.random.randint(0, 500, len(indices))serie_temporal = pd.Series(datos, index=indices)serie_temporal
2021-01-01 00:00:00 2412021-01-01 00:00:01 142021-01-01 00:00:02 1902021-01-01 00:00:03 4072021-01-01 00:00:04 94...2021-01-01 00:01:35 2752021-01-01 00:01:36 562021-01-01 00:01:37 4482021-01-01 00:01:38 1512021-01-01 00:01:39 316Freq: S, Length: 100, dtype: int64
This Pandas functionality is very powerful, for example, we can have a dataset at certain hours of one time zone and change them to another time zone.
horas = pd.date_range("3/6/2021 00:00", periods=10, freq="H")datos = np.random.randn(len(horas))serie_horaria = pd.Series(datos, horas)serie_horaria
2021-03-06 00:00:00 -0.8535242021-03-06 01:00:00 -1.3553722021-03-06 02:00:00 -1.2675032021-03-06 03:00:00 -1.1557872021-03-06 04:00:00 0.7309352021-03-06 05:00:00 1.4359572021-03-06 06:00:00 0.4609122021-03-06 07:00:00 0.7234512021-03-06 08:00:00 -0.8533372021-03-06 09:00:00 0.456359Freq: H, dtype: float64
We locate the data in a time zone
serie_horaria_utc = serie_horaria.tz_localize("UTC")serie_horaria_utc
2021-03-06 00:00:00+00:00 -0.8535242021-03-06 01:00:00+00:00 -1.3553722021-03-06 02:00:00+00:00 -1.2675032021-03-06 03:00:00+00:00 -1.1557872021-03-06 04:00:00+00:00 0.7309352021-03-06 05:00:00+00:00 1.4359572021-03-06 06:00:00+00:00 0.4609122021-03-06 07:00:00+00:00 0.7234512021-03-06 08:00:00+00:00 -0.8533372021-03-06 09:00:00+00:00 0.456359Freq: H, dtype: float64
And now we can change them to another use
serie_horaria_US = serie_horaria_utc.tz_convert("US/Eastern")serie_horaria_US
2021-03-05 19:00:00-05:00 -0.8535242021-03-05 20:00:00-05:00 -1.3553722021-03-05 21:00:00-05:00 -1.2675032021-03-05 22:00:00-05:00 -1.1557872021-03-05 23:00:00-05:00 0.7309352021-03-06 00:00:00-05:00 1.4359572021-03-06 01:00:00-05:00 0.4609122021-03-06 02:00:00-05:00 0.7234512021-03-06 03:00:00-05:00 -0.8533372021-03-06 04:00:00-05:00 0.456359Freq: H, dtype: float64
15. Categorical Data![link image 127](/icons/link.svg)
Pandas offers the possibility of adding categorical data in a DataFrame
. Suppose the following DataFrame
dataframe = pd.DataFrame(
{"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)
dataframe
We can convert the data in the raw_grade
column to categorical data using the astype()
method.
dataframe['grade'] = dataframe["raw_grade"].astype("category")
dataframe
The columns raw_grade
and grade
seem identical, but if we look at the information of the DataFrame
we can see that this is not the case.
dataframe.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 6 entries, 0 to 5Data columns (total 3 columns):# Column Non-Null Count Dtype--- ------ -------------- -----0 id 6 non-null int641 raw_grade 6 non-null object2 grade 6 non-null categorydtypes: category(1), int64(1), object(1)memory usage: 334.0+ bytes
It can be seen that the column grade
is of categorical type
We can see the categories of categorical data types through the method cat.categories()
dataframe["grade"].cat.categories
Index(['a', 'b', 'e'], dtype='object')
We can also rename the categories using the same method, but by providing a list with the new categories.
dataframe["grade"].cat.categories = ["very good", "good", "very bad"]
dataframe
Pandas gives us the possibility to numerically encode categorical data using the get_dummies
method.
pd.get_dummies(dataframe["grade"])
16. Groupby![link image 128](/icons/link.svg)
We can group the dataframes by values from one of the columns. Let's reload the dataframe with the value of houses in California.
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train.head()
Now we can group the data by one of the columns, for example, let's group the houses based on the number of years and see how many houses there are of each age with count
california_housing_train.groupby("housing_median_age").count().head()
As we can see in all the columns, we get the same value, which is the number of houses that have a certain age, but we can find out the average value of each column with mean
california_housing_train.groupby("housing_median_age").mean().head()
We can obtain several measures for each age using the agg
(aggregation) command, passing it the measures we want through a list. For example, let's look at the minimum, maximum, and mean of each column for each age:
california_housing_train.groupby("housing_median_age").agg(['min', 'max', 'mean']).head()
We can specify on which columns we want to perform certain calculations by passing a dictionary, where the keys will be the columns on which we want to perform calculations and the values will be lists with the calculations.
california_housing_train.groupby("housing_median_age").agg({'total_rooms': ['min', 'max', 'mean'], 'total_bedrooms': ['min', 'max', 'mean', 'median']}).head()
We can group by more than one column, for this, we have to pass the columns in a list
california_housing_train.groupby(["housing_median_age", "total_bedrooms"]).mean()
17. Graphics![link image 129](/icons/link.svg)
Pandas offers the possibility of representing the data in our DataFrame
s in charts to obtain a better representation of it. For this, it uses the matplotlib
library, which we will cover in the next post.
17.1. Basic Graph![link image 130](/icons/link.svg)
To represent the data in a chart, the easiest way is to use the plot()
method.
serie = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))
serie = serie.cumsum()
serie.plot()
In the case of having a DataFrame
, the plot()
method will represent each of the columns of the DataFrame
dataframe = pd.DataFrame(
np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"]
)
dataframe = dataframe.cumsum()
dataframe.plot()
17.2. Vertical Bar Chart![link image 131](/icons/link.svg)
There are more methods to create charts, such as the vertical bar chart using plot.bar()
dataframe = pd.DataFrame(np.random.rand(10, 4), columns=["a", "b", "c", "d"])
dataframe.plot.bar()
If we want to stack the bars, we indicate this through the variable stacked=True
dataframe.plot.bar(stacked=True)
17.3. Horizontal Bar Chart![link image 132](/icons/link.svg)
To create a horizontal bar chart we use plot.barh()
dataframe.plot.barh()
If we want to stack the bars, we indicate this through the variable stacked=True
dataframe.plot.barh(stacked=True)
17.4. Histogram![link image 133](/icons/link.svg)
To create a histogram we use plot.hist()
dataframe = pd.DataFrame(
{
"a": np.random.randn(1000) + 1,
"b": np.random.randn(1000),
"c": np.random.randn(1000) - 1,
}
)
dataframe.plot.hist(alpha=0.5)
If we want to stack the bars, we indicate this through the variable stacked=True
dataframe.plot.hist(alpha=0.5, stacked=True)
If we want to add more columns, that is, if we want the histogram to be more informative or accurate, we indicate this through the bins
variable.
dataframe.plot.hist(alpha=0.5, stacked=True, bins=20)
17.5. Candlestick Diagrams![link image 134](/icons/link.svg)
To create a candlestick chart we use plot.box()
dataframe = pd.DataFrame(np.random.rand(10, 5), columns=["A", "B", "C", "D", "E"])
dataframe.plot.box()
17.6. Area Charts![link image 135](/icons/link.svg)
To create an area chart we use plot.area()
dataframe.plot.area()
17.7. Scatter plot![link image 136](/icons/link.svg)
To create a scatter plot we use plot.scatter()
, where you need to specify the x
and y
variables of the plot.
dataframe.plot.scatter(x='A', y='B')
17.8. Hexagonal Container Plot![link image 137](/icons/link.svg)
To create a hexagonal bin plot we use plot.hexbin()
, where you need to specify the x
and y
variables of the plot and the mesh size using gridsize
.
dataframe = pd.DataFrame(np.random.randn(1000, 2), columns=["a", "b"])
dataframe["b"] = dataframe["b"] + np.arange(1000)
dataframe.plot.hexbin(x="a", y="b", gridsize=25)