Data management with Pandas

Data management with Pandas Data management with Pandas

Data handling with Pandaslink image 69

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

1. Summarylink image 70

Let's take a look at a brief introduction to the data manipulation and analysis library Pandas. With it, we can handle and process tabular data, which will help us operate with the data and extract valuable information.

No description has been provided for this image

2. What is Pandas?link image 71

Pandas is a Python library designed to make working with relational or labeled data easy and intuitive Pandas is designed for many different types of data:

  • Tabular data with heterogeneous column types, such as in an SQL table or an Excel spreadsheet* Time series data, ordered and unordered (not necessarily of fixed frequency).* Arbitrary matrix data (homogeneous or heterogeneous) with row and column labels* Any other form of observational/statistical datasets. Data does not need to be labeled at all in order to place it into a Pandas data structure.

The two main data structures of Pandas are Series (one-dimensional) and DataFrames (two-dimensional). Pandas is built on top of NumPy and is designed to integrate well within a scientific computing environment with many other third-party libraries. For data scientists, working with data is generally divided into several stages: collecting and cleaning data, analyzing/modeling it, and then organizing the analysis results in a suitable format for plotting or displaying them in tabular form. pandas is the ideal tool for all these tasks. Another feature is that pandas is fast, many of the low-level algorithms have been built in C.

2.1. Pandas as pdlink image 72

Generally when importing pandas, it is usually imported with the alias pd

	
import pandas as pd
print(pd.__version__)
Copy
	
1.0.1

3. Data Structures in Pandaslink image 73

In Pandas there are two types of data structures: Series and DataFrames

3.1. Serieslink image 74

The Serie data type is a one-dimensional labeled array capable of holding any data type (integers, strings, floating-point numbers, Python objects, etc.). It is divided into indices.

To create a Series data type, the most common way is

series = pd.Series(data, index=index)```
      
      Where `data` can be:* A dictionary* A list or tuple* A `ndarray` from Numpy* A scalar value
      

As one of the data types can be a ndarray from NumPy, we import NumPy to be able to use it

	
import numpy as np
Copy

3.1.1. Series from a dictionarylink image 75

	
diccionario = {"b": 1, "a": 0, "c": 2}
serie = pd.Series(diccionario)
serie
Copy
	
b 1
a 0
c 2
dtype: int64

If an index is passed, the values corresponding to the index labels will be extracted. If they do not exist, they are created as NaN (not a number)

	
diccionario = {"b": 1, "a": 0, "c": 2}
serie = pd.Series(diccionario, index=["b", "c", "d", "a"])
serie
Copy
	
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64

3.1.2. Series from a list or tuplelink image 76

If the data comes from a list or tuple and no index is passed, one will be created with values [0, ..., len(data)-1]

	
serie = pd.Series([1, 2, 3, 4])
serie
Copy
	
0 1
1 2
2 3
3 4
dtype: int64

If an index is passed, it must have the same length as the data

	
serie = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"])
serie
Copy
	
a 1
b 2
c 3
d 4
dtype: int64

3.1.3. Series from an ndarraylink image 77

If the data comes from an ndarray and no index is passed, one will be created with values [0, ..., len(data)-1]

	
serie = pd.Series(np.random.randn(5))
serie
Copy
	
0 1.267865
1 -0.877857
2 -0.138556
3 -0.132987
4 -0.827295
dtype: float64

If an index is passed, it must have the same length as the data

	
serie = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
serie
Copy
	
a -1.091828
b -0.584243
c 0.220398
d 1.248923
e 1.652351
dtype: float64

3.1.4. Series from a scalarlink image 78

If the series is created from a scalar, it will be created with a single item

	
serie = pd.Series(5.0)
serie
Copy
	
0 5.0
dtype: float64

If you want to create more items in the series, you need to pass the index with the number of items you want, so all the items will have the value of the scalar

	
serie = pd.Series(5.0, index=["a", "b", "c", "d", "e"])
serie
Copy
	
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64

3.1.5. Operations with Serieslink image 79

Just like with Numpy, we can perform operations on all elements of a series without having to iterate through each one of them.

	
serie = pd.Series(5.0, index=["a", "b", "c", "d", "e"])
print(f"serie:\n{serie}")
print(f"\nserie + serie =\n{serie + serie}")
Copy
	
serie:
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64
serie + serie =
a 10.0
b 10.0
c 10.0
d 10.0
e 10.0
dtype: float64
	
serie = pd.Series(5.0, index=["a", "b", "c", "d", "e"])
print(f"serie:\n{serie}")
print(f"\nexp(serie) =\n{np.exp(serie)}")
Copy
	
serie:
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64
exp(serie) =
a 148.413159
b 148.413159
c 148.413159
d 148.413159
e 148.413159
dtype: float64

A difference between Series and ndarrays is that operations between Series automatically align the data according to their labels. Therefore, calculations can be written without considering whether the involved Series have the same labels. If a label is not found in one Series or the other, the result will be marked as missing (NaN).

	
serie = pd.Series(5.0, index=["a", "b", "c", "d", "e"])
print(f"serie:\n{serie}")
print(f"\nserie[1:] + serie[:-1] =\n{serie[1:] + serie[:-1]}")
Copy
	
serie:
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64
serie[1:] + serie[:-1] =
a NaN
b 10.0
c 10.0
d 10.0
e NaN
dtype: float64

3.1.6. Attribute name of the Serieslink image 80

One of the attributes of Series is name, which corresponds to the name they will have when added to a DataFrame. Conversely, when a series is obtained from a DataFrame, this series will have the same name it had in the DataFrame.

	
serie = pd.Series(np.random.randn(5), name="aleatorio")
serie
Copy
	
0 -0.191009
1 -0.793151
2 -0.907747
3 -1.440508
4 -0.676419
Name: aleatorio, dtype: float64

The name of a series can be changed using the rename() method.

	
serie = serie.rename("random")
serie
Copy
	
0 -0.191009
1 -0.793151
2 -0.907747
3 -1.440508
4 -0.676419
Name: random, dtype: float64

3.2. DataFrameslink image 81

A DataFrame is a labeled and two-dimensional data structure, with potentially different types of columns, meaning that one column can contain integer data, another column string data, etc. You can think of it as a spreadsheet or an SQL table, or a dictionary of Series objects. It is the most used pandas object. Just like Series, DataFrames accept many different types of input:

Along with the data, you can optionally pass index arguments (row labels) and columns (column labels). If you pass an index and/or columns, you are guaranteeing the index and/or columns of the resulting DataFrame. Therefore, a dictionary of Series plus a specific index will discard all data that does not match the passed index.

If axis labels are not provided, they will be constructed from the input data based on common sense rules.

3.2.1. DataFrames from a Dictionary of Serieslink image 82

If a dictionary of Series is passed, the DataFrame will be created with as many columns as Series the dictionary has.

diccionario = {
          "uno": pd.Series([1.0, 2.0, 3.0]),
          "dos": pd.Series([4.0, 5.0, 6.0, 7.0])
          }
      
      dataframe = pd.DataFrame(diccionario)
      dataframe
      
Out[87]:
uno dos
0 1.0 4.0
1 2.0 5.0
2 3.0 6.0
3 NaN 7.0

If each of the Series has defined indices, the resulting DataFrame will be the union of these indices.

diccionario = {
          "uno": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
          "dos": pd.Series([4.0, 5.0, 6.0, 7.0], index=["a", "b", "c", "d"])
          }
      
      dataframe = pd.DataFrame(diccionario)
      dataframe
      
Out[88]:
uno dos
a 1.0 4.0
b 2.0 5.0
c 3.0 6.0
d NaN 7.0
dataframe = pd.DataFrame(diccionario, index=["d", "b", "a"])
      dataframe
      
Out[89]:
uno dos
d NaN 7.0
b 2.0 5.0
a 1.0 4.0

If the columns are passed, they will appear in the passed order

dataframe = pd.DataFrame(diccionario, columns=["dos", "tres"])
      dataframe
      
Out[90]:
dos tres
a 4.0 NaN
b 5.0 NaN
c 6.0 NaN
d 7.0 NaN

3.2.2. DataFrames from a dictionary of ndarrays or listslink image 83

All ndarrays or lists must have the same length. If an index is passed, it must also have the same length as the ndarrays or lists.

diccionario = {
          "uno": [1.0, 2.0, 3.0, 4.0],
          "dos": [4.0, 3.0, 2.0, 1.0]
          }
      
      dataframe = pd.DataFrame(diccionario)
      dataframe
      
Out[91]:
uno dos
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0

3.2.3. DataFrames from a matrixlink image 84

If an index is passed, it must have the same length as the number of rows in the array, and if columns are passed, they must have the same length as the columns in the array.

matriz = np.array([[1, 3], [2, 2], [3, 1]])
      
      dataframe = pd.DataFrame(matriz, index=["a", "b", "c"], columns=["columna1", "columna2"])
      dataframe
      
Out[92]:
columna1 columna2
a 1 3
b 2 2
c 3 1

3.2.4. DataFrames from a list of dictionarieslink image 85

lista = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
      
      dataframe = pd.DataFrame(lista)
      dataframe
      
Out[93]:
a b c
0 1 2 NaN
1 5 10 20.0

3.2.5. DataFrames from a dictionary of tupleslink image 86

diccionario = {
          ("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
          ("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
          ("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
          ("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
          ("b", "b"): {("A", "D"): 9, ("A", "B"): 10},
          }
      
      dataframe = pd.DataFrame(diccionario)
      dataframe
      
Out[94]:
a b
b a c a b
A B 1.0 4.0 5.0 8.0 10.0
C 2.0 3.0 6.0 7.0 NaN
D NaN NaN NaN NaN 9.0

3.2.6. DataFrames from a Serieslink image 87

The result will be a DataFrame with the same index as the input Series, and with a column whose name is the original name of the Series (only if another column name is not provided).

diccionario = {"b": 1, "a": 0, "c": 2}
      
      serie = pd.Series(diccionario)
      
      dataframe = pd.DataFrame(serie)
      dataframe
      
Out[95]:
0
b 1
a 0
c 2

4. Exploration of a DataFramelink image 88

When a DataFrame is very large, it cannot be fully represented.

california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
      california_housing_train
      
Out[102]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0
... ... ... ... ... ... ... ... ... ...
16995 -124.26 40.58 52.0 2217.0 394.0 907.0 369.0 2.3571 111400.0
16996 -124.27 40.69 36.0 2349.0 528.0 1194.0 465.0 2.5179 79000.0
16997 -124.30 41.84 17.0 2677.0 531.0 1244.0 456.0 3.0313 103600.0
16998 -124.30 41.80 19.0 2672.0 552.0 1298.0 478.0 1.9797 85800.0
16999 -124.35 40.54 52.0 1820.0 300.0 806.0 270.0 3.0147 94600.0

17000 rows × 9 columns

So it's very useful to have methods for exploring it and getting information quickly.

4.1. DataFrame Headerlink image 89

To view the first rows and get an idea of what the DataFrame looks like, there is the head() method, which by default displays the first 5 rows of the DataFrame. If you want to see a different number of rows, specify it using the n attribute.

california_housing_train.head(n=10)
      
Out[103]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0
5 -114.58 33.63 29.0 1387.0 236.0 671.0 239.0 3.3438 74000.0
6 -114.58 33.61 25.0 2907.0 680.0 1841.0 633.0 2.6768 82400.0
7 -114.59 34.83 41.0 812.0 168.0 375.0 158.0 1.7083 48500.0
8 -114.59 33.61 34.0 4789.0 1175.0 3134.0 1056.0 2.1782 58400.0
9 -114.60 34.83 46.0 1497.0 309.0 787.0 271.0 2.1908 48100.0

4.2. Tail of the DataFramelink image 90

If you want to see the last rows, you can use the tail() method. By using the n parameter, you can choose how many rows to display.

california_housing_train.tail()
      
Out[104]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
16995 -124.26 40.58 52.0 2217.0 394.0 907.0 369.0 2.3571 111400.0
16996 -124.27 40.69 36.0 2349.0 528.0 1194.0 465.0 2.5179 79000.0
16997 -124.30 41.84 17.0 2677.0 531.0 1244.0 456.0 3.0313 103600.0
16998 -124.30 41.80 19.0 2672.0 552.0 1298.0 478.0 1.9797 85800.0
16999 -124.35 40.54 52.0 1820.0 300.0 806.0 270.0 3.0147 94600.0

4.3. DataFrame Informationlink image 91

Another very useful method is info() which gives us information about the DataFrame

	
california_housing_train.info()
Copy
	
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 17000 non-null float64
1 latitude 17000 non-null float64
2 housing_median_age 17000 non-null float64
3 total_rooms 17000 non-null float64
4 total_bedrooms 17000 non-null float64
5 population 17000 non-null float64
6 households 17000 non-null float64
7 median_income 17000 non-null float64
8 median_house_value 17000 non-null float64
dtypes: float64(9)
memory usage: 1.2 MB

4.4. Rows and Columns of DataFramelink image 92

The indices and columns of a DataFrame can be obtained using the index and columns methods.

	
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0], index=["a", "b", "c", "d"])
}
dataframe = pd.DataFrame(diccionario)
indices = dataframe.index
columnas = dataframe.columns
print(f"El DataFrame tiene los índices\n {indices}\n")
print(f"El DataFrame tiene las columnas\n {columnas}")
Copy
	
El DataFrame tiene los índices
Index(['a', 'b', 'c', 'd'], dtype='object')
El DataFrame tiene las columnas
Index(['uno', 'dos'], dtype='object')

4.5. Description of the DataFramelink image 93

The describe() method displays a quick statistical summary of the data in the DataFrame

california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
      california_housing_train.describe()
      
Out[107]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000
mean -119.562108 35.625225 28.589353 2643.664412 539.410824 1429.573941 501.221941 3.883578 207300.912353
std 2.005166 2.137340 12.586937 2179.947071 421.499452 1147.852959 384.520841 1.908157 115983.764387
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.790000 33.930000 18.000000 1462.000000 297.000000 790.000000 282.000000 2.566375 119400.000000
50% -118.490000 34.250000 29.000000 2127.000000 434.000000 1167.000000 409.000000 3.544600 180400.000000
75% -118.000000 37.720000 37.000000 3151.250000 648.250000 1721.000000 605.250000 4.767000 265000.000000
max -114.310000 41.950000 52.000000 37937.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000

4.6. Sorting the DataFramelink image 94

Rows of a DataFrame can be alphabetically sorted using the sort_index() method.

california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
      california_housing_train.sort_index().head()
      
Out[108]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0

Since the rows were already sorted in this case, we set ascending=False to reverse the order.

california_housing_train.sort_index(ascending=False).head()
      
Out[109]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
16999 -124.35 40.54 52.0 1820.0 300.0 806.0 270.0 3.0147 94600.0
16998 -124.30 41.80 19.0 2672.0 552.0 1298.0 478.0 1.9797 85800.0
16997 -124.30 41.84 17.0 2677.0 531.0 1244.0 456.0 3.0313 103600.0
16996 -124.27 40.69 36.0 2349.0 528.0 1194.0 465.0 2.5179 79000.0
16995 -124.26 40.58 52.0 2217.0 394.0 907.0 369.0 2.3571 111400.0

If you want to sort the columns, you need to set axis=1 since the default is 0.

california_housing_train.sort_index(axis=1).head()
      
Out[110]:
households housing_median_age latitude longitude median_house_value median_income population total_bedrooms total_rooms
0 472.0 15.0 34.19 -114.31 66900.0 1.4936 1015.0 1283.0 5612.0
1 463.0 19.0 34.40 -114.47 80100.0 1.8200 1129.0 1901.0 7650.0
2 117.0 17.0 33.69 -114.56 85700.0 1.6509 333.0 174.0 720.0
3 226.0 14.0 33.64 -114.57 73400.0 3.1917 515.0 337.0 1501.0
4 262.0 20.0 33.57 -114.57 65500.0 1.9250 624.0 326.0 1454.0

If we want to sort the DataFrame by a specific column, we need to use the sort_values() method and specify the label of the column to sort by.

california_housing_train.sort_values('median_house_value')
      
Out[111]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
568 -117.02 36.40 19.0 619.0 239.0 490.0 164.0 2.1000 14999.0
16643 -122.74 39.71 16.0 255.0 73.0 85.0 38.0 1.6607 14999.0
16801 -123.17 40.31 36.0 98.0 28.0 18.0 8.0 0.5360 14999.0
3226 -117.86 34.24 52.0 803.0 267.0 628.0 225.0 4.1932 14999.0
7182 -118.33 34.15 39.0 493.0 168.0 259.0 138.0 2.3667 17500.0
... ... ... ... ... ... ... ... ... ...
15834 -122.42 37.81 52.0 1314.0 317.0 473.0 250.0 4.3472 500001.0
7927 -118.40 33.87 38.0 2398.0 431.0 911.0 392.0 5.2319 500001.0
3546 -117.90 33.63 28.0 2370.0 352.0 832.0 347.0 7.1148 500001.0
7924 -118.40 33.88 35.0 1060.0 191.0 444.0 196.0 8.0015 500001.0
14011 -122.04 37.26 24.0 4973.0 709.0 1692.0 696.0 7.8627 500001.0

17000 rows × 9 columns

4.7. DataFrame Statisticslink image 95

Statistics can be obtained from the DataFrame, such as the mean, mode, and standard deviation.

	
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
print(f"media:\n{california_housing_train.mean()}")
print(f"\n\ndesviación estandar:\n{california_housing_train.std()}")
Copy
	
media:
longitude -119.562108
latitude 35.625225
housing_median_age 28.589353
total_rooms 2643.664412
total_bedrooms 539.410824
population 1429.573941
households 501.221941
median_income 3.883578
median_house_value 207300.912353
dtype: float64
desviación estandar:
longitude 2.005166
latitude 2.137340
housing_median_age 12.586937
total_rooms 2179.947071
total_bedrooms 421.499452
population 1147.852959
households 384.520841
median_income 1.908157
median_house_value 115983.764387
dtype: float64

If you want to get statistics about the rows instead of the columns, you need to specify it using axis=1

	
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
print(f"media:\n{california_housing_train.mean(axis=1)}")
print(f"\n\ndesviación estandar:\n{california_housing_train.std(axis=1)}")
Copy
	
media:
0 8357.597067
1 10131.527778
2 9664.642322
3 8435.029078
4 7567.436111
...
16995 12806.408567
16996 9276.770878
16997 12049.507922
16998 10082.053300
16999 10863.022744
Length: 17000, dtype: float64
desviación estandar:
0 22026.612445
1 26352.939272
2 28514.316588
3 24366.754747
4 21730.014569
...
16995 36979.676899
16996 26158.006771
16997 34342.876792
16998 28408.152329
16999 31407.119788
Length: 17000, dtype: float64

Another useful thing that can be obtained from DataFrames is, for example, the number of times each item in a column repeats.

	
california_housing_train["total_rooms"].value_counts()
Copy
	
1582.0 16
1527.0 15
1717.0 14
1471.0 14
1703.0 14
..
157.0 1
2760.0 1
458.0 1
10239.0 1
4068.0 1
Name: total_rooms, Length: 5533, dtype: int64

For example, we can see that there are a total of 16 houses with 1582 rooms.

4.8. Used Memorylink image 96

We can see the memory used by the DataFrame

	
california_housing_train.memory_usage(deep=True)
Copy
	
Index 128
longitude 136000
latitude 136000
housing_median_age 136000
total_rooms 136000
total_bedrooms 136000
population 136000
households 136000
median_income 136000
median_house_value 136000
dtype: int64

5. Addition of Datalink image 97

5.1. Addition of Columnslink image 98

Columns can be easily added as operations of other columns

diccionario = {
          "uno": pd.Series([1.0, 2.0, 3.0]),
          "dos": pd.Series([4.0, 5.0, 6.0, 7.0])
          }
      
      dataframe = pd.DataFrame(diccionario)
      
      dataframe["tres"] = dataframe["uno"] + dataframe["dos"]
      dataframe["flag"] = dataframe["tres"] > 7.0
      
      dataframe
      
Out[ ]:
uno dos tres flag
0 1.0 4.0 5.0 False
1 2.0 5.0 7.0 False
2 3.0 6.0 9.0 True
3 NaN 7.0 NaN False

Columns can also be added indicating what value all their items will have

dataframe["constante"] = 8.0
      dataframe
      
Out[ ]:
uno dos tres flag constante
0 1.0 4.0 5.0 False 8.0
1 2.0 5.0 7.0 False 8.0
2 3.0 6.0 9.0 True 8.0
3 NaN 7.0 NaN False 8.0

If a Series is added that does not have the same number of indices as the DataFrame, it will be adjusted to match the number of indices of the DataFrame.

dataframe["Menos indices"] = dataframe["uno"][:2]
      dataframe
      
Out[ ]:
uno dos tres flag constante Menos indices
0 1.0 4.0 5.0 False 8.0 1.0
1 2.0 5.0 7.0 False 8.0 2.0
2 3.0 6.0 9.0 True 8.0 NaN
3 NaN 7.0 NaN False 8.0 NaN

With the previous methods, the column was added at the end, but if you want to add the column in a specific position, you can use the insert() method. For example, if you want to add a column at position 3 (keeping in mind that indexing starts from position 0), with the name of the column being inserted column and its value being double that of the column three, it would be done as follows:

dataframe.insert(loc=3, column="columna insertada", value=dataframe["tres"]*2)
      dataframe
      
Out[ ]:
uno dos tres columna insertada flag constante Menos indices
0 1.0 4.0 5.0 10.0 False 8.0 1.0
1 2.0 5.0 7.0 14.0 False 8.0 2.0
2 3.0 6.0 9.0 18.0 True 8.0 NaN
3 NaN 7.0 NaN NaN False 8.0 NaN

If you want to add more than one column per command, you can use the assign() method.

dataframe = dataframe.assign(
          columna_asignada1 = dataframe["uno"] * dataframe["tres"],
          columna_asignada2 = dataframe["dos"] * dataframe["tres"],
          )
      dataframe
      
Out[ ]:
uno dos tres columna insertada flag constante Menos indices columna_asignada1 columna_asignada2
0 1.0 4.0 5.0 10.0 False 8.0 1.0 5.0 20.0
1 2.0 5.0 7.0 14.0 False 8.0 2.0 14.0 35.0
2 3.0 6.0 9.0 18.0 True 8.0 NaN 27.0 54.0
3 NaN 7.0 NaN NaN False 8.0 NaN NaN NaN

5.2. Addition of Rowslink image 99

diccionario = {
          "uno": pd.Series([1.0, 2.0, 3.0]),
          "dos": pd.Series([4.0, 5.0, 6.0, 7.0])
          }
      
      dataframe = pd.DataFrame(diccionario)
      
      dataframe.head()
      
Out[20]:
uno dos
0 1.0 4.0
1 2.0 5.0
2 3.0 6.0
3 NaN 7.0

We can add a row at the end using the concat method (which we will cover in more detail later)

diccionario = {
          "uno": [10.0],
          "dos": [20.0]
          }
      
      dataframe = pd.concat([dataframe, pd.DataFrame(diccionario)])
      dataframe
      
Out[21]:
uno dos
0 1.0 4.0
1 2.0 5.0
2 3.0 6.0
3 NaN 7.0
0 10.0 20.0

We see that the column has been added at the end, but it has index zero, so we reorder the indices using the method reset_index(drop=True)

dataframe = dataframe.reset_index(drop=True)
      dataframe
      
Out[23]:
index uno dos
0 0 1.0 4.0
1 1 2.0 5.0
2 2 3.0 6.0
3 3 NaN 7.0
4 0 10.0 20.0

6. Data Deletionlink image 100

6.1. Column Deletionlink image 101

A specific column can be deleted using the pop() method.

dataframe.pop("constante")
      dataframe
      
Out[ ]:
uno dos tres columna insertada flag Menos indices columna_asignada1 columna_asignada2
0 1.0 4.0 5.0 10.0 False 1.0 5.0 20.0
1 2.0 5.0 7.0 14.0 False 2.0 14.0 35.0
2 3.0 6.0 9.0 18.0 True NaN 27.0 54.0
3 NaN 7.0 NaN NaN False NaN NaN NaN

Or using del

del dataframe["flag"]
      dataframe
      
Out[ ]:
uno dos tres columna insertada Menos indices columna_asignada1 columna_asignada2
0 1.0 4.0 5.0 10.0 1.0 5.0 20.0
1 2.0 5.0 7.0 14.0 2.0 14.0 35.0
2 3.0 6.0 9.0 18.0 NaN 27.0 54.0
3 NaN 7.0 NaN NaN NaN NaN NaN

6.1. Row Deletionlink image 102

diccionario = {
          "uno": pd.Series([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]),
          "dos": pd.Series([11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]),
          "tres": pd.Series([21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0])
          }
      
      dataframe = pd.DataFrame(diccionario)
      
      dataframe.head()
      
Out[49]:
uno dos tres
0 1.0 11.0 21.0
1 2.0 12.0 22.0
2 3.0 13.0 23.0
3 4.0 14.0 24.0
4 5.0 15.0 25.0

If we want to delete a row, we can use the drop method, specifying its position. For example, if we want to delete the row at position 1

dataframe = dataframe.drop(1)
      dataframe
      
Out[50]:
uno dos tres
0 1.0 11.0 21.0
2 3.0 13.0 23.0
3 4.0 14.0 24.0
4 5.0 15.0 25.0
5 6.0 16.0 26.0
6 7.0 17.0 27.0
7 8.0 18.0 28.0
8 9.0 19.0 29.0
9 10.0 20.0 30.0

If we want to delete the last row

dataframe = dataframe.drop(len(dataframe)-1)
      dataframe
      
Out[51]:
uno dos tres
0 1.0 11.0 21.0
2 3.0 13.0 23.0
3 4.0 14.0 24.0
4 5.0 15.0 25.0
5 6.0 16.0 26.0
6 7.0 17.0 27.0
7 8.0 18.0 28.0
9 10.0 20.0 30.0

If what we want is to delete a range of rows

dataframe = dataframe.drop(range(2, 5))
      dataframe
      
Out[52]:
uno dos tres
0 1.0 11.0 21.0
5 6.0 16.0 26.0
6 7.0 17.0 27.0
7 8.0 18.0 28.0
9 10.0 20.0 30.0

If what we want is to remove a specific set of rows

dataframe = dataframe.drop([5, 7, 9])
      dataframe
      
Out[53]:
uno dos tres
0 1.0 11.0 21.0
6 7.0 17.0 27.0

Just like when we added rows, we see that some indices have been removed, so we reorder the indices using the method reset_index(drop=True)

dataframe = dataframe.reset_index(drop=True)
      dataframe
      
Out[54]:
uno dos tres
0 1.0 11.0 21.0
1 7.0 17.0 27.0

7. Operations on DataFrameslink image 103

Operations can be performed on DataFrames just as they can with Numpy.

diccionario = {
          "uno": pd.Series([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]),
          "dos": pd.Series([11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]),
          "tres": pd.Series([21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0])
          }
      
      dataframe = pd.DataFrame(diccionario)
      
      dataframe.head()
      
Out[58]:
uno dos tres
0 1.0 11.0 21.0
1 2.0 12.0 22.0
2 3.0 13.0 23.0
3 4.0 14.0 24.0
4 5.0 15.0 25.0
dataframe[ ["uno", "dos", "tres"] ] * 2
      
Out[59]:
uno dos tres
0 2.0 22.0 42.0
1 4.0 24.0 44.0
2 6.0 26.0 46.0
3 8.0 28.0 48.0
4 10.0 30.0 50.0
5 12.0 32.0 52.0
6 14.0 34.0 54.0
7 16.0 36.0 56.0
8 18.0 38.0 58.0
9 20.0 40.0 60.0
np.exp(dataframe[ ["uno", "dos", "tres"] ])
      
Out[60]:
uno dos tres
0 2.718282 5.987414e+04 1.318816e+09
1 7.389056 1.627548e+05 3.584913e+09
2 20.085537 4.424134e+05 9.744803e+09
3 54.598150 1.202604e+06 2.648912e+10
4 148.413159 3.269017e+06 7.200490e+10
5 403.428793 8.886111e+06 1.957296e+11
6 1096.633158 2.415495e+07 5.320482e+11
7 2980.957987 6.565997e+07 1.446257e+12
8 8103.083928 1.784823e+08 3.931334e+12
9 22026.465795 4.851652e+08 1.068647e+13

If more complex operations are desired, the apply() method can be used.

	
dataframe = dataframe.apply(lambda x: x.max() - x.min())
dataframe
Copy
	
uno 9.0
dos 9.0
tres 9.0
dtype: float64

A lambda function has been applied because it is a simple function, but if we want to apply more complex functions, we can define and apply them.

	
def funcion(x):
if x < 10:
return np.exp(x) - np.log(5*x) + np.sqrt(x)
elif x < 20:
return np.sin(x) + np.cos(x) + np.tan(x)
else:
return np.log(x) + np.log10(x) + np.log2(x)
dataframe = dataframe.apply(funcion)
dataframe
Copy
	
uno 8102.277265
dos 8102.277265
tres 8102.277265
dtype: float64

Using the apply method on a DataFrame is much faster than looping through each row and performing the operation.

california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
      california_housing_train.head()
      
Out[64]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0

Let's calculate the percentage of bedrooms out of the total number of rooms.

california_housing_train["percent_bedrooms"] = None
      %time california_housing_train["percent_bedrooms"] = california_housing_train.apply(lambda x: x["total_bedrooms"] / x["total_rooms"], axis=1)
      california_housing_train.head()
      
CPU times: user 309 ms, sys: 86 µs, total: 309 ms
      Wall time: 309 ms
      
Out[74]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value percent_bedrooms
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0 0.228617
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0 0.248497
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0 0.241667
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0 0.224517
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0 0.224209
california_housing_train["percent_bedrooms"] = None
      %time for i in range(len(california_housing_train)): california_housing_train["percent_bedrooms"][i] = california_housing_train["total_bedrooms"][i] / california_housing_train["total_rooms"][i]
      california_housing_train.head()
      
/home/wallabot/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
      A value is trying to be set on a copy of a slice from a DataFrame
      
      See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
        """Entry point for launching an IPython kernel.
      
CPU times: user 1.72 s, sys: 12 ms, total: 1.73 s
      Wall time: 1.72 s
      
Out[77]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value percent_bedrooms
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0 0.228617
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0 0.248497
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0 0.241667
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0 0.224517
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0 0.224209

The lambda function took about 300 ms, while the for loop took more than 1 second.

8. Transposelink image 104

The transpose of a DataFrame can be obtained using the T method.

diccionario = {
          "uno": pd.Series([1.0, 2.0, 3.0], index=["fila a", "fila b", "fila c"]),
          "dos": pd.Series([4.0, 5.0, 6.0], index=["fila a", "fila b", "fila c"])
          }
      
      dataframe = pd.DataFrame(diccionario)
      
      dataframe["tres"] = dataframe["uno"] + dataframe["dos"]
      dataframe["flag"] = dataframe["tres"] > 7.0
      
      dataframe.T
      
Out[ ]:
fila a fila b fila c
uno 1 2 3
dos 4 5 6
tres 5 7 9
flag False False True

9. Conversion to Numpylink image 105

If you want to convert a Series or DataFrame to NumPy, you can use the to_numpy() method or the np.asarray() function.

diccionario = {
          "uno": pd.Series([1.0, 2.0, 3.0], index=["fila a", "fila b", "fila c"]),
          "dos": pd.Series([4.0, 5.0, 6.0], index=["fila a", "fila b", "fila c"])
          }
      
      dataframe = pd.DataFrame(diccionario)
      
      dataframe["tres"] = dataframe["uno"] + dataframe["dos"]
      dataframe["flag"] = dataframe["tres"] > 7.0
      
      dataframe
      
Out[ ]:
uno dos tres flag
fila a 1.0 4.0 5.0 False
fila b 2.0 5.0 7.0 False
fila c 3.0 6.0 9.0 True
	
matriz_np = dataframe.to_numpy()
matriz_np
Copy
	
array([[1.0, 4.0, 5.0, False],
[2.0, 5.0, 7.0, False],
[3.0, 6.0, 9.0, True]], dtype=object)
	
matriz_np = np.asarray(dataframe)
matriz_np
Copy
	
array([[1.0, 4.0, 5.0, False],
[2.0, 5.0, 7.0, False],
[3.0, 6.0, 9.0, True]], dtype=object)

This example is not the most appropriate, as it mixes numbers with booleans, and as we explained in the previous post Matrix Calculus with NumPy, all elements of an ndarray must be of the same type. In this case we are mixing numbers with booleans, so to solve it NumPy converts them all to objects To solve this, we keep only the numbers and convert them to an ndarray

	
matriz_np = dataframe[ ["uno", "dos", "tres"] ].to_numpy()
matriz_np, matriz_np.dtype
Copy
	
(array([[1., 4., 5.],
[2., 5., 7.],
[3., 6., 9.]]), dtype('float64'))

Now it can be seen that an ndarray has been created where all the data are of type float

10. Reading data from external sourceslink image 106

One of the greatest strengths of Pandas is being able to read data from files, so it's not necessary to create a DataFrame with the data you want to process; instead, it can be read from a file. Just like you can create DataFrames from external files, you can also save DataFrames to files, so you can create your own dataset, configure it the way you want, and save it to a file to use it later. The following table shows the functions for reading and writing files in different formats.

|Format|File type|Reading function|Writing function||---|---|---|---||text|CSV|read_csv|to_csv||text|Fixed-Width Text File|read_fwf|||text|JSON|read_json|to_json||text|HTML|read_html|to_html||text|Local clipboard|read_clipboard|to_clipboard||binary|MS Excel|read_excel|to_excel||binary|OpenDocument|read_excel||binary|HDF5 Format|read_hdf|to_hdf||binary|Feather Format|read_feather|to_feather||binary|Parquet Format|read_parquet|to_parquet||binary|ORC Format|read_orc||binary|Msgpack|read_msgpack|to_msgpack||binary|Stata|read_stata|to_stata||binary|SAS|read_sas||binary|SPSS|read_spss||binary|Python Pickle Format|read_pickle|to_pickle||SQL|SQL|read_sql|to_sql||SQL|Google BigQuery|read_gbq|to_gbq|

11. Indexing in DataFrameslink image 107

There are many ways to index into DataFrames.

fechas = pd.date_range('1/1/2000', periods=8)
      dataframe = pd.DataFrame(np.random.randn(8, 4), index=fechas, columns=['A', 'B', 'C', 'D'])
      
      dataframe
      
Out[56]:
A B C D
2000-01-01 -0.869298 -0.210502 0.477938 0.912121
2000-01-02 -0.502425 -1.897287 -0.084122 -1.203818
2000-01-03 -0.204297 0.711485 -1.271802 -0.138120
2000-01-04 1.512985 0.726718 0.960902 0.433124
2000-01-05 0.724005 -2.127668 0.674417 -0.297808
2000-01-06 -0.217175 -0.621172 0.668163 0.170576
2000-01-07 0.352484 0.260515 -1.576033 1.263213
2000-01-08 -0.032928 -0.648313 -0.622509 0.231683

11.1. Column Indexinglink image 108

To select columns within a DataFrame, you can do so by selecting the column between brackets [], or by indicating the column as if it were an attribute of the DataFrame.

	
dataframe['A']
Copy
	
2000-01-01 0.813153
2000-01-02 -0.244584
2000-01-03 0.125729
2000-01-04 0.352275
2000-01-05 -2.050976
2000-01-06 -0.312296
2000-01-07 0.897837
2000-01-08 0.271403
Freq: D, Name: A, dtype: float64
	
dataframe.A
Copy
	
2000-01-01 0.813153
2000-01-02 -0.244584
2000-01-03 0.125729
2000-01-04 0.352275
2000-01-05 -2.050976
2000-01-06 -0.312296
2000-01-07 0.897837
2000-01-08 0.271403
Freq: D, Name: A, dtype: float64

If specific rows are desired, they are passed through a list

dataframe[ ['A', 'B'] ]
      
Out[ ]:
A B
2000-01-01 -0.341962 0.639913
2000-01-02 0.507258 -0.942036
2000-01-03 -1.463557 1.041023
2000-01-04 1.781058 -1.849352
2000-01-05 0.318897 -0.229218
2000-01-06 0.362064 -0.193479
2000-01-07 -0.084632 -0.112474
2000-01-08 0.739424 0.253191

11.2. Row Indexing by Positionslink image 109

A range of rows from a DataFrame can be selected as follows

dataframe[0:3]
      
Out[ ]:
A B C D
2000-01-01 0.813153 -0.869356 0.934293 0.338644
2000-01-02 -0.244584 0.536352 0.322248 0.238903
2000-01-03 0.125729 2.046910 -0.877466 -0.710034

If you want to select a single row, you need to specify a range of rows that includes only that one. For example, if you want to select row number 1

dataframe[1:2]
      
Out[ ]:
A B C D
2000-01-02 -0.244584 0.536352 0.322248 0.238903

Another method to select a row by its position is the iloc[] method

dataframe.iloc[0:3]
      
Out[ ]:
A B C D
2000-01-01 0.813153 -0.869356 0.934293 0.338644
2000-01-02 -0.244584 0.536352 0.322248 0.238903
2000-01-03 0.125729 2.046910 -0.877466 -0.710034

If specific rows are desired, a list with their positions is passed.

dataframe.iloc[ [0, 2, 4] ]
      
Out[ ]:
A B C D
2000-01-01 -0.341962 0.639913 0.765817 0.056692
2000-01-03 -1.463557 1.041023 -1.321715 2.822735
2000-01-05 0.318897 -0.229218 -1.095593 -0.186248

11.3. Row Label Indexinglink image 110

To select a row by its labels, we can use the loc[] method.

	
dataframe.loc['2000-01-01']
Copy
	
A 0.813153
B -0.869356
C 0.934293
D 0.338644
Name: 2000-01-01 00:00:00, dtype: float64

If you want to select a range of rows, you can index them using the colon :

dataframe.loc['2000-01-01':'2000-01-03']
      
Out[ ]:
A B C D
2000-01-01 0.813153 -0.869356 0.934293 0.338644
2000-01-02 -0.244584 0.536352 0.322248 0.238903
2000-01-03 0.125729 2.046910 -0.877466 -0.710034

If specific rows are desired, they are passed through a list

dataframe.loc[ ['2000-01-01', '2000-01-03', '2000-01-05'] ]
      
Out[ ]:
A B C D
2000-01-01 -0.341962 0.639913 0.765817 0.056692
2000-01-03 -1.463557 1.041023 -1.321715 2.822735
2000-01-05 0.318897 -0.229218 -1.095593 -0.186248

11.4. Selection of a Portion of the DataFrame by Positionslink image 111

dataframe.iloc[0:3, 0:2]
      
Out[ ]:
A B
2000-01-01 0.813153 -0.869356
2000-01-02 -0.244584 0.536352
2000-01-03 0.125729 2.046910

If specific rows and columns are desired, lists with the desired positions are passed.

dataframe.iloc[ [0, 2, 4], [0, 2] ]
      
Out[ ]:
A C
2000-01-01 -0.341962 0.765817
2000-01-03 -1.463557 -1.321715
2000-01-05 0.318897 -1.095593

11.5. Selecting a Portion of the DataFrame Using Labelslink image 112

dataframe.loc['2000-01-01':'2000-01-03', 'A':'B']
      
Out[ ]:
A B
2000-01-01 0.813153 -0.869356
2000-01-02 -0.244584 0.536352
2000-01-03 0.125729 2.046910

If you want certain rows and columns, you pass lists with the desired labels

dataframe.loc[ ['2000-01-01', '2000-01-03', '2000-01-05'], ['A', 'C'] ]
      
Out[ ]:
A C
2000-01-01 -0.341962 0.765817
2000-01-03 -1.463557 -1.321715
2000-01-05 0.318897 -1.095593

11.6. Lambda Function Indexinglink image 113

Data can be selected from a DataFrame that meets a condition given by a lambda function.

dataframe.loc[lambda dataframe:2*dataframe['A']+5*np.exp(dataframe['B'])>0.2]
      
Out[58]:
A B C D
2000-01-01 -0.869298 -0.210502 0.477938 0.912121
2000-01-03 -0.204297 0.711485 -1.271802 -0.138120
2000-01-04 1.512985 0.726718 0.960902 0.433124
2000-01-05 0.724005 -2.127668 0.674417 -0.297808
2000-01-06 -0.217175 -0.621172 0.668163 0.170576
2000-01-07 0.352484 0.260515 -1.576033 1.263213
2000-01-08 -0.032928 -0.648313 -0.622509 0.231683

As can be seen, this form of indexing is very powerful

11.7. Conditional Indexinglink image 114

If we don't need complex functions for indexing, but only conditionals, we can do

dataframe[dataframe['A']>0.2]
      
Out[61]:
A B C D
2000-01-04 1.512985 0.726718 0.960902 0.433124
2000-01-05 0.724005 -2.127668 0.674417 -0.297808
2000-01-07 0.352484 0.260515 -1.576033 1.263213

We can make multiple conditions

dataframe[(dataframe['A']>0.2) & (dataframe['B']>0.2)]
      
Out[69]:
A B C D
2000-01-04 1.512985 0.726718 0.960902 0.433124
2000-01-07 0.352484 0.260515 -1.576033 1.263213

11.8. Random Indexinglink image 115

By using the sample() method, we will obtain a random row from the DataFrame.

dataframe.sample()
      
Out[ ]:
A B C D
2000-01-06 -0.312296 0.129097 -0.991085 1.704535

If we want more than one sample, we indicate it with the attribute n

dataframe.sample(n=3)
      
Out[ ]:
A B C D
2000-01-08 0.271403 1.527116 0.144970 1.175728
2000-01-01 0.813153 -0.869356 0.934293 0.338644
2000-01-03 0.125729 2.046910 -0.877466 -0.710034

If random columns are desired, this must be indicated using axis=1

dataframe.sample(axis=1)
      
Out[ ]:
D
2000-01-01 0.338644
2000-01-02 0.238903
2000-01-03 -0.710034
2000-01-04 0.504410
2000-01-05 -1.601926
2000-01-06 1.704535
2000-01-07 -0.584860
2000-01-08 1.175728

If a single item from the DataFrame is desired, the sample() method has to be called twice.

dataframe.sample(axis=1).sample()
      
Out[ ]:
D
2000-01-05 -1.601926

12. Merging DataFrameslink image 116

12.1. Concatenation of DataFrameslink image 117

To concatenate multiple DataFrames we use the concat() method, where a list of the DataFrames to be joined is passed.

	
dataframe1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
})
dataframe2 = pd.DataFrame(
{
"A": ["A4", "A5", "A6", "A7"],
"B": ["B4", "B5", "B6", "B7"],
"C": ["C4", "C5", "C6", "C7"],
"D": ["D4", "D5", "D6", "D7"],
})
dataframe3 = pd.DataFrame(
{
"A": ["A8", "A9", "A10", "A11"],
"B": ["B8", "B9", "B10", "B11"],
"C": ["C8", "C9", "C10", "C11"],
"D": ["D8", "D9", "D10", "D11"],
})
dataframe = pd.concat([dataframe1, dataframe2, dataframe3])
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe2}")
print(f"dataframe3:\n{dataframe3}")
print(f"\ndataframe:\n{dataframe}")
Copy
	
dataframe1:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
dataframe2:
A B C D
0 A4 B4 C4 D4
1 A5 B5 C5 D5
2 A6 B6 C6 D6
3 A7 B7 C7 D7
dataframe3:
A B C D
0 A8 B8 C8 D8
1 A9 B9 C9 D9
2 A10 B10 C10 D10
3 A11 B11 C11 D11
dataframe:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
0 A4 B4 C4 D4
1 A5 B5 C5 D5
2 A6 B6 C6 D6
3 A7 B7 C7 D7
0 A8 B8 C8 D8
1 A9 B9 C9 D9
2 A10 B10 C10 D10
3 A11 B11 C11 D11

As can be seen, the indices 0, 1, 2, and 3 are repeated because each dataframe has those indices. To prevent this, you should use the parameter ignore_index=True.

	
dataframe = pd.concat([dataframe1, dataframe2, dataframe3], ignore_index=True)
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe2}")
print(f"dataframe3:\n{dataframe3}")
print(f"\ndataframe:\n{dataframe}")
Copy
	
dataframe1:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
dataframe2:
A B C D
0 A4 B4 C4 D4
1 A5 B5 C5 D5
2 A6 B6 C6 D6
3 A7 B7 C7 D7
dataframe3:
A B C D
0 A8 B8 C8 D8
1 A9 B9 C9 D9
2 A10 B10 C10 D10
3 A11 B11 C11 D11
dataframe:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11

concat

If the concatenation was intended to be performed along the columns, the variable axis=1 should have been used.

	
dataframe = pd.concat([dataframe1, dataframe2, dataframe3], axis=1)
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe2}")
print(f"dataframe3:\n{dataframe3}")
print(f"\ndataframe:\n{dataframe}")
Copy
	
dataframe1:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
dataframe2:
A B C D
0 A4 B4 C4 D4
1 A5 B5 C5 D5
2 A6 B6 C6 D6
3 A7 B7 C7 D7
dataframe3:
A B C D
0 A8 B8 C8 D8
1 A9 B9 C9 D9
2 A10 B10 C10 D10
3 A11 B11 C11 D11
dataframe:
A B C D A B C D A B C D
0 A0 B0 C0 D0 A4 B4 C4 D4 A8 B8 C8 D8
1 A1 B1 C1 D1 A5 B5 C5 D5 A9 B9 C9 D9
2 A2 B2 C2 D2 A6 B6 C6 D6 A10 B10 C10 D10
3 A3 B3 C3 D3 A7 B7 C7 D7 A11 B11 C11 D11

12.1.1. Intersection of Concatenationlink image 118

There are two ways to perform the concatenation, either taking all the indices from the DataFrames or only those that match. This is determined by the join variable, which accepts the values 'outer' (default) (takes all indices) or 'inner' (only those that match).

Let's see an example of 'outer'

	
dataframe1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
index=[0, 1, 2, 3])
dataframe4 = pd.DataFrame(
{
"B": ["B2", "B3", "B6", "B7"],
"D": ["D2", "D3", "D6", "D7"],
"F": ["F2", "F3", "F6", "F7"],
},index=[2, 3, 6, 7])
dataframe = pd.concat([dataframe1, dataframe4], axis=1)
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe4}")
print(f"\ndataframe:\n{dataframe}")
Copy
	
dataframe1:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
dataframe2:
B D F
2 B2 D2 F2
3 B3 D3 F3
6 B6 D6 F6
7 B7 D7 F7
dataframe:
A B C D B D F
0 A0 B0 C0 D0 NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN
2 A2 B2 C2 D2 B2 D2 F2
3 A3 B3 C3 D3 B3 D3 F3
6 NaN NaN NaN NaN B6 D6 F6
7 NaN NaN NaN NaN B7 D7 F7

outlier

Let's see an example of 'inner'

	
dataframe = pd.concat([dataframe1, dataframe4], axis=1, join="inner")
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe4}")
print(f"\ndataframe:\n{dataframe}")
Copy
	
dataframe1:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
dataframe2:
B D F
2 B2 D2 F2
3 B3 D3 F3
6 B6 D6 F6
7 B7 D7 F7
dataframe:
A B C D B D F
2 A2 B2 C2 D2 B2 D2 F2
3 A3 B3 C3 D3 B3 D3 F3

inner

12.2. Merge of DataFrameslink image 119

We previously created a new dataframe by merging several dataframes. Now we can complete one dataframe with another using merge, passing the parameter on to specify which column should be used for the merge.

	
dataframe1 = pd.DataFrame(
{
"Key": ["K0", "K1", "K2", "K3"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
})
dataframe2 = pd.DataFrame(
{
"Key": ["K0", "K1", "K2", "K3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
})
dataframe = dataframe1.merge(dataframe2, on="Key")
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe2}")
print(f"\ndataframe:\n{dataframe}")
Copy
	
dataframe1:
Key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
dataframe2:
Key C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
3 K3 C3 D3
dataframe:
Key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3

In this case, both dataframes had a key with the same name (Key), but if we have dataframes where their keys are named differently, we can use the left_on and right_on parameters.

	
dataframe1 = pd.DataFrame(
{
"Key1": ["K0", "K1", "K2", "K3"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
})
dataframe2 = pd.DataFrame(
{
"Key2": ["K0", "K1", "K2", "K3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
})
dataframe = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2")
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe2}")
print(f"\ndataframe:\n{dataframe}")
Copy
	
dataframe1:
Key1 A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
dataframe2:
Key2 C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
3 K3 C3 D3
dataframe:
Key1 A B Key2 C D
0 K0 A0 B0 K0 C0 D0
1 K1 A1 B1 K1 C1 D1
2 K2 A2 B2 K2 C2 D2
3 K3 A3 B3 K3 C3 D3

In the case where one of the keys does not match, the merge will not be performed on that key.

	
dataframe1 = pd.DataFrame(
{
"Key1": ["K0", "K1", "K2", "K3"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
})
dataframe2 = pd.DataFrame(
{
"Key2": ["K0", "K1", "K2", np.nan],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
})
dataframe = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2")
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe2}")
print(f"\ndataframe:\n{dataframe}")
Copy
	
dataframe1:
Key1 A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
dataframe2:
Key2 C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
3 NaN C3 D3
dataframe:
Key1 A B Key2 C D
0 K0 A0 B0 K0 C0 D0
1 K1 A1 B1 K1 C1 D1
2 K2 A2 B2 K2 C2 D2

To change this behavior, we can use the how parameter, which by default has the value inner, but we can pass it the values left, right, and outer.

	
dataframe1 = pd.DataFrame(
{
"Key1": ["K0", "K1", "K2", "K3"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
})
dataframe2 = pd.DataFrame(
{
"Key2": ["K0", "K1", "K2", np.nan],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
})
dataframe_inner = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2", how="inner")
dataframe_left = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2", how="left")
dataframe_right = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2", how="right")
dataframe_outer = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2", how="outer")
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe2}")
print(f"\ndataframe inner:\n{dataframe_inner}")
print(f"\ndataframe left:\n{dataframe_left}")
print(f"\ndataframe right:\n{dataframe_right}")
print(f"\ndataframe outer:\n{dataframe_outer}")
Copy
	
dataframe1:
Key1 A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
dataframe2:
Key2 C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
3 NaN C3 D3
dataframe inner:
Key1 A B Key2 C D
0 K0 A0 B0 K0 C0 D0
1 K1 A1 B1 K1 C1 D1
2 K2 A2 B2 K2 C2 D2
dataframe left:
Key1 A B Key2 C D
0 K0 A0 B0 K0 C0 D0
1 K1 A1 B1 K1 C1 D1
2 K2 A2 B2 K2 C2 D2
3 K3 A3 B3 NaN NaN NaN
dataframe right:
Key1 A B Key2 C D
0 K0 A0 B0 K0 C0 D0
1 K1 A1 B1 K1 C1 D1
2 K2 A2 B2 K2 C2 D2
3 NaN NaN NaN NaN C3 D3
dataframe outer:
Key1 A B Key2 C D
0 K0 A0 B0 K0 C0 D0
1 K1 A1 B1 K1 C1 D1
2 K2 A2 B2 K2 C2 D2
3 K3 A3 B3 NaN NaN NaN
4 NaN NaN NaN NaN C3 D3

As can be seen, when left is chosen, only the values from the left dataframe are added, and when right is chosen, the values from the right dataframe are added.

12.3. Join of dataframeslink image 120

The last tool for joining dataframes is join. It is similar to merge, except that instead of looking for similarities based on specified columns, it looks for them based on the indices.

	
dataframe1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
},
index=["K0", "K1", "K2", "K3"])
dataframe2 = pd.DataFrame(
{
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
index=["K0", "K1", "K2", "K3"])
dataframe = dataframe1.join(dataframe2)
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe2}")
print(f"\ndataframe:\n{dataframe}")
Copy
	
dataframe1:
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
K3 A3 B3
dataframe2:
C D
K0 C0 D0
K1 C1 D1
K2 C2 D2
K3 C3 D3
dataframe:
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 C1 D1
K2 A2 B2 C2 D2
K3 A3 B3 C3 D3

In this case, the indices are the same, but when they are different we can specify how to join the dataframes using the how parameter, which by default has the value inner, but can also have the values left, right, or outer.

	
dataframe1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
},
index=["K0", "K1", "K2", "K3"])
dataframe2 = pd.DataFrame(
{
"C": ["C0", "C2", "C3", "C4"],
"D": ["D0", "D2", "D3", "D4"],
},
index=["K0", "K2", "K3", "K4"])
dataframe_inner = dataframe1.join(dataframe2, how="inner")
dataframe_left = dataframe1.join(dataframe2, how="left")
dataframe_right = dataframe1.join(dataframe2, how="right")
dataframe_outer = dataframe1.join(dataframe2, how="outer")
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe2}")
print(f"\ndataframe inner:\n{dataframe_inner}")
print(f"\ndataframe left:\n{dataframe_left}")
print(f"\ndataframe rigth:\n{dataframe_right}")
print(f"\ndataframe outer:\n{dataframe_outer}")
Copy
	
dataframe1:
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
K3 A3 B3
dataframe2:
C D
K0 C0 D0
K2 C2 D2
K3 C3 D3
K4 C4 D4
dataframe:
A B C D
K0 A0 B0 C0 D0
K2 A2 B2 C2 D2
K3 A3 B3 C3 D3
dataframe:
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 A3 B3 C3 D3
dataframe:
A B C D
K0 A0 B0 C0 D0
K2 A2 B2 C2 D2
K3 A3 B3 C3 D3
K4 NaN NaN C4 D4
dataframe:
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 A3 B3 C3 D3
K4 NaN NaN C4 D4

13. Missing data (NaN)link image 121

In a DataFrame there can be some missing data, Pandas represents them as np.nan

diccionario = {
          "uno": pd.Series([1.0, 2.0, 3.0]),
          "dos": pd.Series([4.0, 5.0, 6.0, 7.0])
          }
      
      dataframe = pd.DataFrame(diccionario)
      dataframe
      
Out[ ]:
uno dos
0 1.0 4.0
1 2.0 5.0
2 3.0 6.0
3 NaN 7.0

13.1. Removal of Rows with Missing Datalink image 122

To avoid having rows with missing data, these can be removed.

dataframe.dropna(how="any")
      
Out[ ]:
uno dos
0 1.0 4.0
1 2.0 5.0
2 3.0 6.0

13.2. Dropping Columns with Missing Datalink image 123

dataframe.dropna(axis=1, how='any')
      
Out[ ]:
dos
0 4.0
1 5.0
2 6.0
3 7.0

13.3. Boolean mask with missing positionslink image 124

pd.isna(dataframe)
      
Out[ ]:
uno dos
0 False False
1 False False
2 False False
3 True False

13.4. Filling Missing Datalink image 125

dataframe.fillna(value=5.5, inplace=True)
      dataframe
      
Out[ ]:
uno dos
0 1.0 4.0
1 2.0 5.0
2 3.0 6.0
3 5.5 7.0

Tip: By setting the variable inplace=True, the DataFrame being operated on is modified, so there's no need to write dataframe = dataframe.fillna(value=5.5)

14. Time serieslink image 126

Pandas offers the possibility of working with time series. For example, we create a Series of 100 random data points every second starting from 01/01/2021

	
indices = pd.date_range("1/1/2021", periods=100, freq="S")
datos = np.random.randint(0, 500, len(indices))
serie_temporal = pd.Series(datos, index=indices)
serie_temporal
Copy
	
2021-01-01 00:00:00 241
2021-01-01 00:00:01 14
2021-01-01 00:00:02 190
2021-01-01 00:00:03 407
2021-01-01 00:00:04 94
...
2021-01-01 00:01:35 275
2021-01-01 00:01:36 56
2021-01-01 00:01:37 448
2021-01-01 00:01:38 151
2021-01-01 00:01:39 316
Freq: S, Length: 100, dtype: int64

This Pandas functionality is very powerful, for example, we can have a dataset at certain hours of one time zone and change them to another time zone.

	
horas = pd.date_range("3/6/2021 00:00", periods=10, freq="H")
datos = np.random.randn(len(horas))
serie_horaria = pd.Series(datos, horas)
serie_horaria
Copy
	
2021-03-06 00:00:00 -0.853524
2021-03-06 01:00:00 -1.355372
2021-03-06 02:00:00 -1.267503
2021-03-06 03:00:00 -1.155787
2021-03-06 04:00:00 0.730935
2021-03-06 05:00:00 1.435957
2021-03-06 06:00:00 0.460912
2021-03-06 07:00:00 0.723451
2021-03-06 08:00:00 -0.853337
2021-03-06 09:00:00 0.456359
Freq: H, dtype: float64

We locate the data in a time zone

	
serie_horaria_utc = serie_horaria.tz_localize("UTC")
serie_horaria_utc
Copy
	
2021-03-06 00:00:00+00:00 -0.853524
2021-03-06 01:00:00+00:00 -1.355372
2021-03-06 02:00:00+00:00 -1.267503
2021-03-06 03:00:00+00:00 -1.155787
2021-03-06 04:00:00+00:00 0.730935
2021-03-06 05:00:00+00:00 1.435957
2021-03-06 06:00:00+00:00 0.460912
2021-03-06 07:00:00+00:00 0.723451
2021-03-06 08:00:00+00:00 -0.853337
2021-03-06 09:00:00+00:00 0.456359
Freq: H, dtype: float64

And now we can change them to another use

	
serie_horaria_US = serie_horaria_utc.tz_convert("US/Eastern")
serie_horaria_US
Copy
	
2021-03-05 19:00:00-05:00 -0.853524
2021-03-05 20:00:00-05:00 -1.355372
2021-03-05 21:00:00-05:00 -1.267503
2021-03-05 22:00:00-05:00 -1.155787
2021-03-05 23:00:00-05:00 0.730935
2021-03-06 00:00:00-05:00 1.435957
2021-03-06 01:00:00-05:00 0.460912
2021-03-06 02:00:00-05:00 0.723451
2021-03-06 03:00:00-05:00 -0.853337
2021-03-06 04:00:00-05:00 0.456359
Freq: H, dtype: float64

15. Categorical Datalink image 127

Pandas offers the possibility of adding categorical data in a DataFrame. Suppose the following DataFrame

dataframe = pd.DataFrame(
          {"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
      )
      
      dataframe
      
Out[2]:
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 e

We can convert the data in the raw_grade column to categorical data using the astype() method.

dataframe['grade'] = dataframe["raw_grade"].astype("category")
      
      dataframe
      
Out[3]:
id raw_grade grade
0 1 a a
1 2 b b
2 3 b b
3 4 a a
4 5 a a
5 6 e e

The columns raw_grade and grade seem identical, but if we look at the information of the DataFrame we can see that this is not the case.

	
dataframe.info()
Copy
	
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 6 non-null int64
1 raw_grade 6 non-null object
2 grade 6 non-null category
dtypes: category(1), int64(1), object(1)
memory usage: 334.0+ bytes

It can be seen that the column grade is of categorical type

We can see the categories of categorical data types through the method cat.categories()

	
dataframe["grade"].cat.categories
Copy
	
Index(['a', 'b', 'e'], dtype='object')

We can also rename the categories using the same method, but by providing a list with the new categories.

dataframe["grade"].cat.categories = ["very good", "good", "very bad"]
      dataframe
      
Out[7]:
id raw_grade grade
0 1 a very good
1 2 b good
2 3 b good
3 4 a very good
4 5 a very good
5 6 e very bad

Pandas gives us the possibility to numerically encode categorical data using the get_dummies method.

pd.get_dummies(dataframe["grade"])
      
Out[8]:
very good good very bad
0 1 0 0
1 0 1 0
2 0 1 0
3 1 0 0
4 1 0 0
5 0 0 1

16. Groupbylink image 128

We can group the dataframes by values from one of the columns. Let's reload the dataframe with the value of houses in California.

california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
      california_housing_train.head()
      
Out[43]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0

Now we can group the data by one of the columns, for example, let's group the houses based on the number of years and see how many houses there are of each age with count

california_housing_train.groupby("housing_median_age").count().head()
      
Out[8]:
longitude latitude total_rooms total_bedrooms population households median_income median_house_value
housing_median_age
1.0 2 2 2 2 2 2 2 2
2.0 49 49 49 49 49 49 49 49
3.0 46 46 46 46 46 46 46 46
4.0 161 161 161 161 161 161 161 161
5.0 199 199 199 199 199 199 199 199

As we can see in all the columns, we get the same value, which is the number of houses that have a certain age, but we can find out the average value of each column with mean

california_housing_train.groupby("housing_median_age").mean().head()
      
Out[9]:
longitude latitude total_rooms total_bedrooms population households median_income median_house_value
housing_median_age
1.0 -121.465000 37.940000 2158.000000 335.500000 637.000000 190.000000 4.756800 190250.000000
2.0 -119.035306 35.410816 5237.102041 871.448980 2005.224490 707.122449 5.074237 229438.836735
3.0 -118.798478 35.164783 6920.326087 1190.826087 2934.673913 1030.413043 5.572013 239450.043478
4.0 -118.805093 34.987764 6065.614907 1068.192547 2739.956522 964.291925 5.196055 230054.105590
5.0 -118.789497 35.095327 4926.261307 910.924623 2456.979899 826.768844 4.732460 211035.708543

We can obtain several measures for each age using the agg (aggregation) command, passing it the measures we want through a list. For example, let's look at the minimum, maximum, and mean of each column for each age:

california_housing_train.groupby("housing_median_age").agg(['min', 'max', 'mean']).head()
      
Out[10]:
longitude latitude total_rooms total_bedrooms ... population households median_income median_house_value
min max mean min max mean min max mean min ... mean min max mean min max mean min max mean
housing_median_age
1.0 -122.00 -120.93 -121.465000 37.65 38.23 37.940000 2062.0 2254.0 2158.000000 328.0 ... 637.000000 112.0 268.0 190.000000 4.2500 5.2636 4.756800 189200.0 191300.0 190250.000000
2.0 -122.51 -115.80 -119.035306 33.16 40.58 35.410816 96.0 21897.0 5237.102041 18.0 ... 2005.224490 16.0 2873.0 707.122449 1.9667 10.1531 5.074237 47500.0 500001.0 229438.836735
3.0 -122.33 -115.60 -118.798478 32.87 38.77 35.164783 475.0 21060.0 6920.326087 115.0 ... 2934.673913 123.0 3112.0 1030.413043 2.1187 11.5199 5.572013 83200.0 500001.0 239450.043478
4.0 -122.72 -116.76 -118.805093 32.65 39.00 34.987764 2.0 37937.0 6065.614907 2.0 ... 2739.956522 2.0 5189.0 964.291925 0.5360 13.4883 5.196055 42500.0 500001.0 230054.105590
5.0 -122.55 -115.55 -118.789497 32.55 40.60 35.095327 111.0 25187.0 4926.261307 21.0 ... 2456.979899 20.0 3886.0 826.768844 0.7526 12.6320 4.732460 50000.0 500001.0 211035.708543

5 rows × 24 columns

We can specify on which columns we want to perform certain calculations by passing a dictionary, where the keys will be the columns on which we want to perform calculations and the values will be lists with the calculations.

california_housing_train.groupby("housing_median_age").agg({'total_rooms': ['min', 'max', 'mean'], 'total_bedrooms': ['min', 'max', 'mean', 'median']}).head()
      
Out[12]:
total_rooms total_bedrooms
min max mean min max mean median
housing_median_age
1.0 2062.0 2254.0 2158.000000 328.0 343.0 335.500000 335.5
2.0 96.0 21897.0 5237.102041 18.0 3513.0 871.448980 707.0
3.0 475.0 21060.0 6920.326087 115.0 3559.0 1190.826087 954.0
4.0 2.0 37937.0 6065.614907 2.0 5471.0 1068.192547 778.0
5.0 111.0 25187.0 4926.261307 21.0 4386.0 910.924623 715.0

We can group by more than one column, for this, we have to pass the columns in a list

california_housing_train.groupby(["housing_median_age", "total_bedrooms"]).mean()
      
Out[16]:
longitude latitude total_rooms population households median_income median_house_value
housing_median_age total_bedrooms
1.0 328.0 -120.93 37.65 2254.0 402.0 112.0 4.2500 189200.0
343.0 -122.00 38.23 2062.0 872.0 268.0 5.2636 191300.0
2.0 18.0 -115.80 33.26 96.0 30.0 16.0 5.3374 47500.0
35.0 -121.93 37.78 227.0 114.0 49.0 3.1591 434700.0
55.0 -117.27 33.93 337.0 115.0 49.0 3.1042 164800.0
... ... ... ... ... ... ... ... ...
52.0 1360.0 -118.35 34.06 3446.0 1768.0 1245.0 2.4722 500001.0
1535.0 -122.41 37.80 3260.0 3260.0 1457.0 0.9000 500001.0
1944.0 -118.25 34.05 2806.0 2232.0 1605.0 0.6775 350000.0
2509.0 -122.41 37.79 6016.0 3436.0 2119.0 2.5166 275000.0
2747.0 -122.41 37.79 5783.0 4518.0 2538.0 1.7240 225000.0

13394 rows × 7 columns

17. Graphicslink image 129

Pandas offers the possibility of representing the data in our DataFrames in charts to obtain a better representation of it. For this, it uses the matplotlib library, which we will cover in the next post.

17.1. Basic Graphlink image 130

To represent the data in a chart, the easiest way is to use the plot() method.

serie = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))
      serie = serie.cumsum()
      
      serie.plot()
      
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc5666b9990>
image pandas 1

In the case of having a DataFrame, the plot() method will represent each of the columns of the DataFrame

dataframe = pd.DataFrame(
          np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"]
      )
      dataframe = dataframe.cumsum()
      
      dataframe.plot()
      
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc5663ce610>
image pandas 2

17.2. Vertical Bar Chartlink image 131

There are more methods to create charts, such as the vertical bar chart using plot.bar()

dataframe = pd.DataFrame(np.random.rand(10, 4), columns=["a", "b", "c", "d"])
      dataframe.plot.bar()
      
image pandas 3

If we want to stack the bars, we indicate this through the variable stacked=True

dataframe.plot.bar(stacked=True)
      
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc56265c5d0>
image pandas 4

17.3. Horizontal Bar Chartlink image 132

To create a horizontal bar chart we use plot.barh()

dataframe.plot.barh()
      
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc56247fa10>
image pandas 5

If we want to stack the bars, we indicate this through the variable stacked=True

dataframe.plot.barh(stacked=True)
      
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc562d1d2d0>
image pandas 6

17.4. Histogramlink image 133

To create a histogram we use plot.hist()

dataframe = pd.DataFrame(
          {
              "a": np.random.randn(1000) + 1,
              "b": np.random.randn(1000),
              "c": np.random.randn(1000) - 1,
          }
      )
      
      dataframe.plot.hist(alpha=0.5)
      
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc5650711d0>
image pandas 7

If we want to stack the bars, we indicate this through the variable stacked=True

dataframe.plot.hist(alpha=0.5, stacked=True)
      
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc5625779d0>
image pandas 8

If we want to add more columns, that is, if we want the histogram to be more informative or accurate, we indicate this through the bins variable.

dataframe.plot.hist(alpha=0.5, stacked=True, bins=20)
      
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc562324990>
image pandas 9

17.5. Candlestick Diagramslink image 134

To create a candlestick chart we use plot.box()

dataframe = pd.DataFrame(np.random.rand(10, 5), columns=["A", "B", "C", "D", "E"])
      
      dataframe.plot.box()
      
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc56201a410>
image pandas 10

17.6. Area Chartslink image 135

To create an area chart we use plot.area()

dataframe.plot.area()
      
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc561e9ca50>
image pandas 11

17.7. Scatter plotlink image 136

To create a scatter plot we use plot.scatter(), where you need to specify the x and y variables of the plot.

dataframe.plot.scatter(x='A', y='B')
      
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc561e2ff10>
image pandas 12

17.8. Hexagonal Container Plotlink image 137

To create a hexagonal bin plot we use plot.hexbin(), where you need to specify the x and y variables of the plot and the mesh size using gridsize.

dataframe = pd.DataFrame(np.random.randn(1000, 2), columns=["a", "b"])
      dataframe["b"] = dataframe["b"] + np.arange(1000)
      
      dataframe.plot.hexbin(x="a", y="b", gridsize=25)
      
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc561cdded0>
image pandas 13

Continue reading

Last posts -->

Have you seen these projects?

Subtify

Subtify Subtify

Subtitle generator for videos in the language you want. Also, it puts a different color subtitle to each person

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

Flow edit

Flow edit Flow edit

FLUX.1-RealismLora

FLUX.1-RealismLora FLUX.1-RealismLora
View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

Dataset with jokes in English

opus100

Dataset with translations from English to Spanish

netflix_titles

Dataset with Netflix movies and series

View more datasets -->