Data management with Pandas

15 of june of 2021

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

1. Summary

Let's take a look at a brief introduction to the data manipulation and analysis library Pandas. With it, we can handle and process tabular data, which will help us operate with the data and extract valuable information.

2. What is Pandas?

Pandas is a **Python** library designed to make working with *relational* or *labeled* data easy and intuitive

Pandas is designed for many different types of data:

Tabular data with heterogeneous column types, such as in an SQL table or an Excel spreadsheet
Time series data, ordered and unordered (not necessarily of fixed frequency).
Arbitrary matrix data (homogeneous or heterogeneous) with row and column labels
Any other form of observational/statistical datasets. Data does not need to be labeled at all in order to place it into a Pandas data structure.

The two main data structures of Pandas are Series (one-dimensional) and DataFrames (two-dimensional). Pandas is built on top of NumPy and is designed to integrate well within a scientific computing environment with many other third-party libraries.

For data scientists, working with data is generally divided into several stages: collecting and cleaning data, analyzing/modeling it, and then organizing the analysis results in a suitable format for plotting or displaying them in tabular form. pandas is the ideal tool for all these tasks.

Another feature is that pandas is fast, many of the low-level algorithms have been built in C.

2.1. Pandas as `pd`

Generally when importing pandas, it is usually imported with the alias pd

	
		import pandas as pd
 
print(pd.__version__)
	
	
		
	
	Copied

	
		1.0.1

3. Data Structures in Pandas

In Pandas there are two types of data structures: Series and DataFrames

3.1. Series

The Serie data type is a one-dimensional labeled array capable of holding any data type (integers, strings, floating-point numbers, Python objects, etc.). It is divided into indices.

To create a Series data type, the most common way is

series = pd.Series(data, index=index)

Where data can be:

A dictionary* A list or tuple
A ndarray from Numpy
A scalar value

As one of the data types can be a ndarray from NumPy, we import NumPy to be able to use it

	
		import numpy as np
	
	
		
	
	Copied

3.1.1. Series from a dictionary

	
		diccionario = {"b": 1, "a": 0, "c": 2}
 
serie = pd.Series(diccionario)
serie
	
	
		
	
	Copied

	
		b    1
a    0
c    2
dtype: int64

If an index is passed, the values corresponding to the index labels will be extracted. If they do not exist, they are created as NaN (not a number)

	
		diccionario = {"b": 1, "a": 0, "c": 2}
 
serie = pd.Series(diccionario, index=["b", "c", "d", "a"])
serie
	
	
		
	
	Copied

	
		b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

3.1.2. Series from a list or tuple

If the data comes from a list or tuple and no index is passed, one will be created with values [0, ..., len(data)-1]

	
		serie = pd.Series([1, 2, 3, 4])
serie
	
	
		
	
	Copied

	
		0    1
1    2
2    3
3    4
dtype: int64

If an index is passed, it must have the same length as the data

	
		serie = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"])
serie
	
	
		
	
	Copied

	
		a    1
b    2
c    3
d    4
dtype: int64

3.1.3. Series from an ndarray

If the data comes from an ndarray and no index is passed, one will be created with values [0, ..., len(data)-1]

	
		serie = pd.Series(np.random.randn(5))
serie
	
	
		
	
	Copied

	
		0    1.267865
1   -0.877857
2   -0.138556
3   -0.132987
4   -0.827295
dtype: float64

If an index is passed, it must have the same length as the data

	
		serie = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
serie
	
	
		
	
	Copied

	
		a   -1.091828
b   -0.584243
c    0.220398
d    1.248923
e    1.652351
dtype: float64

3.1.4. Series from a scalar

If the series is created from a scalar, it will be created with a single item

	
		serie = pd.Series(5.0)
serie
	
	
		
	
	Copied

	
		0    5.0
dtype: float64

If you want to create more items in the series, you need to pass the index with the number of items you want, so all the items will have the value of the scalar

	
		serie = pd.Series(5.0, index=["a", "b", "c", "d", "e"])
serie
	
	
		
	
	Copied

	
		a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

3.1.5. Operations with Series

Just like with Numpy, we can perform operations on all elements of a series without having to iterate through each one of them.

	
		serie = pd.Series(5.0, index=["a", "b", "c", "d", "e"])
print(f"serie:
{serie}")
 
print(f"
serie + serie =
{serie + serie}")
	
	
		
	
	Copied

	
		serie:
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64
serie + serie =
a    10.0
b    10.0
c    10.0
d    10.0
e    10.0
dtype: float64

	
		serie = pd.Series(5.0, index=["a", "b", "c", "d", "e"])
print(f"serie:
{serie}")
 
print(f"
exp(serie) =
{np.exp(serie)}")
	
	
		
	
	Copied

	
		serie:
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64
exp(serie) =
a    148.413159
b    148.413159
c    148.413159
d    148.413159
e    148.413159
dtype: float64

A difference between Series and ndarrays is that operations between Series automatically align the data according to their labels. Therefore, calculations can be written without considering whether the involved Series have the same labels. If a label is not found in one Series or the other, the result will be marked as missing (NaN).

	
		serie = pd.Series(5.0, index=["a", "b", "c", "d", "e"])
print(f"serie:
{serie}")
 
print(f"
serie[1:] + serie[:-1] =
{serie[1:] + serie[:-1]}")
	
	
		
	
	Copied

	
		serie:
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64
serie[1:] + serie[:-1] =
a     NaN
b    10.0
c    10.0
d    10.0
e     NaN
dtype: float64

3.1.6. Attribute name of the Series

One of the attributes of Series is name, which corresponds to the name they will have when added to a DataFrame. Conversely, when a series is obtained from a DataFrame, this series will have the same name it had in the DataFrame.

	
		serie = pd.Series(np.random.randn(5), name="aleatorio")
serie
	
	
		
	
	Copied

	
		0   -0.191009
1   -0.793151
2   -0.907747
3   -1.440508
4   -0.676419
Name: aleatorio, dtype: float64

The name of a series can be changed using the rename() method.

	
		serie = serie.rename("random")
serie
	
	
		
	
	Copied

	
		0   -0.191009
1   -0.793151
2   -0.907747
3   -1.440508
4   -0.676419
Name: random, dtype: float64

3.2. DataFrames

A DataFrame is a labeled and two-dimensional data structure, with potentially different types of columns, meaning that one column can contain integer data, another column string data, etc. You can think of it as a spreadsheet or an SQL table, or a dictionary of Series objects.

It is the most used pandas object. Just like Series, DataFrames accept many different types of input:

Along with the data, you can optionally pass index arguments (row labels) and columns (column labels). If you pass an index and/or columns, you are guaranteeing the index and/or columns of the resulting DataFrame. Therefore, a dictionary of Series plus a specific index will discard all data that does not match the passed index.

If axis labels are not provided, they will be constructed from the input data based on common sense rules.

3.2.1. DataFrames from a Dictionary of Series

If a dictionary of Series is passed, the DataFrame will be created with as many columns as Series the dictionary has.

	
		diccionario = {
    "uno": pd.Series([1.0, 2.0, 3.0]),
    "dos": pd.Series([4.0, 5.0, 6.0, 7.0])
    }
 
dataframe = pd.DataFrame(diccionario)
dataframe
	
	
		
	
	Copied

	
		   uno  dos
0  1.0  4.0
1  2.0  5.0
2  3.0  6.0
3  NaN  7.0

If each of the Series has defined indices, the resulting DataFrame will be the union of these indices.

	
		diccionario = {
    "uno": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "dos": pd.Series([4.0, 5.0, 6.0, 7.0], index=["a", "b", "c", "d"])
    }
 
dataframe = pd.DataFrame(diccionario)
dataframe
	
	
		
	
	Copied

	
		   uno  dos
a  1.0  4.0
b  2.0  5.0
c  3.0  6.0
d  NaN  7.0

	
		dataframe = pd.DataFrame(diccionario, index=["d", "b", "a"])
dataframe
	
	
		
	
	Copied

	
		   uno  dos
d  NaN  7.0
b  2.0  5.0
a  1.0  4.0

If the columns are passed, they will appear in the passed order

	
		dataframe = pd.DataFrame(diccionario, columns=["dos", "tres"])
dataframe
	
	
		
	
	Copied

	
		   dos tres
a  4.0  NaN
b  5.0  NaN
c  6.0  NaN
d  7.0  NaN

3.2.2. DataFrames from a dictionary of ndarrays or lists

All ndarrays or lists must have the same length. If an index is passed, it must also have the same length as the ndarrays or lists.

	
		diccionario = {
    "uno": [1.0, 2.0, 3.0, 4.0],
    "dos": [4.0, 3.0, 2.0, 1.0]
    }
 
dataframe = pd.DataFrame(diccionario)
dataframe
	
	
		
	
	Copied

	
		   uno  dos
0  1.0  4.0
1  2.0  3.0
2  3.0  2.0
3  4.0  1.0

3.2.3. DataFrames from a matrix

If an index is passed, it must have the same length as the number of rows in the array, and if columns are passed, they must have the same length as the columns in the array.

	
		matriz = np.array([[1, 3], [2, 2], [3, 1]])
 
dataframe = pd.DataFrame(matriz, index=["a", "b", "c"], columns=["columna1", "columna2"])
dataframe
	
	
		
	
	Copied

	
		   columna1  columna2
a         1         3
b         2         2
c         3         1

3.2.4. DataFrames from a list of dictionaries

	
		lista = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
 
dataframe = pd.DataFrame(lista)
dataframe
	
	
		
	
	Copied

	
		   a   b     c
0  1   2   NaN
1  5  10  20.0

3.2.5. DataFrames from a dictionary of tuples

	
		diccionario = {
    ("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
    ("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
    ("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
    ("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
    ("b", "b"): {("A", "D"): 9, ("A", "B"): 10},
    }
 
dataframe = pd.DataFrame(diccionario)
dataframe
	
	
		
	
	Copied

	
		       a              b
       b    a    c    a     b
A B  1.0  4.0  5.0  8.0  10.0
  C  2.0  3.0  6.0  7.0   NaN
  D  NaN  NaN  NaN  NaN   9.0

3.2.6. DataFrames from a Series

The result will be a DataFrame with the same index as the input Series, and with a column whose name is the original name of the Series (only if another column name is not provided).

	
		diccionario = {"b": 1, "a": 0, "c": 2}
 
serie = pd.Series(diccionario)
 
dataframe = pd.DataFrame(serie)
dataframe
	
	
		
	
	Copied

4. Exploration of a DataFrame

When a DataFrame is very large, it cannot be fully represented.

	
		california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train
	
	
		
	
	Copied

	
		       longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0        -114.31     34.19                15.0       5612.0          1283.0
1        -114.47     34.40                19.0       7650.0          1901.0
2        -114.56     33.69                17.0        720.0           174.0
3        -114.57     33.64                14.0       1501.0           337.0
4        -114.57     33.57                20.0       1454.0           326.0
...          ...       ...                 ...          ...             ...
16995    -124.26     40.58                52.0       2217.0           394.0
16996    -124.27     40.69                36.0       2349.0           528.0
16997    -124.30     41.84                17.0       2677.0           531.0
16998    -124.30     41.80                19.0       2672.0           552.0
16999    -124.35     40.54                52.0       1820.0           300.0
       population  households  median_income  median_house_value
0          1015.0       472.0         1.4936             66900.0
1          1129.0       463.0         1.8200             80100.0
2           333.0       117.0         1.6509             85700.0
3           515.0       226.0         3.1917             73400.0
4           624.0       262.0         1.9250             65500.0
...           ...         ...            ...                 ...
16995       907.0       369.0         2.3571            111400.0
16996      1194.0       465.0         2.5179             79000.0
16997      1244.0       456.0         3.0313            103600.0
16998      1298.0       478.0         1.9797             85800.0
16999       806.0       270.0         3.0147             94600.0
[17000 rows x 9 columns]

So it's very useful to have methods for exploring it and getting information quickly.

To view the first rows and get an idea of what the DataFrame looks like, there is the head() method, which by default displays the first 5 rows of the DataFrame. If you want to see a different number of rows, specify it using the n attribute.

	
		california_housing_train.head(n=10)
	
	
		
	
	Copied

	
		   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -114.31     34.19                15.0       5612.0          1283.0
1    -114.47     34.40                19.0       7650.0          1901.0
2    -114.56     33.69                17.0        720.0           174.0
3    -114.57     33.64                14.0       1501.0           337.0
4    -114.57     33.57                20.0       1454.0           326.0
5    -114.58     33.63                29.0       1387.0           236.0
6    -114.58     33.61                25.0       2907.0           680.0
7    -114.59     34.83                41.0        812.0           168.0
8    -114.59     33.61                34.0       4789.0          1175.0
9    -114.60     34.83                46.0       1497.0           309.0
   population  households  median_income  median_house_value
0      1015.0       472.0         1.4936             66900.0
1      1129.0       463.0         1.8200             80100.0
2       333.0       117.0         1.6509             85700.0
3       515.0       226.0         3.1917             73400.0
4       624.0       262.0         1.9250             65500.0
5       671.0       239.0         3.3438             74000.0
6      1841.0       633.0         2.6768             82400.0
7       375.0       158.0         1.7083             48500.0
8      3134.0      1056.0         2.1782             58400.0
9       787.0       271.0         2.1908             48100.0

4.2. Tail of the DataFrame

If you want to see the last rows, you can use the tail() method. By using the n parameter, you can choose how many rows to display.

	
		california_housing_train.tail()
	
	
		
	
	Copied

	
		       longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
16995    -124.26     40.58                52.0       2217.0           394.0
16996    -124.27     40.69                36.0       2349.0           528.0
16997    -124.30     41.84                17.0       2677.0           531.0
16998    -124.30     41.80                19.0       2672.0           552.0
16999    -124.35     40.54                52.0       1820.0           300.0
       population  households  median_income  median_house_value
16995       907.0       369.0         2.3571            111400.0
16996      1194.0       465.0         2.5179             79000.0
16997      1244.0       456.0         3.0313            103600.0
16998      1298.0       478.0         1.9797             85800.0
16999       806.0       270.0         3.0147             94600.0

4.3. DataFrame Information

Another very useful method is info() which gives us information about the DataFrame

	
		california_housing_train.info()
	
	
		
	
	Copied

	
		&lt;class 'pandas.core.frame.DataFrame'&gt;
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
#   Column              Non-Null Count  Dtype
---  ------              --------------  -----
0   longitude           17000 non-null  float64
1   latitude            17000 non-null  float64
2   housing_median_age  17000 non-null  float64
3   total_rooms         17000 non-null  float64
4   total_bedrooms      17000 non-null  float64
5   population          17000 non-null  float64
6   households          17000 non-null  float64
7   median_income       17000 non-null  float64
8   median_house_value  17000 non-null  float64
dtypes: float64(9)
memory usage: 1.2 MB

4.4. Rows and Columns of DataFrame

The indices and columns of a DataFrame can be obtained using the index and columns methods.

	
		diccionario = {
    "uno": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "dos": pd.Series([4.0, 5.0, 6.0, 7.0], index=["a", "b", "c", "d"])
    }
 
dataframe = pd.DataFrame(diccionario)
 
indices = dataframe.index
columnas = dataframe.columns
print(f"El DataFrame tiene los índices
	{indices}
")
print(f"El DataFrame tiene las columnas
	{columnas}")
	
	
		
	
	Copied

	
		El DataFrame tiene los índices
Index(['a', 'b', 'c', 'd'], dtype='object')
El DataFrame tiene las columnas
Index(['uno', 'dos'], dtype='object')

4.5. Description of the DataFrame

The describe() method displays a quick statistical summary of the data in the DataFrame

	
		california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train.describe()
	
	
		
	
	Copied

	
		          longitude      latitude  housing_median_age   total_rooms  \
count  17000.000000  17000.000000        17000.000000  17000.000000
mean    -119.562108     35.625225           28.589353   2643.664412
std        2.005166      2.137340           12.586937   2179.947071
min     -124.350000     32.540000            1.000000      2.000000
25%     -121.790000     33.930000           18.000000   1462.000000
50%     -118.490000     34.250000           29.000000   2127.000000
75%     -118.000000     37.720000           37.000000   3151.250000
max     -114.310000     41.950000           52.000000  37937.000000
       total_bedrooms    population    households  median_income  \
count    17000.000000  17000.000000  17000.000000   17000.000000
mean       539.410824   1429.573941    501.221941       3.883578
std        421.499452   1147.852959    384.520841       1.908157
min          1.000000      3.000000      1.000000       0.499900
25%        297.000000    790.000000    282.000000       2.566375
50%        434.000000   1167.000000    409.000000       3.544600
75%        648.250000   1721.000000    605.250000       4.767000
max       6445.000000  35682.000000   6082.000000      15.000100
       median_house_value
count        17000.000000
mean        207300.912353
std         115983.764387
min          14999.000000
25%         119400.000000
50%         180400.000000
75%         265000.000000
max         500001.000000

4.6. Sorting the `DataFrame`

Rows of a DataFrame can be alphabetically sorted using the sort_index() method.

	
		california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train.sort_index().head()
	
	
		
	
	Copied

	
		   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -114.31     34.19                15.0       5612.0          1283.0
1    -114.47     34.40                19.0       7650.0          1901.0
2    -114.56     33.69                17.0        720.0           174.0
3    -114.57     33.64                14.0       1501.0           337.0
4    -114.57     33.57                20.0       1454.0           326.0
   population  households  median_income  median_house_value
0      1015.0       472.0         1.4936             66900.0
1      1129.0       463.0         1.8200             80100.0
2       333.0       117.0         1.6509             85700.0
3       515.0       226.0         3.1917             73400.0
4       624.0       262.0         1.9250             65500.0

Since the rows were already sorted in this case, we set ascending=False to reverse the order.

	
		california_housing_train.sort_index(ascending=False).head()
	
	
		
	
	Copied

	
		       longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
16999    -124.35     40.54                52.0       1820.0           300.0
16998    -124.30     41.80                19.0       2672.0           552.0
16997    -124.30     41.84                17.0       2677.0           531.0
16996    -124.27     40.69                36.0       2349.0           528.0
16995    -124.26     40.58                52.0       2217.0           394.0
       population  households  median_income  median_house_value
16999       806.0       270.0         3.0147             94600.0
16998      1298.0       478.0         1.9797             85800.0
16997      1244.0       456.0         3.0313            103600.0
16996      1194.0       465.0         2.5179             79000.0
16995       907.0       369.0         2.3571            111400.0

If you want to sort the columns, you need to set axis=1 since the default is 0.

	
		california_housing_train.sort_index(axis=1).head()
	
	
		
	
	Copied

	
		   households  housing_median_age  latitude  longitude  median_house_value  \
0       472.0                15.0     34.19    -114.31             66900.0
1       463.0                19.0     34.40    -114.47             80100.0
2       117.0                17.0     33.69    -114.56             85700.0
3       226.0                14.0     33.64    -114.57             73400.0
4       262.0                20.0     33.57    -114.57             65500.0
   median_income  population  total_bedrooms  total_rooms
0         1.4936      1015.0          1283.0       5612.0
1         1.8200      1129.0          1901.0       7650.0
2         1.6509       333.0           174.0        720.0
3         3.1917       515.0           337.0       1501.0
4         1.9250       624.0           326.0       1454.0

If we want to sort the DataFrame by a specific column, we need to use the sort_values() method and specify the label of the column to sort by.

	
		california_housing_train.sort_values('median_house_value')
	
	
		
	
	Copied

	
		       longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
568      -117.02     36.40                19.0        619.0           239.0
16643    -122.74     39.71                16.0        255.0            73.0
16801    -123.17     40.31                36.0         98.0            28.0
3226     -117.86     34.24                52.0        803.0           267.0
7182     -118.33     34.15                39.0        493.0           168.0
...          ...       ...                 ...          ...             ...
15834    -122.42     37.81                52.0       1314.0           317.0
7927     -118.40     33.87                38.0       2398.0           431.0
3546     -117.90     33.63                28.0       2370.0           352.0
7924     -118.40     33.88                35.0       1060.0           191.0
14011    -122.04     37.26                24.0       4973.0           709.0
       population  households  median_income  median_house_value
568         490.0       164.0         2.1000             14999.0
16643        85.0        38.0         1.6607             14999.0
16801        18.0         8.0         0.5360             14999.0
3226        628.0       225.0         4.1932             14999.0
7182        259.0       138.0         2.3667             17500.0
...           ...         ...            ...                 ...
15834       473.0       250.0         4.3472            500001.0
7927        911.0       392.0         5.2319            500001.0
3546        832.0       347.0         7.1148            500001.0
7924        444.0       196.0         8.0015            500001.0
14011      1692.0       696.0         7.8627            500001.0
[17000 rows x 9 columns]

4.7. DataFrame Statistics

Statistics can be obtained from the DataFrame, such as the mean, mode, and standard deviation.

	
		california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
 
print(f"media:
{california_housing_train.mean()}")
print(f"

desviación estandar:
{california_housing_train.std()}")
	
	
		
	
	Copied

	
		media:
longitude               -119.562108
latitude                  35.625225
housing_median_age        28.589353
total_rooms             2643.664412
total_bedrooms           539.410824
population              1429.573941
households               501.221941
median_income              3.883578
median_house_value    207300.912353
dtype: float64
desviación estandar:
longitude                  2.005166
latitude                   2.137340
housing_median_age        12.586937
total_rooms             2179.947071
total_bedrooms           421.499452
population              1147.852959
households               384.520841
median_income              1.908157
median_house_value    115983.764387
dtype: float64

If you want to get statistics about the rows instead of the columns, you need to specify it using axis=1

	
		california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
 
print(f"media:
{california_housing_train.mean(axis=1)}")
print(f"

desviación estandar:
{california_housing_train.std(axis=1)}")
	
	
		
	
	Copied

	
		media:
0         8357.597067
1        10131.527778
2         9664.642322
3         8435.029078
4         7567.436111
             ...
16995    12806.408567
16996     9276.770878
16997    12049.507922
16998    10082.053300
16999    10863.022744
Length: 17000, dtype: float64
desviación estandar:
0        22026.612445
1        26352.939272
2        28514.316588
3        24366.754747
4        21730.014569
             ...
16995    36979.676899
16996    26158.006771
16997    34342.876792
16998    28408.152329
16999    31407.119788
Length: 17000, dtype: float64

Another useful thing that can be obtained from DataFrames is, for example, the number of times each item in a column repeats.

	
		california_housing_train["total_rooms"].value_counts()
	
	
		
	
	Copied

	
		1582.0     16
1527.0     15
1717.0     14
1471.0     14
1703.0     14
           ..
157.0       1
2760.0      1
458.0       1
10239.0     1
4068.0      1
Name: total_rooms, Length: 5533, dtype: int64

For example, we can see that there are a total of 16 houses with 1582 rooms.

4.8. Used Memory

We can see the memory used by the DataFrame

	
		california_housing_train.memory_usage(deep=True)
	
	
		
	
	Copied

	
		Index                    128
longitude             136000
latitude              136000
housing_median_age    136000
total_rooms           136000
total_bedrooms        136000
population            136000
households            136000
median_income         136000
median_house_value    136000
dtype: int64

5. Addition of Data

5.1. Addition of Columns

Columns can be easily added as operations of other columns

	
		diccionario = {
    "uno": pd.Series([1.0, 2.0, 3.0]),
    "dos": pd.Series([4.0, 5.0, 6.0, 7.0])
    }
 
dataframe = pd.DataFrame(diccionario)
 
dataframe["tres"] = dataframe["uno"] + dataframe["dos"]
dataframe["flag"] = dataframe["tres"] &gt; 7.0
 
dataframe
	
	
		
	
	Copied

	
		   uno  dos  tres   flag
0  1.0  4.0   5.0  False
1  2.0  5.0   7.0  False
2  3.0  6.0   9.0   True
3  NaN  7.0   NaN  False

Columns can also be added indicating what value all their items will have

	
		dataframe["constante"] = 8.0
dataframe
	
	
		
	
	Copied

	
		   uno  dos  tres   flag  constante
0  1.0  4.0   5.0  False        8.0
1  2.0  5.0   7.0  False        8.0
2  3.0  6.0   9.0   True        8.0
3  NaN  7.0   NaN  False        8.0

If a Series is added that does not have the same number of indices as the DataFrame, it will be adjusted to match the number of indices of the DataFrame.

	
		dataframe["Menos indices"] = dataframe["uno"][:2]
dataframe
	
	
		
	
	Copied

	
		   uno  dos  tres   flag  constante  Menos indices
0  1.0  4.0   5.0  False        8.0            1.0
1  2.0  5.0   7.0  False        8.0            2.0
2  3.0  6.0   9.0   True        8.0            NaN
3  NaN  7.0   NaN  False        8.0            NaN

With the previous methods, the column was added at the end, but if you want to add the column in a specific position, you can use the insert() method.

For example, if you want to add a column at position 3 (keeping in mind that indexing starts from position 0), with the name of the column being *inserted column* and its value being double that of the column *three*, it would be done as follows:

	
		dataframe.insert(loc=3, column="columna insertada", value=dataframe["tres"]*2)
dataframe
	
	
		
	
	Copied

	
		   uno  dos  tres  columna insertada   flag  constante  Menos indices
0  1.0  4.0   5.0               10.0  False        8.0            1.0
1  2.0  5.0   7.0               14.0  False        8.0            2.0
2  3.0  6.0   9.0               18.0   True        8.0            NaN
3  NaN  7.0   NaN                NaN  False        8.0            NaN

If you want to add more than one column per command, you can use the assign() method.

	
		dataframe = dataframe.assign(
    columna_asignada1 = dataframe["uno"] * dataframe["tres"],
    columna_asignada2 = dataframe["dos"] * dataframe["tres"],
    )
dataframe
	
	
		
	
	Copied

	
		   uno  dos  tres  ...  Menos indices  columna_asignada1  columna_asignada2
0  1.0  4.0   5.0  ...            1.0                5.0               20.0
1  2.0  5.0   7.0  ...            2.0               14.0               35.0
2  3.0  6.0   9.0  ...            NaN               27.0               54.0
3  NaN  7.0   NaN  ...            NaN                NaN                NaN
[4 rows x 9 columns]

5.2. Addition of Rows

	
		diccionario = {
    "uno": pd.Series([1.0, 2.0, 3.0]),
    "dos": pd.Series([4.0, 5.0, 6.0, 7.0])
    }
 
dataframe = pd.DataFrame(diccionario)
 
dataframe.head()
	
	
		
	
	Copied

	
		   uno  dos
0  1.0  4.0
1  2.0  5.0
2  3.0  6.0
3  NaN  7.0

We can add a row at the end using the concat method (which we will cover in more detail later)

	
		diccionario = {
    "uno": [10.0],
    "dos": [20.0]
    }
 
dataframe = pd.concat([dataframe, pd.DataFrame(diccionario)])
dataframe
	
	
		
	
	Copied

	
		    uno   dos
0   1.0   4.0
1   2.0   5.0
2   3.0   6.0
3   NaN   7.0
0  10.0  20.0

We see that the column has been added at the end, but it has index zero, so we reorder the indices using the method reset_index(drop=True)

	
		dataframe = dataframe.reset_index(drop=True)
dataframe
	
	
		
	
	Copied

	
		   index   uno   dos
0      0   1.0   4.0
1      1   2.0   5.0
2      2   3.0   6.0
3      3   NaN   7.0
4      0  10.0  20.0

6. Data Deletion

6.1. Column Deletion

A specific column can be deleted using the pop() method.

	
		dataframe.pop("constante")
dataframe
	
	
		
	
	Copied

	
		   uno  dos  tres  ...  Menos indices  columna_asignada1  columna_asignada2
0  1.0  4.0   5.0  ...            1.0                5.0               20.0
1  2.0  5.0   7.0  ...            2.0               14.0               35.0
2  3.0  6.0   9.0  ...            NaN               27.0               54.0
3  NaN  7.0   NaN  ...            NaN                NaN                NaN
[4 rows x 8 columns]

Or using del

	
		del dataframe["flag"]
dataframe
	
	
		
	
	Copied

	
		   uno  dos  tres  ...  Menos indices  columna_asignada1  columna_asignada2
0  1.0  4.0   5.0  ...            1.0                5.0               20.0
1  2.0  5.0   7.0  ...            2.0               14.0               35.0
2  3.0  6.0   9.0  ...            NaN               27.0               54.0
3  NaN  7.0   NaN  ...            NaN                NaN                NaN
[4 rows x 7 columns]

6.1. Row Deletion

	
		diccionario = {
    "uno": pd.Series([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]),
    "dos": pd.Series([11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]),
    "tres": pd.Series([21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0])
    }
 
dataframe = pd.DataFrame(diccionario)
 
dataframe.head()
	
	
		
	
	Copied

	
		   uno   dos  tres
0  1.0  11.0  21.0
1  2.0  12.0  22.0
2  3.0  13.0  23.0
3  4.0  14.0  24.0
4  5.0  15.0  25.0

If we want to delete a row, we can use the drop method, specifying its position. For example, if we want to delete the row at position 1

	
		dataframe = dataframe.drop(1)
dataframe
	
	
		
	
	Copied

	
		    uno   dos  tres
0   1.0  11.0  21.0
2   3.0  13.0  23.0
3   4.0  14.0  24.0
4   5.0  15.0  25.0
5   6.0  16.0  26.0
6   7.0  17.0  27.0
7   8.0  18.0  28.0
8   9.0  19.0  29.0
9  10.0  20.0  30.0

If we want to delete the last row

	
		dataframe = dataframe.drop(len(dataframe)-1)
dataframe
	
	
		
	
	Copied

	
		    uno   dos  tres
0   1.0  11.0  21.0
2   3.0  13.0  23.0
3   4.0  14.0  24.0
4   5.0  15.0  25.0
5   6.0  16.0  26.0
6   7.0  17.0  27.0
7   8.0  18.0  28.0
9  10.0  20.0  30.0

If what we want is to delete a range of rows

	
		dataframe = dataframe.drop(range(2, 5))
dataframe
	
	
		
	
	Copied

	
		    uno   dos  tres
0   1.0  11.0  21.0
5   6.0  16.0  26.0
6   7.0  17.0  27.0
7   8.0  18.0  28.0
9  10.0  20.0  30.0

If what we want is to remove a specific set of rows

	
		dataframe = dataframe.drop([5, 7, 9])
dataframe
	
	
		
	
	Copied

	
		   uno   dos  tres
0  1.0  11.0  21.0
6  7.0  17.0  27.0

Just like when we added rows, we see that some indices have been removed, so we reorder the indices using the method reset_index(drop=True)

	
		dataframe = dataframe.reset_index(drop=True)
dataframe
	
	
		
	
	Copied

	
		   uno   dos  tres
0  1.0  11.0  21.0
1  7.0  17.0  27.0

7. Operations on DataFrames

Operations can be performed on DataFrames just as they can with Numpy.

	
		diccionario = {
    "uno": pd.Series([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]),
    "dos": pd.Series([11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]),
    "tres": pd.Series([21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0])
    }
 
dataframe = pd.DataFrame(diccionario)
 
dataframe.head()
	
	
		
	
	Copied

	
		   uno   dos  tres
0  1.0  11.0  21.0
1  2.0  12.0  22.0
2  3.0  13.0  23.0
3  4.0  14.0  24.0
4  5.0  15.0  25.0

	
		dataframe[ ["uno", "dos", "tres"] ] * 2
	
	
		
	
	Copied

	
		    uno   dos  tres
0   2.0  22.0  42.0
1   4.0  24.0  44.0
2   6.0  26.0  46.0
3   8.0  28.0  48.0
4  10.0  30.0  50.0
5  12.0  32.0  52.0
6  14.0  34.0  54.0
7  16.0  36.0  56.0
8  18.0  38.0  58.0
9  20.0  40.0  60.0

	
		np.exp(dataframe[ ["uno", "dos", "tres"] ])
	
	
		
	
	Copied

	
		            uno           dos          tres
0      2.718282  5.987414e+04  1.318816e+09
1      7.389056  1.627548e+05  3.584913e+09
2     20.085537  4.424134e+05  9.744803e+09
3     54.598150  1.202604e+06  2.648912e+10
4    148.413159  3.269017e+06  7.200490e+10
5    403.428793  8.886111e+06  1.957296e+11
6   1096.633158  2.415495e+07  5.320482e+11
7   2980.957987  6.565997e+07  1.446257e+12
8   8103.083928  1.784823e+08  3.931334e+12
9  22026.465795  4.851652e+08  1.068647e+13

If more complex operations are desired, the apply() method can be used.

	
		dataframe = dataframe.apply(lambda x: x.max() - x.min())
dataframe
	
	
		
	
	Copied

	
		uno     9.0
dos     9.0
tres    9.0
dtype: float64

A lambda function has been applied because it is a simple function, but if we want to apply more complex functions, we can define and apply them.

	
		def funcion(x):
    if x &lt; 10:
        return np.exp(x) - np.log(5*x) + np.sqrt(x)
    elif x &lt; 20:
        return np.sin(x) + np.cos(x) + np.tan(x)
    else:
        return np.log(x) + np.log10(x) + np.log2(x)
    
dataframe = dataframe.apply(funcion)
 
dataframe
	
	
		
	
	Copied

	
		uno     8102.277265
dos     8102.277265
tres    8102.277265
dtype: float64

Using the apply method on a DataFrame is much faster than looping through each row and performing the operation.

	
		california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train.head()
	
	
		
	
	Copied

	
		   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -114.31     34.19                15.0       5612.0          1283.0
1    -114.47     34.40                19.0       7650.0          1901.0
2    -114.56     33.69                17.0        720.0           174.0
3    -114.57     33.64                14.0       1501.0           337.0
4    -114.57     33.57                20.0       1454.0           326.0
   population  households  median_income  median_house_value
0      1015.0       472.0         1.4936             66900.0
1      1129.0       463.0         1.8200             80100.0
2       333.0       117.0         1.6509             85700.0
3       515.0       226.0         3.1917             73400.0
4       624.0       262.0         1.9250             65500.0

Let's calculate the percentage of bedrooms out of the total number of rooms.

	
		california_housing_train["percent_bedrooms"] = None
%time california_housing_train["percent_bedrooms"] = california_housing_train.apply(lambda x: x["total_bedrooms"] / x["total_rooms"], axis=1)
california_housing_train.head()
	
	
		
	
	Copied

	
		CPU times: user 309 ms, sys: 86 µs, total: 309 ms
Wall time: 309 ms

	
		   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -114.31     34.19                15.0       5612.0          1283.0
1    -114.47     34.40                19.0       7650.0          1901.0
2    -114.56     33.69                17.0        720.0           174.0
3    -114.57     33.64                14.0       1501.0           337.0
4    -114.57     33.57                20.0       1454.0           326.0
   population  households  median_income  median_house_value  percent_bedrooms
0      1015.0       472.0         1.4936             66900.0          0.228617
1      1129.0       463.0         1.8200             80100.0          0.248497
2       333.0       117.0         1.6509             85700.0          0.241667
3       515.0       226.0         3.1917             73400.0          0.224517
4       624.0       262.0         1.9250             65500.0          0.224209

	
		california_housing_train["percent_bedrooms"] = None
%time for i in range(len(california_housing_train)): california_housing_train["percent_bedrooms"][i] = california_housing_train["total_bedrooms"][i] / california_housing_train["total_rooms"][i]
california_housing_train.head()
	
	
		
	
	Copied

	
		/home/wallabot/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

	
		CPU times: user 1.72 s, sys: 12 ms, total: 1.73 s
Wall time: 1.72 s

	
		   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -114.31     34.19                15.0       5612.0          1283.0
1    -114.47     34.40                19.0       7650.0          1901.0
2    -114.56     33.69                17.0        720.0           174.0
3    -114.57     33.64                14.0       1501.0           337.0
4    -114.57     33.57                20.0       1454.0           326.0
   population  households  median_income  median_house_value percent_bedrooms
0      1015.0       472.0         1.4936             66900.0         0.228617
1      1129.0       463.0         1.8200             80100.0         0.248497
2       333.0       117.0         1.6509             85700.0         0.241667
3       515.0       226.0         3.1917             73400.0         0.224517
4       624.0       262.0         1.9250             65500.0         0.224209

The lambda function took about 300 ms, while the for loop took more than 1 second.

8. Transpose

The transpose of a DataFrame can be obtained using the T method.

	
		diccionario = {
    "uno": pd.Series([1.0, 2.0, 3.0], index=["fila a", "fila b", "fila c"]),
    "dos": pd.Series([4.0, 5.0, 6.0], index=["fila a", "fila b", "fila c"])
    }
 
dataframe = pd.DataFrame(diccionario)
 
dataframe["tres"] = dataframe["uno"] + dataframe["dos"]
dataframe["flag"] = dataframe["tres"] &gt; 7.0
 
dataframe.T
	
	
		
	
	Copied

	
		     fila a fila b fila c
uno       1      2      3
dos       4      5      6
tres      5      7      9
flag  False  False   True

9. Conversion to Numpy

If you want to convert a Series or DataFrame to NumPy, you can use the to_numpy() method or the np.asarray() function.

	
		diccionario = {
    "uno": pd.Series([1.0, 2.0, 3.0], index=["fila a", "fila b", "fila c"]),
    "dos": pd.Series([4.0, 5.0, 6.0], index=["fila a", "fila b", "fila c"])
    }
 
dataframe = pd.DataFrame(diccionario)
 
dataframe["tres"] = dataframe["uno"] + dataframe["dos"]
dataframe["flag"] = dataframe["tres"] &gt; 7.0
 
dataframe
	
	
		
	
	Copied

	
		        uno  dos  tres   flag
fila a  1.0  4.0   5.0  False
fila b  2.0  5.0   7.0  False
fila c  3.0  6.0   9.0   True

	
		matriz_np = dataframe.to_numpy()
matriz_np
	
	
		
	
	Copied

	
		array([[1.0, 4.0, 5.0, False],
       [2.0, 5.0, 7.0, False],
       [3.0, 6.0, 9.0, True]], dtype=object)

	
		matriz_np = np.asarray(dataframe)
matriz_np
	
	
		
	
	Copied

	
		array([[1.0, 4.0, 5.0, False],
       [2.0, 5.0, 7.0, False],
       [3.0, 6.0, 9.0, True]], dtype=object)

This example is not the most appropriate, as it mixes numbers with booleans, and as we explained in the previous post Matrix Calculus with NumPy, all elements of an ndarray must be of the same type.

In this case we are mixing numbers with booleans, so to solve it NumPy converts them all to objects

To solve this, we keep only the numbers and convert them to an ndarray

	
		matriz_np = dataframe[ ["uno", "dos", "tres"] ].to_numpy()
matriz_np, matriz_np.dtype
	
	
		
	
	Copied

	
		(array([[1., 4., 5.],
        [2., 5., 7.],
        [3., 6., 9.]]), dtype('float64'))

Now it can be seen that an ndarray has been created where all the data are of type float

10. Reading data from external sources

One of the greatest strengths of Pandas is being able to read data from files, so it's not necessary to create a DataFrame with the data you want to process; instead, it can be read from a file.

Just like you can create DataFrames from external files, you can also save DataFrames to files, so you can create your own dataset, configure it the way you want, and save it to a file to use it later.

The following table shows the functions for reading and writing files in different formats.

11. Indexing in DataFrames

There are many ways to index into DataFrames.

	
		fechas = pd.date_range('1/1/2000', periods=8)
dataframe = pd.DataFrame(np.random.randn(8, 4), index=fechas, columns=['A', 'B', 'C', 'D'])
 
dataframe
	
	
		
	
	Copied

	
		                   A         B         C         D
2000-01-01 -0.869298 -0.210502  0.477938  0.912121
2000-01-02 -0.502425 -1.897287 -0.084122 -1.203818
2000-01-03 -0.204297  0.711485 -1.271802 -0.138120
2000-01-04  1.512985  0.726718  0.960902  0.433124
2000-01-05  0.724005 -2.127668  0.674417 -0.297808
2000-01-06 -0.217175 -0.621172  0.668163  0.170576
2000-01-07  0.352484  0.260515 -1.576033  1.263213
2000-01-08 -0.032928 -0.648313 -0.622509  0.231683

11.1. Column Indexing

To select columns within a DataFrame, you can do so by selecting the column between brackets [], or by indicating the column as if it were an attribute of the DataFrame.

	
		dataframe['A']
	
	
		
	
	Copied

	
		2000-01-01    0.813153
2000-01-02   -0.244584
2000-01-03    0.125729
2000-01-04    0.352275
2000-01-05   -2.050976
2000-01-06   -0.312296
2000-01-07    0.897837
2000-01-08    0.271403
Freq: D, Name: A, dtype: float64

	
		dataframe.A
	
	
		
	
	Copied

	
		2000-01-01    0.813153
2000-01-02   -0.244584
2000-01-03    0.125729
2000-01-04    0.352275
2000-01-05   -2.050976
2000-01-06   -0.312296
2000-01-07    0.897837
2000-01-08    0.271403
Freq: D, Name: A, dtype: float64

If specific rows are desired, they are passed through a list

	
		dataframe[ ['A', 'B'] ]
	
	
		
	
	Copied

	
		                   A         B
2000-01-01 -0.341962  0.639913
2000-01-02  0.507258 -0.942036
2000-01-03 -1.463557  1.041023
2000-01-04  1.781058 -1.849352
2000-01-05  0.318897 -0.229218
2000-01-06  0.362064 -0.193479
2000-01-07 -0.084632 -0.112474
2000-01-08  0.739424  0.253191

11.2. Row Indexing by Positions

A range of rows from a DataFrame can be selected as follows

	
		dataframe[0:3]
	
	
		
	
	Copied

	
		                   A         B         C         D
2000-01-01  0.813153 -0.869356  0.934293  0.338644
2000-01-02 -0.244584  0.536352  0.322248  0.238903
2000-01-03  0.125729  2.046910 -0.877466 -0.710034

If you want to select a single row, you need to specify a range of rows that includes only that one. For example, if you want to select row number 1

	
		dataframe[1:2]
	
	
		
	
	Copied

	
		                   A         B         C         D
2000-01-02 -0.244584  0.536352  0.322248  0.238903

Another method to select a row by its position is the iloc[] method

	
		dataframe.iloc[0:3]
	
	
		
	
	Copied

	
		                   A         B         C         D
2000-01-01  0.813153 -0.869356  0.934293  0.338644
2000-01-02 -0.244584  0.536352  0.322248  0.238903
2000-01-03  0.125729  2.046910 -0.877466 -0.710034

If specific rows are desired, a list with their positions is passed.

	
		dataframe.iloc[ [0, 2, 4] ]
	
	
		
	
	Copied

	
		                   A         B         C         D
2000-01-01 -0.341962  0.639913  0.765817  0.056692
2000-01-03 -1.463557  1.041023 -1.321715  2.822735
2000-01-05  0.318897 -0.229218 -1.095593 -0.186248

11.3. Row Label Indexing

To select a row by its labels, we can use the loc[] method.

	
		dataframe.loc['2000-01-01']
	
	
		
	
	Copied

	
		A    0.813153
B   -0.869356
C    0.934293
D    0.338644
Name: 2000-01-01 00:00:00, dtype: float64

If you want to select a range of rows, you can index them using the colon :

	
		dataframe.loc['2000-01-01':'2000-01-03']
	
	
		
	
	Copied

	
		                   A         B         C         D
2000-01-01  0.813153 -0.869356  0.934293  0.338644
2000-01-02 -0.244584  0.536352  0.322248  0.238903
2000-01-03  0.125729  2.046910 -0.877466 -0.710034

If specific rows are desired, they are passed through a list

	
		dataframe.loc[ ['2000-01-01', '2000-01-03', '2000-01-05'] ]
	
	
		
	
	Copied

	
		                   A         B         C         D
2000-01-01 -0.341962  0.639913  0.765817  0.056692
2000-01-03 -1.463557  1.041023 -1.321715  2.822735
2000-01-05  0.318897 -0.229218 -1.095593 -0.186248

11.4. Selection of a Portion of the DataFrame by Positions

	
		dataframe.iloc[0:3, 0:2]
	
	
		
	
	Copied

	
		                   A         B
2000-01-01  0.813153 -0.869356
2000-01-02 -0.244584  0.536352
2000-01-03  0.125729  2.046910

If specific rows and columns are desired, lists with the desired positions are passed.

	
		dataframe.iloc[ [0, 2, 4], [0, 2] ]
	
	
		
	
	Copied

	
		                   A         C
2000-01-01 -0.341962  0.765817
2000-01-03 -1.463557 -1.321715
2000-01-05  0.318897 -1.095593

11.5. Selecting a Portion of the DataFrame Using Labels

	
		dataframe.loc['2000-01-01':'2000-01-03', 'A':'B']
	
	
		
	
	Copied

	
		                   A         B
2000-01-01  0.813153 -0.869356
2000-01-02 -0.244584  0.536352
2000-01-03  0.125729  2.046910

If you want certain rows and columns, you pass lists with the desired labels

	
		dataframe.loc[ ['2000-01-01', '2000-01-03', '2000-01-05'], ['A', 'C'] ]
	
	
		
	
	Copied

	
		                   A         C
2000-01-01 -0.341962  0.765817
2000-01-03 -1.463557 -1.321715
2000-01-05  0.318897 -1.095593

11.6. Lambda Function Indexing

Data can be selected from a DataFrame that meets a condition given by a lambda function.

	
		dataframe.loc[lambda dataframe:2*dataframe['A']+5*np.exp(dataframe['B'])&gt;0.2]
	
	
		
	
	Copied

	
		                   A         B         C         D
2000-01-01 -0.869298 -0.210502  0.477938  0.912121
2000-01-03 -0.204297  0.711485 -1.271802 -0.138120
2000-01-04  1.512985  0.726718  0.960902  0.433124
2000-01-05  0.724005 -2.127668  0.674417 -0.297808
2000-01-06 -0.217175 -0.621172  0.668163  0.170576
2000-01-07  0.352484  0.260515 -1.576033  1.263213
2000-01-08 -0.032928 -0.648313 -0.622509  0.231683

As can be seen, this form of indexing is very powerful

11.7. Conditional Indexing

If we don't need complex functions for indexing, but only conditionals, we can do

	
		dataframe[dataframe['A']&gt;0.2]
	
	
		
	
	Copied

	
		                   A         B         C         D
2000-01-04  1.512985  0.726718  0.960902  0.433124
2000-01-05  0.724005 -2.127668  0.674417 -0.297808
2000-01-07  0.352484  0.260515 -1.576033  1.263213

We can make multiple conditions

	
		dataframe[(dataframe['A']&gt;0.2) &amp; (dataframe['B']&gt;0.2)]
	
	
		
	
	Copied

	
		                   A         B         C         D
2000-01-04  1.512985  0.726718  0.960902  0.433124
2000-01-07  0.352484  0.260515 -1.576033  1.263213

11.8. Random Indexing

By using the sample() method, we will obtain a random row from the DataFrame.

	
		dataframe.sample()
	
	
		
	
	Copied

	
		                   A         B         C         D
2000-01-06 -0.312296  0.129097 -0.991085  1.704535

If we want more than one sample, we indicate it with the attribute n

	
		dataframe.sample(n=3)
	
	
		
	
	Copied

	
		                   A         B         C         D
2000-01-08  0.271403  1.527116  0.144970  1.175728
2000-01-01  0.813153 -0.869356  0.934293  0.338644
2000-01-03  0.125729  2.046910 -0.877466 -0.710034

If random columns are desired, this must be indicated using axis=1

	
		dataframe.sample(axis=1)
	
	
		
	
	Copied

	
		                   D
2000-01-01  0.338644
2000-01-02  0.238903
2000-01-03 -0.710034
2000-01-04  0.504410
2000-01-05 -1.601926
2000-01-06  1.704535
2000-01-07 -0.584860
2000-01-08  1.175728

If a single item from the DataFrame is desired, the sample() method has to be called twice.

	
		dataframe.sample(axis=1).sample()
	
	
		
	
	Copied

	
		                   D
2000-01-05 -1.601926

12. Merging DataFrames

12.1. Concatenation of DataFrames

To concatenate multiple DataFrames we use the concat() method, where a list of the DataFrames to be joined is passed.

	
		dataframe1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
     })
 
dataframe2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
     })
 
dataframe3 = pd.DataFrame(
    {
        "A": ["A8", "A9", "A10", "A11"],
        "B": ["B8", "B9", "B10", "B11"],
        "C": ["C8", "C9", "C10", "C11"],
        "D": ["D8", "D9", "D10", "D11"],
     })
 
dataframe = pd.concat([dataframe1, dataframe2, dataframe3])
 
print(f"dataframe1:
{dataframe1}")
print(f"dataframe2:
{dataframe2}")
print(f"dataframe3:
{dataframe3}")
 
print(f"
dataframe:
{dataframe}")
	
	
		
	
	Copied

	
		dataframe1:
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
dataframe2:
    A   B   C   D
0  A4  B4  C4  D4
1  A5  B5  C5  D5
2  A6  B6  C6  D6
3  A7  B7  C7  D7
dataframe3:
     A    B    C    D
0   A8   B8   C8   D8
1   A9   B9   C9   D9
2  A10  B10  C10  D10
3  A11  B11  C11  D11
dataframe:
     A    B    C    D
0   A0   B0   C0   D0
1   A1   B1   C1   D1
2   A2   B2   C2   D2
3   A3   B3   C3   D3
0   A4   B4   C4   D4
1   A5   B5   C5   D5
2   A6   B6   C6   D6
3   A7   B7   C7   D7
0   A8   B8   C8   D8
1   A9   B9   C9   D9
2  A10  B10  C10  D10
3  A11  B11  C11  D11

As can be seen, the indices 0, 1, 2, and 3 are repeated because each dataframe has those indices. To prevent this, you should use the parameter ignore_index=True.

	
		dataframe = pd.concat([dataframe1, dataframe2, dataframe3], ignore_index=True)
 
print(f"dataframe1:
{dataframe1}")
print(f"dataframe2:
{dataframe2}")
print(f"dataframe3:
{dataframe3}")
 
print(f"
dataframe:
{dataframe}")
	
	
		
	
	Copied

	
		dataframe1:
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
dataframe2:
    A   B   C   D
0  A4  B4  C4  D4
1  A5  B5  C5  D5
2  A6  B6  C6  D6
3  A7  B7  C7  D7
dataframe3:
     A    B    C    D
0   A8   B8   C8   D8
1   A9   B9   C9   D9
2  A10  B10  C10  D10
3  A11  B11  C11  D11
dataframe:
      A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11

If the concatenation was intended to be performed along the columns, the variable axis=1 should have been used.

	
		dataframe = pd.concat([dataframe1, dataframe2, dataframe3], axis=1)
 
print(f"dataframe1:
{dataframe1}")
print(f"dataframe2:
{dataframe2}")
print(f"dataframe3:
{dataframe3}")
 
print(f"
dataframe:
{dataframe}")
	
	
		
	
	Copied

	
		dataframe1:
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
dataframe2:
    A   B   C   D
0  A4  B4  C4  D4
1  A5  B5  C5  D5
2  A6  B6  C6  D6
3  A7  B7  C7  D7
dataframe3:
     A    B    C    D
0   A8   B8   C8   D8
1   A9   B9   C9   D9
2  A10  B10  C10  D10
3  A11  B11  C11  D11
dataframe:
    A   B   C   D   A   B   C   D    A    B    C    D
0  A0  B0  C0  D0  A4  B4  C4  D4   A8   B8   C8   D8
1  A1  B1  C1  D1  A5  B5  C5  D5   A9   B9   C9   D9
2  A2  B2  C2  D2  A6  B6  C6  D6  A10  B10  C10  D10
3  A3  B3  C3  D3  A7  B7  C7  D7  A11  B11  C11  D11

12.1.1. Intersection of Concatenation

There are two ways to perform the concatenation, either taking all the indices from the DataFrames or only those that match. This is determined by the join variable, which accepts the values 'outer' (default) (takes all indices) or 'inner' (only those that match).

Let's see an example of 'outer'

	
		dataframe1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
     },
     index=[0, 1, 2, 3])
 
dataframe4 = pd.DataFrame(
    {
        "B": ["B2", "B3", "B6", "B7"],
        "D": ["D2", "D3", "D6", "D7"],
        "F": ["F2", "F3", "F6", "F7"],
     },index=[2, 3, 6, 7])
 
dataframe = pd.concat([dataframe1, dataframe4], axis=1)
 
print(f"dataframe1:
{dataframe1}")
print(f"dataframe2:
{dataframe4}")
 
print(f"
dataframe:
{dataframe}")
	
	
		
	
	Copied

	
		dataframe1:
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
dataframe2:
    B   D   F
2  B2  D2  F2
3  B3  D3  F3
6  B6  D6  F6
7  B7  D7  F7
dataframe:
     A    B    C    D    B    D    F
0   A0   B0   C0   D0  NaN  NaN  NaN
1   A1   B1   C1   D1  NaN  NaN  NaN
2   A2   B2   C2   D2   B2   D2   F2
3   A3   B3   C3   D3   B3   D3   F3
6  NaN  NaN  NaN  NaN   B6   D6   F6
7  NaN  NaN  NaN  NaN   B7   D7   F7

Let's see an example of 'inner'

	
		dataframe = pd.concat([dataframe1, dataframe4], axis=1, join="inner")
 
print(f"dataframe1:
{dataframe1}")
print(f"dataframe2:
{dataframe4}")
 
print(f"
dataframe:
{dataframe}")
	
	
		
	
	Copied

	
		dataframe1:
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
dataframe2:
    B   D   F
2  B2  D2  F2
3  B3  D3  F3
6  B6  D6  F6
7  B7  D7  F7
dataframe:
    A   B   C   D   B   D   F
2  A2  B2  C2  D2  B2  D2  F2
3  A3  B3  C3  D3  B3  D3  F3

12.2. `Merge` of DataFrames

We previously created a new dataframe by merging several dataframes. Now we can complete one dataframe with another using merge, passing the parameter on to specify which column should be used for the merge.

	
		dataframe1 = pd.DataFrame(
    {
        "Key": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
     })
 
dataframe2 = pd.DataFrame(
    {
        "Key": ["K0", "K1", "K2", "K3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
     })
 
dataframe = dataframe1.merge(dataframe2, on="Key")
 
print(f"dataframe1:
{dataframe1}")
print(f"dataframe2:
{dataframe2}")
 
print(f"
dataframe:
{dataframe}")
	
	
		
	
	Copied

	
		dataframe1:
  Key   A   B
0  K0  A0  B0
1  K1  A1  B1
2  K2  A2  B2
3  K3  A3  B3
dataframe2:
  Key   C   D
0  K0  C0  D0
1  K1  C1  D1
2  K2  C2  D2
3  K3  C3  D3
dataframe:
  Key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
2  K2  A2  B2  C2  D2
3  K3  A3  B3  C3  D3

In this case, both dataframes had a key with the same name (Key), but if we have dataframes where their keys are named differently, we can use the left_on and right_on parameters.

	
		dataframe1 = pd.DataFrame(
    {
        "Key1": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
     })
 
dataframe2 = pd.DataFrame(
    {
        "Key2": ["K0", "K1", "K2", "K3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
     })
 
dataframe = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2")
 
print(f"dataframe1:
{dataframe1}")
print(f"dataframe2:
{dataframe2}")
 
print(f"
dataframe:
{dataframe}")
	
	
		
	
	Copied

	
		dataframe1:
  Key1   A   B
0   K0  A0  B0
1   K1  A1  B1
2   K2  A2  B2
3   K3  A3  B3
dataframe2:
  Key2   C   D
0   K0  C0  D0
1   K1  C1  D1
2   K2  C2  D2
3   K3  C3  D3
dataframe:
  Key1   A   B Key2   C   D
0   K0  A0  B0   K0  C0  D0
1   K1  A1  B1   K1  C1  D1
2   K2  A2  B2   K2  C2  D2
3   K3  A3  B3   K3  C3  D3

In the case where one of the keys does not match, the merge will not be performed on that key.

	
		dataframe1 = pd.DataFrame(
    {
        "Key1": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
     })
 
dataframe2 = pd.DataFrame(
    {
        "Key2": ["K0", "K1", "K2", np.nan],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
     })
 
dataframe = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2")
 
print(f"dataframe1:
{dataframe1}")
print(f"dataframe2:
{dataframe2}")
 
print(f"
dataframe:
{dataframe}")
	
	
		
	
	Copied

	
		dataframe1:
  Key1   A   B
0   K0  A0  B0
1   K1  A1  B1
2   K2  A2  B2
3   K3  A3  B3
dataframe2:
  Key2   C   D
0   K0  C0  D0
1   K1  C1  D1
2   K2  C2  D2
3  NaN  C3  D3
dataframe:
  Key1   A   B Key2   C   D
0   K0  A0  B0   K0  C0  D0
1   K1  A1  B1   K1  C1  D1
2   K2  A2  B2   K2  C2  D2

To change this behavior, we can use the how parameter, which by default has the value inner, but we can pass it the values left, right, and outer.

	
		dataframe1 = pd.DataFrame(
    {
        "Key1": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
     })
 
dataframe2 = pd.DataFrame(
    {
        "Key2": ["K0", "K1", "K2", np.nan],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
     })
 
dataframe_inner = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2", how="inner")
dataframe_left = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2", how="left")
dataframe_right = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2", how="right")
dataframe_outer = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2", how="outer")
 
print(f"dataframe1:
{dataframe1}")
print(f"dataframe2:
{dataframe2}")
 
print(f"
dataframe inner:
{dataframe_inner}")
print(f"
dataframe left:
{dataframe_left}")
print(f"
dataframe right:
{dataframe_right}")
print(f"
dataframe outer:
{dataframe_outer}")
	
	
		
	
	Copied

	
		dataframe1:
  Key1   A   B
0   K0  A0  B0
1   K1  A1  B1
2   K2  A2  B2
3   K3  A3  B3
dataframe2:
  Key2   C   D
0   K0  C0  D0
1   K1  C1  D1
2   K2  C2  D2
3  NaN  C3  D3
dataframe inner:
  Key1   A   B Key2   C   D
0   K0  A0  B0   K0  C0  D0
1   K1  A1  B1   K1  C1  D1
2   K2  A2  B2   K2  C2  D2
dataframe left:
  Key1   A   B Key2    C    D
0   K0  A0  B0   K0   C0   D0
1   K1  A1  B1   K1   C1   D1
2   K2  A2  B2   K2   C2   D2
3   K3  A3  B3  NaN  NaN  NaN
dataframe right:
  Key1    A    B Key2   C   D
0   K0   A0   B0   K0  C0  D0
1   K1   A1   B1   K1  C1  D1
2   K2   A2   B2   K2  C2  D2
3  NaN  NaN  NaN  NaN  C3  D3
dataframe outer:
  Key1    A    B Key2    C    D
0   K0   A0   B0   K0   C0   D0
1   K1   A1   B1   K1   C1   D1
2   K2   A2   B2   K2   C2   D2
3   K3   A3   B3  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN   C3   D3

As can be seen, when left is chosen, only the values from the left dataframe are added, and when right is chosen, the values from the right dataframe are added.

12.3. `Join` of dataframes

The last tool for joining dataframes is join. It is similar to merge, except that instead of looking for similarities based on specified columns, it looks for them based on the indices.

	
		dataframe1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
     },
     index=["K0", "K1", "K2", "K3"])
 
dataframe2 = pd.DataFrame(
    {
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
     },
        index=["K0", "K1", "K2", "K3"])
 
dataframe = dataframe1.join(dataframe2)
 
print(f"dataframe1:
{dataframe1}")
print(f"dataframe2:
{dataframe2}")
 
print(f"
dataframe:
{dataframe}")
	
	
		
	
	Copied

	
		dataframe1:
     A   B
K0  A0  B0
K1  A1  B1
K2  A2  B2
K3  A3  B3
dataframe2:
     C   D
K0  C0  D0
K1  C1  D1
K2  C2  D2
K3  C3  D3
dataframe:
     A   B   C   D
K0  A0  B0  C0  D0
K1  A1  B1  C1  D1
K2  A2  B2  C2  D2
K3  A3  B3  C3  D3

In this case, the indices are the same, but when they are different we can specify how to join the dataframes using the how parameter, which by default has the value inner, but can also have the values left, right, or outer.

	
		dataframe1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
     },
     index=["K0", "K1", "K2", "K3"])
 
dataframe2 = pd.DataFrame(
    {
        "C": ["C0", "C2", "C3", "C4"],
        "D": ["D0", "D2", "D3", "D4"],
     },
        index=["K0", "K2", "K3", "K4"])
 
dataframe_inner = dataframe1.join(dataframe2, how="inner")
dataframe_left = dataframe1.join(dataframe2, how="left")
dataframe_right = dataframe1.join(dataframe2, how="right")
dataframe_outer = dataframe1.join(dataframe2, how="outer")
 
print(f"dataframe1:
{dataframe1}")
print(f"dataframe2:
{dataframe2}")
 
print(f"
dataframe inner:
{dataframe_inner}")
print(f"
dataframe left:
{dataframe_left}")
print(f"
dataframe rigth:
{dataframe_right}")
print(f"
dataframe outer:
{dataframe_outer}")
	
	
		
	
	Copied

	
		dataframe1:
     A   B
K0  A0  B0
K1  A1  B1
K2  A2  B2
K3  A3  B3
dataframe2:
     C   D
K0  C0  D0
K2  C2  D2
K3  C3  D3
K4  C4  D4
dataframe:
     A   B   C   D
K0  A0  B0  C0  D0
K2  A2  B2  C2  D2
K3  A3  B3  C3  D3
dataframe:
     A   B    C    D
K0  A0  B0   C0   D0
K1  A1  B1  NaN  NaN
K2  A2  B2   C2   D2
K3  A3  B3   C3   D3
dataframe:
      A    B   C   D
K0   A0   B0  C0  D0
K2   A2   B2  C2  D2
K3   A3   B3  C3  D3
K4  NaN  NaN  C4  D4
dataframe:
      A    B    C    D
K0   A0   B0   C0   D0
K1   A1   B1  NaN  NaN
K2   A2   B2   C2   D2
K3   A3   B3   C3   D3
K4  NaN  NaN   C4   D4

13. Missing data (`NaN`)

In a DataFrame there can be some missing data, Pandas represents them as np.nan

	
		diccionario = {
    "uno": pd.Series([1.0, 2.0, 3.0]),
    "dos": pd.Series([4.0, 5.0, 6.0, 7.0])
    }
 
dataframe = pd.DataFrame(diccionario)
dataframe
	
	
		
	
	Copied

	
		   uno  dos
0  1.0  4.0
1  2.0  5.0
2  3.0  6.0
3  NaN  7.0

13.1. Removal of Rows with Missing Data

To avoid having rows with missing data, these can be removed.

	
		dataframe.dropna(how="any")
	
	
		
	
	Copied

	
		   uno  dos
0  1.0  4.0
1  2.0  5.0
2  3.0  6.0

13.2. Dropping Columns with Missing Data

	
		dataframe.dropna(axis=1, how='any')
	
	
		
	
	Copied

13.3. Boolean mask with missing positions

	
		pd.isna(dataframe)
	
	
		
	
	Copied

	
		     uno    dos
0  False  False
1  False  False
2  False  False
3   True  False

13.4. Filling Missing Data

	
		dataframe.fillna(value=5.5, inplace=True)
dataframe
	
	
		
	
	Copied

	
		   uno  dos
0  1.0  4.0
1  2.0  5.0
2  3.0  6.0
3  5.5  7.0

Tip: By setting the variable inplace=True, the DataFrame being operated on is modified, so there's no need to write dataframe = dataframe.fillna(value=5.5)

14. Time series

Pandas offers the possibility of working with time series. For example, we create a Series of 100 random data points every second starting from 01/01/2021

	
		indices = pd.date_range("1/1/2021", periods=100, freq="S")
datos = np.random.randint(0, 500, len(indices))
 
serie_temporal = pd.Series(datos, index=indices)
serie_temporal
	
	
		
	
	Copied

	
		2021-01-01 00:00:00    241
2021-01-01 00:00:01     14
2021-01-01 00:00:02    190
2021-01-01 00:00:03    407
2021-01-01 00:00:04     94
                      ...
2021-01-01 00:01:35    275
2021-01-01 00:01:36     56
2021-01-01 00:01:37    448
2021-01-01 00:01:38    151
2021-01-01 00:01:39    316
Freq: S, Length: 100, dtype: int64

This Pandas functionality is very powerful, for example, we can have a dataset at certain hours of one time zone and change them to another time zone.

	
		horas = pd.date_range("3/6/2021 00:00", periods=10, freq="H")
datos = np.random.randn(len(horas))
 
serie_horaria = pd.Series(datos, horas)
serie_horaria
	
	
		
	
	Copied

	
		2021-03-06 00:00:00   -0.853524
2021-03-06 01:00:00   -1.355372
2021-03-06 02:00:00   -1.267503
2021-03-06 03:00:00   -1.155787
2021-03-06 04:00:00    0.730935
2021-03-06 05:00:00    1.435957
2021-03-06 06:00:00    0.460912
2021-03-06 07:00:00    0.723451
2021-03-06 08:00:00   -0.853337
2021-03-06 09:00:00    0.456359
Freq: H, dtype: float64

We locate the data in a time zone

	
		serie_horaria_utc = serie_horaria.tz_localize("UTC")
serie_horaria_utc
	
	
		
	
	Copied

	
		2021-03-06 00:00:00+00:00   -0.853524
2021-03-06 01:00:00+00:00   -1.355372
2021-03-06 02:00:00+00:00   -1.267503
2021-03-06 03:00:00+00:00   -1.155787
2021-03-06 04:00:00+00:00    0.730935
2021-03-06 05:00:00+00:00    1.435957
2021-03-06 06:00:00+00:00    0.460912
2021-03-06 07:00:00+00:00    0.723451
2021-03-06 08:00:00+00:00   -0.853337
2021-03-06 09:00:00+00:00    0.456359
Freq: H, dtype: float64

And now we can change them to another use

	
		serie_horaria_US = serie_horaria_utc.tz_convert("US/Eastern")
serie_horaria_US
	
	
		
	
	Copied

	
		2021-03-05 19:00:00-05:00   -0.853524
2021-03-05 20:00:00-05:00   -1.355372
2021-03-05 21:00:00-05:00   -1.267503
2021-03-05 22:00:00-05:00   -1.155787
2021-03-05 23:00:00-05:00    0.730935
2021-03-06 00:00:00-05:00    1.435957
2021-03-06 01:00:00-05:00    0.460912
2021-03-06 02:00:00-05:00    0.723451
2021-03-06 03:00:00-05:00   -0.853337
2021-03-06 04:00:00-05:00    0.456359
Freq: H, dtype: float64

15. Categorical Data

Pandas offers the possibility of adding categorical data in a DataFrame. Suppose the following DataFrame

	
		dataframe = pd.DataFrame(
    {"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)
 
dataframe
	
	
		
	
	Copied

	
		   id raw_grade
0   1         a
1   2         b
2   3         b
3   4         a
4   5         a
5   6         e

We can convert the data in the raw_grade column to categorical data using the astype() method.

	
		dataframe['grade'] = dataframe["raw_grade"].astype("category")
 
dataframe
	
	
		
	
	Copied

	
		   id raw_grade grade
0   1         a     a
1   2         b     b
2   3         b     b
3   4         a     a
4   5         a     a
5   6         e     e

The columns raw_grade and grade seem identical, but if we look at the information of the DataFrame we can see that this is not the case.

	
		dataframe.info()
	
	
		
	
	Copied

	
		&lt;class 'pandas.core.frame.DataFrame'&gt;
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
#   Column     Non-Null Count  Dtype
---  ------     --------------  -----
0   id         6 non-null      int64
1   raw_grade  6 non-null      object
2   grade      6 non-null      category
dtypes: category(1), int64(1), object(1)
memory usage: 334.0+ bytes

It can be seen that the column grade is of categorical type

We can see the categories of categorical data types through the method cat.categories()

	
		dataframe["grade"].cat.categories
	
	
		
	
	Copied

	
		Index(['a', 'b', 'e'], dtype='object')

We can also rename the categories using the same method, but by providing a list with the new categories.

	
		dataframe["grade"].cat.categories = ["very good", "good", "very bad"]
dataframe
	
	
		
	
	Copied

	
		   id raw_grade      grade
0   1         a  very good
1   2         b       good
2   3         b       good
3   4         a  very good
4   5         a  very good
5   6         e   very bad

Pandas gives us the possibility to numerically encode categorical data using the get_dummies method.

	
		pd.get_dummies(dataframe["grade"])
	
	
		
	
	Copied

	
		   very good  good  very bad
0          1     0         0
1          0     1         0
2          0     1         0
3          1     0         0
4          1     0         0
5          0     0         1

16. Groupby

We can group the dataframes by values from one of the columns. Let's reload the dataframe with the value of houses in California.

	
		california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train.head()
	
	
		
	
	Copied

	
		   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -114.31     34.19                15.0       5612.0          1283.0
1    -114.47     34.40                19.0       7650.0          1901.0
2    -114.56     33.69                17.0        720.0           174.0
3    -114.57     33.64                14.0       1501.0           337.0
4    -114.57     33.57                20.0       1454.0           326.0
   population  households  median_income  median_house_value
0      1015.0       472.0         1.4936             66900.0
1      1129.0       463.0         1.8200             80100.0
2       333.0       117.0         1.6509             85700.0
3       515.0       226.0         3.1917             73400.0
4       624.0       262.0         1.9250             65500.0

Now we can group the data by one of the columns, for example, let's group the houses based on the number of years and see how many houses there are of each age with count

	
		california_housing_train.groupby("housing_median_age").count().head()
	
	
		
	
	Copied

	
		                    longitude  latitude  total_rooms  total_bedrooms  \
housing_median_age
1.0                         2         2            2               2
2.0                        49        49           49              49
3.0                        46        46           46              46
4.0                       161       161          161             161
5.0                       199       199          199             199
                    population  households  median_income  median_house_value
housing_median_age
1.0                          2           2              2                   2
2.0                         49          49             49                  49
3.0                         46          46             46                  46
4.0                        161         161            161                 161
5.0                        199         199            199                 199

As we can see in all the columns, we get the same value, which is the number of houses that have a certain age, but we can find out the average value of each column with mean

	
		california_housing_train.groupby("housing_median_age").mean().head()
	
	
		
	
	Copied

	
		                     longitude   latitude  total_rooms  total_bedrooms  \
housing_median_age
1.0                -121.465000  37.940000  2158.000000      335.500000
2.0                -119.035306  35.410816  5237.102041      871.448980
3.0                -118.798478  35.164783  6920.326087     1190.826087
4.0                -118.805093  34.987764  6065.614907     1068.192547
5.0                -118.789497  35.095327  4926.261307      910.924623
                     population   households  median_income  \
housing_median_age
1.0                  637.000000   190.000000       4.756800
2.0                 2005.224490   707.122449       5.074237
3.0                 2934.673913  1030.413043       5.572013
4.0                 2739.956522   964.291925       5.196055
5.0                 2456.979899   826.768844       4.732460
                    median_house_value
housing_median_age
1.0                      190250.000000
2.0                      229438.836735
3.0                      239450.043478
4.0                      230054.105590
5.0                      211035.708543

We can obtain several measures for each age using the agg (aggregation) command, passing it the measures we want through a list. For example, let's look at the minimum, maximum, and mean of each column for each age:

	
		california_housing_train.groupby("housing_median_age").agg(['min', 'max', 'mean']).head()
	
	
		
	
	Copied

	
		                   longitude                     latitude                    \
                         min     max        mean      min    max       mean
housing_median_age
1.0                  -122.00 -120.93 -121.465000    37.65  38.23  37.940000
2.0                  -122.51 -115.80 -119.035306    33.16  40.58  35.410816
3.0                  -122.33 -115.60 -118.798478    32.87  38.77  35.164783
4.0                  -122.72 -116.76 -118.805093    32.65  39.00  34.987764
5.0                  -122.55 -115.55 -118.789497    32.55  40.60  35.095327
                   total_rooms                       total_bedrooms  ...  \
                           min      max         mean            min  ...
housing_median_age                                                   ...
1.0                     2062.0   2254.0  2158.000000          328.0  ...
2.0                       96.0  21897.0  5237.102041           18.0  ...
3.0                      475.0  21060.0  6920.326087          115.0  ...
4.0                        2.0  37937.0  6065.614907            2.0  ...
5.0                      111.0  25187.0  4926.261307           21.0  ...
                     population households                      median_income  \
                           mean        min     max         mean           min
housing_median_age
1.0                  637.000000      112.0   268.0   190.000000        4.2500
2.0                 2005.224490       16.0  2873.0   707.122449        1.9667
3.0                 2934.673913      123.0  3112.0  1030.413043        2.1187
4.0                 2739.956522        2.0  5189.0   964.291925        0.5360
5.0                 2456.979899       20.0  3886.0   826.768844        0.7526
                                      median_house_value            \
                        max      mean                min       max
housing_median_age
1.0                  5.2636  4.756800           189200.0  191300.0
2.0                 10.1531  5.074237            47500.0  500001.0
3.0                 11.5199  5.572013            83200.0  500001.0
4.0                 13.4883  5.196055            42500.0  500001.0
5.0                 12.6320  4.732460            50000.0  500001.0
                                   
                             mean
housing_median_age
1.0                 190250.000000
2.0                 229438.836735
3.0                 239450.043478
4.0                 230054.105590
5.0                 211035.708543
[5 rows x 24 columns]

We can specify on which columns we want to perform certain calculations by passing a dictionary, where the keys will be the columns on which we want to perform calculations and the values will be lists with the calculations.

	
		california_housing_train.groupby("housing_median_age").agg({'total_rooms': ['min', 'max', 'mean'], 'total_bedrooms': ['min', 'max', 'mean', 'median']}).head()
	
	
		
	
	Copied

	
		                   total_rooms                       total_bedrooms          \
                           min      max         mean            min     max
housing_median_age
1.0                     2062.0   2254.0  2158.000000          328.0   343.0
2.0                       96.0  21897.0  5237.102041           18.0  3513.0
3.0                      475.0  21060.0  6920.326087          115.0  3559.0
4.0                        2.0  37937.0  6065.614907            2.0  5471.0
5.0                      111.0  25187.0  4926.261307           21.0  4386.0
                                        
                           mean median
housing_median_age
1.0                  335.500000  335.5
2.0                  871.448980  707.0
3.0                 1190.826087  954.0
4.0                 1068.192547  778.0
5.0                  910.924623  715.0

We can group by more than one column, for this, we have to pass the columns in a list

	
		california_housing_train.groupby(["housing_median_age", "total_bedrooms"]).mean()
	
	
		
	
	Copied

	
		                                   longitude  latitude  total_rooms  \
housing_median_age total_bedrooms
1.0                328.0             -120.93     37.65       2254.0
                   343.0             -122.00     38.23       2062.0
2.0                18.0              -115.80     33.26         96.0
                   35.0              -121.93     37.78        227.0
                   55.0              -117.27     33.93        337.0
...                                      ...       ...          ...
52.0               1360.0            -118.35     34.06       3446.0
                   1535.0            -122.41     37.80       3260.0
                   1944.0            -118.25     34.05       2806.0
                   2509.0            -122.41     37.79       6016.0
                   2747.0            -122.41     37.79       5783.0
                                   population  households  median_income  \
housing_median_age total_bedrooms
1.0                328.0                402.0       112.0         4.2500
                   343.0                872.0       268.0         5.2636
2.0                18.0                  30.0        16.0         5.3374
                   35.0                 114.0        49.0         3.1591
                   55.0                 115.0        49.0         3.1042
...                                       ...         ...            ...
52.0               1360.0              1768.0      1245.0         2.4722
                   1535.0              3260.0      1457.0         0.9000
                   1944.0              2232.0      1605.0         0.6775
                   2509.0              3436.0      2119.0         2.5166
                   2747.0              4518.0      2538.0         1.7240
                                   median_house_value
housing_median_age total_bedrooms
1.0                328.0                     189200.0
                   343.0                     191300.0
2.0                18.0                       47500.0
                   35.0                      434700.0
                   55.0                      164800.0
...                                               ...
52.0               1360.0                    500001.0
                   1535.0                    500001.0
                   1944.0                    350000.0
                   2509.0                    275000.0
                   2747.0                    225000.0
[13394 rows x 7 columns]

17. Graphics

Pandas offers the possibility of representing the data in our DataFrames in charts to obtain a better representation of it. For this, it uses the matplotlib library, which we will cover in the next post.

17.1. Basic Graph

To represent the data in a chart, the easiest way is to use the plot() method.

	
		serie = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))
serie = serie.cumsum()
 
serie.plot()
	
	
		
	
	Copied

	
		&lt;matplotlib.axes._subplots.AxesSubplot at 0x7fc5666b9990&gt;

	
		&lt;Figure size 432x288 with 1 Axes&gt;

In the case of having a DataFrame, the plot() method will represent each of the columns of the DataFrame

	
		dataframe = pd.DataFrame(
    np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"]
)
dataframe = dataframe.cumsum()
 
dataframe.plot()
	
	
		
	
	Copied

	
		&lt;matplotlib.axes._subplots.AxesSubplot at 0x7fc5663ce610&gt;

	
		&lt;Figure size 432x288 with 1 Axes&gt;

17.2. Vertical Bar Chart

There are more methods to create charts, such as the vertical bar chart using plot.bar()

	
		dataframe = pd.DataFrame(np.random.rand(10, 4), columns=["a", "b", "c", "d"])
dataframe.plot.bar()
	
	
		
	
	Copied

	
		&lt;Figure size 432x288 with 1 Axes&gt;

If we want to stack the bars, we indicate this through the variable stacked=True

	
		dataframe.plot.bar(stacked=True)
	
	
		
	
	Copied

	
		&lt;matplotlib.axes._subplots.AxesSubplot at 0x7fc56265c5d0&gt;

	
		&lt;Figure size 432x288 with 1 Axes&gt;

17.3. Horizontal Bar Chart

To create a horizontal bar chart we use plot.barh()

	
		dataframe.plot.barh()
	
	
		
	
	Copied

	
		&lt;matplotlib.axes._subplots.AxesSubplot at 0x7fc56247fa10&gt;

	
		&lt;Figure size 432x288 with 1 Axes&gt;

If we want to stack the bars, we indicate this through the variable stacked=True

	
		dataframe.plot.barh(stacked=True)
	
	
		
	
	Copied

	
		&lt;matplotlib.axes._subplots.AxesSubplot at 0x7fc562d1d2d0&gt;

	
		&lt;Figure size 432x288 with 1 Axes&gt;

17.4. Histogram

To create a histogram we use plot.hist()

	
		dataframe = pd.DataFrame(
    {
        "a": np.random.randn(1000) + 1,
        "b": np.random.randn(1000),
        "c": np.random.randn(1000) - 1,
    }
)
 
dataframe.plot.hist(alpha=0.5)
	
	
		
	
	Copied

	
		&lt;matplotlib.axes._subplots.AxesSubplot at 0x7fc5650711d0&gt;

	
		&lt;Figure size 432x288 with 1 Axes&gt;

If we want to stack the bars, we indicate this through the variable stacked=True

	
		dataframe.plot.hist(alpha=0.5, stacked=True)
	
	
		
	
	Copied

	
		&lt;matplotlib.axes._subplots.AxesSubplot at 0x7fc5625779d0&gt;

	
		&lt;Figure size 432x288 with 1 Axes&gt;

If we want to add more columns, that is, if we want the histogram to be more informative or accurate, we indicate this through the bins variable.

	
		dataframe.plot.hist(alpha=0.5, stacked=True, bins=20)
	
	
		
	
	Copied

	
		&lt;matplotlib.axes._subplots.AxesSubplot at 0x7fc562324990&gt;

	
		&lt;Figure size 432x288 with 1 Axes&gt;

17.5. Candlestick Diagrams

To create a candlestick chart we use plot.box()

	
		dataframe = pd.DataFrame(np.random.rand(10, 5), columns=["A", "B", "C", "D", "E"])
 
dataframe.plot.box()
	
	
		
	
	Copied

	
		&lt;matplotlib.axes._subplots.AxesSubplot at 0x7fc56201a410&gt;

	
		&lt;Figure size 432x288 with 1 Axes&gt;

17.6. Area Charts

To create an area chart we use plot.area()

	
		dataframe.plot.area()
	
	
		
	
	Copied

	
		&lt;matplotlib.axes._subplots.AxesSubplot at 0x7fc561e9ca50&gt;

	
		&lt;Figure size 432x288 with 1 Axes&gt;

17.7. Scatter plot

To create a scatter plot we use plot.scatter(), where you need to specify the x and y variables of the plot.

	
		dataframe.plot.scatter(x='A', y='B')
	
	
		
	
	Copied

	
		&lt;matplotlib.axes._subplots.AxesSubplot at 0x7fc561e2ff10&gt;

	
		&lt;Figure size 432x288 with 1 Axes&gt;

17.8. Hexagonal Container Plot

To create a hexagonal bin plot we use plot.hexbin(), where you need to specify the x and y variables of the plot and the mesh size using gridsize.

	
		dataframe = pd.DataFrame(np.random.randn(1000, 2), columns=["a", "b"])
dataframe["b"] = dataframe["b"] + np.arange(1000)
 
dataframe.plot.hexbin(x="a", y="b", gridsize=25)
	
	
		
	
	Copied

	
		&lt;matplotlib.axes._subplots.AxesSubplot at 0x7fc561cdded0&gt;

	
		&lt;Figure size 432x288 with 2 Axes&gt;

Continue reading

Deep Research with LangGraph: Build an AI Research Assistant from Scratch

Learn how neural networks work from scratch with a practical linear regression example. This beginner-friendly tutorial explains artificial neurons, parameter initialization, loss functions, and mean squared error (MSE) with step-by-step code examples in Python.

MCP Elicitation: Implementing Elicitation in Servers with FastMCP and Python

Learn how to implement elicitation in MCP (Model Context Protocol) servers with FastMCP. Complete step-by-step tutorial to create an intelligent travel booking agent that requests user information interactively. Includes server and client code, virtual environment setup with uv, and practical elicitation examples for real-time user data collection.

MCP Durability: Server and Client with Persistence for Long-Running Tasks

Learn to build durable MCP server and client for long-running tasks with persistence. Complete Model Context Protocol tutorial featuring SQLite data persistence, background task management, and real-time monitoring. Implement data migration, batch processing, and ML model training that survive server restarts. Python code examples using FastMCP, resources, tools, and durability patterns for enterprise applications.

Last posts -->

Have you seen these projects?

Horeca chatbot

Naviground

Subtify

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to watch any talk?

Tomorrow's Agents: Deciphering the Mysteries of Planning, UX and Memory

AI agents, powered by LLMs, promise to transform applications. But are they simple executors today or future intelligent collaborators? To reach their true potential, we must overcome critical barriers. This talk delves into the three puzzles that will define the next generation of agents: 1. Advanced Planning (The Brain): Today's agents often stumble on complex tasks. We'll explore how, beyond basic function calls, cognitive architectures enable robust plans, anticipation of problems, and deep reasoning. How do we make them think several steps ahead? 2. Revolutionary UX (The Soul): Interacting with an agent cannot be a source of frustration. We'll discuss how to transcend traditional chat toward human-on-the-loop interfaces—collaborative, generative, and accessible UX. How to Design Engaging Experiences? 3. Persistent Memory (The Legacy): An agent that forgets what it's learned is doomed to inefficiency. We'll look at techniques for empowering agents with meaningful memory that goes beyond their history, enabling them to learn and making each interaction smarter. With practical examples, we'll not only understand the magnitude of these challenges, but we'll also take away concrete ideas and a clear vision to help build the agents of tomorrow: smarter, more intuitive, and truly capable. Will you join us on the journey to unravel the next chapter of AI agents?

Create your own Apple intelligence

Learn to create an IA system to execute efficiently on a device

Last talks -->

Do you want to improve with these tips?

o1 prompt engineering

Create better prompts for o1 following an example

Memory profiler

See the memory usage of a script

DataLoader with pin_memory and num_workers

Increase DataLoader performance with pin_memory and num_workers

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

Flow edit

FLUX.1-RealismLora

token_hmr

View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

Dataset with jokes in English

opus100

Dataset with translations from English to Spanish

netflix_titles

Dataset with Netflix movies and series

View more datasets -->

Format	File type	Reading function	Writing function
text	CSV	[read_csv](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-read-csv-table)	[to_csv](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-store-in-csv)
text	Fixed-Width Text File	[read_fwf](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-fwf-reader)
text	JSON	[read_json](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-json-reader)	[to_json](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-json-writer)
text	HTML	[read_html](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-read-html)	[to_html](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-html)
text	Local clipboard	[read_clipboard](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-clipboard)	[to_clipboard](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-clipboard)
binary	MS Excel	[read_excel](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-excel-reader)	[to_excel](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-excel-writer)
binary	OpenDocument	[read_excel](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-ods)
binary	HDF5 Format	[read_hdf](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-hdf5)	[to_hdf](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-hdf5)
binary	Feather Format	[read_feather](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-feather)	[to_feather](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-feather)
binary	Parquet Format	[read_parquet](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-parquet)	[to_parquet](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-parquet)
binary	ORC Format	[read_orc](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-orc)
binary	Msgpack	[read_msgpack](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-msgpack)	[to_msgpack](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-msgpack)
binary	Stata	[read_stata](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-stata-reader)	[to_stata](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-stata-writer)
binary	SAS	[read_sas](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sas-reader)
binary	SPSS	[read_spss](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-spss-reader)
binary	Python Pickle Format	[read_pickle](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-pickle)	[to_pickle](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-pickle)
SQL	SQL	[read_sql](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql)	[to_sql](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql)
SQL	Google BigQuery	[read_gbq](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-bigquery)	[to_gbq](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-bigquery)

1. Summary

2. What is Pandas?

2.1. Pandas as pd

3. Data Structures in Pandas

3.1. Series

3.1.1. Series from a dictionary

3.1.2. Series from a list or tuple

3.1.3. Series from an ndarray

3.1.4. Series from a scalar

3.1.5. Operations with Series

3.1.6. Attribute name of the Series

3.2. DataFrames

3.2.1. DataFrames from a Dictionary of Series

3.2.2. DataFrames from a dictionary of ndarrays or lists

3.2.3. DataFrames from a matrix

3.2.4. DataFrames from a list of dictionaries

3.2.5. DataFrames from a dictionary of tuples

3.2.6. DataFrames from a Series

4. Exploration of a DataFrame

4.1. DataFrame Header

4.2. Tail of the DataFrame

4.3. DataFrame Information

4.4. Rows and Columns of DataFrame

4.5. Description of the DataFrame

4.6. Sorting the DataFrame

4.7. DataFrame Statistics

4.8. Used Memory

5. Addition of Data

5.1. Addition of Columns

5.2. Addition of Rows

6. Data Deletion

6.1. Column Deletion

6.1. Row Deletion

7. Operations on DataFrames

8. Transpose

9. Conversion to Numpy

10. Reading data from external sources

11. Indexing in DataFrames

11.1. Column Indexing

11.2. Row Indexing by Positions

11.3. Row Label Indexing

11.4. Selection of a Portion of the DataFrame by Positions

11.5. Selecting a Portion of the DataFrame Using Labels

11.6. Lambda Function Indexing

11.7. Conditional Indexing

11.8. Random Indexing

12. Merging DataFrames

12.1. Concatenation of DataFrames

12.1.1. Intersection of Concatenation

12.2. Merge of DataFrames

12.3. Join of dataframes

13. Missing data (NaN)

13.1. Removal of Rows with Missing Data

13.2. Dropping Columns with Missing Data

13.3. Boolean mask with missing positions

13.4. Filling Missing Data

14. Time series

15. Categorical Data

16. Groupby

17. Graphics

17.1. Basic Graph

17.2. Vertical Bar Chart

17.3. Horizontal Bar Chart

17.4. Histogram

17.5. Candlestick Diagrams

17.6. Area Charts

17.7. Scatter plot

17.8. Hexagonal Container Plot

Continue reading

Deep Research with LangGraph: Build an AI Research Assistant from Scratch

MCP Elicitation: Implementing Elicitation in Servers with FastMCP and Python

MCP Durability: Server and Client with Persistence for Long-Running Tasks

Have you seen these projects?

Do you want to apply AI in your project? Contact me!

Do you want to watch any talk?

Tomorrow's Agents: Deciphering the Mysteries of Planning, UX and Memory

Create your own Apple intelligence

Do you want to improve with these tips?

2.1. Pandas as `pd`

4.6. Sorting the `DataFrame`

12.2. `Merge` of DataFrames

12.3. `Join` of dataframes

13. Missing data (`NaN`)