Data management with Pandas

Data management with Pandas Data management with Pandas

Data Handling with Pandaslink image 69

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

1. Summarylink image 70

Let's take a small introduction to the data manipulation and analysis library Pandas. With it, we will be able to handle and process tabular data which will help us operate with it and obtain information in a very valuable way.

No description has been provided for this image

2. What is Pandas?link image 71

Pandas is a Python library that is designed to make working with relational or labeled data easy and intuitive Pandas is designed for many different types of data:

  • Tabular data with columns of heterogeneous types, such as in an SQL table or an Excel spreadsheet* Ordered and unordered time series data (not necessarily of fixed frequency).* Arbitrary matrix data (homogeneous or heterogeneous) with row and column labels* Any other form of observational/statistical datasets. It is not necessary to label the data at all to place it in a pandas data structure.

The two main data structures in Pandas are Series (one-dimensional) and DataFrame (two-dimensional). Pandas is built on top of NumPy and is designed to integrate well within a scientific computing environment with many other third-party libraries. For data scientists, working with data generally involves several stages: collecting and cleaning data, analyzing/modeling it, and then organizing the analysis results into a suitable form for plotting or displaying in tabular form. pandas is the ideal tool for all these tasks. Another feature is that pandas is fast, many of the low-level algorithms have been built in C.

2.1. Pandas as pdlink image 72

Generally, when importing pandas, it is usually imported with the alias pd.

	
import pandas as pd
print(pd.__version__)
Copy
	
1.0.1

3. Pandas Data Structureslink image 73

In Pandas, there are two types of data structures: Series and DataFrames.

3.1. Serieslink image 74

The Series data type is a one-dimensional labeled array capable of holding any data type (integers, strings, floating-point numbers, Python objects, etc.). It is divided into indices.

To create a Series data type the most common way is

```series = pd.Series(data, index=index)
Where `data` can be* A dictionary* A list or tuple* A Numpy ndarray* A scalar value
      

As one of the data types can be a NumPy ndarray, we import NumPy in order to use it

	
import numpy as np
Copy

3.1.1. Series from a Dictionarylink image 75

	
import numpy as np
diccionario = {"b": 1, "a": 0, "c": 2}
serie = pd.Series(diccionario)
serie
Copy
	
b 1
a 0
c 2
dtype: int64

If an index is passed, the values of the data corresponding to the index labels will be extracted. If they do not exist, they are created as NaN (not a number).

	
diccionario = {"b": 1, "a": 0, "c": 2}
serie = pd.Series(diccionario, index=["b", "c", "d", "a"])
serie
Copy
	
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64

3.1.2. Series from a list or tuplelink image 76

If the data comes from a list or tuple and no index is passed, one will be created with values [0, ..., len(data)-1]

	
serie = pd.Series([1, 2, 3, 4])
serie
Copy
	
0 1
1 2
2 3
3 4
dtype: int64

If an index is passed, it must have the same length as the data

	
serie = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"])
serie
Copy
	
a 1
b 2
c 3
d 4
dtype: int64

3.1.3. Series from an ndarraylink image 77

If the data comes from an ndarray and no index is passed, one with values [0, ..., len(data)-1] will be created.

	
serie = pd.Series(np.random.randn(5))
serie
Copy
	
0 1.267865
1 -0.877857
2 -0.138556
3 -0.132987
4 -0.827295
dtype: float64

If an index is passed, it must have the same length as the data

	
serie = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
serie
Copy
	
a -1.091828
b -0.584243
c 0.220398
d 1.248923
e 1.652351
dtype: float64

3.1.4. Series from a Scalarlink image 78

If the series is created from a scalar, it will be created with a single item

	
serie = pd.Series(5.0)
serie
Copy
	
0 5.0
dtype: float64

If you want to create more items in the series, you need to pass the index with the number of items you want, so that all items will have the value of the scalar

	
serie = pd.Series(5.0, index=["a", "b", "c", "d", "e"])
serie
Copy
	
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64

3.1.5. Operations with Serieslink image 79

Just like with Numpy, we can perform operations on all the elements of a series without having to iterate over each one of them.

	
serie = pd.Series(5.0, index=["a", "b", "c", "d", "e"])
print(f"serie:\n{serie}")
print(f"\nserie + serie =\n{serie + serie}")
Copy
	
serie:
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64
serie + serie =
a 10.0
b 10.0
c 10.0
d 10.0
e 10.0
dtype: float64
	
serie = pd.Series(5.0, index=["a", "b", "c", "d", "e"])
print(f"serie:\n{serie}")
print(f"\nexp(serie) =\n{np.exp(serie)}")
Copy
	
serie:
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64
exp(serie) =
a 148.413159
b 148.413159
c 148.413159
d 148.413159
e 148.413159
dtype: float64

One difference between Series and ndarrays is that operations between Series automatically align the data according to their labels. Therefore, calculations can be written without considering whether the Series involved have the same labels. If a label is not found in one Series or another, the result will be marked as missing (NaN).

	
serie = pd.Series(5.0, index=["a", "b", "c", "d", "e"])
print(f"serie:\n{serie}")
print(f"\nserie[1:] + serie[:-1] =\n{serie[1:] + serie[:-1]}")
Copy
	
serie:
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64
serie[1:] + serie[:-1] =
a NaN
b 10.0
c 10.0
d 10.0
e NaN
dtype: float64

3.1.6. Series Name Attributelink image 80

One of the attributes of Series is name, which corresponds to the name they will have when added to a DataFrame. Conversely, when a series is obtained from a DataFrame, this series will have the name it had in the DataFrame.

	
serie = pd.Series(np.random.randn(5), name="aleatorio")
serie
Copy
	
0 -0.191009
1 -0.793151
2 -0.907747
3 -1.440508
4 -0.676419
Name: aleatorio, dtype: float64

You can change the name of a series using the rename() method

	
serie = serie.rename("random")
serie
Copy
	
0 -0.191009
1 -0.793151
2 -0.907747
3 -1.440508
4 -0.676419
Name: random, dtype: float64

3.2. DataFrameslink image 81

A DataFrame is a labeled, two-dimensional data structure with columns of potentially different types, meaning that one column can have integer data, another column string data, etc. You can think of it as a spreadsheet or an SQL table, or a dictionary of Series objects. It is the most commonly used pandas object. Like Series, DataFrames accept many different types of input:

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and/or columns, you are ensuring the index and/or columns of the resulting DataFrame. Therefore, a dictionary of Series plus a specific index will discard all data that does not match the passed index.

If the axis labels are not provided, they will be constructed from the input data based on common sense rules.

3.2.1. DataFrames from a dictionary of Serieslink image 82

If a dictionary with Series is passed, the DataFrame will be created with as many columns as there are Series in the dictionary.

diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0])
}
      
dataframe = pd.DataFrame(diccionario)
dataframe
      
Out[87]:
uno dos
0 1.0 4.0
1 2.0 5.0
2 3.0 6.0
3 NaN 7.0

If each of the Series has defined indexes, the resulting DataFrame will be the union of these indexes

diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0], index=["a", "b", "c", "d"])
}
      
dataframe = pd.DataFrame(diccionario)
dataframe
      
Out[88]:
uno dos
a 1.0 4.0
b 2.0 5.0
c 3.0 6.0
d NaN 7.0
dataframe = pd.DataFrame(diccionario, index=["d", "b", "a"])
dataframe
      
Out[89]:
uno dos
d NaN 7.0
b 2.0 5.0
a 1.0 4.0

If the columns are passed, they will appear in the given order

dataframe = pd.DataFrame(diccionario, columns=["dos", "tres"])
dataframe
      
Out[90]:
dos tres
a 4.0 NaN
b 5.0 NaN
c 6.0 NaN
d 7.0 NaN

3.2.2. DataFrames from a dictionary of ndarrays or listslink image 83

All ndarrays or lists must have the same length. If an index is passed, it must also have the same length as the ndarrays or lists.

diccionario = {
"uno": [1.0, 2.0, 3.0, 4.0],
"dos": [4.0, 3.0, 2.0, 1.0]
}
      
dataframe = pd.DataFrame(diccionario)
dataframe
      
Out[91]:
uno dos
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0

3.2.3. DataFrames from a matrixlink image 84

If an index is passed, it must have the same length as the number of rows in the matrix, and if columns are passed, they must have the same length as the columns in the matrix.

matriz = np.array([[1, 3], [2, 2], [3, 1]])
      
dataframe = pd.DataFrame(matriz, index=["a", "b", "c"], columns=["columna1", "columna2"])
dataframe
      
Out[92]:
columna1 columna2
a 1 3
b 2 2
c 3 1

3.2.4. DataFrames from a list of dictionarieslink image 85

lista = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
      
dataframe = pd.DataFrame(lista)
dataframe
      
Out[93]:
a b c
0 1 2 NaN
1 5 10 20.0

3.2.5. DataFrames from a dictionary of tupleslink image 86

diccionario = {
("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
("b", "b"): {("A", "D"): 9, ("A", "B"): 10},
}
      
dataframe = pd.DataFrame(diccionario)
dataframe
      
Out[94]:
a b
b a c a b
A B 1.0 4.0 5.0 8.0 10.0
C 2.0 3.0 6.0 7.0 NaN
D NaN NaN NaN NaN 9.0

3.2.6. DataFrames from a Serieslink image 87

The result will be a DataFrame with the same index as the input Series, and with a column whose name is the original name of the Series (only if no other column name is provided).

diccionario = {"b": 1, "a": 0, "c": 2}
      
serie = pd.Series(diccionario)
      
dataframe = pd.DataFrame(serie)
dataframe
      
Out[95]:
0
b 1
a 0
c 2

4. Exploring a DataFramelink image 88

When a DataFrame is very large it cannot be fully displayed

california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train
      
Out[102]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0
... ... ... ... ... ... ... ... ... ...
16995 -124.26 40.58 52.0 2217.0 394.0 907.0 369.0 2.3571 111400.0
16996 -124.27 40.69 36.0 2349.0 528.0 1194.0 465.0 2.5179 79000.0
16997 -124.30 41.84 17.0 2677.0 531.0 1244.0 456.0 3.0313 103600.0
16998 -124.30 41.80 19.0 2672.0 552.0 1298.0 478.0 1.9797 85800.0
16999 -124.35 40.54 52.0 1820.0 300.0 806.0 270.0 3.0147 94600.0

17000 rows × 9 columns

So it is very useful to have methods for exploring it and obtaining information quickly.

4.1. Head of the DataFramelink image 89

To view the first few rows and get an idea of what the DataFrame looks like, there is the head() method, which by default shows the first 5 rows of the DataFrame. If you want to see a different number of rows, specify it using the n attribute.

california_housing_train.head(n=10)
      
Out[103]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0
5 -114.58 33.63 29.0 1387.0 236.0 671.0 239.0 3.3438 74000.0
6 -114.58 33.61 25.0 2907.0 680.0 1841.0 633.0 2.6768 82400.0
7 -114.59 34.83 41.0 812.0 168.0 375.0 158.0 1.7083 48500.0
8 -114.59 33.61 34.0 4789.0 1175.0 3134.0 1056.0 2.1782 58400.0
9 -114.60 34.83 46.0 1497.0 309.0 787.0 271.0 2.1908 48100.0

4.2. DataFrame Taillink image 90

If you want to see the last rows, you can use the tail() method, using the n attribute to choose how many rows to display.

california_housing_train.tail()
      
Out[104]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
16995 -124.26 40.58 52.0 2217.0 394.0 907.0 369.0 2.3571 111400.0
16996 -124.27 40.69 36.0 2349.0 528.0 1194.0 465.0 2.5179 79000.0
16997 -124.30 41.84 17.0 2677.0 531.0 1244.0 456.0 3.0313 103600.0
16998 -124.30 41.80 19.0 2672.0 552.0 1298.0 478.0 1.9797 85800.0
16999 -124.35 40.54 52.0 1820.0 300.0 806.0 270.0 3.0147 94600.0

4.3. DataFrame Informationlink image 91

Another very useful method is info() which gives us information about the DataFrame

	
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0])
}
dataframe = pd.DataFrame(diccionario)
dataframe
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0], index=["a", "b", "c", "d"])
}
dataframe = pd.DataFrame(diccionario)
dataframe
dataframe = pd.DataFrame(diccionario, index=["d", "b", "a"])
dataframe
dataframe = pd.DataFrame(diccionario, columns=["dos", "tres"])
dataframe
diccionario = {
"uno": [1.0, 2.0, 3.0, 4.0],
"dos": [4.0, 3.0, 2.0, 1.0]
}
dataframe = pd.DataFrame(diccionario)
dataframe
matriz = np.array([[1, 3], [2, 2], [3, 1]])
dataframe = pd.DataFrame(matriz, index=["a", "b", "c"], columns=["columna1", "columna2"])
dataframe
lista = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
dataframe = pd.DataFrame(lista)
dataframe
diccionario = {
("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
("b", "b"): {("A", "D"): 9, ("A", "B"): 10},
}
dataframe = pd.DataFrame(diccionario)
dataframe
diccionario = {"b": 1, "a": 0, "c": 2}
serie = pd.Series(diccionario)
dataframe = pd.DataFrame(serie)
dataframe
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train
california_housing_train.head(n=10)
california_housing_train.tail()
california_housing_train.info()
Copy
	
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 17000 non-null float64
1 latitude 17000 non-null float64
2 housing_median_age 17000 non-null float64
3 total_rooms 17000 non-null float64
4 total_bedrooms 17000 non-null float64
5 population 17000 non-null float64
6 households 17000 non-null float64
7 median_income 17000 non-null float64
8 median_house_value 17000 non-null float64
dtypes: float64(9)
memory usage: 1.2 MB

4.4. DataFrame Rows and Columnslink image 92

Indices and columns of a DataFrame can be obtained using the methods index and columns.

	
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0], index=["a", "b", "c", "d"])
}
dataframe = pd.DataFrame(diccionario)
indices = dataframe.index
columnas = dataframe.columns
print(f"El DataFrame tiene los índices {indices}\n")
print(f"El DataFrame tiene las columnas {columnas}")
Copy
	
El DataFrame tiene los índices
Index(['a', 'b', 'c', 'd'], dtype='object')
El DataFrame tiene las columnas
Index(['uno', 'dos'], dtype='object')

4.5. DataFrame Descriptionlink image 93

The describe() method shows a quick statistical summary of the DataFrame data.

california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train.describe()
      
Out[107]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000
mean -119.562108 35.625225 28.589353 2643.664412 539.410824 1429.573941 501.221941 3.883578 207300.912353
std 2.005166 2.137340 12.586937 2179.947071 421.499452 1147.852959 384.520841 1.908157 115983.764387
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.790000 33.930000 18.000000 1462.000000 297.000000 790.000000 282.000000 2.566375 119400.000000
50% -118.490000 34.250000 29.000000 2127.000000 434.000000 1167.000000 409.000000 3.544600 180400.000000
75% -118.000000 37.720000 37.000000 3151.250000 648.250000 1721.000000 605.250000 4.767000 265000.000000
max -114.310000 41.950000 52.000000 37937.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000

4.6. Sorting the DataFramelink image 94

Rows of a DataFrame can be sorted alphabetically using the sort_index() method.

california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train.sort_index().head()
      
Out[108]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0

As in this case the rows were already sorted, we set ascending=False so that the order is reversed

california_housing_train.sort_index(ascending=False).head()
      
Out[109]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
16999 -124.35 40.54 52.0 1820.0 300.0 806.0 270.0 3.0147 94600.0
16998 -124.30 41.80 19.0 2672.0 552.0 1298.0 478.0 1.9797 85800.0
16997 -124.30 41.84 17.0 2677.0 531.0 1244.0 456.0 3.0313 103600.0
16996 -124.27 40.69 36.0 2349.0 528.0 1194.0 465.0 2.5179 79000.0
16995 -124.26 40.58 52.0 2217.0 394.0 907.0 369.0 2.3571 111400.0

If what you want is to sort the columns, you need to introduce axis=1 because by default it is 0

california_housing_train.sort_index(axis=1).head()
      
Out[110]:
households housing_median_age latitude longitude median_house_value median_income population total_bedrooms total_rooms
0 472.0 15.0 34.19 -114.31 66900.0 1.4936 1015.0 1283.0 5612.0
1 463.0 19.0 34.40 -114.47 80100.0 1.8200 1129.0 1901.0 7650.0
2 117.0 17.0 33.69 -114.56 85700.0 1.6509 333.0 174.0 720.0
3 226.0 14.0 33.64 -114.57 73400.0 3.1917 515.0 337.0 1501.0
4 262.0 20.0 33.57 -114.57 65500.0 1.9250 624.0 326.0 1454.0

If what we want is to sort the DataFrame by a specific column, we need to use the sort_values() method and specify the label of the column we want to sort by.

california_housing_train.sort_values('median_house_value')
      
Out[111]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
568 -117.02 36.40 19.0 619.0 239.0 490.0 164.0 2.1000 14999.0
16643 -122.74 39.71 16.0 255.0 73.0 85.0 38.0 1.6607 14999.0
16801 -123.17 40.31 36.0 98.0 28.0 18.0 8.0 0.5360 14999.0
3226 -117.86 34.24 52.0 803.0 267.0 628.0 225.0 4.1932 14999.0
7182 -118.33 34.15 39.0 493.0 168.0 259.0 138.0 2.3667 17500.0
... ... ... ... ... ... ... ... ... ...
15834 -122.42 37.81 52.0 1314.0 317.0 473.0 250.0 4.3472 500001.0
7927 -118.40 33.87 38.0 2398.0 431.0 911.0 392.0 5.2319 500001.0
3546 -117.90 33.63 28.0 2370.0 352.0 832.0 347.0 7.1148 500001.0
7924 -118.40 33.88 35.0 1060.0 191.0 444.0 196.0 8.0015 500001.0
14011 -122.04 37.26 24.0 4973.0 709.0 1692.0 696.0 7.8627 500001.0

17000 rows × 9 columns

4.7. DataFrame Statisticslink image 95

You can obtain statistics from the DataFrame, such as the mean, mode, standard deviation

	
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train.describe()
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train.sort_index().head()
california_housing_train.sort_index(ascending=False).head()
california_housing_train.sort_index(axis=1).head()
california_housing_train.sort_values('median_house_value')
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
print(f"media:\n{california_housing_train.mean()}")
print(f" desviación estandar:\n{california_housing_train.std()}")
Copy
	
media:
longitude -119.562108
latitude 35.625225
housing_median_age 28.589353
total_rooms 2643.664412
total_bedrooms 539.410824
population 1429.573941
households 501.221941
median_income 3.883578
median_house_value 207300.912353
dtype: float64
desviación estandar:
longitude 2.005166
latitude 2.137340
housing_median_age 12.586937
total_rooms 2179.947071
total_bedrooms 421.499452
population 1147.852959
households 384.520841
median_income 1.908157
median_house_value 115983.764387
dtype: float64

If you want to obtain statistics on the rows instead of the columns, you need to indicate it with axis=1

	
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
print(f"media:\n{california_housing_train.mean(axis=1)}")
print(f" desviación estandar:\n{california_housing_train.std(axis=1)}")
Copy
	
media:
0 8357.597067
1 10131.527778
2 9664.642322
3 8435.029078
4 7567.436111
...
16995 12806.408567
16996 9276.770878
16997 12049.507922
16998 10082.053300
16999 10863.022744
Length: 17000, dtype: float64
desviación estandar:
0 22026.612445
1 26352.939272
2 28514.316588
3 24366.754747
4 21730.014569
...
16995 36979.676899
16996 26158.006771
16997 34342.876792
16998 28408.152329
16999 31407.119788
Length: 17000, dtype: float64

Another useful thing that can be obtained from DataFrames is, for example, the number of times each item in a column repeats.

	
california_housing_train["total_rooms"].value_counts()
Copy
	
1582.0 16
1527.0 15
1717.0 14
1471.0 14
1703.0 14
..
157.0 1
2760.0 1
458.0 1
10239.0 1
4068.0 1
Name: total_rooms, Length: 5533, dtype: int64

For example, we can see that there are a total of 16 houses with 1582 rooms.

4.8. Used Memorylink image 96

We can see the memory used by the dataframe

	
california_housing_train.memory_usage(deep=True)
Copy
	
Index 128
longitude 136000
latitude 136000
housing_median_age 136000
total_rooms 136000
total_bedrooms 136000
population 136000
households 136000
median_income 136000
median_house_value 136000
dtype: int64

5. Adding Datalink image 97

5.1. Adding Columnslink image 98

Columns can be easily added as operations from other columns

diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0])
}
      
dataframe = pd.DataFrame(diccionario)
      
dataframe["tres"] = dataframe["uno"] + dataframe["dos"]
dataframe["flag"] = dataframe["tres"] > 7.0
      
dataframe
      
Out[ ]:
uno dos tres flag
0 1.0 4.0 5.0 False
1 2.0 5.0 7.0 False
2 3.0 6.0 9.0 True
3 NaN 7.0 NaN False

Columns can also be added indicating what value all their items will have.

dataframe["constante"] = 8.0
dataframe
      
Out[ ]:
uno dos tres flag constante
0 1.0 4.0 5.0 False 8.0
1 2.0 5.0 7.0 False 8.0
2 3.0 6.0 9.0 True 8.0
3 NaN 7.0 NaN False 8.0

If a Series is added that does not have the same number of indices as the DataFrame, it will be adjusted to the number of indices of the DataFrame

dataframe["Menos indices"] = dataframe["uno"][:2]
dataframe
      
Out[ ]:
uno dos tres flag constante Menos indices
0 1.0 4.0 5.0 False 8.0 1.0
1 2.0 5.0 7.0 False 8.0 2.0
2 3.0 6.0 9.0 True 8.0 NaN
3 NaN 7.0 NaN False 8.0 NaN

With the previous methods, the column was added at the end, but if you want to add the column at a specific position, you can use the insert() method. For example, if you want to add a column in position 3 (considering that counting starts from position 0), with the column name being inserted column and its value being double that of column three, it would be done as follows

dataframe.insert(loc=3, column="columna insertada", value=dataframe["tres"]*2)
dataframe
      
Out[ ]:
uno dos tres columna insertada flag constante Menos indices
0 1.0 4.0 5.0 10.0 False 8.0 1.0
1 2.0 5.0 7.0 14.0 False 8.0 2.0
2 3.0 6.0 9.0 18.0 True 8.0 NaN
3 NaN 7.0 NaN NaN False 8.0 NaN

If you want to add more than one column per command, you can use the assign() method.

dataframe = dataframe.assign(
columna_asignada1 = dataframe["uno"] * dataframe["tres"],
columna_asignada2 = dataframe["dos"] * dataframe["tres"],
)
dataframe
      
Out[ ]:
uno dos tres columna insertada flag constante Menos indices columna_asignada1 columna_asignada2
0 1.0 4.0 5.0 10.0 False 8.0 1.0 5.0 20.0
1 2.0 5.0 7.0 14.0 False 8.0 2.0 14.0 35.0
2 3.0 6.0 9.0 18.0 True 8.0 NaN 27.0 54.0
3 NaN 7.0 NaN NaN False 8.0 NaN NaN NaN

5.2. Adding Rowslink image 99

diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0])
}
      
dataframe = pd.DataFrame(diccionario)
      
dataframe.head()
      
Out[20]:
uno dos
0 1.0 4.0
1 2.0 5.0
2 3.0 6.0
3 NaN 7.0

We can add a row at the end with the concat method (which we will see in more detail later)

diccionario = {
"uno": [10.0],
"dos": [20.0]
}
      
dataframe = pd.concat([dataframe, pd.DataFrame(diccionario)])
dataframe
      
Out[21]:
uno dos
0 1.0 4.0
1 2.0 5.0
2 3.0 6.0
3 NaN 7.0
0 10.0 20.0

We see that the column has been added at the end, but it has index zero, so we reorder the indexes using the method reset_index(drop=True)

dataframe = dataframe.reset_index(drop=True)
dataframe
      
Out[23]:
index uno dos
0 0 1.0 4.0
1 1 2.0 5.0
2 2 3.0 6.0
3 3 NaN 7.0
4 0 10.0 20.0

6. Data Deletionlink image 100

6.1. Removal of Columnslink image 101

A specific column can be removed using the pop() method

dataframe.pop("constante")
dataframe
      
Out[ ]:
uno dos tres columna insertada flag Menos indices columna_asignada1 columna_asignada2
0 1.0 4.0 5.0 10.0 False 1.0 5.0 20.0
1 2.0 5.0 7.0 14.0 False 2.0 14.0 35.0
2 3.0 6.0 9.0 18.0 True NaN 27.0 54.0
3 NaN 7.0 NaN NaN False NaN NaN NaN

Or using del

del dataframe["flag"]
dataframe
      
Out[ ]:
uno dos tres columna insertada Menos indices columna_asignada1 columna_asignada2
0 1.0 4.0 5.0 10.0 1.0 5.0 20.0
1 2.0 5.0 7.0 14.0 2.0 14.0 35.0
2 3.0 6.0 9.0 18.0 NaN 27.0 54.0
3 NaN 7.0 NaN NaN NaN NaN NaN

6.1. Row Removallink image 102

diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]),
"dos": pd.Series([11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]),
"tres": pd.Series([21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0])
}
      
dataframe = pd.DataFrame(diccionario)
      
dataframe.head()
      
Out[49]:
uno dos tres
0 1.0 11.0 21.0
1 2.0 12.0 22.0
2 3.0 13.0 23.0
3 4.0 14.0 24.0
4 5.0 15.0 25.0

If we want to delete a row, we can use the drop method, specifying its position. For example, if we want to delete the row at position 1

dataframe = dataframe.drop(1)
dataframe
      
Out[50]:
uno dos tres
0 1.0 11.0 21.0
2 3.0 13.0 23.0
3 4.0 14.0 24.0
4 5.0 15.0 25.0
5 6.0 16.0 26.0
6 7.0 17.0 27.0
7 8.0 18.0 28.0
8 9.0 19.0 29.0
9 10.0 20.0 30.0

If we want to delete the last row

dataframe = dataframe.drop(len(dataframe)-1)
dataframe
      
Out[51]:
uno dos tres
0 1.0 11.0 21.0
2 3.0 13.0 23.0
3 4.0 14.0 24.0
4 5.0 15.0 25.0
5 6.0 16.0 26.0
6 7.0 17.0 27.0
7 8.0 18.0 28.0
9 10.0 20.0 30.0

If what we want is to delete a range of rows

dataframe = dataframe.drop(range(2, 5))
dataframe
      
Out[52]:
uno dos tres
0 1.0 11.0 21.0
5 6.0 16.0 26.0
6 7.0 17.0 27.0
7 8.0 18.0 28.0
9 10.0 20.0 30.0

If what we want is to delete a specific set of rows

dataframe = dataframe.drop([5, 7, 9])
dataframe
      
Out[53]:
uno dos tres
0 1.0 11.0 21.0
6 7.0 17.0 27.0

Just like when we add rows, we see that some indices have been removed, so we reorder the indices using the reset_index(drop=True) method.

dataframe = dataframe.reset_index(drop=True)
dataframe
      
Out[54]:
uno dos tres
0 1.0 11.0 21.0
1 7.0 17.0 27.0

7. Operations on DataFrameslink image 103

You can perform operations on DataFrames just as you could with Numpy

diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]),
"dos": pd.Series([11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]),
"tres": pd.Series([21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0])
}
      
dataframe = pd.DataFrame(diccionario)
      
dataframe.head()
      
Out[58]:
uno dos tres
0 1.0 11.0 21.0
1 2.0 12.0 22.0
2 3.0 13.0 23.0
3 4.0 14.0 24.0
4 5.0 15.0 25.0
dataframe[ ["uno", "dos", "tres"] ] * 2
      
Out[59]:
uno dos tres
0 2.0 22.0 42.0
1 4.0 24.0 44.0
2 6.0 26.0 46.0
3 8.0 28.0 48.0
4 10.0 30.0 50.0
5 12.0 32.0 52.0
6 14.0 34.0 54.0
7 16.0 36.0 56.0
8 18.0 38.0 58.0
9 20.0 40.0 60.0
np.exp(dataframe[ ["uno", "dos", "tres"] ])
      
Out[60]:
uno dos tres
0 2.718282 5.987414e+04 1.318816e+09
1 7.389056 1.627548e+05 3.584913e+09
2 20.085537 4.424134e+05 9.744803e+09
3 54.598150 1.202604e+06 2.648912e+10
4 148.413159 3.269017e+06 7.200490e+10
5 403.428793 8.886111e+06 1.957296e+11
6 1096.633158 2.415495e+07 5.320482e+11
7 2980.957987 6.565997e+07 1.446257e+12
8 8103.083928 1.784823e+08 3.931334e+12
9 22026.465795 4.851652e+08 1.068647e+13

If you want to perform more complex operations, you can use the apply() method.

	
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0])
}
dataframe = pd.DataFrame(diccionario)
dataframe["tres"] = dataframe["uno"] + dataframe["dos"]
dataframe["flag"] = dataframe["tres"] > 7.0
dataframe
dataframe["constante"] = 8.0
dataframe
dataframe["Menos indices"] = dataframe["uno"][:2]
dataframe
dataframe.insert(loc=3, column="columna insertada", value=dataframe["tres"]*2)
dataframe
dataframe = dataframe.assign(
columna_asignada1 = dataframe["uno"] * dataframe["tres"],
columna_asignada2 = dataframe["dos"] * dataframe["tres"],
)
dataframe
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0])
}
dataframe = pd.DataFrame(diccionario)
dataframe.head()
diccionario = {
"uno": [10.0],
"dos": [20.0]
}
dataframe = pd.concat([dataframe, pd.DataFrame(diccionario)])
dataframe
dataframe = dataframe.reset_index(drop=True)
dataframe
dataframe.pop("constante")
dataframe
del dataframe["flag"]
dataframe
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]),
"dos": pd.Series([11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]),
"tres": pd.Series([21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0])
}
dataframe = pd.DataFrame(diccionario)
dataframe.head()
dataframe = dataframe.drop(1)
dataframe
dataframe = dataframe.drop(len(dataframe)-1)
dataframe
dataframe = dataframe.drop(range(2, 5))
dataframe
dataframe = dataframe.drop([5, 7, 9])
dataframe
dataframe = dataframe.reset_index(drop=True)
dataframe
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]),
"dos": pd.Series([11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]),
"tres": pd.Series([21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0])
}
dataframe = pd.DataFrame(diccionario)
dataframe.head()
dataframe[ ["uno", "dos", "tres"] ] * 2
np.exp(dataframe[ ["uno", "dos", "tres"] ])
dataframe = dataframe.apply(lambda x: x.max() - x.min())
dataframe
Copy
	
uno 9.0
dos 9.0
tres 9.0
dtype: float64

A lambda function has been used because it is a simple function, but in case we want to apply more complex functions, we can define and apply them.

	
def funcion(x):
if x < 10:
return np.exp(x) - np.log(5*x) + np.sqrt(x)
elif x < 20:
return np.sin(x) + np.cos(x) + np.tan(x)
else:
return np.log(x) + np.log10(x) + np.log2(x)
dataframe = dataframe.apply(funcion)
dataframe
Copy
	
uno 8102.277265
dos 8102.277265
tres 8102.277265
dtype: float64

Using the apply method in a dataframe is much faster than using a for loop for each of the rows and performing the operation

california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train.head()
      
Out[64]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0

Let's calculate the percentage of bedrooms out of the total number of rooms

california_housing_train["percent_bedrooms"] = None
%time california_housing_train["percent_bedrooms"] = california_housing_train.apply(lambda x: x["total_bedrooms"] / x["total_rooms"], axis=1)
california_housing_train.head()
      
CPU times: user 309 ms, sys: 86 µs, total: 309 ms
      Wall time: 309 ms
      
Out[74]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value percent_bedrooms
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0 0.228617
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0 0.248497
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0 0.241667
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0 0.224517
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0 0.224209
california_housing_train["percent_bedrooms"] = None
%time for i in range(len(california_housing_train)): california_housing_train["percent_bedrooms"][i] = california_housing_train["total_bedrooms"][i] / california_housing_train["total_rooms"][i]
california_housing_train.head()
      
/home/wallabot/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
      A value is trying to be set on a copy of a slice from a DataFrame
      
      See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
        """Entry point for launching an IPython kernel.
      
CPU times: user 1.72 s, sys: 12 ms, total: 1.73 s
      Wall time: 1.72 s
      
Out[77]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value percent_bedrooms
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0 0.228617
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0 0.248497
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0 0.241667
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0 0.224517
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0 0.224209

With the lambda function it took about 300 ms, whereas with the for loop it took over 1 second

8. Transposelink image 104

The transpose of a DataFrame can be done using the T method

diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0], index=["fila a", "fila b", "fila c"]),
"dos": pd.Series([4.0, 5.0, 6.0], index=["fila a", "fila b", "fila c"])
}
      
dataframe = pd.DataFrame(diccionario)
      
dataframe["tres"] = dataframe["uno"] + dataframe["dos"]
dataframe["flag"] = dataframe["tres"] > 7.0
      
dataframe.T
      
Out[ ]:
fila a fila b fila c
uno 1 2 3
dos 4 5 6
tres 5 7 9
flag False False True

9. Conversion to Numpylink image 105

If you want to convert a Series or DataFrame to NumPy you can use the to_numpy() method or use the np.asarray() function.

diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0], index=["fila a", "fila b", "fila c"]),
"dos": pd.Series([4.0, 5.0, 6.0], index=["fila a", "fila b", "fila c"])
}
      
dataframe = pd.DataFrame(diccionario)
      
dataframe["tres"] = dataframe["uno"] + dataframe["dos"]
dataframe["flag"] = dataframe["tres"] > 7.0
      
dataframe
      
Out[ ]:
uno dos tres flag
fila a 1.0 4.0 5.0 False
fila b 2.0 5.0 7.0 False
fila c 3.0 6.0 9.0 True
	
california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train.head()
california_housing_train["percent_bedrooms"] = None
%time california_housing_train["percent_bedrooms"] = california_housing_train.apply(lambda x: x["total_bedrooms"] / x["total_rooms"], axis=1)
california_housing_train.head()
california_housing_train["percent_bedrooms"] = None
%time for i in range(len(california_housing_train)): california_housing_train["percent_bedrooms"][i] = california_housing_train["total_bedrooms"][i] / california_housing_train["total_rooms"][i]
california_housing_train.head()
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0], index=["fila a", "fila b", "fila c"]),
"dos": pd.Series([4.0, 5.0, 6.0], index=["fila a", "fila b", "fila c"])
}
dataframe = pd.DataFrame(diccionario)
dataframe["tres"] = dataframe["uno"] + dataframe["dos"]
dataframe["flag"] = dataframe["tres"] > 7.0
dataframe.T
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0], index=["fila a", "fila b", "fila c"]),
"dos": pd.Series([4.0, 5.0, 6.0], index=["fila a", "fila b", "fila c"])
}
dataframe = pd.DataFrame(diccionario)
dataframe["tres"] = dataframe["uno"] + dataframe["dos"]
dataframe["flag"] = dataframe["tres"] > 7.0
dataframe
matriz_np = dataframe.to_numpy()
matriz_np
Copy
	
CPU times: user 309 ms, sys: 86 µs, total: 309 ms
Wall time: 309 ms
/home/wallabot/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
array([[1.0, 4.0, 5.0, False],
[2.0, 5.0, 7.0, False],
[3.0, 6.0, 9.0, True]], dtype=object)
	
matriz_np = np.asarray(dataframe)
matriz_np
Copy
	
array([[1.0, 4.0, 5.0, False],
[2.0, 5.0, 7.0, False],
[3.0, 6.0, 9.0, True]], dtype=object)

This example is not the most suitable, as it mixes numbers with booleans, and as we already explained in the previous post Matrix Calculation with NumPy, all elements of an ndarray must be of the same type. In this case, we are mixing numbers with booleans, so to solve it NumPy converts them all to objects. To solve this, we keep only the numbers and convert them to an ndarray

	
matriz_np = dataframe[ ["uno", "dos", "tres"] ].to_numpy()
matriz_np, matriz_np.dtype
Copy
	
(array([[1., 4., 5.],
[2., 5., 7.],
[3., 6., 9.]]), dtype('float64'))

Now it can be seen that an ndarray has been created where all the data is of type float

10. Reading Data from External Sourceslink image 106

One of the greatest strengths of Pandas is its ability to read data from files, so it is not necessary to create a DataFrame with the data to be processed, but rather it can be read from a file In the same way that DataFrames can be created from external files, DataFrames can also be saved to files, allowing you to create your own dataset, configure it as you wish, and save it to a file for later use. The following table shows the functions for reading and writing files in different formats

|Formato|Tipo de archivo|Función de lectura|Función de escritura||---|---|---|---| |texto|CSV|read_csv|to_csv| |texto|Fixed-Width Text File|read_fwf|| |texto|JSON|read_json|to_json| |texto|HTML|read_html|to_html| |texto|Local clipboard|read_clipboard|to_clipboard| |binary|MS Excel|read_excel|to_excel| |binary|OpenDocument|read_excel| |binary|HDF5 Format|read_hdf|to_hdf| |binary|Feather Format|read_feather|to_feather| |binary|Parquet Format|read_parquet|to_parquet| |binary|ORC Format|read_orc||binary|Msgpack|read_msgpack|to_msgpack| |binary|Stata|read_stata|to_stata| |binary|SAS|read_sas||binary|SPSS|read_spss||binary|Python Pickle Format|read_pickle|to_pickle| |SQL|SQL|read_sql|to_sql| |SQL|Google BigQuery|read_gbq|to_gbq|

11. Indexing in DataFrameslink image 107

There are many ways to index in DataFrames,

fechas = pd.date_range('1/1/2000', periods=8)
dataframe = pd.DataFrame(np.random.randn(8, 4), index=fechas, columns=['A', 'B', 'C', 'D'])
      
dataframe
      
Out[56]:
A B C D
2000-01-01 -0.869298 -0.210502 0.477938 0.912121
2000-01-02 -0.502425 -1.897287 -0.084122 -1.203818
2000-01-03 -0.204297 0.711485 -1.271802 -0.138120
2000-01-04 1.512985 0.726718 0.960902 0.433124
2000-01-05 0.724005 -2.127668 0.674417 -0.297808
2000-01-06 -0.217175 -0.621172 0.668163 0.170576
2000-01-07 0.352484 0.260515 -1.576033 1.263213
2000-01-08 -0.032928 -0.648313 -0.622509 0.231683

11.1. Column Indexinglink image 108

To select columns within a DataFrame we can do it by selecting the column between brackets [], or by indicating the column as if it were a method of the DataFrame.

	
fechas = pd.date_range('1/1/2000', periods=8)
dataframe = pd.DataFrame(np.random.randn(8, 4), index=fechas, columns=['A', 'B', 'C', 'D'])
dataframe
dataframe['A']
Copy
	
2000-01-01 0.813153
2000-01-02 -0.244584
2000-01-03 0.125729
2000-01-04 0.352275
2000-01-05 -2.050976
2000-01-06 -0.312296
2000-01-07 0.897837
2000-01-08 0.271403
Freq: D, Name: A, dtype: float64
	
dataframe.A
Copy
	
2000-01-01 0.813153
2000-01-02 -0.244584
2000-01-03 0.125729
2000-01-04 0.352275
2000-01-05 -2.050976
2000-01-06 -0.312296
2000-01-07 0.897837
2000-01-08 0.271403
Freq: D, Name: A, dtype: float64

If specific rows are desired, they are passed through a list

dataframe[ ['A', 'B'] ]
      
Out[ ]:
A B
2000-01-01 -0.341962 0.639913
2000-01-02 0.507258 -0.942036
2000-01-03 -1.463557 1.041023
2000-01-04 1.781058 -1.849352
2000-01-05 0.318897 -0.229218
2000-01-06 0.362064 -0.193479
2000-01-07 -0.084632 -0.112474
2000-01-08 0.739424 0.253191

11.2. Row Indexing by Positionslink image 109

You can select a range of rows from a DataFrame as follows

dataframe[0:3]
      
Out[ ]:
A B C D
2000-01-01 0.813153 -0.869356 0.934293 0.338644
2000-01-02 -0.244584 0.536352 0.322248 0.238903
2000-01-03 0.125729 2.046910 -0.877466 -0.710034

If you only want to select a single row, you need to specify a range of rows that includes only that one, for example, if you want to select row number 1.

dataframe[1:2]
      
Out[ ]:
A B C D
2000-01-02 -0.244584 0.536352 0.322248 0.238903

Another method to select a row by its position is the iloc[] method

dataframe.iloc[0:3]
      
Out[ ]:
A B C D
2000-01-01 0.813153 -0.869356 0.934293 0.338644
2000-01-02 -0.244584 0.536352 0.322248 0.238903
2000-01-03 0.125729 2.046910 -0.877466 -0.710034

If certain rows are desired, a list with their positions is passed.

dataframe.iloc[ [0, 2, 4] ]
      
Out[ ]:
A B C D
2000-01-01 -0.341962 0.639913 0.765817 0.056692
2000-01-03 -1.463557 1.041023 -1.321715 2.822735
2000-01-05 0.318897 -0.229218 -1.095593 -0.186248

11.3. Row Indexing by Labelslink image 110

To select a row by its labels, we can use the loc[] method

	
dataframe[ ['A', 'B'] ]
dataframe[0:3]
dataframe[1:2]
dataframe.iloc[0:3]
dataframe.iloc[ [0, 2, 4] ]
dataframe.loc['2000-01-01']
Copy
	
A 0.813153
B -0.869356
C 0.934293
D 0.338644
Name: 2000-01-01 00:00:00, dtype: float64

If you want to select a range of rows, you can index them using the colon :

dataframe.loc['2000-01-01':'2000-01-03']
      
Out[ ]:
A B C D
2000-01-01 0.813153 -0.869356 0.934293 0.338644
2000-01-02 -0.244584 0.536352 0.322248 0.238903
2000-01-03 0.125729 2.046910 -0.877466 -0.710034

If specific rows are desired, they are passed through a list

dataframe.loc[ ['2000-01-01', '2000-01-03', '2000-01-05'] ]
      
Out[ ]:
A B C D
2000-01-01 -0.341962 0.639913 0.765817 0.056692
2000-01-03 -1.463557 1.041023 -1.321715 2.822735
2000-01-05 0.318897 -0.229218 -1.095593 -0.186248

11.4. Selecting a Portion of the DataFrame by Positionslink image 111

dataframe.iloc[0:3, 0:2]
      
Out[ ]:
A B
2000-01-01 0.813153 -0.869356
2000-01-02 -0.244584 0.536352
2000-01-03 0.125729 2.046910

If specific rows and columns are desired, lists with the desired positions are provided.

dataframe.iloc[ [0, 2, 4], [0, 2] ]
      
Out[ ]:
A C
2000-01-01 -0.341962 0.765817
2000-01-03 -1.463557 -1.321715
2000-01-05 0.318897 -1.095593

11.5. Selecting a portion of the DataFrame using labelslink image 112

dataframe.loc['2000-01-01':'2000-01-03', 'A':'B']
      
Out[ ]:
A B
2000-01-01 0.813153 -0.869356
2000-01-02 -0.244584 0.536352
2000-01-03 0.125729 2.046910

If specific rows and columns are desired, lists with the desired labels are passed

dataframe.loc[ ['2000-01-01', '2000-01-03', '2000-01-05'], ['A', 'C'] ]
      
Out[ ]:
A C
2000-01-01 -0.341962 0.765817
2000-01-03 -1.463557 -1.321715
2000-01-05 0.318897 -1.095593

11.6. Indexing by Lambda Functionlink image 113

You can select data from a DataFrame that meets a condition given by a lambda function.

dataframe.loc[lambda dataframe:2*dataframe['A']+5*np.exp(dataframe['B'])>0.2]
      
Out[58]:
A B C D
2000-01-01 -0.869298 -0.210502 0.477938 0.912121
2000-01-03 -0.204297 0.711485 -1.271802 -0.138120
2000-01-04 1.512985 0.726718 0.960902 0.433124
2000-01-05 0.724005 -2.127668 0.674417 -0.297808
2000-01-06 -0.217175 -0.621172 0.668163 0.170576
2000-01-07 0.352484 0.260515 -1.576033 1.263213
2000-01-08 -0.032928 -0.648313 -0.622509 0.231683

As you can see, this form of indexing is very powerful

11.7. Conditional Indexinglink image 114

If we don't need complex functions to index, but only conditionals, we can do

dataframe[dataframe['A']>0.2]
      
Out[61]:
A B C D
2000-01-04 1.512985 0.726718 0.960902 0.433124
2000-01-05 0.724005 -2.127668 0.674417 -0.297808
2000-01-07 0.352484 0.260515 -1.576033 1.263213

We can make multiple conditions

dataframe[(dataframe['A']>0.2) & (dataframe['B']>0.2)]
      
Out[69]:
A B C D
2000-01-04 1.512985 0.726718 0.960902 0.433124
2000-01-07 0.352484 0.260515 -1.576033 1.263213

11.8. Random indexinglink image 115

Using the sample() method we will obtain a random row from the DataFrame

dataframe.sample()
      
Out[ ]:
A B C D
2000-01-06 -0.312296 0.129097 -0.991085 1.704535

If we want more than one sample, we indicate it with the attribute n

dataframe.sample(n=3)
      
Out[ ]:
A B C D
2000-01-08 0.271403 1.527116 0.144970 1.175728
2000-01-01 0.813153 -0.869356 0.934293 0.338644
2000-01-03 0.125729 2.046910 -0.877466 -0.710034

If what you want are random columns, you need to indicate it with axis=1

dataframe.sample(axis=1)
      
Out[ ]:
D
2000-01-01 0.338644
2000-01-02 0.238903
2000-01-03 -0.710034
2000-01-04 0.504410
2000-01-05 -1.601926
2000-01-06 1.704535
2000-01-07 -0.584860
2000-01-08 1.175728

If you want a single item from the DataFrame, you need to call the sample() method twice.

dataframe.sample(axis=1).sample()
      
Out[ ]:
D
2000-01-05 -1.601926

12. Merging DataFrameslink image 116

12.1. Concatenation of DataFrameslink image 117

To concatenate multiple DataFrames we use the concat() method, where it will be passed a list with the DataFrames that you want to join.

	
dataframe.loc['2000-01-01':'2000-01-03']
dataframe.loc[ ['2000-01-01', '2000-01-03', '2000-01-05'] ]
dataframe.iloc[0:3, 0:2]
dataframe.iloc[ [0, 2, 4], [0, 2] ]
dataframe.loc['2000-01-01':'2000-01-03', 'A':'B']
dataframe.loc[ ['2000-01-01', '2000-01-03', '2000-01-05'], ['A', 'C'] ]
dataframe.loc[lambda dataframe:2*dataframe['A']+5*np.exp(dataframe['B'])>0.2]
dataframe[dataframe['A']>0.2]
dataframe[(dataframe['A']>0.2) & (dataframe['B']>0.2)]
dataframe.sample()
dataframe.sample(n=3)
dataframe.sample(axis=1)
dataframe.sample(axis=1).sample()
dataframe1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
})
dataframe2 = pd.DataFrame(
{
"A": ["A4", "A5", "A6", "A7"],
"B": ["B4", "B5", "B6", "B7"],
"C": ["C4", "C5", "C6", "C7"],
"D": ["D4", "D5", "D6", "D7"],
})
dataframe3 = pd.DataFrame(
{
"A": ["A8", "A9", "A10", "A11"],
"B": ["B8", "B9", "B10", "B11"],
"C": ["C8", "C9", "C10", "C11"],
"D": ["D8", "D9", "D10", "D11"],
})
dataframe = pd.concat([dataframe1, dataframe2, dataframe3])
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe2}")
print(f"dataframe3:\n{dataframe3}")
print(f"\ndataframe:\n{dataframe}")
Copy
	
dataframe1:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
dataframe2:
A B C D
0 A4 B4 C4 D4
1 A5 B5 C5 D5
2 A6 B6 C6 D6
3 A7 B7 C7 D7
dataframe3:
A B C D
0 A8 B8 C8 D8
1 A9 B9 C9 D9
2 A10 B10 C10 D10
3 A11 B11 C11 D11
dataframe:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
0 A4 B4 C4 D4
1 A5 B5 C5 D5
2 A6 B6 C6 D6
3 A7 B7 C7 D7
0 A8 B8 C8 D8
1 A9 B9 C9 D9
2 A10 B10 C10 D10
3 A11 B11 C11 D11

As you can see, the indexes 0, 1, 2, and 3 are repeated because each dataframe has those indexes. To prevent this from happening, you need to use the ignore_index=True parameter.

	
dataframe = pd.concat([dataframe1, dataframe2, dataframe3], ignore_index=True)
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe2}")
print(f"dataframe3:\n{dataframe3}")
print(f"\ndataframe:\n{dataframe}")
Copy
	
dataframe1:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
dataframe2:
A B C D
0 A4 B4 C4 D4
1 A5 B5 C5 D5
2 A6 B6 C6 D6
3 A7 B7 C7 D7
dataframe3:
A B C D
0 A8 B8 C8 D8
1 A9 B9 C9 D9
2 A10 B10 C10 D10
3 A11 B11 C11 D11
dataframe:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11

concat

If you wanted to concatenate along the columns, you would have needed to introduce the variable axis=1

	
dataframe = pd.concat([dataframe1, dataframe2, dataframe3], axis=1)
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe2}")
print(f"dataframe3:\n{dataframe3}")
print(f"\ndataframe:\n{dataframe}")
Copy
	
dataframe1:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
dataframe2:
A B C D
0 A4 B4 C4 D4
1 A5 B5 C5 D5
2 A6 B6 C6 D6
3 A7 B7 C7 D7
dataframe3:
A B C D
0 A8 B8 C8 D8
1 A9 B9 C9 D9
2 A10 B10 C10 D10
3 A11 B11 C11 D11
dataframe:
A B C D A B C D A B C D
0 A0 B0 C0 D0 A4 B4 C4 D4 A8 B8 C8 D8
1 A1 B1 C1 D1 A5 B5 C5 D5 A9 B9 C9 D9
2 A2 B2 C2 D2 A6 B6 C6 D6 A10 B10 C10 D10
3 A3 B3 C3 D3 A7 B7 C7 D7 A11 B11 C11 D11

12.1.1. Intersection of Concatenationlink image 118

There are two ways to perform concatenation: by taking all the indices of the DataFrames or by taking only those that match. This is determined by the join variable, which accepts the values 'outer' (default) (takes all indices) or 'inner' (only those that match).

Let's see an example of 'outer'

	
dataframe1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
index=[0, 1, 2, 3])
dataframe4 = pd.DataFrame(
{
"B": ["B2", "B3", "B6", "B7"],
"D": ["D2", "D3", "D6", "D7"],
"F": ["F2", "F3", "F6", "F7"],
},index=[2, 3, 6, 7])
dataframe = pd.concat([dataframe1, dataframe4], axis=1)
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe4}")
print(f"\ndataframe:\n{dataframe}")
Copy
	
dataframe1:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
dataframe2:
B D F
2 B2 D2 F2
3 B3 D3 F3
6 B6 D6 F6
7 B7 D7 F7
dataframe:
A B C D B D F
0 A0 B0 C0 D0 NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN
2 A2 B2 C2 D2 B2 D2 F2
3 A3 B3 C3 D3 B3 D3 F3
6 NaN NaN NaN NaN B6 D6 F6
7 NaN NaN NaN NaN B7 D7 F7

outlier

Let's see an example of 'inner'

	
dataframe = pd.concat([dataframe1, dataframe4], axis=1, join="inner")
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe4}")
print(f"\ndataframe:\n{dataframe}")
Copy
	
dataframe1:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
dataframe2:
B D F
2 B2 D2 F2
3 B3 D3 F3
6 B6 D6 F6
7 B7 D7 F7
dataframe:
A B C D B D F
2 A2 B2 C2 D2 B2 D2 F2
3 A3 B3 C3 D3 B3 D3 F3

inner

12.2. Merge of DataFrameslink image 119

Previously we created a new dataframe by merging several dataframes, now we can complete a dataframe with another one, for this we use merge, passing the on parameter, indicating on which column we want the merge to be done.

	
dataframe1 = pd.DataFrame(
{
"Key": ["K0", "K1", "K2", "K3"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
})
dataframe2 = pd.DataFrame(
{
"Key": ["K0", "K1", "K2", "K3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
})
dataframe = dataframe1.merge(dataframe2, on="Key")
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe2}")
print(f"\ndataframe:\n{dataframe}")
Copy
	
dataframe1:
Key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
dataframe2:
Key C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
3 K3 C3 D3
dataframe:
Key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3

In this case, the two dataframes had a key that was called the same (Key), but in the case of having dataframes where their key is called differently, we can use the parameters left_on and right_on

	
dataframe1 = pd.DataFrame(
{
"Key1": ["K0", "K1", "K2", "K3"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
})
dataframe2 = pd.DataFrame(
{
"Key2": ["K0", "K1", "K2", "K3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
})
dataframe = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2")
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe2}")
print(f"\ndataframe:\n{dataframe}")
Copy
	
dataframe1:
Key1 A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
dataframe2:
Key2 C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
3 K3 C3 D3
dataframe:
Key1 A B Key2 C D
0 K0 A0 B0 K0 C0 D0
1 K1 A1 B1 K1 C1 D1
2 K2 A2 B2 K2 C2 D2
3 K3 A3 B3 K3 C3 D3

In the event that one of the keys does not match, the merge will not be performed on that key.

	
dataframe1 = pd.DataFrame(
{
"Key1": ["K0", "K1", "K2", "K3"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
})
dataframe2 = pd.DataFrame(
{
"Key2": ["K0", "K1", "K2", np.nan],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
})
dataframe = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2")
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe2}")
print(f"\ndataframe:\n{dataframe}")
Copy
	
dataframe1:
Key1 A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
dataframe2:
Key2 C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
3 NaN C3 D3
dataframe:
Key1 A B Key2 C D
0 K0 A0 B0 K0 C0 D0
1 K1 A1 B1 K1 C1 D1
2 K2 A2 B2 K2 C2 D2

To change this behavior we can use the how parameter, which defaults to inner, but we can pass it the values left, right, and outer

	
dataframe1 = pd.DataFrame(
{
"Key1": ["K0", "K1", "K2", "K3"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
})
dataframe2 = pd.DataFrame(
{
"Key2": ["K0", "K1", "K2", np.nan],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
})
dataframe_inner = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2", how="inner")
dataframe_left = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2", how="left")
dataframe_right = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2", how="right")
dataframe_outer = dataframe1.merge(dataframe2, left_on="Key1", right_on="Key2", how="outer")
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe2}")
print(f"\ndataframe inner:\n{dataframe_inner}")
print(f"\ndataframe left:\n{dataframe_left}")
print(f"\ndataframe right:\n{dataframe_right}")
print(f"\ndataframe outer:\n{dataframe_outer}")
Copy
	
dataframe1:
Key1 A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
dataframe2:
Key2 C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
3 NaN C3 D3
dataframe inner:
Key1 A B Key2 C D
0 K0 A0 B0 K0 C0 D0
1 K1 A1 B1 K1 C1 D1
2 K2 A2 B2 K2 C2 D2
dataframe left:
Key1 A B Key2 C D
0 K0 A0 B0 K0 C0 D0
1 K1 A1 B1 K1 C1 D1
2 K2 A2 B2 K2 C2 D2
3 K3 A3 B3 NaN NaN NaN
dataframe right:
Key1 A B Key2 C D
0 K0 A0 B0 K0 C0 D0
1 K1 A1 B1 K1 C1 D1
2 K2 A2 B2 K2 C2 D2
3 NaN NaN NaN NaN C3 D3
dataframe outer:
Key1 A B Key2 C D
0 K0 A0 B0 K0 C0 D0
1 K1 A1 B1 K1 C1 D1
2 K2 A2 B2 K2 C2 D2
3 K3 A3 B3 NaN NaN NaN
4 NaN NaN NaN NaN C3 D3

As you can see, when left is chosen, only the values from the left dataframe are added, and when right is chosen, the values from the right dataframe are added.

12.3. Join of dataframeslink image 120

The last DataFrame joining tool is join. It is similar to merge, except that instead of looking for similarities based on specified columns, it looks for them based on the indices.

	
dataframe1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
},
index=["K0", "K1", "K2", "K3"])
dataframe2 = pd.DataFrame(
{
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
index=["K0", "K1", "K2", "K3"])
dataframe = dataframe1.join(dataframe2)
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe2}")
print(f"\ndataframe:\n{dataframe}")
Copy
	
dataframe1:
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
K3 A3 B3
dataframe2:
C D
K0 C0 D0
K1 C1 D1
K2 C2 D2
K3 C3 D3
dataframe:
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 C1 D1
K2 A2 B2 C2 D2
K3 A3 B3 C3 D3

In this case, the indexes are the same, but when they are different, we can specify how to join the dataframes using the how parameter, which defaults to inner, but can also have the values left, right, and outer.

	
dataframe1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
},
index=["K0", "K1", "K2", "K3"])
dataframe2 = pd.DataFrame(
{
"C": ["C0", "C2", "C3", "C4"],
"D": ["D0", "D2", "D3", "D4"],
},
index=["K0", "K2", "K3", "K4"])
dataframe_inner = dataframe1.join(dataframe2, how="inner")
dataframe_left = dataframe1.join(dataframe2, how="left")
dataframe_right = dataframe1.join(dataframe2, how="right")
dataframe_outer = dataframe1.join(dataframe2, how="outer")
print(f"dataframe1:\n{dataframe1}")
print(f"dataframe2:\n{dataframe2}")
print(f"\ndataframe inner:\n{dataframe_inner}")
print(f"\ndataframe left:\n{dataframe_left}")
print(f"\ndataframe rigth:\n{dataframe_right}")
print(f"\ndataframe outer:\n{dataframe_outer}")
Copy
	
dataframe1:
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
K3 A3 B3
dataframe2:
C D
K0 C0 D0
K2 C2 D2
K3 C3 D3
K4 C4 D4
dataframe:
A B C D
K0 A0 B0 C0 D0
K2 A2 B2 C2 D2
K3 A3 B3 C3 D3
dataframe:
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 A3 B3 C3 D3
dataframe:
A B C D
K0 A0 B0 C0 D0
K2 A2 B2 C2 D2
K3 A3 B3 C3 D3
K4 NaN NaN C4 D4
dataframe:
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 A3 B3 C3 D3
K4 NaN NaN C4 D4

13. Missing Data (NaN)link image 121

In a DataFrame there may be some missing data, Pandas represents them as np.nan

diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0])
}
      
dataframe = pd.DataFrame(diccionario)
dataframe
      
Out[ ]:
uno dos
0 1.0 4.0
1 2.0 5.0
2 3.0 6.0
3 NaN 7.0

13.1. Removal of Rows with Missing Datalink image 122

To avoid having rows with missing data, these can be eliminated

dataframe.dropna(how="any")
      
Out[ ]:
uno dos
0 1.0 4.0
1 2.0 5.0
2 3.0 6.0

13.2. Removing Columns with Missing Datalink image 123

dataframe.dropna(axis=1, how='any')
      
Out[ ]:
dos
0 4.0
1 5.0
2 6.0
3 7.0

13.3. Boolean mask with missing positionslink image 124

pd.isna(dataframe)
      
Out[ ]:
uno dos
0 False False
1 False False
2 False False
3 True False

13.4. Filling Missing Datalink image 125

dataframe.fillna(value=5.5, inplace=True)
dataframe
      
Out[ ]:
uno dos
0 1.0 4.0
1 2.0 5.0
2 3.0 6.0
3 5.5 7.0

Tip: Setting the variable inplace=True modifies the DataFrame being operated on, so there's no need to write dataframe = dataframe.fillna(value=5.5)

14. Time Serieslink image 126

Pandas offers the ability to work with time series. For example, we create a Series of 100 random data points every second starting from 01/01/2021

	
diccionario = {
"uno": pd.Series([1.0, 2.0, 3.0]),
"dos": pd.Series([4.0, 5.0, 6.0, 7.0])
}
dataframe = pd.DataFrame(diccionario)
dataframe
dataframe.dropna(how="any")
dataframe.dropna(axis=1, how='any')
pd.isna(dataframe)
dataframe.fillna(value=5.5, inplace=True)
dataframe
indices = pd.date_range("1/1/2021", periods=100, freq="S")
datos = np.random.randint(0, 500, len(indices))
serie_temporal = pd.Series(datos, index=indices)
serie_temporal
Copy
	
2021-01-01 00:00:00 241
2021-01-01 00:00:01 14
2021-01-01 00:00:02 190
2021-01-01 00:00:03 407
2021-01-01 00:00:04 94
...
2021-01-01 00:01:35 275
2021-01-01 00:01:36 56
2021-01-01 00:01:37 448
2021-01-01 00:01:38 151
2021-01-01 00:01:39 316
Freq: S, Length: 100, dtype: int64

This functionality of Pandas is very powerful, for example, we can have a dataset at certain hours in one time zone and change them to another time zone.

	
horas = pd.date_range("3/6/2021 00:00", periods=10, freq="H")
datos = np.random.randn(len(horas))
serie_horaria = pd.Series(datos, horas)
serie_horaria
Copy
	
2021-03-06 00:00:00 -0.853524
2021-03-06 01:00:00 -1.355372
2021-03-06 02:00:00 -1.267503
2021-03-06 03:00:00 -1.155787
2021-03-06 04:00:00 0.730935
2021-03-06 05:00:00 1.435957
2021-03-06 06:00:00 0.460912
2021-03-06 07:00:00 0.723451
2021-03-06 08:00:00 -0.853337
2021-03-06 09:00:00 0.456359
Freq: H, dtype: float64

We locate the data in a time zone

	
serie_horaria_utc = serie_horaria.tz_localize("UTC")
serie_horaria_utc
Copy
	
2021-03-06 00:00:00+00:00 -0.853524
2021-03-06 01:00:00+00:00 -1.355372
2021-03-06 02:00:00+00:00 -1.267503
2021-03-06 03:00:00+00:00 -1.155787
2021-03-06 04:00:00+00:00 0.730935
2021-03-06 05:00:00+00:00 1.435957
2021-03-06 06:00:00+00:00 0.460912
2021-03-06 07:00:00+00:00 0.723451
2021-03-06 08:00:00+00:00 -0.853337
2021-03-06 09:00:00+00:00 0.456359
Freq: H, dtype: float64

And now we can change them to another use

	
serie_horaria_US = serie_horaria_utc.tz_convert("US/Eastern")
serie_horaria_US
Copy
	
2021-03-05 19:00:00-05:00 -0.853524
2021-03-05 20:00:00-05:00 -1.355372
2021-03-05 21:00:00-05:00 -1.267503
2021-03-05 22:00:00-05:00 -1.155787
2021-03-05 23:00:00-05:00 0.730935
2021-03-06 00:00:00-05:00 1.435957
2021-03-06 01:00:00-05:00 0.460912
2021-03-06 02:00:00-05:00 0.723451
2021-03-06 03:00:00-05:00 -0.853337
2021-03-06 04:00:00-05:00 0.456359
Freq: H, dtype: float64

15. Categorical Datalink image 127

Pandas offers the possibility of adding categorical data in a DataFrame. Let's suppose the following DataFrame

dataframe = pd.DataFrame(
{"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)
      
dataframe
      
Out[2]:
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 e

We can convert the data in the raw_grade column to categorical data using the astype() method

dataframe['grade'] = dataframe["raw_grade"].astype("category")
      
dataframe
      
Out[3]:
id raw_grade grade
0 1 a a
1 2 b b
2 3 b b
3 4 a a
4 5 a a
5 6 e e

The columns raw_grade and grade appear to be the same, but if we look at the DataFrame information we can see that this is not the case.

	
dataframe = pd.DataFrame(
{"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)
dataframe
dataframe['grade'] = dataframe["raw_grade"].astype("category")
dataframe
dataframe.info()
Copy
	
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 6 non-null int64
1 raw_grade 6 non-null object
2 grade 6 non-null category
dtypes: category(1), int64(1), object(1)
memory usage: 334.0+ bytes

You can see that the grade column is of categorical type

We can see the categories of categorical data types using the cat.categories() method.

	
dataframe["grade"].cat.categories
Copy
	
Index(['a', 'b', 'e'], dtype='object')

We can also rename the categories using the same method, but by providing a list with the new categories.

dataframe["grade"].cat.categories = ["very good", "good", "very bad"]
dataframe
      
Out[7]:
id raw_grade grade
0 1 a very good
1 2 b good
2 3 b good
3 4 a very good
4 5 a very good
5 6 e very bad

Pandas gives us the possibility to numerically encode categorical data using the get_dummies method

pd.get_dummies(dataframe["grade"])
      
Out[8]:
very good good very bad
0 1 0 0
1 0 1 0
2 0 1 0
3 1 0 0
4 1 0 0
5 0 0 1

16. Groupbylink image 128

We can group the dataframes by values from one of the columns. Let's reload the dataframe with the value of houses in California.

california_housing_train = pd.read_csv("https://raw.githubusercontent.com/maximofn/portafolio/main/posts/california_housing_train.csv")
california_housing_train.head()
      
Out[43]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0

Now we can group the data by one of the columns, for example, let's group the houses by the number of years and see how many houses there are of each age with count

california_housing_train.groupby("housing_median_age").count().head()
      
Out[8]:
longitude latitude total_rooms total_bedrooms population households median_income median_house_value
housing_median_age
1.0 2 2 2 2 2 2 2 2
2.0 49 49 49 49 49 49 49 49
3.0 46 46 46 46 46 46 46 46
4.0 161 161 161 161 161 161 161 161
5.0 199 199 199 199 199 199 199 199

As we see in all the columns, we get the same value, which is the number of houses with a certain age, but we can find out the average value of each column with mean

california_housing_train.groupby("housing_median_age").mean().head()
      
Out[9]:
longitude latitude total_rooms total_bedrooms population households median_income median_house_value
housing_median_age
1.0 -121.465000 37.940000 2158.000000 335.500000 637.000000 190.000000 4.756800 190250.000000
2.0 -119.035306 35.410816 5237.102041 871.448980 2005.224490 707.122449 5.074237 229438.836735
3.0 -118.798478 35.164783 6920.326087 1190.826087 2934.673913 1030.413043 5.572013 239450.043478
4.0 -118.805093 34.987764 6065.614907 1068.192547 2739.956522 964.291925 5.196055 230054.105590
5.0 -118.789497 35.095327 4926.261307 910.924623 2456.979899 826.768844 4.732460 211035.708543

We can obtain multiple measures for each age using the agg (aggregation) command, passing the measures we want with a list. For example, let's see the minimum, maximum, and mean of each column for each house age.

california_housing_train.groupby("housing_median_age").agg(['min', 'max', 'mean']).head()
      
Out[10]:
longitude latitude total_rooms total_bedrooms ... population households median_income median_house_value
min max mean min max mean min max mean min ... mean min max mean min max mean min max mean
housing_median_age
1.0 -122.00 -120.93 -121.465000 37.65 38.23 37.940000 2062.0 2254.0 2158.000000 328.0 ... 637.000000 112.0 268.0 190.000000 4.2500 5.2636 4.756800 189200.0 191300.0 190250.000000
2.0 -122.51 -115.80 -119.035306 33.16 40.58 35.410816 96.0 21897.0 5237.102041 18.0 ... 2005.224490 16.0 2873.0 707.122449 1.9667 10.1531 5.074237 47500.0 500001.0 229438.836735
3.0 -122.33 -115.60 -118.798478 32.87 38.77 35.164783 475.0 21060.0 6920.326087 115.0 ... 2934.673913 123.0 3112.0 1030.413043 2.1187 11.5199 5.572013 83200.0 500001.0 239450.043478
4.0 -122.72 -116.76 -118.805093 32.65 39.00 34.987764 2.0 37937.0 6065.614907 2.0 ... 2739.956522 2.0 5189.0 964.291925 0.5360 13.4883 5.196055 42500.0 500001.0 230054.105590
5.0 -122.55 -115.55 -118.789497 32.55 40.60 35.095327 111.0 25187.0 4926.261307 21.0 ... 2456.979899 20.0 3886.0 826.768844 0.7526 12.6320 4.732460 50000.0 500001.0 211035.708543

5 rows × 24 columns

We can specify which columns we want to perform certain calculations on by passing a dictionary, where the keys will be the columns on which we want to perform calculations and the values will be lists with the calculations.

california_housing_train.groupby("housing_median_age").agg({'total_rooms': ['min', 'max', 'mean'], 'total_bedrooms': ['min', 'max', 'mean', 'median']}).head()
      
Out[12]:
total_rooms total_bedrooms
min max mean min max mean median
housing_median_age
1.0 2062.0 2254.0 2158.000000 328.0 343.0 335.500000 335.5
2.0 96.0 21897.0 5237.102041 18.0 3513.0 871.448980 707.0
3.0 475.0 21060.0 6920.326087 115.0 3559.0 1190.826087 954.0
4.0 2.0 37937.0 6065.614907 2.0 5471.0 1068.192547 778.0
5.0 111.0 25187.0 4926.261307 21.0 4386.0 910.924623 715.0

We can group by more than one column, for this, the columns need to be passed in a list

california_housing_train.groupby(["housing_median_age", "total_bedrooms"]).mean()
      
Out[16]:
longitude latitude total_rooms population households median_income median_house_value
housing_median_age total_bedrooms
1.0 328.0 -120.93 37.65 2254.0 402.0 112.0 4.2500 189200.0
343.0 -122.00 38.23 2062.0 872.0 268.0 5.2636 191300.0
2.0 18.0 -115.80 33.26 96.0 30.0 16.0 5.3374 47500.0
35.0 -121.93 37.78 227.0 114.0 49.0 3.1591 434700.0
55.0 -117.27 33.93 337.0 115.0 49.0 3.1042 164800.0
... ... ... ... ... ... ... ... ...
52.0 1360.0 -118.35 34.06 3446.0 1768.0 1245.0 2.4722 500001.0
1535.0 -122.41 37.80 3260.0 3260.0 1457.0 0.9000 500001.0
1944.0 -118.25 34.05 2806.0 2232.0 1605.0 0.6775 350000.0
2509.0 -122.41 37.79 6016.0 3436.0 2119.0 2.5166 275000.0
2747.0 -122.41 37.79 5783.0 4518.0 2538.0 1.7240 225000.0

13394 rows × 7 columns

17. Graphicslink image 129

Pandas offers the possibility of representing the data of our DataFrames in charts to obtain a better representation of them. To do this, it makes use of the matplotlib library, which we will see in the next post.

17.1. Basic Graphlink image 130

To represent the data in a graph, the easiest way is to use the plot() method.

serie = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))
serie = serie.cumsum()
      
serie.plot()
      
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc5666b9990>
image pandas 1

In the case of having a DataFrame, the plot() method will represent each of the columns of the DataFrame.

dataframe = pd.DataFrame(
np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"]
)
dataframe = dataframe.cumsum()
      
dataframe.plot()
      
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc5663ce610>
image pandas 2

17.2. Vertical Bar Chartlink image 131

There are more methods to create charts, such as the vertical bar chart using plot.bar()

dataframe = pd.DataFrame(np.random.rand(10, 4), columns=["a", "b", "c", "d"])
dataframe.plot.bar()
      
image pandas 3

If we want to stack the bars, we indicate it with the variable stacked=True

dataframe.plot.bar(stacked=True)
      
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc56265c5d0>
image pandas 4

17.3. Horizontal Bar Chartlink image 132

To create a horizontal bar chart we use plot.barh()

dataframe.plot.barh()
      
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc56247fa10>
image pandas 5

If we want to stack the bars, we indicate it using the variable stacked=True

dataframe.plot.barh(stacked=True)
      
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc562d1d2d0>
image pandas 6

17.4. Histogramlink image 133

To create a histogram we use plot.hist()

dataframe = pd.DataFrame(
{
    "a": np.random.randn(1000) + 1,
    "b": np.random.randn(1000),
    "c": np.random.randn(1000) - 1,
}
)
      
dataframe.plot.hist(alpha=0.5)
      
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc5650711d0>
image pandas 7

If we want to stack the bars, we indicate it with the variable stacked=True.

dataframe.plot.hist(alpha=0.5, stacked=True)
      
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc5625779d0>
image pandas 8

If we want to add more columns, that is, if we want the histogram to be more informative or precise, we indicate it using the variable bins

dataframe.plot.hist(alpha=0.5, stacked=True, bins=20)
      
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc562324990>
image pandas 9

17.5. Candlestick Chartslink image 134

To create a candlestick chart we use plot.box()

dataframe = pd.DataFrame(np.random.rand(10, 5), columns=["A", "B", "C", "D", "E"])
      
dataframe.plot.box()
      
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc56201a410>
image pandas 10

17.6. Area Chartslink image 135

To create an area chart we use plot.area()

dataframe.plot.area()
      
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc561e9ca50>
image pandas 11

17.7. Scatter Plotlink image 136

To create a scatter plot we use plot.scatter(), where we need to specify the x and y variables of the plot

dataframe.plot.scatter(x='A', y='B')
      
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc561e2ff10>
image pandas 12

17.8. Hexbin Plotlink image 137

To create a hexagonal bin plot we use plot.hexbin(), where we need to specify the x and y variables of the diagram and the mesh size using gridsize.

dataframe = pd.DataFrame(np.random.randn(1000, 2), columns=["a", "b"])
dataframe["b"] = dataframe["b"] + np.arange(1000)
      
dataframe.plot.hexbin(x="a", y="b", gridsize=25)
      
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc561cdded0>
image pandas 13

Continue reading

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

Have you ever talked to an LLM and they answered you something that sounds like they've been drinking machine coffee all night long 😂 That's what we call a hallucination in the LLM world! But don't worry, because it's not that your language model is crazy (although it can sometimes seem that way 🤪). The truth is that LLMs can be a bit... creative when it comes to generating text. But thanks to DoLa, a method that uses contrast layers to improve the feasibility of LLMs, we can keep our language models from turning into science fiction writers 😂. In this post, I'll explain how DoLa works and show you a code example so you can better understand how to make your LLMs more reliable and less prone to making up stories. Let's save our LLMs from insanity and make them more useful! 🚀

Last posts -->

Have you seen these projects?

Subtify

Subtify Subtify

Subtitle generator for videos in the language you want. Also, it puts a different color subtitle to each person

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

Dataset with jokes in English

opus100

Dataset with translations from English to Spanish

netflix_titles

Dataset with Netflix movies and series

View more datasets -->