Expresiones regulares

10 of february of 2024

Regular expressions

	
		import re

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

Methods

Findall

With the findall() method we can find all matches of a pattern in a string

	
		import re
string = "Hola, soy un string"
print(re.findall("Hola, soy", string))

	
		['Hola, soy']

Search

But if we want to find the position where a pattern is located, we can use the search() method to search for a pattern in a string. This method returns a Match object if it finds a match, otherwise it returns None.

	
		print(re.search("soy", string))

	
		<re.Match object; span=(6, 9), match='soy'>

Match

We can also use the match() method that looks for the pattern at the beginning of the string.

	
		print(re.match("Hola", string))
print(re.match("soy", string))

	
		<re.Match object; span=(0, 4), match='Hola'>
None

Span

If we want to get the position of the match, we can use the span() method which returns a tuple with the start and end position of the match.

	
		print(re.match("Hola", string).span())

	
		(0, 4)

Group

Knowing the position of the match, we can use the group() method to get the substring that matches the pattern.

	
		print(re.match("Hola", string).group())

	
		Hola

We could also use the start and end of the match to make a slice of the string.

	
		start, end = re.match("Hola", string).span()
print(string[start:end])

	
		Hola

Split

With the split() method we can split a string into a list of substrings using a pattern as a separator.

	
		split = re.split("soy", string)
print(split)

	
		['Hola, ', ' un string']

The sentence has been divided into two strings using "soy" as separator.

Sub

With the sub() method we can replace all matches of a pattern with another substring.

	
		sub = re.sub("soy", "eres", string)
print(sub)

	
		Hola, eres un string

It has replaced all "I am" matches with "you are".

Patterns

The `.` character

With the . character we can search for any character, any character in our string will be found.

	
		string = "Hola, soy un string"
print(re.findall(".", string))

	
		['H', 'o', 'l', 'a', ',', ' ', 's', 'o', 'y', ' ', 'u', 'n', ' ', 's', 't', 'r', 'i', 'n', 'g']

If for example we want sequences of two characters we would search with two .s followed by `.

	
		string1 = "Hola, soy un string"
string2 = "Hola, soy un string2"
print(re.findall("..", string1))
print(re.findall("..", string2))

	
		['Ho', 'la', ', ', 'so', 'y ', 'un', ' s', 'tr', 'in']
['Ho', 'la', ', ', 'so', 'y ', 'un', ' s', 'tr', 'in', 'g2']

As we can see string1 has an odd number of characters, so the last g is not taken, but string2 has an even number of characters, so it takes all characters.

Let's look at this another way, let's change each sequence of three characters by a $ symbol.

	
		print(string1)
print(re.sub("...", "$  ", string1))

	
		Hola, soy un string
$  $  $  $  $  $  g

I have printed two spaces after each $ so that you can see the change, you can see how the last character does not convert it.

Predefined and constructed classes

Digit

If we want to find the digits we need to use d.

	
		string = "Hola, soy un string con 123 digitos"
print(re.findall("d", string))

	
		['1', '2', '3']

As before, if for example we want two digits, we put d twice

	
		print(re.findall("dd", string))

	
		['12']

Letter

If we want to find letters we need to use w. Wordmeans all letters fromatoz, fromAtoZ, numbers from0to9and_`.

	
		string = "Hola, soy un_string con, 123 digitos"
print(re.findall("w", string))

	
		['H', 'o', 'l', 'a', 's', 'o', 'y', 'u', 'n', '_', 's', 't', 'r', 'i', 'n', 'g', 'c', 'o', 'n', '1', '2', '3', 'd', 'i', 'g', 'i', 't', 'o', 's']

As we can see, it takes everything except the spaces and the comma.

Spaces

If we want to find spaces we need `s

	
		string = "Hola, soy un_string con, 123 digitos"
print(re.sub("s", "*", string))

	
		Hola,*soy*un_string*con,*123*digitos

Regular expressions consider line breaks as spaces.

	
		string = """Hola, soy un string 
con un salto de línea"""
print(re.sub("s", "*", string))

	
		Hola,*soy*un*string**con*un*salto*de*línea

Ranks

If we want to search a range we use [], for example, if we want the numbers from 4 to 8 we use

	
		string = "1234567890"
print(re.findall("[4-8]", string))

	
		['4', '5', '6', '7', '8']

We can extend the search range

	
		string = "1234567890"
print(re.findall("[2-57-9]", string))

	
		['2', '3', '4', '5', '7', '8', '9']

If we also want to find a specific character, we enter the character followed by ``.

	
		string = "1234567890."
print(re.findall("[2-57-9.]", string))

	
		['2', '3', '4', '5', '7', '8', '9', '.']

Bracket `[` and bracket `]`

As we have seen, if we want to find ranges we use [], but what if we want to find only the [ or the ]? For that we have to use [] and []`.

	
		string = "[1234567890]"
print(re.findall("[", string))
print(re.findall("]", string))

	
		['[']
[']']

Delimiters `+`, `*`, `?`, `?

Star `*` (none or all)

The * delimiter indicates that you want to search for none or all of them, not one by one as before.

	
		string = "Hola, soy un string con 12 123 digitos"
print(re.findall("d", string))
print(re.findall("d*", string))

	
		['1', '2', '1', '2', '3']
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '12', '', '123', '', '', '', '', '', '', '', '', '']

As you can see, putting the * has found all the positions where there are zero characters or all characters

Plus `+` (one or more)

With the delimiter + you indicate that you want to search for one or more

	
		string = "Hola, soy un string con 1 12 123 digitos"
print(re.findall("d+", string))

	
		['1', '12', '123']

Optional `?` (zero or one)

The ? delimiter indicates that you want to search for zero or one.

	
		string = "Hola, soy un string con 1 12 123 digitos"
print(re.sub("d?", "-", string))

	
		-H-o-l-a-,- -s-o-y- -u-n- -s-t-r-i-n-g- -c-o-n- -- --- ---- -d-i-g-i-t-o-s-

Counters

When we want to find something that appears x times we use the counters with the braces {}. For example, if we want to find a sequence in which there are at least two digits

	
		string = "Hola, soy un string con 1 12 123 1234 1234digitos"
print(re.findall("d{2}", string))

	
		['12', '12', '12', '34', '12', '34']

As you can see you have found the sequences 12 and 34.

The counters accept an upper and lower dimension {inf, sup}.

	
		string = "Hola, soy un string con 1 12 123 1234 1234digitos"
print(re.findall("d{2,5}", string))

	
		['12', '123', '1234', '1234']

If the upper dimension is not defined, it means that you want at least the number of elements indicated, but with no upper limit.

	
		string = "Hola, soy un string con 1 12 123 1234 12345464168415641646451563416 digitos"
print(re.findall("d{2,}", string))

	
		['12', '123', '1234', '12345464168415641646451563416']

If we want to use the notation of upper and lower dimension, but we want a fixed number, we have to put that number in both dimensions

	
		string = "Hola, soy un string con 1 12 123 1234 12345464168415641646451563416 digitos"
print(re.findall("d{2,3}", string))

	
		['12', '123', '123', '123', '454', '641', '684', '156', '416', '464', '515', '634', '16']

Classes

You can create classes using [] brackets. Actually we saw that this was used for ranges, but, once you define what you want inside, you can consider it as a class and operate with the [].

For example, suppose we have a telephone number, which can be given in one of the following ways

666-66-66-66
666-666-666
666 666 666
666 66 66 66
666666666

There are many ways to give a number, so let's see how to create a class to define the delimiter

First we are going to tell it to look for all number sequences in which there are at least two numbers.

	
		string1 = "666-66-66-66"
string2 = "666-666-666"
string3 = "666 66 66 66"
string4 = "666 666 666"
string5 = "666666666"
print(f"string1: {string1} -->", re.findall("d{2,}", string1))
print(f"string2: {string2} -->", re.findall("d{2,}", string2))
print(f"string3: {string3} -->", re.findall("d{2,}", string3))
print(f"string4: {string4} -->", re.findall("d{2,}", string4))
print(f"string5: {string5} -->", re.findall("d{2,}", string5))

	
		string1: 666-66-66-66 --> ['666', '66', '66', '66']
string2: 666-666-666 --> ['666', '666', '666']
string3: 666 66 66 66 --> ['666', '66', '66', '66']
string4: 666 666 666 --> ['666', '666', '666']
string5: 666666666 --> ['666666666']

Now we define to find the separator as a - or a space

	
		string1 = "666-66-66-66"
string2 = "666-666-666"
string3 = "666 66 66 66"
string4 = "666 666 666"
string5 = "666666666"
print(f"string1: {string1} -->", re.findall("[-s]", string1))
print(f"string2: {string2} -->", re.findall("[-s]", string2))
print(f"string3: {string3} -->", re.findall("[-s]", string3))
print(f"string4: {string4} -->", re.findall("[-s]", string4))
print(f"string5: {string5} -->", re.findall("[-s]", string5))

	
		string1: 666-66-66-66 --> ['-', '-', '-']
string2: 666-666-666 --> ['-', '-']
string3: 666 66 66 66 --> [' ', ' ', ' ']
string4: 666 666 666 --> [' ', ' ']
string5: 666666666 --> []

As you can see in the last string it has not found, so we add a ? to find when there is zero or one.

	
		string1 = "666-66-66-66"
string2 = "666-666-666"
string3 = "666 66 66 66"
string4 = "666 666 666"
string5 = "666666666"
print(f"string1: {string1} -->", re.findall("[-s]?", string1))
print(f"string2: {string2} -->", re.findall("[-s]?", string2))
print(f"string3: {string3} -->", re.findall("[-s]?", string3))
print(f"string4: {string4} -->", re.findall("[-s]?", string4))
print(f"string5: {string5} -->", re.findall("[-s]?", string5))

	
		string1: 666-66-66-66 --> ['', '', '', '-', '', '', '-', '', '', '-', '', '', '']
string2: 666-666-666 --> ['', '', '', '-', '', '', '', '-', '', '', '', '']
string3: 666 66 66 66 --> ['', '', '', ' ', '', '', ' ', '', '', ' ', '', '', '']
string4: 666 666 666 --> ['', '', '', ' ', '', '', '', ' ', '', '', '', '']
string5: 666666666 --> ['', '', '', '', '', '', '', '', '', '']

Now we are looking for everything to be together

	
		string1 = "666-66-66-66"
string2 = "666-666-666"
string3 = "666 66 66 66"
string4 = "666 666 666"
string5 = "666666666"
print(f"string1: {string1} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?", string1))
print(f"string2: {string2} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?", string2))
print(f"string3: {string3} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?", string3))
print(f"string4: {string4} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?", string4))
print(f"string5: {string5} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?", string5))

	
		string1: 666-66-66-66 --> ['666-66-66-66']
string2: 666-666-666 --> []
string3: 666 66 66 66 --> ['666 66 66 66']
string4: 666 666 666 --> []
string5: 666666666 --> ['666666666']

As we see in string2 and string4, it finds nothing. We have set the filter [\d{2,}[\s]? 4 times, i.e. we want a sequence of at least two numbers, followed by zero or a hyphen or space separator that repeats 4 times. But in the last sequence there is no need for the [\d{2,}[\s]?, since it will never end a number with a space or a hyphen.

	
		string1 = "666-66-66-66"
string2 = "666-666-666"
string3 = "666 66 66 66"
string4 = "666 666 666"
string5 = "666666666"
print(f"string1: {string1} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}", string1))
print(f"string2: {string2} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}", string2))
print(f"string3: {string3} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}", string3))
print(f"string4: {string4} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}", string4))
print(f"string5: {string5} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}", string5))

	
		string1: 666-66-66-66 --> ['666-66-66-66']
string2: 666-666-666 --> []
string3: 666 66 66 66 --> ['666 66 66 66']
string4: 666 666 666 --> []
string5: 666666666 --> ['666666666']

It is still not found for string2 and string4. This is because the last thing in the filter is a d{2,}, i.e. after the third separator we are expecting at least 2 numbers, but that in string2 and string4 doesn't happen, so we put the following

	
		string1 = "666-66-66-66"
string2 = "666-666-666"
string3 = "666 66 66 66"
string4 = "666 666 666"
string5 = "666666666"
print(f"string1: {string1} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d*", string1))
print(f"string2: {string2} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d*", string2))
print(f"string3: {string3} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d*", string3))
print(f"string4: {string4} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d*", string4))
print(f"string5: {string5} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d*", string5))

	
		string1: 666-66-66-66 --> ['666-66-66-66']
string2: 666-666-666 --> ['666-666-666']
string3: 666 66 66 66 --> ['666 66 66 66']
string4: 666 666 666 --> ['666 666 666']
string5: 666666666 --> ['666666666']

The delimiter `?` as a quick delimiter

The above example can be filtered by d+?[- ].

	
		string1 = "666-66-66-66"
string2 = "666-666-666"
string3 = "666 66 66 66"
string4 = "666 666 666"
string5 = "666666666"
print(f"string1: {string1} -->", re.findall("d+?[- ]", string1))
print(f"string2: {string2} -->", re.findall("d+?[- ]", string2))
print(f"string3: {string3} -->", re.findall("d+?[- ]", string3))
print(f"string4: {string4} -->", re.findall("d+?[- ]", string4))
print(f"string5: {string5} -->", re.findall("d+?[- ]", string5))

	
		string1: 666-66-66-66 --> ['666-', '66-', '66-']
string2: 666-666-666 --> ['666-', '666-']
string3: 666 66 66 66 --> ['666 ', '66 ', '66 ']
string4: 666 666 666 --> ['666 ', '666 ']
string5: 666666666 --> []

If the ? delimiter were not present, we would have \d+[- ], which means a sequence of one or more numbers followed by a space or a hyphen. But what the ? delimiter does is to make this search faster.

The denier

Before we have seen that with d we found digits, so with D we find everything that are not digits.

	
		string1 = "E3s4t6e e1s2t3r5i6n7g8 t9i0e4n2e1 d4i5g7i9t0o5s2"
print(re.findall("D", string1))

	
		['E', 's', 't', 'e', ' ', 'e', 's', 't', 'r', 'i', 'n', 'g', ' ', 't', 'i', 'e', 'n', 'e', ' ', 'd', 'i', 'g', 'i', 't', 'o', 's']

The same happens with letters, if we write W it will find everything that is not letters.

	
		string1 = "Letras ab27_ no letras ,.:;´ç"
print(re.findall("W", string1))

	
		[' ', ' ', ' ', ' ', ',', '.', ':', ';', '´']

If we put S we will find everything other than spaces.

	
		print(re.findall("S", string1))

	
		['L', 'e', 't', 'r', 'a', 's', 'a', 'b', '2', '7', '_', 'n', 'o', 'l', 'e', 't', 'r', 'a', 's', ',', '.', ':', ';', '´', 'ç']

But in case we have a class or something else, we can deny by ^

	
		string1 = "1234567890"
print(re.findall("[^5-9]", string1))

	
		['1', '2', '3', '4', '0']

Going back to the example of the phone numbers from before, we can filter them by the following

	
		string1 = "666-66-66-66"
string2 = "666-666-666"
string3 = "666 66 66 66"
string4 = "666 666 666"
string5 = "666666666"
print(f"string1: {string1} -->", re.findall("d{2,}D?d{2,}D?d{2,}D?d*", string1))
print(f"string2: {string2} -->", re.findall("d{2,}D?d{2,}D?d{2,}D?d*", string2))
print(f"string3: {string3} -->", re.findall("d{2,}D?d{2,}D?d{2,}D?d*", string3))
print(f"string4: {string4} -->", re.findall("d{2,}D?d{2,}D?d{2,}D?d*", string4))
print(f"string5: {string5} -->", re.findall("d{2,}D?d{2,}D?d{2,}D?d*", string5))
string5 = "666 666 666"

	
		string1: 666-66-66-66 --> ['666-66-66-66']
string2: 666-666-666 --> ['666-666-666']
string3: 666 66 66 66 --> ['666 66 66 66']
string4: 666 666 666 --> ['666 666 666']
string5: 666666666 --> ['666666666']

What we are doing is asking for sequences of at least two digits followed by one or no non-digits.

The beginning `^` and end of line `$`.

With ^ we can search for the beginning of a line, for example, if we want to find a digit only at the beginning of a line

	
		string1 = "linea 1"
string2 = "2ª linea"
print(re.findall("^d", string1))
print(re.findall("^d", string2))

	
		[]
['2']

As you can see there is only one digit at the beginning of the line in string2.

Likewise, the end of a line can be found with $. If we want to find a digit only at the end of a line

	
		string1 = "linea 1"
string2 = "2ª linea"
print(re.findall("d$", string1))
print(re.findall("d$", string2))

	
		['1']
[]

This only occurs in string1.

Practical examples

Logs

If in the following log we want to find only the WARNs

	
		log = """[LOG ENTRY] [ERROR] The system is unstable
[LOG ENTRY] [WARN] The system may be down
[LOG ENTRY] [WARN] Microsoft just bought Github
[LOG DATA] [LOG] Everything is OK
[LOG ENTRY] [LOG] [user:@beco] Logged in
[LOG ENTRY] [LOG] [user:@beco] Clicked here
[LOG DATA] [LOG] [user:@celismx] Did something
[LOG ENTRY] [LOG] [user:@beco] Rated the app
[LOG ENTRY] [LOG] [user:@beco] Logged out
[LOG LINE] [LOG] [user:@celismx] Logged in"""
result = re.findall("[LOG.*[WARN].*", log)
result

	
		['[LOG ENTRY] [WARN] The system may be down',
 '[LOG ENTRY] [WARN] Microsoft just bought Github']

Phone number

Within a number we can find letters such as e for extension, # also for extension, or p to pause if a computer calls. We can also find the + to indicate a country prefix and separators such as spaces, -, ., ., ., ., ., ., ., ..

	
		tel = """555658
56-58-11
56.58.11
56.78-98
65 09 87
76y87r98
45y78-56
78.87 65
78 54-56
+521565811
58-11-11#246
55256048p123
55256048e123"""
result = re.findall("+?d{2,3}[^da-zA-Z\n]?d{2,3}[^da-zA-Z\n]?d{2,3}[#pe]?d*", tel)
result

	
		['555658',
 '56-58-11',
 '56.58.11',
 '56.78-98',
 '65 09 87',
 '78.87 65',
 '78 54-56',
 '+521565811',
 '58-11-11#246',
 '55256048p123',
 '55256048e123']

Here is an explanation

+?: Beginning with the character + and containing either zero or one
``d{2,3}`: To be followed by 2 to 3 digits
Next there can be zero or a character that is neither a digit, nor a letter from a to z, nor a letter from A to Z, nor a line break.
``d{2,3}`: To be followed by 2 to 3 digits
Next there can be zero or a character that is neither a digit, nor a letter from a to z, nor a letter from A to Z, nor a line break.
``d{2,3}`: To be followed by 2 to 3 digits
[#pe]?: Then there can be zero or one character either #, or p, or e.
Lastly, let there be zero or all numbers.

URLs

	
		urls = """url: https://www.instagram.com/p/BXB4zsUlW5Z/?taken-by=beco.mx
url: http://instagram.com/p/blablablah
url: http://itam.mx/test
http://instagram.com/p/blablablah
https://www.vanguarsoft.com.ve
http://platzi.com
https://traetelo.net
https://traetelo.net/images archivo.jsp
url: https://subdominio.traetelo.net
url: https://www.instagram.com/p/BXB4zsUlW5Z/?taken-by=beco.mx
url: http://instagram.com/p/blablablah
url: http://itam.mx/test
http://instagram.com/p/blablablah
https://www.google.com.co/
https://sub.dominio.de.alguien.com/archivo.html
https://en.wikipedia.org/wiki/.org
https://cdn-microsoft.org/image/seixo2t9sjl_22.jpg
https://hola.pizza
https://platzi.com/clases/1301-expresiones-regulares/11860-urls9102/ clase
https://api.giphy.com/v1/gifs/search?q=Rick and Morty&limit=10&api_key=DG3hItPp5HIRNC0nit3AOR7eQZAe
http://localhost:3000/something?color1=red&color2=blue
http://localhost:3000/display/post?size=small
 http://localhost:3000/?name=satyam
 http://localhost:3000/scanned?orderid=234
 http://localhost:3000/getUsers?userId=12354411&name=Billy
 http://localhost:3000/getUsers?userId=12354411
http://localhost:3000/search?city=Barcelona
www.sitiodeejemplo.net/pagina.php?nombredevalor1=valor1&nombredevalor2=valor2"""
result = re.findall("https?://[w-.]+.w{2,6}/?S*", urls)
result

	
		['https://www.instagram.com/p/BXB4zsUlW5Z/?taken-by=beco.mx',
 'http://instagram.com/p/blablablah',
 'http://itam.mx/test',
 'http://instagram.com/p/blablablah',
 'https://www.vanguarsoft.com.ve',
 'http://platzi.com',
 'https://traetelo.net',
 'https://traetelo.net/images',
 'https://subdominio.traetelo.net',
 'https://www.instagram.com/p/BXB4zsUlW5Z/?taken-by=beco.mx',
 'http://instagram.com/p/blablablah',
 'http://itam.mx/test',
 'http://instagram.com/p/blablablah',
 'https://www.google.com.co/',
 'https://sub.dominio.de.alguien.com/archivo.html',
 'https://en.wikipedia.org/wiki/.org',
 'https://cdn-microsoft.org/image/seixo2t9sjl_22.jpg',
 'https://hola.pizza',
 'https://platzi.com/clases/1301-expresiones-regulares/11860-urls9102/',
 'https://api.giphy.com/v1/gifs/search?q=Rick']

Here is an explanation

http: We want it to start with http.
s?: There may or may not be an s in the following.
:://: Followed by://`.
`[*]: Followed by one or more letters, gions or periods
Next, a point.
``w{2,6}`: Between 2 and 6 letters for the tld
/?: Followed by zero or a /.
None or everything that is not a space.

Mails

	
		mails = """esto.es_un.mail@mail.com
esto.es_un.mail+complejo@mail.com
dominio.com
rodrigo.jimenez@yahoo.com.mx
ruben@starbucks.com
esto_no$es_email@dominio.com
no_se_de_internet3@hotmail.com"""
result = re.findall("[w._]{5,30}+?[w._]{0,10}@[w.-]{2,}.w{2,6}", mails)
result

	
		['esto.es_un.mail@mail.com',
 'esto.es_un.mail+complejo@mail.com',
 'rodrigo.jimenez@yahoo.com.mx',
 'ruben@starbucks.com',
 'es_email@dominio.com',
 'no_se_de_internet3@hotmail.com']

Here is an explanation

{5,30}`: We want it to start with between 5 and 30 (which is the minimum and maximum that gmail supports) letters, dots or underscores.
+?: Followed by zero or a +.
{0,10}`: Then between 0 and 10 letters, dots or underscores.
@: The @: The @: The @: The @
{[{2,}`: Between 2 and infinite letters, dots and dashes (domain)
.: Followed by a `.
``w{2,6}`: And finally between 2 and 6 letters for the tld

Locations

There are two possible ways to give locations, so we analyze both of them

	
		loc = """-99.205646,19.429707,2275.10
-99.205581, 19.429652,2275.10
-99.204654,19.428952,2275.58"""
result = re.findall("-?d{1,3}.d{1,6},s?-?d{1,3}.d{1,6},.*", loc)
result

	
		['-99.205646,19.429707,2275.10',
 '-99.205581, 19.429652,2275.10',
 '-99.204654,19.428952,2275.58']

Here is an explanation

We want it to start with zero or a minus sign.
Followed by between one and three numbers
Next, a point.
``d{1,6}`: After one to six numbers
,: Then a ,: Then a ,: Then a ,: Then a ,: Then a ,
``s?`: After zero or a space
``-?`: Zero or a minus sign
d{1,3}`: Then between one and three numbers
Next, a point.
Followed by between one and six numbers.
,: Then a comma
.*: Lastly none or all types of characters

	
		loc = """-99 12' 34.08"W, 19 34' 56.98"N
-34 54' 32.00"E, -3 21' 67.00"S"""
result = re.findall("-?d{1,3}sd{1,2}'sd{1,2}.d{2,2}\"[WE],s?-?d{1,3}sd{1,2}'sd{1,2}.d{2,2}\"[SN]", loc)
result

	
		['-99 12' 34.08"W, 19 34' 56.98"N', '-34 54' 32.00"E, -3 21' 67.00"S']

	
		print(result[0])
print(result[1])

	
		-99 12' 34.08"W, 19 34' 56.98"N
-34 54' 32.00"E, -3 21' 67.00"S

Here is an explanation

We want it to start with zero or a minus sign.
Followed by between one and three numbers
s: Then a space
``d{1,2}`: Segment of one to two numbers
': Then a ': Then a '.
Followed by a space.
``d{1,2}: Then between one and two numbers
After a period
``d{2,2}`: Followed by two numbers
": Then a ": Then a ": Then a ": Then a ": Then a ".
[WE]: Then the letter W or the letter E.
,: After a comma
Followed by a zero or a space
``-?`: After zero or a minus sign
d{1,3}`: Then between one and three numbers
Followed by a space.
``d{1,2}: Then between one and two numbers
': Then a ': After a '
s: Then a space
``d{1,2}`: Next between one and two numbers
Followed by a period
``d{2,2}`: After two numbers
": Followed by ": Followed by "`.
[SN]: And finally the letter S or the letter N.

Names

	
		nombres = """Camilo Sarmiento Gálvez
Alejandro Pliego Abasto
Milagros Reyes Japón
Samuel París Arrabal
Juan Pablo Tafalla
Axel Gálvez Velázquez
Óscar Montreal Aparicio
Jacobo Pozo Tassis
Guillermo Ordóñez Espiga
Eduardo Pousa Curbelo
Ivanna Bienvenida Kevin
Ada Tasis López
Luciana Sáenz García
Florencia Sainz Márquz
Catarina Cazalla Lombarda
Paloma Gallo Perro
Margarita Quesada Florez
Vicente Fox Quesada
Iris Graciani
Asunción Carballar
Constanza Muñoz
Manuel Andres García Márquez"""
result = re.findall("[A-ZÁÉÍÓÚ][a-záéíóú]+s[A-ZÁÉÍÓÚ][a-záéíóú]+s[A-ZÁÉÍÓÚ][a-záéíóú]+", nombres)
result

	
		['Camilo Sarmiento Gálvez',
 'Alejandro Pliego Abasto',
 'Milagros Reyes Japón',
 'Samuel París Arrabal',
 'Juan Pablo Tafalla',
 'Axel Gálvez Velázquez',
 'Óscar Montreal Aparicio',
 'Jacobo Pozo Tassis',
 'Espiga
Eduardo Pousa',
 'Curbelo
Ivanna Bienvenida',
 'Kevin
Ada Tasis',
 'López
Luciana Sáenz',
 'García
Florencia Sainz',
 'Márquz
Catarina Cazalla',
 'Lombarda
Paloma Gallo',
 'Perro
Margarita Quesada',
 'Florez
Vicente Fox',
 'Quesada
Iris Graciani',
 'Asunción Carballar
Constanza',
 'Manuel Andres García']

Here is an explanation

[A-ZÁÉÍÓÚ]: We want it to start with a capital letter, including accents.
[a-záééíóú]+: Followed by one or more lowercase letters, enclosed by spaces
Followed by a space.
[A-ZÁÉÍÓÓÚ]: followed by an uppercase letter, including accents
[a-záééíóú]+: Followed by one or more lowercase letters, enclosed by spaces
Followed by a space.
[A-ZÁÉÍÓÓÚ]: followed by an uppercase letter, including accents
[a-záééíóú]+: Followed by one or more lowercase letters, enclosed by spaces

Search and replace

We are going to download a file with a lot of historical films.

	
		# download file from url
import urllib.request
url = "https://static.platzi.com/media/tmp/class-files/github/moviedemo/moviedemo-master/movies.dat"
urllib.request.urlretrieve(url, "movies.dat")

	
		---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
Cell In[43], line 4
      2 import urllib.request
      3 url = "https://static.platzi.com/media/tmp/class-files/github/moviedemo/moviedemo-master/movies.dat"
----> 4 urllib.request.urlretrieve(url, "movies.dat")
File ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:241, in urlretrieve(url, filename, reporthook, data)
    224 """
    225 Retrieve a URL into a temporary location on disk.
    226 
   (...)
    237 data file as well as the resulting HTTPMessage object.
    238 """
    239 url_type, path = _splittype(url)
--> 241 with contextlib.closing(urlopen(url, data)) as fp:
    242     headers = fp.info()
    244     # Just return the local path and the "headers" for file://
    245     # URLs. No sense in performing a copy unless requested.
File ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:216, in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    214 else:
    215     opener = _opener
--> 216 return opener.open(url, data, timeout)
File ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:525, in OpenerDirector.open(self, fullurl, data, timeout)
    523 for processor in self.process_response.get(protocol, []):
    524     meth = getattr(processor, meth_name)
--> 525     response = meth(req, response)
    527 return response
File ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:634, in HTTPErrorProcessor.http_response(self, request, response)
    631 # According to RFC 2616, "2xx" code indicates that the client's
    632 # request was successfully received, understood, and accepted.
    633 if not (200 <= code < 300):
--> 634     response = self.parent.error(
    635         'http', request, response, code, msg, hdrs)
    637 return response
File ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:563, in OpenerDirector.error(self, proto, *args)
    561 if http_err:
    562     args = (dict, 'default', 'http_error_default') + orig_args
--> 563     return self._call_chain(*args)
File ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:496, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
    494 for handler in handlers:
    495     func = getattr(handler, meth_name)
--> 496     result = func(*args)
    497     if result is not None:
    498         return result
File ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:643, in HTTPDefaultErrorHandler.http_error_default(self, req, fp, code, msg, hdrs)
    642 def http_error_default(self, req, fp, code, msg, hdrs):
--> 643     raise HTTPError(req.full_url, code, msg, hdrs, fp)
HTTPError: HTTP Error 403: Forbidden

Let's print the first 10 lines to analyze it.

	
		file = open("movies.dat", "r")
for i, line in enumerate(file):
    print(line, end="")
    if i == 10:
        break
file.close()

	
		1::Toy Story (1995)::Adventure|Animation|Children|Comedy|Fantasy
2::Jumanji (1995)::Adventure|Children|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama|Romance
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children
9::Sudden Death (1995)::Action
10::GoldenEye (1995)::Action|Adventure|Thriller
11::American President, The (1995)::Comedy|Drama|Romance

As you can see, we have an ID, followed by ::, then the name of the movie, in parentheses the year, followed by :: and then genres separated by |.

We can make a cleaning of the file very easy by means of regular expressions, the compile and match functions and the use of groupings with parenthesis. When making groupings, we select which areas of the text we want to keep and then work with them as we want, let's see it with an example

	
		pattern = re.compile(r"^d+::([ws:,().-'&¡!/¿?ÁÉÍÓÚáéíóú+*$#°'"[]@·]+)s((d{4,4}))::(.*)$")
      
      file = open("movies.dat", "r")
      file_filtered = open("movies.csv", "w")
      file_filtered.write("title,year,genders
")
      sep = ";;"
      for line in file:
          result = re.match(pattern, line)
          if result:
              file_filtered.write(f"{result.group(1)}{sep}{result.group(2)}{sep}{result.group(3)}
")
          else:
              print(line, end="")
      file.close()
      file_filtered.close()

Let's see what we have done, first we have defined a pattern with the following:

^: We want it to start with the beginning of the line.
Next one or more numbers
::: Followed by ::
`([(([([([([([([([("("("("("("": This is the first grouping, we look for any word, space or character in the square brackets that appears one or more times.
``s`: Next a space
: The tightening of a parenthesis
(4,4})`: Here is the second grouping, we are looking for four numbers.
After the closing of a parenthesis
::: Next ::
(.*): The third grouping, any character that appears none or all times.
$: Lastly the end of the line

Inside the for we analyze line by line if the pattern we have defined is found, and if it is found we write the three patterns in the csv separated by sep, which in our case we have defined as ;;. This separator has been defined, because there are movie titles that have ,s.

We read the csv with Pandas.

import pandas as pd
      df = pd.read_csv("movies.csv", sep=";;", engine="python")
      df.head()

Out[100]:

		title,year,genders
Toy Story	1995	Adventure\|Animation\|Children\|Comedy\|Fantasy
Jumanji	1995	Adventure\|Children\|Fantasy
Grumpier Old Men	1995	Comedy\|Romance
Waiting to Exhale	1995	Comedy\|Drama\|Romance
Father of the Bride Part II	1995	Comedy

Cheatsheet

Here you have a cheatsheet with a lot of patterns

davechild_regular-expressions

Continue reading

Agents patterns

Are your agents falling short? Elevate your AI projects with advanced patterns: ReAct, planning, multi-agents, and more. Practical guide with code!

LangGraph: Revolutionize your AI agents

🚀 Revolutionize your AI agents! 🧠 LangGraph is not just another library, it's the orchestration framework that gives you total control to build complex agents, with long-term memory and even human intervention! Say goodbye to basic chatbots, it's time to create true intelligence. Dive into this post and discover it!

Create virtual environments with uv

Learn how to create virtual environments with uv, a package manager and environment for Python written in Rust, which makes it very fast. If you have had problems with the waiting times using conda, or want a faster and easier alternative to venv, enter and see how to use uv.

Last posts -->

Have you seen these projects?

Horeca chatbot

Naviground

Subtify

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Memory profiler

See the memory usage of a script

DataLoader with pin_memory and num_workers

Increase DataLoader performance with pin_memory and num_workers

py-smi

Python library to get GPU data like `nvidia-smi`

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

Flow edit

FLUX.1-RealismLora

token_hmr

View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

Dataset with jokes in English

opus100

Dataset with translations from English to Spanish

netflix_titles

Dataset with Netflix movies and series

View more datasets -->

Regular expressions

Methods

Findall

Search

Match

Span

Group

Split

Sub

Patterns

The . character

Predefined and constructed classes

Digit

Letter

Spaces

Ranks

Bracket [ and bracket ]

Delimiters +, *, ?, `?

Star * (none or all)

Plus + (one or more)

Optional ? (zero or one)

Counters

Classes

The delimiter ? as a quick delimiter

The denier

The beginning ^ and end of line $.

Practical examples

Logs

Phone number

URLs

Mails

Locations

Names

Search and replace

Cheatsheet

Continue reading

Agents patterns

LangGraph: Revolutionize your AI agents

Create virtual environments with uv

Have you seen these projects?

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Memory profiler

DataLoader with pin_memory and num_workers

py-smi

Use this locally

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

opus100

netflix_titles

The `.` character

Bracket `[` and bracket `]`

Delimiters `+`, `*`, `?`, `?

Star `*` (none or all)

Plus `+` (one or more)

Optional `?` (zero or one)

The delimiter `?` as a quick delimiter

The beginning `^` and end of line `$`.