Expresiones regulares

Expresiones regulares Expresiones regulares

Regular expressionslink image 35

	
import re
Copy

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

Methodslink image 36

Findalllink image 37

With the findall() method we can find all matches of a pattern in a string

	
import re
string = "Hola, soy un string"
print(re.findall("Hola, soy", string))
Copy
	
['Hola, soy']

But if we want to find the position where a pattern is located, we can use the search() method to search for a pattern in a string. This method returns a Match object if it finds a match, otherwise it returns None.

	
print(re.search("soy", string))
Copy
	
<re.Match object; span=(6, 9), match='soy'>

Matchlink image 39

We can also use the match() method that looks for the pattern at the beginning of the string.

	
print(re.match("Hola", string))
print(re.match("soy", string))
Copy
	
<re.Match object; span=(0, 4), match='Hola'>
None

Spanlink image 40

If we want to get the position of the match, we can use the span() method which returns a tuple with the start and end position of the match.

	
print(re.match("Hola", string).span())
Copy
	
(0, 4)

Grouplink image 41

Knowing the position of the match, we can use the group() method to get the substring that matches the pattern.

	
print(re.match("Hola", string).group())
Copy
	
Hola

We could also use the start and end of the match to make a slice of the string.

	
start, end = re.match("Hola", string).span()
print(string[start:end])
Copy
	
Hola

Splitlink image 42

With the split() method we can split a string into a list of substrings using a pattern as a separator.

	
split = re.split("soy", string)
print(split)
Copy
	
['Hola, ', ' un string']

The sentence has been divided into two strings using "soy" as separator.

Sublink image 43

With the sub() method we can replace all matches of a pattern with another substring.

	
sub = re.sub("soy", "eres", string)
print(sub)
Copy
	
Hola, eres un string

It has replaced all "I am" matches with "you are".

Patternslink image 44

The . characterlink image 45

With the . character we can search for any character, any character in our string will be found.

	
string = "Hola, soy un string"
print(re.findall(".", string))
Copy
	
['H', 'o', 'l', 'a', ',', ' ', 's', 'o', 'y', ' ', 'u', 'n', ' ', 's', 't', 'r', 'i', 'n', 'g']

If for example we want sequences of two characters we would search with two .s followed by `.

	
string1 = "Hola, soy un string"
string2 = "Hola, soy un string2"
print(re.findall("..", string1))
print(re.findall("..", string2))
Copy
	
['Ho', 'la', ', ', 'so', 'y ', 'un', ' s', 'tr', 'in']
['Ho', 'la', ', ', 'so', 'y ', 'un', ' s', 'tr', 'in', 'g2']

As we can see string1 has an odd number of characters, so the last g is not taken, but string2 has an even number of characters, so it takes all characters.

Let's look at this another way, let's change each sequence of three characters by a $ symbol.

	
print(string1)
print(re.sub("...", "$ ", string1))
Copy
	
Hola, soy un string
$ $ $ $ $ $ g

I have printed two spaces after each $ so that you can see the change, you can see how the last character does not convert it.

Predefined and constructed classeslink image 46

Digitlink image 47

If we want to find the digits we need to use d.

	
string = "Hola, soy un string con 123 digitos"
print(re.findall("d", string))
Copy
	
['1', '2', '3']

As before, if for example we want two digits, we put d twice

	
print(re.findall("dd", string))
Copy
	
['12']

Letterlink image 48

If we want to find letters we need to use w. Wordmeans all letters fromatoz, fromAtoZ, numbers from0to9and_`.

	
string = "Hola, soy un_string con, 123 digitos"
print(re.findall("w", string))
Copy
	
['H', 'o', 'l', 'a', 's', 'o', 'y', 'u', 'n', '_', 's', 't', 'r', 'i', 'n', 'g', 'c', 'o', 'n', '1', '2', '3', 'd', 'i', 'g', 'i', 't', 'o', 's']

As we can see, it takes everything except the spaces and the comma.

Spaceslink image 49

If we want to find spaces we need `s

	
string = "Hola, soy un_string con, 123 digitos"
print(re.sub("s", "*", string))
Copy
	
Hola,*soy*un_string*con,*123*digitos

Regular expressions consider line breaks as spaces.

	
string = """Hola, soy un string
con un salto de línea"""
print(re.sub("s", "*", string))
Copy
	
Hola,*soy*un*string**con*un*salto*de*línea

Rankslink image 50

If we want to search a range we use [], for example, if we want the numbers from 4 to 8 we use

	
string = "1234567890"
print(re.findall("[4-8]", string))
Copy
	
['4', '5', '6', '7', '8']

We can extend the search range

	
string = "1234567890"
print(re.findall("[2-57-9]", string))
Copy
	
['2', '3', '4', '5', '7', '8', '9']

If we also want to find a specific character, we enter the character followed by ``.

	
string = "1234567890."
print(re.findall("[2-57-9.]", string))
Copy
	
['2', '3', '4', '5', '7', '8', '9', '.']

Bracket [ and bracket ]link image 51

As we have seen, if we want to find ranges we use [], but what if we want to find only the [ or the ]? For that we have to use [] and []`.

	
string = "[1234567890]"
print(re.findall("[", string))
print(re.findall("]", string))
Copy
	
['[']
[']']

Delimiters +, *, ?, `?link image 52

Star * (none or all)link image 53

The * delimiter indicates that you want to search for none or all of them, not one by one as before.

	
string = "Hola, soy un string con 12 123 digitos"
print(re.findall("d", string))
print(re.findall("d*", string))
Copy
	
['1', '2', '1', '2', '3']
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '12', '', '123', '', '', '', '', '', '', '', '', '']

As you can see, putting the * has found all the positions where there are zero characters or all characters

Plus + (one or more)link image 54

With the delimiter + you indicate that you want to search for one or more

	
string = "Hola, soy un string con 1 12 123 digitos"
print(re.findall("d+", string))
Copy
	
['1', '12', '123']

Optional ? (zero or one)link image 55

The ? delimiter indicates that you want to search for zero or one.

	
string = "Hola, soy un string con 1 12 123 digitos"
print(re.sub("d?", "-", string))
Copy
	
-H-o-l-a-,- -s-o-y- -u-n- -s-t-r-i-n-g- -c-o-n- -- --- ---- -d-i-g-i-t-o-s-

Counterslink image 56

When we want to find something that appears x times we use the counters with the braces {}. For example, if we want to find a sequence in which there are at least two digits

	
string = "Hola, soy un string con 1 12 123 1234 1234digitos"
print(re.findall("d{2}", string))
Copy
	
['12', '12', '12', '34', '12', '34']

As you can see you have found the sequences 12 and 34.

The counters accept an upper and lower dimension {inf, sup}.

	
string = "Hola, soy un string con 1 12 123 1234 1234digitos"
print(re.findall("d{2,5}", string))
Copy
	
['12', '123', '1234', '1234']

If the upper dimension is not defined, it means that you want at least the number of elements indicated, but with no upper limit.

	
string = "Hola, soy un string con 1 12 123 1234 12345464168415641646451563416 digitos"
print(re.findall("d{2,}", string))
Copy
	
['12', '123', '1234', '12345464168415641646451563416']

If we want to use the notation of upper and lower dimension, but we want a fixed number, we have to put that number in both dimensions

	
string = "Hola, soy un string con 1 12 123 1234 12345464168415641646451563416 digitos"
print(re.findall("d{2,3}", string))
Copy
	
['12', '123', '123', '123', '454', '641', '684', '156', '416', '464', '515', '634', '16']

Classeslink image 57

You can create classes using [] brackets. Actually we saw that this was used for ranges, but, once you define what you want inside, you can consider it as a class and operate with the [].

For example, suppose we have a telephone number, which can be given in one of the following ways

  • 666-66-66-66
  • 666-666-666
  • 666 666 666
  • 666 66 66 66
  • 666666666

There are many ways to give a number, so let's see how to create a class to define the delimiter

First we are going to tell it to look for all number sequences in which there are at least two numbers.

	
string1 = "666-66-66-66"
string2 = "666-666-666"
string3 = "666 66 66 66"
string4 = "666 666 666"
string5 = "666666666"
print(f"string1: {string1} -->", re.findall("d{2,}", string1))
print(f"string2: {string2} -->", re.findall("d{2,}", string2))
print(f"string3: {string3} -->", re.findall("d{2,}", string3))
print(f"string4: {string4} -->", re.findall("d{2,}", string4))
print(f"string5: {string5} -->", re.findall("d{2,}", string5))
Copy
	
string1: 666-66-66-66 --> ['666', '66', '66', '66']
string2: 666-666-666 --> ['666', '666', '666']
string3: 666 66 66 66 --> ['666', '66', '66', '66']
string4: 666 666 666 --> ['666', '666', '666']
string5: 666666666 --> ['666666666']

Now we define to find the separator as a - or a space

	
string1 = "666-66-66-66"
string2 = "666-666-666"
string3 = "666 66 66 66"
string4 = "666 666 666"
string5 = "666666666"
print(f"string1: {string1} -->", re.findall("[-s]", string1))
print(f"string2: {string2} -->", re.findall("[-s]", string2))
print(f"string3: {string3} -->", re.findall("[-s]", string3))
print(f"string4: {string4} -->", re.findall("[-s]", string4))
print(f"string5: {string5} -->", re.findall("[-s]", string5))
Copy
	
string1: 666-66-66-66 --> ['-', '-', '-']
string2: 666-666-666 --> ['-', '-']
string3: 666 66 66 66 --> [' ', ' ', ' ']
string4: 666 666 666 --> [' ', ' ']
string5: 666666666 --> []

As you can see in the last string it has not found, so we add a ? to find when there is zero or one.

	
string1 = "666-66-66-66"
string2 = "666-666-666"
string3 = "666 66 66 66"
string4 = "666 666 666"
string5 = "666666666"
print(f"string1: {string1} -->", re.findall("[-s]?", string1))
print(f"string2: {string2} -->", re.findall("[-s]?", string2))
print(f"string3: {string3} -->", re.findall("[-s]?", string3))
print(f"string4: {string4} -->", re.findall("[-s]?", string4))
print(f"string5: {string5} -->", re.findall("[-s]?", string5))
Copy
	
string1: 666-66-66-66 --> ['', '', '', '-', '', '', '-', '', '', '-', '', '', '']
string2: 666-666-666 --> ['', '', '', '-', '', '', '', '-', '', '', '', '']
string3: 666 66 66 66 --> ['', '', '', ' ', '', '', ' ', '', '', ' ', '', '', '']
string4: 666 666 666 --> ['', '', '', ' ', '', '', '', ' ', '', '', '', '']
string5: 666666666 --> ['', '', '', '', '', '', '', '', '', '']

Now we are looking for everything to be together

	
string1 = "666-66-66-66"
string2 = "666-666-666"
string3 = "666 66 66 66"
string4 = "666 666 666"
string5 = "666666666"
print(f"string1: {string1} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?", string1))
print(f"string2: {string2} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?", string2))
print(f"string3: {string3} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?", string3))
print(f"string4: {string4} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?", string4))
print(f"string5: {string5} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?", string5))
Copy
	
string1: 666-66-66-66 --> ['666-66-66-66']
string2: 666-666-666 --> []
string3: 666 66 66 66 --> ['666 66 66 66']
string4: 666 666 666 --> []
string5: 666666666 --> ['666666666']

As we see in string2 and string4, it finds nothing. We have set the filter [\d{2,}[\s]? 4 times, i.e. we want a sequence of at least two numbers, followed by zero or a hyphen or space separator that repeats 4 times. But in the last sequence there is no need for the [\d{2,}[\s]?, since it will never end a number with a space or a hyphen.

	
string1 = "666-66-66-66"
string2 = "666-666-666"
string3 = "666 66 66 66"
string4 = "666 666 666"
string5 = "666666666"
print(f"string1: {string1} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}", string1))
print(f"string2: {string2} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}", string2))
print(f"string3: {string3} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}", string3))
print(f"string4: {string4} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}", string4))
print(f"string5: {string5} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}", string5))
Copy
	
string1: 666-66-66-66 --> ['666-66-66-66']
string2: 666-666-666 --> []
string3: 666 66 66 66 --> ['666 66 66 66']
string4: 666 666 666 --> []
string5: 666666666 --> ['666666666']

It is still not found for string2 and string4. This is because the last thing in the filter is a d{2,}, i.e. after the third separator we are expecting at least 2 numbers, but that in string2 and string4 doesn't happen, so we put the following

	
string1 = "666-66-66-66"
string2 = "666-666-666"
string3 = "666 66 66 66"
string4 = "666 666 666"
string5 = "666666666"
print(f"string1: {string1} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d*", string1))
print(f"string2: {string2} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d*", string2))
print(f"string3: {string3} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d*", string3))
print(f"string4: {string4} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d*", string4))
print(f"string5: {string5} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d*", string5))
Copy
	
string1: 666-66-66-66 --> ['666-66-66-66']
string2: 666-666-666 --> ['666-666-666']
string3: 666 66 66 66 --> ['666 66 66 66']
string4: 666 666 666 --> ['666 666 666']
string5: 666666666 --> ['666666666']

The delimiter ? as a quick delimiterlink image 58

The above example can be filtered by d+?[- ].

	
string1 = "666-66-66-66"
string2 = "666-666-666"
string3 = "666 66 66 66"
string4 = "666 666 666"
string5 = "666666666"
print(f"string1: {string1} -->", re.findall("d+?[- ]", string1))
print(f"string2: {string2} -->", re.findall("d+?[- ]", string2))
print(f"string3: {string3} -->", re.findall("d+?[- ]", string3))
print(f"string4: {string4} -->", re.findall("d+?[- ]", string4))
print(f"string5: {string5} -->", re.findall("d+?[- ]", string5))
Copy
	
string1: 666-66-66-66 --> ['666-', '66-', '66-']
string2: 666-666-666 --> ['666-', '666-']
string3: 666 66 66 66 --> ['666 ', '66 ', '66 ']
string4: 666 666 666 --> ['666 ', '666 ']
string5: 666666666 --> []

If the ? delimiter were not present, we would have \d+[- ], which means a sequence of one or more numbers followed by a space or a hyphen. But what the ? delimiter does is to make this search faster.

The denierlink image 59

Before we have seen that with d we found digits, so with D we find everything that are not digits.

	
string1 = "E3s4t6e e1s2t3r5i6n7g8 t9i0e4n2e1 d4i5g7i9t0o5s2"
print(re.findall("D", string1))
Copy
	
['E', 's', 't', 'e', ' ', 'e', 's', 't', 'r', 'i', 'n', 'g', ' ', 't', 'i', 'e', 'n', 'e', ' ', 'd', 'i', 'g', 'i', 't', 'o', 's']

The same happens with letters, if we write W it will find everything that is not letters.

	
string1 = "Letras ab27_ no letras ,.:;´ç"
print(re.findall("W", string1))
Copy
	
[' ', ' ', ' ', ' ', ',', '.', ':', ';', '´']

If we put S we will find everything other than spaces.

	
print(re.findall("S", string1))
Copy
	
['L', 'e', 't', 'r', 'a', 's', 'a', 'b', '2', '7', '_', 'n', 'o', 'l', 'e', 't', 'r', 'a', 's', ',', '.', ':', ';', '´', 'ç']

But in case we have a class or something else, we can deny by ^

	
string1 = "1234567890"
print(re.findall("[^5-9]", string1))
Copy
	
['1', '2', '3', '4', '0']

Going back to the example of the phone numbers from before, we can filter them by the following

	
string1 = "666-66-66-66"
string2 = "666-666-666"
string3 = "666 66 66 66"
string4 = "666 666 666"
string5 = "666666666"
print(f"string1: {string1} -->", re.findall("d{2,}D?d{2,}D?d{2,}D?d*", string1))
print(f"string2: {string2} -->", re.findall("d{2,}D?d{2,}D?d{2,}D?d*", string2))
print(f"string3: {string3} -->", re.findall("d{2,}D?d{2,}D?d{2,}D?d*", string3))
print(f"string4: {string4} -->", re.findall("d{2,}D?d{2,}D?d{2,}D?d*", string4))
print(f"string5: {string5} -->", re.findall("d{2,}D?d{2,}D?d{2,}D?d*", string5))
string5 = "666 666 666"
Copy
	
string1: 666-66-66-66 --> ['666-66-66-66']
string2: 666-666-666 --> ['666-666-666']
string3: 666 66 66 66 --> ['666 66 66 66']
string4: 666 666 666 --> ['666 666 666']
string5: 666666666 --> ['666666666']

What we are doing is asking for sequences of at least two digits followed by one or no non-digits.

The beginning ^ and end of line $.link image 60

With ^ we can search for the beginning of a line, for example, if we want to find a digit only at the beginning of a line

	
string1 = "linea 1"
string2 = "2ª linea"
print(re.findall("^d", string1))
print(re.findall("^d", string2))
Copy
	
[]
['2']

As you can see there is only one digit at the beginning of the line in string2.

Likewise, the end of a line can be found with $. If we want to find a digit only at the end of a line

	
string1 = "linea 1"
string2 = "2ª linea"
print(re.findall("d$", string1))
print(re.findall("d$", string2))
Copy
	
['1']
[]

This only occurs in string1.

Practical exampleslink image 61

Logslink image 62

If in the following log we want to find only the WARNs

	
log = """[LOG ENTRY] [ERROR] The system is unstable
[LOG ENTRY] [WARN] The system may be down
[LOG ENTRY] [WARN] Microsoft just bought Github
[LOG DATA] [LOG] Everything is OK
[LOG ENTRY] [LOG] [user:@beco] Logged in
[LOG ENTRY] [LOG] [user:@beco] Clicked here
[LOG DATA] [LOG] [user:@celismx] Did something
[LOG ENTRY] [LOG] [user:@beco] Rated the app
[LOG ENTRY] [LOG] [user:@beco] Logged out
[LOG LINE] [LOG] [user:@celismx] Logged in"""
result = re.findall("[LOG.*[WARN].*", log)
result
Copy
	
['[LOG ENTRY] [WARN] The system may be down',
'[LOG ENTRY] [WARN] Microsoft just bought Github']

Phone numberlink image 63

Within a number we can find letters such as e for extension, # also for extension, or p to pause if a computer calls. We can also find the + to indicate a country prefix and separators such as spaces, -, ., ., ., ., ., ., ., ..

	
tel = """555658
56-58-11
56.58.11
56.78-98
65 09 87
76y87r98
45y78-56
78.87 65
78 54-56
+521565811
58-11-11#246
55256048p123
55256048e123"""
result = re.findall("+?d{2,3}[^da-zA-Z\n]?d{2,3}[^da-zA-Z\n]?d{2,3}[#pe]?d*", tel)
result
Copy
	
['555658',
'56-58-11',
'56.58.11',
'56.78-98',
'65 09 87',
'78.87 65',
'78 54-56',
'+521565811',
'58-11-11#246',
'55256048p123',
'55256048e123']

Here is an explanation

  • +?: Beginning with the character + and containing either zero or one
  • ``d{2,3}`: To be followed by 2 to 3 digits
  • Next there can be zero or a character that is neither a digit, nor a letter from a to z, nor a letter from A to Z, nor a line break.
  • ``d{2,3}`: To be followed by 2 to 3 digits
  • Next there can be zero or a character that is neither a digit, nor a letter from a to z, nor a letter from A to Z, nor a line break.
  • ``d{2,3}`: To be followed by 2 to 3 digits
  • [#pe]?: Then there can be zero or one character either #, or p, or e.
  • Lastly, let there be zero or all numbers.

URLslink image 64

	
urls = """url: https://www.instagram.com/p/BXB4zsUlW5Z/?taken-by=beco.mx
url: http://instagram.com/p/blablablah
url: http://itam.mx/test
http://instagram.com/p/blablablah
https://www.vanguarsoft.com.ve
http://platzi.com
https://traetelo.net
https://traetelo.net/images archivo.jsp
url: https://subdominio.traetelo.net
url: https://www.instagram.com/p/BXB4zsUlW5Z/?taken-by=beco.mx
url: http://instagram.com/p/blablablah
url: http://itam.mx/test
http://instagram.com/p/blablablah
https://www.google.com.co/
https://sub.dominio.de.alguien.com/archivo.html
https://en.wikipedia.org/wiki/.org
https://cdn-microsoft.org/image/seixo2t9sjl_22.jpg
https://hola.pizza
https://platzi.com/clases/1301-expresiones-regulares/11860-urls9102/ clase
https://api.giphy.com/v1/gifs/search?q=Rick and Morty&limit=10&api_key=DG3hItPp5HIRNC0nit3AOR7eQZAe
http://localhost:3000/something?color1=red&color2=blue
http://localhost:3000/display/post?size=small
http://localhost:3000/?name=satyam
http://localhost:3000/scanned?orderid=234
http://localhost:3000/getUsers?userId=12354411&name=Billy
http://localhost:3000/getUsers?userId=12354411
http://localhost:3000/search?city=Barcelona
www.sitiodeejemplo.net/pagina.php?nombredevalor1=valor1&nombredevalor2=valor2"""
result = re.findall("https?://[w-.]+.w{2,6}/?S*", urls)
result
Copy
	
['https://www.instagram.com/p/BXB4zsUlW5Z/?taken-by=beco.mx',
'http://instagram.com/p/blablablah',
'http://itam.mx/test',
'http://instagram.com/p/blablablah',
'https://www.vanguarsoft.com.ve',
'http://platzi.com',
'https://traetelo.net',
'https://traetelo.net/images',
'https://subdominio.traetelo.net',
'https://www.instagram.com/p/BXB4zsUlW5Z/?taken-by=beco.mx',
'http://instagram.com/p/blablablah',
'http://itam.mx/test',
'http://instagram.com/p/blablablah',
'https://www.google.com.co/',
'https://sub.dominio.de.alguien.com/archivo.html',
'https://en.wikipedia.org/wiki/.org',
'https://cdn-microsoft.org/image/seixo2t9sjl_22.jpg',
'https://hola.pizza',
'https://platzi.com/clases/1301-expresiones-regulares/11860-urls9102/',
'https://api.giphy.com/v1/gifs/search?q=Rick']

Here is an explanation

  • http: We want it to start with http.
  • s?: There may or may not be an s in the following.
  • :://: Followed by://`.
  • `[*]: Followed by one or more letters, gions or periods
  • Next, a point.
  • ``w{2,6}`: Between 2 and 6 letters for the tld
  • /?: Followed by zero or a /.
  • None or everything that is not a space.

Mailslink image 65

	
mails = """esto.es_un.mail@mail.com
esto.es_un.mail+complejo@mail.com
dominio.com
rodrigo.jimenez@yahoo.com.mx
ruben@starbucks.com
esto_no$es_email@dominio.com
no_se_de_internet3@hotmail.com"""
result = re.findall("[w._]{5,30}+?[w._]{0,10}@[w.-]{2,}.w{2,6}", mails)
result
Copy
	
['esto.es_un.mail@mail.com',
'esto.es_un.mail+complejo@mail.com',
'rodrigo.jimenez@yahoo.com.mx',
'ruben@starbucks.com',
'es_email@dominio.com',
'no_se_de_internet3@hotmail.com']

Here is an explanation

  • {5,30}`: We want it to start with between 5 and 30 (which is the minimum and maximum that gmail supports) letters, dots or underscores.
  • +?: Followed by zero or a +.
  • {0,10}`: Then between 0 and 10 letters, dots or underscores.
  • @: The @: The @: The @: The @
  • {[{2,}`: Between 2 and infinite letters, dots and dashes (domain)
  • .: Followed by a `.
  • ``w{2,6}`: And finally between 2 and 6 letters for the tld

Locationslink image 66

There are two possible ways to give locations, so we analyze both of them

	
loc = """-99.205646,19.429707,2275.10
-99.205581, 19.429652,2275.10
-99.204654,19.428952,2275.58"""
result = re.findall("-?d{1,3}.d{1,6},s?-?d{1,3}.d{1,6},.*", loc)
result
Copy
	
['-99.205646,19.429707,2275.10',
'-99.205581, 19.429652,2275.10',
'-99.204654,19.428952,2275.58']

Here is an explanation

  • We want it to start with zero or a minus sign.
  • Followed by between one and three numbers
  • Next, a point.
  • ``d{1,6}`: After one to six numbers
  • ,: Then a ,: Then a ,: Then a ,: Then a ,: Then a ,
  • ``s?`: After zero or a space
  • ``-?`: Zero or a minus sign
  • d{1,3}`: Then between one and three numbers
  • Next, a point.
  • Followed by between one and six numbers.
  • ,: Then a comma
  • .*: Lastly none or all types of characters
	
loc = """-99 12' 34.08"W, 19 34' 56.98"N
-34 54' 32.00"E, -3 21' 67.00"S"""
result = re.findall("-?d{1,3}sd{1,2}'sd{1,2}.d{2,2}\"[WE],s?-?d{1,3}sd{1,2}'sd{1,2}.d{2,2}\"[SN]", loc)
result
Copy
	
['-99 12' 34.08"W, 19 34' 56.98"N', '-34 54' 32.00"E, -3 21' 67.00"S']
	
print(result[0])
print(result[1])
Copy
	
-99 12' 34.08"W, 19 34' 56.98"N
-34 54' 32.00"E, -3 21' 67.00"S

Here is an explanation

  • We want it to start with zero or a minus sign.
  • Followed by between one and three numbers
  • s: Then a space
  • ``d{1,2}`: Segment of one to two numbers
  • ': Then a ': Then a '.
  • Followed by a space.
  • ``d{1,2}: Then between one and two numbers
  • After a period
  • ``d{2,2}`: Followed by two numbers
  • ": Then a ": Then a ": Then a ": Then a ": Then a ".
  • [WE]: Then the letter W or the letter E.
  • ,: After a comma
  • Followed by a zero or a space
  • ``-?`: After zero or a minus sign
  • d{1,3}`: Then between one and three numbers
  • Followed by a space.
  • ``d{1,2}: Then between one and two numbers
  • ': Then a ': After a '
  • s: Then a space
  • ``d{1,2}`: Next between one and two numbers
  • Followed by a period
  • ``d{2,2}`: After two numbers
  • ": Followed by ": Followed by "`.
  • [SN]: And finally the letter S or the letter N.

Nameslink image 67

	
nombres = """Camilo Sarmiento Gálvez
Alejandro Pliego Abasto
Milagros Reyes Japón
Samuel París Arrabal
Juan Pablo Tafalla
Axel Gálvez Velázquez
Óscar Montreal Aparicio
Jacobo Pozo Tassis
Guillermo Ordóñez Espiga
Eduardo Pousa Curbelo
Ivanna Bienvenida Kevin
Ada Tasis López
Luciana Sáenz García
Florencia Sainz Márquz
Catarina Cazalla Lombarda
Paloma Gallo Perro
Margarita Quesada Florez
Vicente Fox Quesada
Iris Graciani
Asunción Carballar
Constanza Muñoz
Manuel Andres García Márquez"""
result = re.findall("[A-ZÁÉÍÓÚ][a-záéíóú]+s[A-ZÁÉÍÓÚ][a-záéíóú]+s[A-ZÁÉÍÓÚ][a-záéíóú]+", nombres)
result
Copy
	
['Camilo Sarmiento Gálvez',
'Alejandro Pliego Abasto',
'Milagros Reyes Japón',
'Samuel París Arrabal',
'Juan Pablo Tafalla',
'Axel Gálvez Velázquez',
'Óscar Montreal Aparicio',
'Jacobo Pozo Tassis',
'Espiga Eduardo Pousa',
'Curbelo Ivanna Bienvenida',
'Kevin Ada Tasis',
'López Luciana Sáenz',
'García Florencia Sainz',
'Márquz Catarina Cazalla',
'Lombarda Paloma Gallo',
'Perro Margarita Quesada',
'Florez Vicente Fox',
'Quesada Iris Graciani',
'Asunción Carballar Constanza',
'Manuel Andres García']

Here is an explanation

  • [A-ZÁÉÍÓÚ]: We want it to start with a capital letter, including accents.
  • [a-záééíóú]+: Followed by one or more lowercase letters, enclosed by spaces
  • Followed by a space.
  • [A-ZÁÉÍÓÓÚ]: followed by an uppercase letter, including accents
  • [a-záééíóú]+: Followed by one or more lowercase letters, enclosed by spaces
  • Followed by a space.
  • [A-ZÁÉÍÓÓÚ]: followed by an uppercase letter, including accents
  • [a-záééíóú]+: Followed by one or more lowercase letters, enclosed by spaces

Search and replacelink image 68

We are going to download a file with a lot of historical films.

	
# download file from url
import urllib.request
url = "https://static.platzi.com/media/tmp/class-files/github/moviedemo/moviedemo-master/movies.dat"
urllib.request.urlretrieve(url, "movies.dat")
Copy
	
---------------------------------------------------------------------------
HTTPError Traceback (most recent call last)
Cell In[43], line 4
2 import urllib.request
3 url = "https://static.platzi.com/media/tmp/class-files/github/moviedemo/moviedemo-master/movies.dat"
----> 4 urllib.request.urlretrieve(url, "movies.dat")
File ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:241, in urlretrieve(url, filename, reporthook, data)
224 """
225 Retrieve a URL into a temporary location on disk.
226
(...)
237 data file as well as the resulting HTTPMessage object.
238 """
239 url_type, path = _splittype(url)
--> 241 with contextlib.closing(urlopen(url, data)) as fp:
242 headers = fp.info()
244 # Just return the local path and the "headers" for file://
245 # URLs. No sense in performing a copy unless requested.
File ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:216, in urlopen(url, data, timeout, cafile, capath, cadefault, context)
214 else:
215 opener = _opener
--> 216 return opener.open(url, data, timeout)
File ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:525, in OpenerDirector.open(self, fullurl, data, timeout)
523 for processor in self.process_response.get(protocol, []):
524 meth = getattr(processor, meth_name)
--> 525 response = meth(req, response)
527 return response
File ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:634, in HTTPErrorProcessor.http_response(self, request, response)
631 # According to RFC 2616, "2xx" code indicates that the client's
632 # request was successfully received, understood, and accepted.
633 if not (200 <= code < 300):
--> 634 response = self.parent.error(
635 'http', request, response, code, msg, hdrs)
637 return response
File ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:563, in OpenerDirector.error(self, proto, *args)
561 if http_err:
562 args = (dict, 'default', 'http_error_default') + orig_args
--> 563 return self._call_chain(*args)
File ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:496, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
494 for handler in handlers:
495 func = getattr(handler, meth_name)
--> 496 result = func(*args)
497 if result is not None:
498 return result
File ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:643, in HTTPDefaultErrorHandler.http_error_default(self, req, fp, code, msg, hdrs)
642 def http_error_default(self, req, fp, code, msg, hdrs):
--> 643 raise HTTPError(req.full_url, code, msg, hdrs, fp)
HTTPError: HTTP Error 403: Forbidden

Let's print the first 10 lines to analyze it.

	
file = open("movies.dat", "r")
for i, line in enumerate(file):
print(line, end="")
if i == 10:
break
file.close()
Copy
	
1::Toy Story (1995)::Adventure|Animation|Children|Comedy|Fantasy
2::Jumanji (1995)::Adventure|Children|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama|Romance
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children
9::Sudden Death (1995)::Action
10::GoldenEye (1995)::Action|Adventure|Thriller
11::American President, The (1995)::Comedy|Drama|Romance

As you can see, we have an ID, followed by ::, then the name of the movie, in parentheses the year, followed by :: and then genres separated by |.

We can make a cleaning of the file very easy by means of regular expressions, the compile and match functions and the use of groupings with parenthesis. When making groupings, we select which areas of the text we want to keep and then work with them as we want, let's see it with an example

	
pattern = re.compile(r"^d+::([ws:,().-'&¡!/¿?ÁÉÍÓÚáéíóú+*$#°'"[]@·]+)s((d{4,4}))::(.*)$")
file = open("movies.dat", "r")
file_filtered = open("movies.csv", "w")
file_filtered.write("title,year,genders ")
sep = ";;"
for line in file:
result = re.match(pattern, line)
if result:
file_filtered.write(f"{result.group(1)}{sep}{result.group(2)}{sep}{result.group(3)} ")
else:
print(line, end="")
file.close()
file_filtered.close()
Copy

Let's see what we have done, first we have defined a pattern with the following:

  • ^: We want it to start with the beginning of the line.
  • Next one or more numbers
  • ::: Followed by ::
  • `([(([([([([([([([("("("("("("": This is the first grouping, we look for any word, space or character in the square brackets that appears one or more times.
  • ``s`: Next a space
  • : The tightening of a parenthesis
  • (4,4})`: Here is the second grouping, we are looking for four numbers.
  • After the closing of a parenthesis
  • ::: Next ::
  • (.*): The third grouping, any character that appears none or all times.
  • $: Lastly the end of the line

Inside the for we analyze line by line if the pattern we have defined is found, and if it is found we write the three patterns in the csv separated by sep, which in our case we have defined as ;;. This separator has been defined, because there are movie titles that have ,s.

We read the csv with Pandas.

import pandas as pd
      df = pd.read_csv("movies.csv", sep=";;", engine="python")
      df.head()
      
Out[100]:
title,year,genders
Toy Story 1995 Adventure|Animation|Children|Comedy|Fantasy
Jumanji 1995 Adventure|Children|Fantasy
Grumpier Old Men 1995 Comedy|Romance
Waiting to Exhale 1995 Comedy|Drama|Romance
Father of the Bride Part II 1995 Comedy

Cheatsheetlink image 69

Here you have a cheatsheet with a lot of patterns

davechild_regular-expressions

Continue reading

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

DoLa – Decoding by Contrasting Layers Improves Factuality in Large Language Models

Have you ever talked to an LLM and they answered you something that sounds like they've been drinking machine coffee all night long 😂 That's what we call a hallucination in the LLM world! But don't worry, because it's not that your language model is crazy (although it can sometimes seem that way 🤪). The truth is that LLMs can be a bit... creative when it comes to generating text. But thanks to DoLa, a method that uses contrast layers to improve the feasibility of LLMs, we can keep our language models from turning into science fiction writers 😂. In this post, I'll explain how DoLa works and show you a code example so you can better understand how to make your LLMs more reliable and less prone to making up stories. Let's save our LLMs from insanity and make them more useful! 🚀

Last posts -->

Have you seen these projects?

Subtify

Subtify Subtify

Subtitle generator for videos in the language you want. Also, it puts a different color subtitle to each person

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

Dataset with jokes in English

opus100

Dataset with translations from English to Spanish

netflix_titles

Dataset with Netflix movies and series

View more datasets -->