Regular expressions
import re
This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.
Methods
Findall
With the findall()
method we can find all matches of a pattern in a string
import restring = "Hola, soy un string"print(re.findall("Hola, soy", string))
['Hola, soy']
Search
But if we want to find the position where a pattern is located, we can use the search()
method to search for a pattern in a string. This method returns a Match object if it finds a match, otherwise it returns None.
print(re.search("soy", string))
<re.Match object; span=(6, 9), match='soy'>
Match
We can also use the match()
method that looks for the pattern at the beginning of the string.
print(re.match("Hola", string))print(re.match("soy", string))
<re.Match object; span=(0, 4), match='Hola'>None
Span
If we want to get the position of the match, we can use the span()
method which returns a tuple with the start and end position of the match.
print(re.match("Hola", string).span())
(0, 4)
Group
Knowing the position of the match, we can use the group()
method to get the substring that matches the pattern.
print(re.match("Hola", string).group())
Hola
We could also use the start and end of the match to make a slice of the string.
start, end = re.match("Hola", string).span()print(string[start:end])
Hola
Split
With the split()
method we can split a string into a list of substrings using a pattern as a separator.
split = re.split("soy", string)print(split)
['Hola, ', ' un string']
The sentence has been divided into two strings using "soy" as separator.
Sub
With the sub()
method we can replace all matches of a pattern with another substring.
sub = re.sub("soy", "eres", string)print(sub)
Hola, eres un string
It has replaced all "I am" matches with "you are".
Patterns
The .
character
With the .
character we can search for any character, any character in our string will be found.
string = "Hola, soy un string"print(re.findall(".", string))
['H', 'o', 'l', 'a', ',', ' ', 's', 'o', 'y', ' ', 'u', 'n', ' ', 's', 't', 'r', 'i', 'n', 'g']
If for example we want sequences of two characters we would search with two .
s followed by `.
string1 = "Hola, soy un string"string2 = "Hola, soy un string2"print(re.findall("..", string1))print(re.findall("..", string2))
['Ho', 'la', ', ', 'so', 'y ', 'un', ' s', 'tr', 'in']['Ho', 'la', ', ', 'so', 'y ', 'un', ' s', 'tr', 'in', 'g2']
As we can see string1
has an odd number of characters, so the last g
is not taken, but string2
has an even number of characters, so it takes all characters.
Let's look at this another way, let's change each sequence of three characters by a $
symbol.
print(string1)print(re.sub("...", "$ ", string1))
Hola, soy un string$ $ $ $ $ $ g
I have printed two spaces after each $
so that you can see the change, you can see how the last character does not convert it.
Predefined and constructed classes
Digit
If we want to find the digits we need to use d
.
string = "Hola, soy un string con 123 digitos"print(re.findall("d", string))
['1', '2', '3']
As before, if for example we want two digits, we put d
twice
print(re.findall("dd", string))
['12']
Letter
If we want to find letters we need to use w
. Wordmeans all letters from
ato
z, from
Ato
Z, numbers from
0to
9and
_`.
string = "Hola, soy un_string con, 123 digitos"print(re.findall("w", string))
['H', 'o', 'l', 'a', 's', 'o', 'y', 'u', 'n', '_', 's', 't', 'r', 'i', 'n', 'g', 'c', 'o', 'n', '1', '2', '3', 'd', 'i', 'g', 'i', 't', 'o', 's']
As we can see, it takes everything except the spaces and the comma.
Spaces
If we want to find spaces we need `s
string = "Hola, soy un_string con, 123 digitos"print(re.sub("s", "*", string))
Hola,*soy*un_string*con,*123*digitos
Regular expressions consider line breaks as spaces.
string = """Hola, soy un stringcon un salto de línea"""print(re.sub("s", "*", string))
Hola,*soy*un*string**con*un*salto*de*línea
Ranks
If we want to search a range we use []
, for example, if we want the numbers from 4 to 8 we use
string = "1234567890"print(re.findall("[4-8]", string))
['4', '5', '6', '7', '8']
We can extend the search range
string = "1234567890"print(re.findall("[2-57-9]", string))
['2', '3', '4', '5', '7', '8', '9']
If we also want to find a specific character, we enter the character followed by ``.
string = "1234567890."print(re.findall("[2-57-9.]", string))
['2', '3', '4', '5', '7', '8', '9', '.']
Bracket [
and bracket ]
As we have seen, if we want to find ranges we use []
, but what if we want to find only the [
or the ]
? For that we have to use []
and [
]`.
string = "[1234567890]"print(re.findall("[", string))print(re.findall("]", string))
['['][']']
Delimiters +
, *
, ?
, `?
Star *
(none or all)
The *
delimiter indicates that you want to search for none or all of them, not one by one as before.
string = "Hola, soy un string con 12 123 digitos"print(re.findall("d", string))print(re.findall("d*", string))
['1', '2', '1', '2', '3']['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '12', '', '123', '', '', '', '', '', '', '', '', '']
As you can see, putting the *
has found all the positions where there are zero characters or all characters
Plus +
(one or more)
With the delimiter +
you indicate that you want to search for one or more
string = "Hola, soy un string con 1 12 123 digitos"print(re.findall("d+", string))
['1', '12', '123']
Optional ?
(zero or one)
The ?
delimiter indicates that you want to search for zero or one.
string = "Hola, soy un string con 1 12 123 digitos"print(re.sub("d?", "-", string))
-H-o-l-a-,- -s-o-y- -u-n- -s-t-r-i-n-g- -c-o-n- -- --- ---- -d-i-g-i-t-o-s-
Counters
When we want to find something that appears x times we use the counters with the braces {}
. For example, if we want to find a sequence in which there are at least two digits
string = "Hola, soy un string con 1 12 123 1234 1234digitos"print(re.findall("d{2}", string))
['12', '12', '12', '34', '12', '34']
As you can see you have found the sequences 12
and 34
.
The counters accept an upper and lower dimension {inf, sup}
.
string = "Hola, soy un string con 1 12 123 1234 1234digitos"print(re.findall("d{2,5}", string))
['12', '123', '1234', '1234']
If the upper dimension is not defined, it means that you want at least the number of elements indicated, but with no upper limit.
string = "Hola, soy un string con 1 12 123 1234 12345464168415641646451563416 digitos"print(re.findall("d{2,}", string))
['12', '123', '1234', '12345464168415641646451563416']
If we want to use the notation of upper and lower dimension, but we want a fixed number, we have to put that number in both dimensions
string = "Hola, soy un string con 1 12 123 1234 12345464168415641646451563416 digitos"print(re.findall("d{2,3}", string))
['12', '123', '123', '123', '454', '641', '684', '156', '416', '464', '515', '634', '16']
Classes
You can create classes using []
brackets. Actually we saw that this was used for ranges, but, once you define what you want inside, you can consider it as a class and operate with the []
.
For example, suppose we have a telephone number, which can be given in one of the following ways
- 666-66-66-66
- 666-666-666
- 666 666 666
- 666 66 66 66
- 666666666
There are many ways to give a number, so let's see how to create a class to define the delimiter
First we are going to tell it to look for all number sequences in which there are at least two numbers.
string1 = "666-66-66-66"string2 = "666-666-666"string3 = "666 66 66 66"string4 = "666 666 666"string5 = "666666666"print(f"string1: {string1} -->", re.findall("d{2,}", string1))print(f"string2: {string2} -->", re.findall("d{2,}", string2))print(f"string3: {string3} -->", re.findall("d{2,}", string3))print(f"string4: {string4} -->", re.findall("d{2,}", string4))print(f"string5: {string5} -->", re.findall("d{2,}", string5))
string1: 666-66-66-66 --> ['666', '66', '66', '66']string2: 666-666-666 --> ['666', '666', '666']string3: 666 66 66 66 --> ['666', '66', '66', '66']string4: 666 666 666 --> ['666', '666', '666']string5: 666666666 --> ['666666666']
Now we define to find the separator as a -
or a space
string1 = "666-66-66-66"string2 = "666-666-666"string3 = "666 66 66 66"string4 = "666 666 666"string5 = "666666666"print(f"string1: {string1} -->", re.findall("[-s]", string1))print(f"string2: {string2} -->", re.findall("[-s]", string2))print(f"string3: {string3} -->", re.findall("[-s]", string3))print(f"string4: {string4} -->", re.findall("[-s]", string4))print(f"string5: {string5} -->", re.findall("[-s]", string5))
string1: 666-66-66-66 --> ['-', '-', '-']string2: 666-666-666 --> ['-', '-']string3: 666 66 66 66 --> [' ', ' ', ' ']string4: 666 666 666 --> [' ', ' ']string5: 666666666 --> []
As you can see in the last string it has not found, so we add a ?
to find when there is zero or one.
string1 = "666-66-66-66"string2 = "666-666-666"string3 = "666 66 66 66"string4 = "666 666 666"string5 = "666666666"print(f"string1: {string1} -->", re.findall("[-s]?", string1))print(f"string2: {string2} -->", re.findall("[-s]?", string2))print(f"string3: {string3} -->", re.findall("[-s]?", string3))print(f"string4: {string4} -->", re.findall("[-s]?", string4))print(f"string5: {string5} -->", re.findall("[-s]?", string5))
string1: 666-66-66-66 --> ['', '', '', '-', '', '', '-', '', '', '-', '', '', '']string2: 666-666-666 --> ['', '', '', '-', '', '', '', '-', '', '', '', '']string3: 666 66 66 66 --> ['', '', '', ' ', '', '', ' ', '', '', ' ', '', '', '']string4: 666 666 666 --> ['', '', '', ' ', '', '', '', ' ', '', '', '', '']string5: 666666666 --> ['', '', '', '', '', '', '', '', '', '']
Now we are looking for everything to be together
string1 = "666-66-66-66"string2 = "666-666-666"string3 = "666 66 66 66"string4 = "666 666 666"string5 = "666666666"print(f"string1: {string1} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?", string1))print(f"string2: {string2} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?", string2))print(f"string3: {string3} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?", string3))print(f"string4: {string4} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?", string4))print(f"string5: {string5} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?", string5))
string1: 666-66-66-66 --> ['666-66-66-66']string2: 666-666-666 --> []string3: 666 66 66 66 --> ['666 66 66 66']string4: 666 666 666 --> []string5: 666666666 --> ['666666666']
As we see in string2
and string4
, it finds nothing. We have set the filter [\d{2,}[\s]?
4 times, i.e. we want a sequence of at least two numbers, followed by zero or a hyphen or space separator that repeats 4 times. But in the last sequence there is no need for the [\d{2,}[\s]?
, since it will never end a number with a space or a hyphen.
string1 = "666-66-66-66"string2 = "666-666-666"string3 = "666 66 66 66"string4 = "666 666 666"string5 = "666666666"print(f"string1: {string1} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}", string1))print(f"string2: {string2} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}", string2))print(f"string3: {string3} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}", string3))print(f"string4: {string4} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}", string4))print(f"string5: {string5} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d{2,}", string5))
string1: 666-66-66-66 --> ['666-66-66-66']string2: 666-666-666 --> []string3: 666 66 66 66 --> ['666 66 66 66']string4: 666 666 666 --> []string5: 666666666 --> ['666666666']
It is still not found for string2
and string4
. This is because the last thing in the filter is a d{2,}
, i.e. after the third separator we are expecting at least 2 numbers, but that in string2
and string4
doesn't happen, so we put the following
string1 = "666-66-66-66"string2 = "666-666-666"string3 = "666 66 66 66"string4 = "666 666 666"string5 = "666666666"print(f"string1: {string1} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d*", string1))print(f"string2: {string2} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d*", string2))print(f"string3: {string3} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d*", string3))print(f"string4: {string4} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d*", string4))print(f"string5: {string5} -->", re.findall("d{2,}[-s]?d{2,}[-s]?d{2,}[-s]?d*", string5))
string1: 666-66-66-66 --> ['666-66-66-66']string2: 666-666-666 --> ['666-666-666']string3: 666 66 66 66 --> ['666 66 66 66']string4: 666 666 666 --> ['666 666 666']string5: 666666666 --> ['666666666']
The delimiter ?
as a quick delimiter
The above example can be filtered by d+?[- ]
.
string1 = "666-66-66-66"string2 = "666-666-666"string3 = "666 66 66 66"string4 = "666 666 666"string5 = "666666666"print(f"string1: {string1} -->", re.findall("d+?[- ]", string1))print(f"string2: {string2} -->", re.findall("d+?[- ]", string2))print(f"string3: {string3} -->", re.findall("d+?[- ]", string3))print(f"string4: {string4} -->", re.findall("d+?[- ]", string4))print(f"string5: {string5} -->", re.findall("d+?[- ]", string5))
string1: 666-66-66-66 --> ['666-', '66-', '66-']string2: 666-666-666 --> ['666-', '666-']string3: 666 66 66 66 --> ['666 ', '66 ', '66 ']string4: 666 666 666 --> ['666 ', '666 ']string5: 666666666 --> []
If the ?
delimiter were not present, we would have \d+[- ]
, which means a sequence of one or more numbers followed by a space or a hyphen. But what the ?
delimiter does is to make this search faster.
The denier
Before we have seen that with d
we found digits, so with D
we find everything that are not digits.
string1 = "E3s4t6e e1s2t3r5i6n7g8 t9i0e4n2e1 d4i5g7i9t0o5s2"print(re.findall("D", string1))
['E', 's', 't', 'e', ' ', 'e', 's', 't', 'r', 'i', 'n', 'g', ' ', 't', 'i', 'e', 'n', 'e', ' ', 'd', 'i', 'g', 'i', 't', 'o', 's']
The same happens with letters, if we write W
it will find everything that is not letters.
string1 = "Letras ab27_ no letras ,.:;´ç"print(re.findall("W", string1))
[' ', ' ', ' ', ' ', ',', '.', ':', ';', '´']
If we put S
we will find everything other than spaces.
print(re.findall("S", string1))
['L', 'e', 't', 'r', 'a', 's', 'a', 'b', '2', '7', '_', 'n', 'o', 'l', 'e', 't', 'r', 'a', 's', ',', '.', ':', ';', '´', 'ç']
But in case we have a class or something else, we can deny by ^
string1 = "1234567890"print(re.findall("[^5-9]", string1))
['1', '2', '3', '4', '0']
Going back to the example of the phone numbers from before, we can filter them by the following
string1 = "666-66-66-66"string2 = "666-666-666"string3 = "666 66 66 66"string4 = "666 666 666"string5 = "666666666"print(f"string1: {string1} -->", re.findall("d{2,}D?d{2,}D?d{2,}D?d*", string1))print(f"string2: {string2} -->", re.findall("d{2,}D?d{2,}D?d{2,}D?d*", string2))print(f"string3: {string3} -->", re.findall("d{2,}D?d{2,}D?d{2,}D?d*", string3))print(f"string4: {string4} -->", re.findall("d{2,}D?d{2,}D?d{2,}D?d*", string4))print(f"string5: {string5} -->", re.findall("d{2,}D?d{2,}D?d{2,}D?d*", string5))string5 = "666 666 666"
string1: 666-66-66-66 --> ['666-66-66-66']string2: 666-666-666 --> ['666-666-666']string3: 666 66 66 66 --> ['666 66 66 66']string4: 666 666 666 --> ['666 666 666']string5: 666666666 --> ['666666666']
What we are doing is asking for sequences of at least two digits followed by one or no non-digits.
The beginning ^
and end of line $
.
With ^
we can search for the beginning of a line, for example, if we want to find a digit only at the beginning of a line
string1 = "linea 1"string2 = "2ª linea"print(re.findall("^d", string1))print(re.findall("^d", string2))
[]['2']
As you can see there is only one digit at the beginning of the line in string2
.
Likewise, the end of a line can be found with $
. If we want to find a digit only at the end of a line
string1 = "linea 1"string2 = "2ª linea"print(re.findall("d$", string1))print(re.findall("d$", string2))
['1'][]
This only occurs in string1
.
Practical examples
Logs
If in the following log we want to find only the WARN
s
log = """[LOG ENTRY] [ERROR] The system is unstable[LOG ENTRY] [WARN] The system may be down[LOG ENTRY] [WARN] Microsoft just bought Github[LOG DATA] [LOG] Everything is OK[LOG ENTRY] [LOG] [user:@beco] Logged in[LOG ENTRY] [LOG] [user:@beco] Clicked here[LOG DATA] [LOG] [user:@celismx] Did something[LOG ENTRY] [LOG] [user:@beco] Rated the app[LOG ENTRY] [LOG] [user:@beco] Logged out[LOG LINE] [LOG] [user:@celismx] Logged in"""result = re.findall("[LOG.*[WARN].*", log)result
['[LOG ENTRY] [WARN] The system may be down','[LOG ENTRY] [WARN] Microsoft just bought Github']
Phone number
Within a number we can find letters such as e
for extension, #
also for extension, or p
to pause if a computer calls. We can also find the +
to indicate a country prefix and separators such as spaces, -
, .
, .
, .
, .
, .
, .
, .
, .
.
tel = """55565856-58-1156.58.1156.78-9865 09 8776y87r9845y78-5678.87 6578 54-56+52156581158-11-11#24655256048p12355256048e123"""result = re.findall("+?d{2,3}[^da-zA-Z\n]?d{2,3}[^da-zA-Z\n]?d{2,3}[#pe]?d*", tel)result
['555658','56-58-11','56.58.11','56.78-98','65 09 87','78.87 65','78 54-56','+521565811','58-11-11#246','55256048p123','55256048e123']
Here is an explanation
+?
: Beginning with the character+
and containing either zero or one- ``d{2,3}`: To be followed by 2 to 3 digits
- Next there can be zero or a character that is neither a digit, nor a letter from
a
toz
, nor a letter fromA
toZ
, nor a line break. - ``d{2,3}`: To be followed by 2 to 3 digits
- Next there can be zero or a character that is neither a digit, nor a letter from
a
toz
, nor a letter fromA
toZ
, nor a line break. - ``d{2,3}`: To be followed by 2 to 3 digits
[#pe]?
: Then there can be zero or one character either#
, orp
, ore
.- Lastly, let there be zero or all numbers.
URLs
urls = """url: https://www.instagram.com/p/BXB4zsUlW5Z/?taken-by=beco.mxurl: http://instagram.com/p/blablablahurl: http://itam.mx/testhttp://instagram.com/p/blablablahhttps://www.vanguarsoft.com.vehttp://platzi.comhttps://traetelo.nethttps://traetelo.net/images archivo.jspurl: https://subdominio.traetelo.neturl: https://www.instagram.com/p/BXB4zsUlW5Z/?taken-by=beco.mxurl: http://instagram.com/p/blablablahurl: http://itam.mx/testhttp://instagram.com/p/blablablahhttps://www.google.com.co/https://sub.dominio.de.alguien.com/archivo.htmlhttps://en.wikipedia.org/wiki/.orghttps://cdn-microsoft.org/image/seixo2t9sjl_22.jpghttps://hola.pizzahttps://platzi.com/clases/1301-expresiones-regulares/11860-urls9102/ clasehttps://api.giphy.com/v1/gifs/search?q=Rick and Morty&limit=10&api_key=DG3hItPp5HIRNC0nit3AOR7eQZAehttp://localhost:3000/something?color1=red&color2=bluehttp://localhost:3000/display/post?size=smallhttp://localhost:3000/?name=satyamhttp://localhost:3000/scanned?orderid=234http://localhost:3000/getUsers?userId=12354411&name=Billyhttp://localhost:3000/getUsers?userId=12354411http://localhost:3000/search?city=Barcelonawww.sitiodeejemplo.net/pagina.php?nombredevalor1=valor1&nombredevalor2=valor2"""result = re.findall("https?://[w-.]+.w{2,6}/?S*", urls)result
['https://www.instagram.com/p/BXB4zsUlW5Z/?taken-by=beco.mx','http://instagram.com/p/blablablah','http://itam.mx/test','http://instagram.com/p/blablablah','https://www.vanguarsoft.com.ve','http://platzi.com','https://traetelo.net','https://traetelo.net/images','https://subdominio.traetelo.net','https://www.instagram.com/p/BXB4zsUlW5Z/?taken-by=beco.mx','http://instagram.com/p/blablablah','http://itam.mx/test','http://instagram.com/p/blablablah','https://www.google.com.co/','https://sub.dominio.de.alguien.com/archivo.html','https://en.wikipedia.org/wiki/.org','https://cdn-microsoft.org/image/seixo2t9sjl_22.jpg','https://hola.pizza','https://platzi.com/clases/1301-expresiones-regulares/11860-urls9102/','https://api.giphy.com/v1/gifs/search?q=Rick']
Here is an explanation
http
: We want it to start withhttp
.s?
: There may or may not be ans
in the following.:
://: Followed by
://`.- `[*]: Followed by one or more letters, gions or periods
- Next, a point.
- ``w{2,6}`: Between 2 and 6 letters for the tld
/?
: Followed by zero or a/
.- None or everything that is not a space.
Mails
mails = """esto.es_un.mail@mail.comesto.es_un.mail+complejo@mail.comdominio.comrodrigo.jimenez@yahoo.com.mxruben@starbucks.comesto_no$es_email@dominio.comno_se_de_internet3@hotmail.com"""result = re.findall("[w._]{5,30}+?[w._]{0,10}@[w.-]{2,}.w{2,6}", mails)result
['esto.es_un.mail@mail.com','esto.es_un.mail+complejo@mail.com','rodrigo.jimenez@yahoo.com.mx','ruben@starbucks.com','es_email@dominio.com','no_se_de_internet3@hotmail.com']
Here is an explanation
- {5,30}`: We want it to start with between 5 and 30 (which is the minimum and maximum that gmail supports) letters, dots or underscores.
+?
: Followed by zero or a+
.- {0,10}`: Then between 0 and 10 letters, dots or underscores.
@
: The@
: The@
: The@
: The@
- {[{2,}`: Between 2 and infinite letters, dots and dashes (domain)
.
: Followed by a `.- ``w{2,6}`: And finally between 2 and 6 letters for the tld
Locations
There are two possible ways to give locations, so we analyze both of them
loc = """-99.205646,19.429707,2275.10-99.205581, 19.429652,2275.10-99.204654,19.428952,2275.58"""result = re.findall("-?d{1,3}.d{1,6},s?-?d{1,3}.d{1,6},.*", loc)result
['-99.205646,19.429707,2275.10','-99.205581, 19.429652,2275.10','-99.204654,19.428952,2275.58']
Here is an explanation
- We want it to start with zero or a minus sign.
- Followed by between one and three numbers
- Next, a point.
- ``d{1,6}`: After one to six numbers
,
: Then a,
: Then a,
: Then a,
: Then a,
: Then a,
- ``s?`: After zero or a space
- ``-?`: Zero or a minus sign
d{1,3}`: Then between one and three numbers
- Next, a point.
- Followed by between one and six numbers.
,
: Then a comma.*
: Lastly none or all types of characters
loc = """-99 12' 34.08"W, 19 34' 56.98"N-34 54' 32.00"E, -3 21' 67.00"S"""result = re.findall("-?d{1,3}sd{1,2}'sd{1,2}.d{2,2}\"[WE],s?-?d{1,3}sd{1,2}'sd{1,2}.d{2,2}\"[SN]", loc)result
['-99 12' 34.08"W, 19 34' 56.98"N', '-34 54' 32.00"E, -3 21' 67.00"S']
print(result[0])print(result[1])
-99 12' 34.08"W, 19 34' 56.98"N-34 54' 32.00"E, -3 21' 67.00"S
Here is an explanation
- We want it to start with zero or a minus sign.
- Followed by between one and three numbers
s
: Then a space- ``d{1,2}`: Segment of one to two numbers
'
: Then a'
: Then a'
.- Followed by a space.
``d{1,2}
: Then between one and two numbers- After a period
- ``d{2,2}`: Followed by two numbers
"
: Then a"
: Then a"
: Then a"
: Then a"
: Then a"
.[WE]
: Then the letterW
or the letterE
.,
: After a comma- Followed by a zero or a space
- ``-?`: After zero or a minus sign
d{1,3}`: Then between one and three numbers
- Followed by a space.
``d{1,2}
: Then between one and two numbers'
: Then a'
: After a'
s
: Then a space- ``d{1,2}`: Next between one and two numbers
- Followed by a period
- ``d{2,2}`: After two numbers
"
: Followed by
": Followed by
"`.[SN]
: And finally the letterS
or the letterN
.
Names
nombres = """Camilo Sarmiento GálvezAlejandro Pliego AbastoMilagros Reyes JapónSamuel París ArrabalJuan Pablo TafallaAxel Gálvez VelázquezÓscar Montreal AparicioJacobo Pozo TassisGuillermo Ordóñez EspigaEduardo Pousa CurbeloIvanna Bienvenida KevinAda Tasis LópezLuciana Sáenz GarcíaFlorencia Sainz MárquzCatarina Cazalla LombardaPaloma Gallo PerroMargarita Quesada FlorezVicente Fox QuesadaIris GracianiAsunción CarballarConstanza MuñozManuel Andres García Márquez"""result = re.findall("[A-ZÁÉÍÓÚ][a-záéíóú]+s[A-ZÁÉÍÓÚ][a-záéíóú]+s[A-ZÁÉÍÓÚ][a-záéíóú]+", nombres)result
['Camilo Sarmiento Gálvez','Alejandro Pliego Abasto','Milagros Reyes Japón','Samuel París Arrabal','Juan Pablo Tafalla','Axel Gálvez Velázquez','Óscar Montreal Aparicio','Jacobo Pozo Tassis','Espiga Eduardo Pousa','Curbelo Ivanna Bienvenida','Kevin Ada Tasis','López Luciana Sáenz','García Florencia Sainz','Márquz Catarina Cazalla','Lombarda Paloma Gallo','Perro Margarita Quesada','Florez Vicente Fox','Quesada Iris Graciani','Asunción Carballar Constanza','Manuel Andres García']
Here is an explanation
[A-ZÁÉÍÓÚ]
: We want it to start with a capital letter, including accents.[a-záééíóú]+
: Followed by one or more lowercase letters, enclosed by spaces- Followed by a space.
[A-ZÁÉÍÓÓÚ]
: followed by an uppercase letter, including accents[a-záééíóú]+
: Followed by one or more lowercase letters, enclosed by spaces- Followed by a space.
[A-ZÁÉÍÓÓÚ]
: followed by an uppercase letter, including accents[a-záééíóú]+
: Followed by one or more lowercase letters, enclosed by spaces
Search and replace
We are going to download a file with a lot of historical films.
# download file from urlimport urllib.requesturl = "https://static.platzi.com/media/tmp/class-files/github/moviedemo/moviedemo-master/movies.dat"urllib.request.urlretrieve(url, "movies.dat")
---------------------------------------------------------------------------HTTPError Traceback (most recent call last)Cell In[43], line 42 import urllib.request3 url = "https://static.platzi.com/media/tmp/class-files/github/moviedemo/moviedemo-master/movies.dat"----> 4 urllib.request.urlretrieve(url, "movies.dat")File ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:241, in urlretrieve(url, filename, reporthook, data)224 """225 Retrieve a URL into a temporary location on disk.226(...)237 data file as well as the resulting HTTPMessage object.238 """239 url_type, path = _splittype(url)--> 241 with contextlib.closing(urlopen(url, data)) as fp:242 headers = fp.info()244 # Just return the local path and the "headers" for file://245 # URLs. No sense in performing a copy unless requested.File ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:216, in urlopen(url, data, timeout, cafile, capath, cadefault, context)214 else:215 opener = _opener--> 216 return opener.open(url, data, timeout)File ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:525, in OpenerDirector.open(self, fullurl, data, timeout)523 for processor in self.process_response.get(protocol, []):524 meth = getattr(processor, meth_name)--> 525 response = meth(req, response)527 return responseFile ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:634, in HTTPErrorProcessor.http_response(self, request, response)631 # According to RFC 2616, "2xx" code indicates that the client's632 # request was successfully received, understood, and accepted.633 if not (200 <= code < 300):--> 634 response = self.parent.error(635 'http', request, response, code, msg, hdrs)637 return responseFile ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:563, in OpenerDirector.error(self, proto, *args)561 if http_err:562 args = (dict, 'default', 'http_error_default') + orig_args--> 563 return self._call_chain(*args)File ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:496, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)494 for handler in handlers:495 func = getattr(handler, meth_name)--> 496 result = func(*args)497 if result is not None:498 return resultFile ~/miniconda3/envs/mybase/lib/python3.11/urllib/request.py:643, in HTTPDefaultErrorHandler.http_error_default(self, req, fp, code, msg, hdrs)642 def http_error_default(self, req, fp, code, msg, hdrs):--> 643 raise HTTPError(req.full_url, code, msg, hdrs, fp)HTTPError: HTTP Error 403: Forbidden
Let's print the first 10 lines to analyze it.
file = open("movies.dat", "r")for i, line in enumerate(file):print(line, end="")if i == 10:breakfile.close()
1::Toy Story (1995)::Adventure|Animation|Children|Comedy|Fantasy2::Jumanji (1995)::Adventure|Children|Fantasy3::Grumpier Old Men (1995)::Comedy|Romance4::Waiting to Exhale (1995)::Comedy|Drama|Romance5::Father of the Bride Part II (1995)::Comedy6::Heat (1995)::Action|Crime|Thriller7::Sabrina (1995)::Comedy|Romance8::Tom and Huck (1995)::Adventure|Children9::Sudden Death (1995)::Action10::GoldenEye (1995)::Action|Adventure|Thriller11::American President, The (1995)::Comedy|Drama|Romance
As you can see, we have an ID, followed by ::
, then the name of the movie, in parentheses the year, followed by ::
and then genres separated by |
.
We can make a cleaning of the file very easy by means of regular expressions, the compile
and match
functions and the use of groupings with parenthesis. When making groupings, we select which areas of the text we want to keep and then work with them as we want, let's see it with an example
pattern = re.compile(r"^d+::([ws:,().-'&¡!/¿?ÁÉÍÓÚáéíóú+*$#°'"[]@·]+)s((d{4,4}))::(.*)$")file = open("movies.dat", "r")file_filtered = open("movies.csv", "w")file_filtered.write("title,year,genders ")sep = ";;"for line in file:result = re.match(pattern, line)if result:file_filtered.write(f"{result.group(1)}{sep}{result.group(2)}{sep}{result.group(3)} ")else:print(line, end="")file.close()file_filtered.close()
Let's see what we have done, first we have defined a pattern with the following:
^
: We want it to start with the beginning of the line.- Next one or more numbers
::
: Followed by::
- `([(([([([([([([([("("("("("("": This is the first grouping, we look for any word, space or character in the square brackets that appears one or more times.
- ``s`: Next a space
: The tightening of a parenthesis
- (4,4})`: Here is the second grouping, we are looking for four numbers.
- After the closing of a parenthesis
::
: Next::
(.*)
: The third grouping, any character that appears none or all times.$
: Lastly the end of the line
Inside the for
we analyze line by line if the pattern we have defined is found, and if it is found we write the three patterns in the csv
separated by sep
, which in our case we have defined as ;;
. This separator has been defined, because there are movie titles that have ,
s.
We read the csv
with Pandas
.
import pandas as pd
df = pd.read_csv("movies.csv", sep=";;", engine="python")
df.head()
Cheatsheet
Here you have a cheatsheet with a lot of patterns