Speed up millions of regex replacements in Python 3

0 votes

I'm using Python 3.5.2

I have two lists

  • a list of about 750,000 "sentences" (long strings)
  • a list of about 20,000 "words" that I would like to delete from my 750,000 sentences

So, I have to loop through 750,000 sentences and perform about 20,000 replacements, but ONLY if my words are actually "words" and are not part of a larger string of characters.

I am doing this by pre-compiling my words so that they are flanked by the \b metacharacter

compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]

Then I loop through my "sentences"

import re

for sentence in sentences:
  for word in compiled_words:
    sentence = re.sub(word, "", sentence)
  # put sentence into a growing list

This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.

  • Is there a way to using the str.replace method (which I believe is faster), but still requiring that replacements only happen at word boundaries?

  • Alternatively, is there a way to speed up the re.sub method? I have already improved the speed marginally by skipping over re.sub if the length of my word is > than the length of my sentence, but it's not much of an improvement.

Thank you for any suggestions.

Nov 19, 2020 in Python by anonymous
• 8,910 points

1 answer to this question.

0 votes

One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b".

Because re rely on C code to do the actual matching, the savings can be dramatic.

As @pvg pointed out in the comments, it also benefits from single-pass matching.

If your words are not regex, 

answered Nov 19, 2020 by Gitika
• 65,850 points

Related Questions In Python

0 votes
1 answer

Is multi-threading supported in Python and can it speed up execution time as well?

The GIL does not prevent threading. All ...READ MORE

answered Nov 22, 2018 in Python by Nymeria
• 3,540 points
+2 votes
3 answers

what is the practical use of polymorphism in Python?

Polymorphism is the ability to present the ...READ MORE

answered Mar 31, 2018 in Python by anto.trigg4
• 3,440 points
+4 votes
6 answers

Use of "continue" in python

The break statement is used to "break" ...READ MORE

answered Jul 16, 2018 in Python by Omkar
• 69,210 points
0 votes
1 answer

How can I compare the content of two files in Python?

Assuming that your file unique.txt just contains ...READ MORE

answered Apr 16, 2018 in Python by charlie_brown
• 7,710 points
0 votes
2 answers
+1 vote
2 answers

how can i count the items in a list?

Syntax :            list. count(value) Code: colors = ['red', 'green', ...READ MORE

answered Jul 7, 2019 in Python by Neha
• 330 points

edited Jul 8, 2019 by Kalgi 2,830 views
0 votes
1 answer
+1 vote
8 answers

Count the frequency of an item in a python list

To count the number of appearances: from collections ...READ MORE

answered Oct 18, 2018 in Python by tinitales
+7 votes
8 answers

What exactly is the function of random.seed() in python?

The seed method is used to initialize the ...READ MORE

answered Oct 29, 2018 in Python by Rahul
Send OTP
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP