Extracting text based on a regex pattern with cheerio nodejs

0 votes

I'm attempting to create a scraper in node.js that will enable me to pull news headlines from a lot of various sources (they are all different so I have to be as general as possible in my approach). I currently have a functioning Python code that makes use of Beautiful Soup and regex to let me define a set of keywords and return headlines that contain those keywords. A pertinent section of Python code is shown below:

for items in soup(text=re.compile(r'\b(?:%s)\b' % '|'.join(keywords)))

To illustrate the expected output, lets assume there is a domain with news articles (Bellow is a html snippet containing a headline):

<a class="gs-c-promo-heading gs-o-faux-block-link__overlay-link gel-pica-bold nw-o-link-split__anchor" href="/news/uk-52773032"><h3 class="gs-c-promo-heading__title gel-pica-bold nw-o-link-split__text">Time to end Clap for Carers, says founder</h3></a>

The expected output given a keyword Time would be a string with a headline Time to end Clap for Carers

My question is: is it possible to do a similar thing with cheerio? What would be the best approach to achieve the same results in nodejs?

EDIT: This works for me now. On top of matching headlines I also wanted to extract post urls

function match_headlines($) {

      const keywords = ['lockdown', 'quarantine'];

      new RegExp('\\b[A-Z].*?' + "(" + test_keywords.join('|') + ")" + 
                 '.*\\b', "g");

      let matches = $('a').map((i, a) => {

          let links = $(a).attr('href');
          let match = $(a).text().match(regexPattern);

          if (match !== null) {

             let posts = {

                 headline: match['input'],
                 post_url: links
             }

             return posts

          }

     })

     return matches.filter((x) => x !== null)

}
Jun 22 in Node-js by Vaani
• 7,020 points
17 views

No answer to this question. Be the first to respond.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.

Related Questions In Node-js

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer

File Download on NodeJS with use opensubtitles API

The problem is the charset output (default ...READ MORE

answered Jun 7 in Node-js by Neha
• 8,560 points
34 views
0 votes
1 answer

codewithmosh NodeJS course asks me a few times to change my NODE_ENV - doesn't work on windows?

The solution you require is straightforward. You must ...READ MORE

answered Jun 14 in Node-js by Neha
• 8,560 points
14 views
0 votes
1 answer

How to extract request http headers from a request using NodeJS connect?

Hello @kartik, To see a list of HTTP ...READ MORE

answered Jul 15, 2020 in Node-js by Niroj
• 82,720 points
13,615 views
0 votes
1 answer

Error:Nodejs cannot find installed module on Windows

Hello @kartik, Add an environment variable called NODE_PATH and set ...READ MORE

answered Jul 15, 2020 in Node-js by Niroj
• 82,720 points
2,563 views
0 votes
1 answer

Truffle tests not running after truffle init

This was a bug. They've fixed it. ...READ MORE

answered Sep 11, 2018 in Blockchain by Christine
• 15,790 points
970 views
0 votes
1 answer
0 votes
0 answers

Extracting text based on a regex pattern with cheerio nodejs

I'm attempting to construct a scraper using ...READ MORE

Jun 21 in Node-js by Vaani
• 7,020 points
10 views
webinar REGISTER FOR FREE WEBINAR X
Send OTP
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP