Extracting text based on a regex pattern with cheerio nodejs

0 votes

I'm attempting to construct a scraper using Node.js that will allow me to extract news headlines from a huge number of websites (they are all different so I have to be as general as possible in my approach). I now have a functioning Python code that uses Beautiful Soup and regex to allow me to declare a set of keywords and return headlines that contain those keywords. A related sample of python code is provided below:

for items in soup(text=re.compile(r'\b(?:%s)\b' % '|'.join(keywords)))

To illustrate the expected output, lets assume there is a domain with news articles (Bellow is a html snippet containing a headline):

<a class="gs-c-promo-heading gs-o-faux-block-link__overlay-link gel-pica-bold nw-o-link-split__anchor" href="/news/uk-52773032"><h3 class="gs-c-promo-heading__title gel-pica-bold nw-o-link-split__text">Time to end Clap for Carers, says founder</h3></a>

The expected output given a keyword Time would be a string with a headline Time to end Clap for Carers

I'm wondering if it's possible to accomplish something similar using cheerio. What would be the best strategy in nodejs to get the similar results?

EDIT: This is now working for me. On top of that, there are headlines that match. I also wanted to get the URLs of the posts.

function match_headlines($) {

      const keywords = ['lockdown', 'quarantine'];

      new RegExp('\\b[A-Z].*?' + "(" + test_keywords.join('|') + ")" + 
                 '.*\\b', "g");

      let matches = $('a').map((i, a) => {

          let links = $(a).attr('href');
          let match = $(a).text().match(regexPattern);

          if (match !== null) {

             let posts = {

                 headline: match['input'],
                 post_url: links
             }

             return posts

          }

     })

     return matches.filter((x) => x !== null)

}
Jun 21 in Node-js by Vaani
• 7,020 points
10 views

No answer to this question. Be the first to respond.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.

Related Questions In Node-js

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer

File Download on NodeJS with use opensubtitles API

The problem is the charset output (default ...READ MORE

answered Jun 7 in Node-js by Neha
• 8,560 points
33 views
0 votes
1 answer

codewithmosh NodeJS course asks me a few times to change my NODE_ENV - doesn't work on windows?

The solution you require is straightforward. You must ...READ MORE

answered Jun 14 in Node-js by Neha
• 8,560 points
14 views
0 votes
1 answer

How to extract request http headers from a request using NodeJS connect?

Hello @kartik, To see a list of HTTP ...READ MORE

answered Jul 15, 2020 in Node-js by Niroj
• 82,720 points
13,615 views
0 votes
1 answer

Error:Nodejs cannot find installed module on Windows

Hello @kartik, Add an environment variable called NODE_PATH and set ...READ MORE

answered Jul 15, 2020 in Node-js by Niroj
• 82,720 points
2,563 views
0 votes
1 answer

Truffle tests not running after truffle init

This was a bug. They've fixed it. ...READ MORE

answered Sep 11, 2018 in Blockchain by Christine
• 15,790 points
970 views
0 votes
1 answer
0 votes
0 answers

Extracting text based on a regex pattern with cheerio nodejs

I'm attempting to create a scraper in ...READ MORE

Jun 22 in Node-js by Vaani
• 7,020 points
17 views
webinar REGISTER FOR FREE WEBINAR X
Send OTP
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP