Extracting text based on a regex pattern with cheerio nodejs

0 votes

I'm attempting to construct a scraper using Node.js that will allow me to extract news headlines from a huge number of websites (they are all different so I have to be as general as possible in my approach). I now have a functioning Python code that uses Beautiful Soup and regex to allow me to declare a set of keywords and return headlines that contain those keywords. A related sample of python code is provided below:

for items in soup(text=re.compile(r'\b(?:%s)\b' % '|'.join(keywords)))

To illustrate the expected output, lets assume there is a domain with news articles (Bellow is a html snippet containing a headline):

<a class="gs-c-promo-heading gs-o-faux-block-link__overlay-link gel-pica-bold nw-o-link-split__anchor" href="/news/uk-52773032"><h3 class="gs-c-promo-heading__title gel-pica-bold nw-o-link-split__text">Time to end Clap for Carers, says founder</h3></a>

The expected output given a keyword Time would be a string with a headline Time to end Clap for Carers

I'm wondering if it's possible to accomplish something similar using cheerio. What would be the best strategy in nodejs to get the similar results?

EDIT: This is now working for me. On top of that, there are headlines that match. I also wanted to get the URLs of the posts.

function match_headlines($) {

      const keywords = ['lockdown', 'quarantine'];

      new RegExp('\\b[A-Z].*?' + "(" + test_keywords.join('|') + ")" + 
                 '.*\\b', "g");

      let matches = $('a').map((i, a) => {

          let links = $(a).attr('href');
          let match = $(a).text().match(regexPattern);

          if (match !== null) {

             let posts = {

                 headline: match['input'],
                 post_url: links
             }

             return posts

          }

     })

     return matches.filter((x) => x !== null)

}
Jun 21, 2022 in Node-js by Vaani
• 7,070 points
1,414 views

No answer to this question. Be the first to respond.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.

Related Questions In Node-js

0 votes
1 answer

How to host MEAN stack application with Angular and nodejs on windows IIS

It's fine that you're using Angular. Be ...READ MORE

answered May 27, 2022 in Node-js by Neha
• 9,020 points
1,369 views
0 votes
1 answer
0 votes
1 answer

File Download on NodeJS with use opensubtitles API

The problem is the charset output (default ...READ MORE

answered Jun 7, 2022 in Node-js by Neha
• 9,020 points
1,026 views
0 votes
1 answer

codewithmosh NodeJS course asks me a few times to change my NODE_ENV - doesn't work on windows?

The solution you require is straightforward. You must ...READ MORE

answered Jun 14, 2022 in Node-js by Neha
• 9,020 points
509 views
0 votes
0 answers

How to install NodeJS LTS on Windows as a local user (without admin rights)

I'm using Windows as a simple user ...READ MORE

Aug 11, 2022 in Node-js by Neha
• 9,020 points
1,849 views
0 votes
1 answer

How to extract request http headers from a request using NodeJS connect?

Hello @kartik, To see a list of HTTP ...READ MORE

answered Jul 15, 2020 in Node-js by Niroj
• 82,840 points
23,308 views
0 votes
1 answer

Truffle tests not running after truffle init

This was a bug. They've fixed it. ...READ MORE

answered Sep 11, 2018 in Blockchain by Christine
• 15,790 points
1,947 views
0 votes
1 answer

Hyperledger Sawtooth vs Quorum in concurrency and speed Ask

Summary: Both should provide similar reliability of ...READ MORE

answered Sep 26, 2018 in IoT (Internet of Things) by Upasana
• 8,620 points
1,472 views
0 votes
0 answers

Extracting text based on a regex pattern with cheerio nodejs

I'm attempting to create a scraper in ...READ MORE

Jun 22, 2022 in Node-js by Vaani
• 7,070 points
637 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP