How to extract specific tags in multiple html txt files using python

0 votes
addresses = []
with open("/rawhtml/greerwilsonchapel.com_executives_contact_us.txt") as fp:
    soup = BeautifulSoup(fp)
    #thumb = soup.find('div',class_="widget widget_text")
    address = soup.find('div',class_="locator-titles").get_text().rstrip('\n').split('\n')
    #address = add.find_All('p').get_text()
    addresses.append(address)
    print(addresses)


Aug 4, 2020 in Python by pooja
• 120 points
4,563 views

1 answer to this question.

0 votes

Hello, @Pooja,

Even I got the same issue, and the below given has helped me, I hope it will be helpful to you. 

import urllib
from bs4 import BeautifulSoup

url = "http:Abc.stm"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

answered Aug 5, 2020 by Kedaar Thomas
But How can I only fetch address from different different files, because all files having address but in different tags or class name or id's etc.

Hello @ pooja,

You can extract using css selector.

When we use CSS Selectors, we do not need to know in advance what the content we want looks like (as we might with regular expressions, where specify the pattern of the data). Since HTML documents are structured as a network of nodes, CSS Selectors make use of that structure to navigate through the nodes and select the data we want. We just need to know which nodes in an HTML file contain what we want to extract.

You can refer this to know how it works!!

Hope this is helpfull to you!!
Thank you!!

Yes I am doing the same thing but every file having different CSS selectors name. So in this case how can I fetch addresses of every company by giving CSS selectors name.

Hello @pooja,

You can refer this for your releted queries.

Hope this help you!!

I go through this, but it didn't get my solution.

I want something which can fetch only contact us details from different different files, Is there any NLP libraries which provide this solution ???

Hello @pooja,

Have you install Beautiful Soup? I'd recommend you learn beautiful soup
It is a python library that can let you extract tags and or text in them. 

Also there is a requests_html library. Which some people can find better than beautiful soup.
Also there's an urllib3 which also designed for processing web requests
I'd recommend to read about them and choose what suits you best.

But if you want to go with Beautiful Soup here it ishttps://www.crummy.com/software/BeautifulSoup/bs4/doc/

Hellp @Niroj,

I read BeautifulSoup but it is helpful in extracting tags in html, but what we have to do is we want to extract  addresses from different different files and every files having diffrent class and id names for extracting specific address, and it is not possible to give 1000 of class name by hardcode it.

Hello @pooja,

Try selenium's xpath (useful method to locate an element is using an XPath expression. We use XPath when a proper id or name attribute is not present in the code to access that element.)
If you don't get your problem solved with that I guess you've to do it manually
Or you can write a script to get raw data and find a pattern for addresses to extract them
Eg. If they're gmail address, extract them using something.endswith("gmail.com")
Or use regular expressions

If you want to know selenium's xpath in python you can refer this.

Thanx @Niroj

I am using regular expression only, Because I think this is the only way to get results. And if you have any more solutions then please tell me.

Hi, @pooja,

I would suggest you to go through this: https://bradleyboehmke.github.io/2015/12/scraping-html-text.html

Related Questions In Python

+3 votes
5 answers

How to read multiple data files in python

Firstly we will import pandas to read ...READ MORE

answered Apr 6, 2018 in Python by DeepCoder786
• 1,720 points
14,753 views
0 votes
2 answers

How can I rename multiple files in a certain directory using Python?

import os from optparse import OptionParser, Option class MyOption ...READ MORE

answered Jul 29, 2020 in Python by The real slim shady
4,421 views
0 votes
0 answers

How to implement multiple try codes in a single block using Python?

Hi all, As per the title, I am ...READ MORE

Jan 14, 2019 in Python by Anirudh
• 2,080 points
454 views
0 votes
1 answer

How to list only text files in a directory using python?

Well, you are using a complex way. ...READ MORE

answered Feb 4, 2019 in Python by Omkar
• 69,210 points
2,422 views
0 votes
1 answer

How to filter HTML tags and resolve entities using Python?

Him the answer is a pretty simple ...READ MORE

answered Feb 13, 2019 in Python by Nymeria
• 3,560 points
1,885 views
0 votes
0 answers

How to handle large files using file handling in python?

I have used file handling for smaller ...READ MORE

Aug 2, 2019 in Python by Waseem
• 4,540 points
455 views
0 votes
1 answer

How to transfer multiple lines in a file using python?

Hi@akhtar, I don't know it will help you ...READ MORE

answered Mar 31, 2020 in Python by MD
• 95,440 points
845 views
0 votes
1 answer

How to fetch HTML code using urllib module in Python?

Hi@akhtar, You can use urllib module to fetch ...READ MORE

answered Jun 26, 2020 in Python by MD
• 95,440 points
734 views
0 votes
2 answers

How to add image file in html using python

Hi, I have implemented as below from robot.api import ...READ MORE

answered Feb 9, 2021 in Python by anonymous
9,584 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP