How to download intext images with beautiful soup

0 votes

I'm trying to use beautiful soups and requests to program a website scraper in Python. I can easily collect all of the text I want but some of the text I'm trying to download has inline images that are important. I want to replace the image with it's title, and add that to a string I can parse later, but I'm not sure how to do this.

This is an example of the kind of HTML I'm trying to parse:

    <td colspan="3"><b>"Assemble under Siegfried!"</b> 
        <a href="/wiki/index.php/File:Continuous.png" class="image" title="CONT"><img alt="CONT" src="/wiki/images/thumb/7/78/Continuous.png/14px-Continuous.png" width="14" height="17" srcset="/wiki/images/thumb/7/78/Continuous.png/21px-Continuous.png 1.5x, /wiki/images/7/78/Continuous.png 2x">
        </a> This unit gains +10 attack for each 
        <a href="/wiki/index.php/File:Black.png" class="image" title="Black"><img alt="Black" src="/wiki/images/thumb/7/71/Black.png/15px-Black.png" width="15" height="15" srcset="/wiki/images/thumb/7/71/Black.png/23px-Black.png 1.5x, /wiki/images/thumb/7/71/Black.png/30px-Black.png 2x">
        </a> and 
        <a href="/wiki/index.php/File:White.png" class="image" title="White"><img alt="White" src="/wiki/images/thumb/8/80/White.png/15px-White.png" width="15" height="15" srcset="/wiki/images/thumb/8/80/White.png/23px-White.png 1.5x, /wiki/images/thumb/8/80/White.png/30px-White.png 2x">
        </a> ally besides this unit.
    </td>

From this HTML I want to pull:

"Assemble under Siegfried! CONT This unit gains +10 attack for each Black and White ally besides this unit."

Using the normal get_text() method does not include the titles of the images, which is the problem.

Sep 20, 2018 in Python by bug_seeker
• 15,350 points
1,270 views

1 answer to this question.

0 votes

Ohh... I got what you need.

Try this:

html_data = """ <td colspan="3"><b>"Assemble under Siegfried!"</b> 
    <a href="/wiki/index.php/File:Continuous.png" class="image" title="CONT"><img alt="CONT" src="/wiki/images/thumb/7/78/Continuous.png/14px-Continuous.png" width="14" height="17" srcset="/wiki/images/thumb/7/78/Continuous.png/21px-Continuous.png 1.5x, /wiki/images/7/78/Continuous.png 2x">
    </a> This unit gains +10 attack for each 
    <a href="/wiki/index.php/File:Black.png" class="image" title="Black"><img alt="Black" src="/wiki/images/thumb/7/71/Black.png/15px-Black.png" width="15" height="15" srcset="/wiki/images/thumb/7/71/Black.png/23px-Black.png 1.5x, /wiki/images/thumb/7/71/Black.png/30px-Black.png 2x">
    </a> and 
    <a href="/wiki/index.php/File:White.png" class="image" title="White"><img alt="White" src="/wiki/images/thumb/8/80/White.png/15px-White.png" width="15" height="15" srcset="/wiki/images/thumb/8/80/White.png/23px-White.png 1.5x, /wiki/images/thumb/8/80/White.png/30px-White.png 2x">
    </a> ally besides this unit.
</td>"""
from bs4 import BeautifulSoup
html = BeautifulSoup(html_data, "html.parser")

texts = [html.find("b").get_text()]
for a in html.find_all("a"):
    texts.append(a.attrs.get("title"))
    texts.append(a.next_element.next_element.next_element.strip())
print(" ".join(texts))

I don't sure that you realy want. But i purpose need attrs of Tag.

Example: from bs4 import BeautifulSoup

html = BeautifulSoup(html_data)
for a in html.find_all("a"):
    print(a.attrs.get("title"))

Output:

CONT
Black
White

If you want download images: from urllib.parse import urljoin import requests from bs4 import BeautifulSoup

cdn_url = "http://some.com/" # root url of site with static content
html = BeautifulSoup(html_data)
for img in html.find_all("img"):
    img_response = requests.get(urljoin(cdn_url, img.attrs.get("src"))) #img content should save in file
answered Sep 20, 2018 by Priyaj
• 56,520 points

Related Questions In Python

0 votes
1 answer

How to perform web scraping with python?

Hey, there are various libraries used in ...READ MORE

answered Apr 20, 2018 in Python by aayushi
• 750 points
235 views
0 votes
1 answer

how to download and install Django rest framework?

To install Django, you can simply open ...READ MORE

answered Apr 24, 2018 in Python by Christine
• 15,790 points
130 views
0 votes
1 answer

How to replace values with None in Pandas data frame in Python?

Actually in later versions of pandas this ...READ MORE

answered Aug 30, 2018 in Python by Priyaj
• 56,520 points
925 views
+1 vote
1 answer

How to use GUI that comes with Python to test your code?

Hey @alex0809, When your testing a website ...READ MORE

answered Sep 24, 2018 in Python by Vardhan
• 12,230 points
69 views
0 votes
1 answer

How to use BeautifulSoup for Webscraping

Your code is good until you get ...READ MORE

answered Sep 6, 2018 in Python by Priyaj
• 56,520 points
193 views
0 votes
1 answer

Get all the read more links of amazon.jobs with Python

As you've noticed your request returns only ...READ MORE

answered Sep 28, 2018 in AWS by Priyaj
• 56,520 points
57 views
0 votes
1 answer

How to parse html file to BeautifulSoup?

Hey. Refer to the following code: driver.get("link") html = ...READ MORE

answered Apr 2 in Python by Kirti
25 views
0 votes
1 answer

How to download intext images with beautiful soup

Try this: html_data = """ <td colspan="3"><b>"Assemble under ...READ MORE

answered Sep 10, 2018 in Python by Priyaj
• 56,520 points
507 views
+1 vote
1 answer

How to replace id with attribute corresponding to id of another table?

Use the following query statement and let ...READ MORE

answered Aug 8, 2018 in Python by Priyaj
• 56,520 points
26 views