Get all the read more links of amazon jobs with Python

0 votes

I'm a Beginner in Python, I just want to scrap all the read more links from amazon job page. for example, I want to scrap this page https://www.amazon.jobs/en/search?base_query=&loc_query=Greater+Seattle+Area%2C+WA%2C+United+States&latitude=&longitude=&loc_group_id=seattle-metro&invalid_location=false&country=&city=&region=&county=

Below is the code I used.

#import the library used to query a website
import urllib2
#import the Beautiful soup functions to parse the data returned from the website
from bs4 import BeautifulSoup

#specify the url
url = "https://www.amazon.jobs/en/search?base_query=&loc_query=Greater+Seattle+Area%2C+WA%2C+United+States&latitude=&longitude=&loc_group_id=seattle-metro&invalid_location=false&country=&city=&region=&county="

#Query the website and return the html to the variable 'page'
page = urllib2.urlopen(url)

#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page, "lxml")
print soup.find_all("a")

Output:

[<a class="icon home" href="/en">Home</a>,
 <a class="icon check-status" data-target="#icims-portal-selector" data-toggle="modal">Review application status</a>,
 <a class="icon working" href="/en/working/working-amazon">Amazon culture &amp; benefits</a>,
 <a class="icon locations" href="/en/locations">Locations</a>,
 <a class="icon teams" href="/en/business_categories">Teams</a>,
 <a class="icon job-categories" href="/en/job_categories">Job categories</a>,
 <a class="icon help" href="/en/faqs">Help</a>,
 <a class="icon language" data-animate="false" data-target="#locale-options" data-toggle="collapse" href="#locale-options" id="current-locale">English</a>,
...
 <a href="/en/privacy/us">Privacy and Data</a>,
 <a href="/en/impressum">Impressum</a>]

I am getting links to only static elements in the page i.e which are constant for any query but I need the links to 4896 jobs. Can anyone guide me where I am doing wrong?

Sep 28, 2018 in AWS by bug_seeker
• 15,520 points
1,153 views

1 answer to this question.

0 votes

As you've noticed your request returns only static elements, because the job links are generated by js. In order to get js generated content you'd need selenium or similar clients that run js.
However, if you inspect the HTTP traffic, you'll notice that the jobs data are loaded by XHR request to api: /search.json, which returns json data.

So, using urllib2 and json we can get the total number of results and collect all the data,

import urllib2
import json

api_url = 'https://www.amazon.jobs/search.json?radius=24km&facets[]=location&facets[]=business_category&facets[]=category&facets[]=schedule_type_id&facets[]=employee_class&facets[]=normalized_location&facets[]=job_function_id&offset=0&result_limit={results}&sort=relevant&loc_group_id=seattle-metro&latitude=&longitude=&loc_group_id=seattle-metro&loc_query={location}&base_query={query}&city=&country=&region=&county=&query_options=&'
query = ''
location = 'Greater Seattle Area, WA, United States'
request = urllib2.urlopen(api_url.format(query=query, location=location, results=10))
results = json.loads(request.read())['hits']

request = urllib2.urlopen(api_url.format(query=query, location=location, results=results))
jobs = json.loads(request.read())['jobs']
for i in jobs:
    i['job_path'] = 'https://www.amazon.jobs' + i['job_path']

The jobs list holds a number of dictionaries with all the job information (title, state, city, etc). If you want to select a specific item - for example the links - just loop over the list and select that item.

links = [i['job_path'] for i in jobs]
print links
answered Sep 28, 2018 by Priyaj
• 58,090 points

Related Questions In AWS

0 votes
1 answer
0 votes
1 answer

How to get the size of an Amazon S3 bucket?

Hi@akhtar, You can use AWS CLI to perform ...READ MORE

answered Oct 5, 2020 in AWS by MD
• 95,440 points
1,249 views
0 votes
1 answer

How to use BeautifulSoup for Webscraping

Your code is good until you get ...READ MORE

answered Sep 6, 2018 in Python by Priyaj
• 58,090 points
1,955 views
0 votes
1 answer

How to download intext images with beautiful soup

Try this: html_data = """ <td colspan="3"><b>"Assemble under ...READ MORE

answered Sep 10, 2018 in Python by Priyaj
• 58,090 points
5,173 views
0 votes
1 answer

How to download intext images with beautiful soup

Ohh... I got what you need. Try this: html_data ...READ MORE

answered Sep 20, 2018 in Python by Priyaj
• 58,090 points
5,393 views
0 votes
1 answer

How to web scrape using python without using a browser?

Yes, you can use the headless mode. ...READ MORE

answered Apr 2, 2019 in Python by Yogi

edited Oct 7, 2021 by Sarfaraz 12,454 views
+1 vote
1 answer

How to add SSL certificate to AWS EC2 with the help of new AWS Certificate Manager service

refer this link  https://aws.amazon.com/certificate-manager/faqs/ You can't install the certificates ...READ MORE

answered Jul 19, 2018 in AWS by Priyaj
• 58,090 points
1,594 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP