Get all the read more links of amazon.jobs with Python

0 votes

I'm a Beginner in Python, I just want to scrap all the read more links from amazon job page. for example, I want to scrap this page https://www.amazon.jobs/en/search?base_query=&loc_query=Greater+Seattle+Area%2C+WA%2C+United+States&latitude=&longitude=&loc_group_id=seattle-metro&invalid_location=false&country=&city=&region=&county=

Below is the code I used.

#import the library used to query a website
import urllib2
#import the Beautiful soup functions to parse the data returned from the website
from bs4 import BeautifulSoup

#specify the url
url = "https://www.amazon.jobs/en/search?base_query=&loc_query=Greater+Seattle+Area%2C+WA%2C+United+States&latitude=&longitude=&loc_group_id=seattle-metro&invalid_location=false&country=&city=&region=&county="

#Query the website and return the html to the variable 'page'
page = urllib2.urlopen(url)

#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page, "lxml")
print soup.find_all("a")

Output:

[<a class="icon home" href="/en">Home</a>,
 <a class="icon check-status" data-target="#icims-portal-selector" data-toggle="modal">Review application status</a>,
 <a class="icon working" href="/en/working/working-amazon">Amazon culture &amp; benefits</a>,
 <a class="icon locations" href="/en/locations">Locations</a>,
 <a class="icon teams" href="/en/business_categories">Teams</a>,
 <a class="icon job-categories" href="/en/job_categories">Job categories</a>,
 <a class="icon help" href="/en/faqs">Help</a>,
 <a class="icon language" data-animate="false" data-target="#locale-options" data-toggle="collapse" href="#locale-options" id="current-locale">English</a>,
...
 <a href="/en/privacy/us">Privacy and Data</a>,
 <a href="/en/impressum">Impressum</a>]

I am getting links to only static elements in the page i.e which are constant for any query but I need the links to 4896 jobs. Can anyone guide me where I am doing wrong?

Sep 28, 2018 in AWS by bug_seeker
• 15,350 points
65 views

1 answer to this question.

0 votes

As you've noticed your request returns only static elements, because the job links are generated by js. In order to get js generated content you'd need selenium or similar clients that run js.
However, if you inspect the HTTP traffic, you'll notice that the jobs data are loaded by XHR request to api: /search.json, which returns json data.

So, using urllib2 and json we can get the total number of results and collect all the data,

import urllib2
import json

api_url = 'https://www.amazon.jobs/search.json?radius=24km&facets[]=location&facets[]=business_category&facets[]=category&facets[]=schedule_type_id&facets[]=employee_class&facets[]=normalized_location&facets[]=job_function_id&offset=0&result_limit={results}&sort=relevant&loc_group_id=seattle-metro&latitude=&longitude=&loc_group_id=seattle-metro&loc_query={location}&base_query={query}&city=&country=&region=&county=&query_options=&'
query = ''
location = 'Greater Seattle Area, WA, United States'
request = urllib2.urlopen(api_url.format(query=query, location=location, results=10))
results = json.loads(request.read())['hits']

request = urllib2.urlopen(api_url.format(query=query, location=location, results=results))
jobs = json.loads(request.read())['jobs']
for i in jobs:
    i['job_path'] = 'https://www.amazon.jobs' + i['job_path']

The jobs list holds a number of dictionaries with all the job information (title, state, city, etc). If you want to select a specific item - for example the links - just loop over the list and select that item.

links = [i['job_path'] for i in jobs]
print links
answered Sep 28, 2018 by Priyaj
• 56,900 points

Related Questions In AWS

0 votes
1 answer
0 votes
1 answer

How to list the contents of Amazon S3 by modified date?

One easy solution would be probably to ...READ MORE

answered Aug 21, 2018 in AWS by datageek
• 2,440 points
5,165 views
0 votes
1 answer

How to use BeautifulSoup for Webscraping

Your code is good until you get ...READ MORE

answered Sep 6, 2018 in Python by Priyaj
• 56,900 points
240 views
0 votes
1 answer

How to download intext images with beautiful soup

Try this: html_data = """ <td colspan="3"><b>"Assemble under ...READ MORE

answered Sep 10, 2018 in Python by Priyaj
• 56,900 points
716 views
0 votes
1 answer

How to download intext images with beautiful soup

Ohh... I got what you need. Try this: html_data ...READ MORE

answered Sep 20, 2018 in Python by Priyaj
• 56,900 points
1,550 views
0 votes
1 answer

How to add SSL certificate to AWS EC2 with the help of new AWS Certificate Manager service

refer this link  https://aws.amazon.com/certificate-manager/faqs/ You can't install the certificates ...READ MORE

answered Jul 19, 2018 in AWS by Priyaj
• 56,900 points
82 views