Get all the read more links of amazon jobs with Python

Question

I'm a Beginner in Python, I just want to scrap all the read more links from amazon job page.

Below is the code I used.

#import the library used to query a website
import urllib2
#import the Beautiful soup functions to parse the data returned from the website
from bs4 import BeautifulSoup

#specify the url
url = "https://example.com/en/search?base_query=&loc_query=Greater+Seattle+Area%2C+WA%2C+United+States&latitude=&longitude=&loc_group_id=seattle-metro&invalid_location=false&country=&city=&region=&county="

#Query the website and return the html to the variable 'page'
page = urllib2.urlopen(url)

#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page, "lxml")
print soup.find_all("a")

Output:

[<a class="icon home" href="/en">Home</a>,
 <a class="icon check-status" data-target="#icims-portal-selector" data-toggle="modal">Review application status</a>,
 <a class="icon working" href="/en/working/working-amazon">Amazon culture &amp; benefits</a>,
 <a class="icon locations" href="/en/locations">Locations</a>,
 <a class="icon teams" href="/en/business_categories">Teams</a>,
 <a class="icon job-categories" href="/en/job_categories">Job categories</a>,
 <a class="icon help" href="/en/faqs">Help</a>,
 <a class="icon language" data-animate="false" data-target="#locale-options" data-toggle="collapse" href="#locale-options" id="current-locale">English</a>,
...
 <a href="/en/privacy/us">Privacy and Data</a>,
 <a href="/en/impressum">Impressum</a>]

I am getting links to only static elements in the page i.e which are constant for any query but I need the links to 4896 jobs. Can anyone guide me where I am doing wrong?

Priyaj · Answer 1 · Sep 28, 2018

As you've noticed your request returns only static elements, because the job links are generated by js. In order to get js generated content you'd need selenium or similar clients that run js.
However, if you inspect the HTTP traffic, you'll notice that the jobs data are loaded by XHR request to api: /search.json, which returns json data.

So, using urllib2 and json we can get the total number of results and collect all the data,

import urllib2
import json

api_url = '/search.json?radius=24km&facets[]=location&facets[]=business_category&facets[]=category&facets[]=schedule_type_id&facets[]=employee_class&facets[]=normalized_location&facets[]=job_function_id&offset=0&result_limit={results}&sort=relevant&loc_group_id=seattle-metro&latitude=&longitude=&loc_group_id=seattle-metro&loc_query={location}&base_query={query}&city=&country=&region=&county=&query_options=&'
query = ''
location = 'Greater Seattle Area, WA, United States'
request = urllib2.urlopen(api_url.format(query=query, location=location, results=10))
results = json.loads(request.read())['hits']

request = urllib2.urlopen(api_url.format(query=query, location=location, results=results))
jobs = json.loads(request.read())['jobs']
for i in jobs:
    i['job_path'] = 'https://www.example.com' + i['job_path']

The jobs list holds a number of dictionaries with all the job information (title, state, city, etc). If you want to select a specific item - for example the links - just loop over the list and select that item.

links = [i['job_path'] for i in jobs]
print links

Get all the read more links of amazon jobs with Python

Your comment on this question:

1 answer to this question.

Your answer

Your comment on this answer:

Related Questions In AWS

I want download all the versions of a file with 100,000+ versions from Amazon S3

How to print the names of all the bucket in S3 in Python?

How to read S3 object of file size more than 32 MB using AWS Lambda and to generate the log report ?

How to get the size of an Amazon S3 bucket?

How to use BeautifulSoup for Webscraping

How to download intext images with beautiful soup

How to download intext images with beautiful soup

How to web scrape using python without using a browser?

How to add SSL certificate to AWS EC2 with the help of new AWS Certificate Manager service

Amazon EC2 Beanstalk Lavarel showing code instead of displaying the page

Subscribe to our Newsletter, and get personalized recommendations.

TRENDING CERTIFICATION COURSES

TRENDING MASTERS COURSES

COMPANY

WORK WITH US

DOWNLOAD APP

CATEGORIES

CATEGORIES

TRENDING BLOG ARTICLES

TRENDING BLOG ARTICLES