A Step-By-Step “NO-CODE” Approach To Web Scrape a News archives Sites with Several Pages and Multiple News links Per Page.

Fig A: The premium Accidents news Archives

Getting Started

  1. First, make sure you have ParseHub downloaded and installed. This project will make use of this web scraper.
  2. Open ParseHub, Click on “New project” and enter the URL from the premium accident news result page (Fig.B1)(https://www.premiumtimesng.com/tag/accident) into the “enter a website you will like to extract data from (Fig.B2).
Fig.B1: Opening the Parsehub web app and clicking “new project”
Fig.B2: Pasting the Premium news article URL
Fig.B3: Premium news Page Rendered inside the PArsehub app

Scraping Premium News Article Page

  1. Click on the first accident title from the displayed page once the accident site has been rendered (Fig.C). The Title you’ve picked will become green to show that it’s been selected(Fig.D).
Fig.C: First Accident Title highlighted Green
Fig.D1: The Rest of the Accident title is highlighted in yellow
Fig.D2: The rest titles turn green
  1. Rename your default selection (under select page) to Accident_Title in the left sidebar. You’ll note that ParseHub now extracts all Accident titles and the URL for each one.
  2. Select the RELATIVE SELECT command from the left sidebar by clicking the PLUS(+) symbol next to the Accident_Title selection (Fig. E).
Fig.E: Select the relative select in the popup above
Fig.F: Arrow connecting the Title to By text
Fig. G: Expand Icon
Fig.H: Accident Title page results

Combining two or more paragraphs into a single cell or JSON object.

Extracting full news stories from a news page links

  1. To begin, click the three dots next to the main_template text in the left sidebar (Fig.I1).
  2. Rename your Template to Accident_results_page. Templates help ParseHub keep different page layouts separate (Fig.I2).
Fig.I1: By the left, click on the three dots to Rename the current Template
Fig.I2: Renaming the current Template
Fig.J1: Click on “Click Command”
Fig.J2: Click “NO”
Fig.J3: Create New Template

Joining Multiple Paragraphs into the cell

Fig K1: Extract Icon
Fig.K2: Selecting the first paragraph of the first accident news story
Fig.L: Renaming “para” and “$e.text” set in the property box
Fig.M1:List Icon, Extract, Advanced(More Commands)
Fig.M2: Results of all paragraphs selected
Fig.N: Hover over the Begin new entry command to delete
Fig.O1: Click on the Select allpara(14)
Fig.O2: Type “selection.index!=0” into the property box
Fig.P: Type’ para+” |” +$e.text’ into the property box for the “Extract para” command

Adding Pagination

  1. Return to the Accident_serch_result Template in the left sidebar. It’s also possible that you’ll need to switch the browser tab to the first page of the accidents archives sites.
  2. Click on the PLUS(+) sign next to the select page selection and choose the “SELECT” command.
  3. Then select the Next page link at the bottom of the Accidents page. Rename the selection to “next_button”.
Fig.Q1: click on the + sign beside the select page for the “SELECT” command
Fig.Q2:name the new selection new_button(i.e “Empty next_button”)
Fig.R1: Pop-up Asking If this is a next page button?
Fig.R2: Adding nine pages
Run the project

Running and Exporting your Project

Fig S1:Final Parsehub Project Set-up for the first Template
Fig.S2: Final setup for template 2 combines multiple paragraphs into one cell.
Final Results in JSON Format

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ayanlowo Babatunde

Ayanlowo Babatunde

Industrial Engineer with interests in Machine learning/Robotics/IOT