A Step-By-Step “NO-CODE” Approach To Web Scrape a News archives Sites with Several Pages and Multiple News links Per Page.

Fig A: The premium Accidents news Archives

On each accident news page, we would extract the Title of the story, the owner of the report, the date of the accident, and the complete accident tale(story) for each page link per each accident news page.

Getting Started

  1. First, make sure you have ParseHub downloaded and installed. This project will make use of this web scraper.
  2. Open ParseHub, Click on “New project” and enter the URL from the premium accident news result page (Fig.B1)(https://www.premiumtimesng.com/tag/accident) into the “enter a website you will like to extract data from (Fig.B2).
Fig.B1: Opening the Parsehub web app and clicking “new project”
Fig.B2: Pasting the Premium news article URL
Fig.B3: Premium news Page Rendered inside the PArsehub app

Scraping Premium News Article Page

  1. Click on the first accident title from the displayed page once the accident site has been rendered (Fig.C). The Title you’ve picked will become green to show that it’s been selected(Fig.D).
Fig.C: First Accident Title highlighted Green

2. The links to the remaining Accident titles will be highlighted in yellow (Fig.D1). On the accident title list, click the second accident title link. All accident titles will now be marked in green (Fig.D2).

Fig.D1: The Rest of the Accident title is highlighted in yellow
Fig.D2: The rest titles turn green
  1. Rename your default selection (under select page) to Accident_Title in the left sidebar. You’ll note that ParseHub now extracts all Accident titles and the URL for each one.
  2. Select the RELATIVE SELECT command from the left sidebar by clicking the PLUS(+) symbol next to the Accident_Title selection (Fig. E).
Fig.E: Select the relative select in the popup above

3. Click the first accident title on the rendered page, then the “By text” beside the BY: keyword, using the “RELATIVE SELECT” command. The two options will be connected by an arrow (Fig. F).

Fig.F: Arrow connecting the Title to By text

4. Expand the new “RELATIVE” command you’ve created and delete the URL that is also being extracted by default (Fig. G).

Fig. G: Expand Icon

To select the date listed on the news page repeat steps 3 - 4 above.

The steps above will only extract the news article’s Title, reported by, and accident date for each accident news link on the first page. Our current parsehub page will now appear as shown in Fig.H. Now we’ll extract the accident story (a full story for each news link on the first page), and then we’ll configure parsehub to go through all of the pages and extract the data mentioned above. (Stay tuned, grab a cup of coffee!!! (LOLS)

Fig.H: Accident Title page results

👋🏻 Enjoyed this article thus far? Kindly click on the FOLLOW button on the top left of this article to follow me for more upcoming articles.

Combining two or more paragraphs into a single cell or JSON object.

In the next section, we’ll configure ParseHub to click on each of the Accident title links from the rendered premium times pages and extract all of the entire news stories that make up the accident news article from each accident title page links. For example, we want to extract the accident news articles/stories from each link as a single cell record value for each article URL.

Extracting full news stories from a news page links

  1. To begin, click the three dots next to the main_template text in the left sidebar (Fig.I1).
  2. Rename your Template to Accident_results_page. Templates help ParseHub keep different page layouts separate (Fig.I2).
Fig.I1: By the left, click on the three dots to Rename the current Template
Fig.I2: Renaming the current Template

3. pick the “CLICK” command using the PLUS(+) button next to the previously created Accident_title selection. A popup will appear asking if this link is a “next page” button. Click “No” and Next to Create New Template, input a new template name; in this case, we will use Accident_page.

Fig.J1: Click on “Click Command”
Fig.J2: Click “NO”
Fig.J3: Create New Template

4. ParseHub will now automatically create this new Template and render the first Accident link page.

👋🏻 Hi, remember to click on the FOLLOW button on the left hand of this page to get notified of more related content.

Joining Multiple Paragraphs into the cell

Now click on the Extract icon beside “Select & Extract firstpara” (Fig.K1) to expand the EXTRACT command from the SELECT command.

Fig K1: Extract Icon
Fig.K2: Selecting the first paragraph of the first accident news story

6. Rename the “Extract firstpara” selection to “para” you should see (Extract para) (Fig.L) and then ensure $e.text is set in the property box as shown on the left side of the parsehub web app(Fig. L)

Fig.L: Renaming “para” and “$e.text” set in the property box

7. Select the + symbol next to “Select page” to create a new SELECT command. Then, click on the first paragraph in the accidents news stories to select it. It should be highlighted in green, while elements in similar paragraphs should be highlighted in yellow. Next, Click on the yellow highlights around the other paragraph’s thumbnails until all of them are selected. Next, rename the SELECT command to “allpara” (“you should see Select allpara(14)”), then click on the List icon to expand the “Begin new entry command” out from the SELECT command.

Fig.M1:List Icon, Extract, Advanced(More Commands)
Fig.M2: Results of all paragraphs selected

Now that the Begin new entry command is visible hover over it and delete it by clicking on the trash icon next to the command.

Fig.N: Hover over the Begin new entry command to delete

4. Click on the + sign next to “Select allpara”, click on Advanced, and then choose a “CONDITIONAL command” from the toolbox (Fig.O1). In our CONDITIONAL command property box, we will type $selection.index!=0 (Fig.O2). This means that ParseHub will execute the commands nested under the Conditional command for every paragraph in the news stories unless it is the first paragraph on the news page.

Fig.O1: Click on the Select allpara(14)
Fig.O2: Type “selection.index!=0” into the property box

5. Click on the + sign next to the “CONDITIONAL” command(if $selection.index!=0)(Fig.P) and add an Extract command to your project (under Advanced). Rename this Extract command to “para” (the same way you spelled it while naming the “EXTRACT” command under “Select firstpara”). In this Extract command’s settings, type in para+” |” +$e.text (Fig.P). This works by taking the value of the paragraph text in your results, appending a “|” character to its end, and then appending the currently selected paragraph to the end of the results. This process is repeated for each paragraph selection on the current page, resulting in a single cell/JSON object containing all the paragraphs.

By looking at the results preview pane at the bottom of the ParseHub client, you can double-check that the extraction has worked properly. Note, that the CSV/Excel preview cut-off results after a certain number of characters, so use the “JSON” result preview to see the entire page extraction.

Fig.P: Type’ para+” |” +$e.text’ into the property box for the “Extract para” command

Adding Pagination

  1. Return to the Accident_serch_result Template in the left sidebar. It’s also possible that you’ll need to switch the browser tab to the first page of the accidents archives sites.
  2. Click on the PLUS(+) sign next to the select page selection and choose the “SELECT” command.
  3. Then select the Next page link at the bottom of the Accidents page. Rename the selection to “next_button”.
Fig.Q1: click on the + sign beside the select page for the “SELECT” command
Fig.Q2:name the new selection new_button(i.e “Empty next_button”)

4. By default, ParseHub will extract the text and URL from this link, click on the Extract icon, expand your new “next_button” selection, and remove these two commands.

5. Click on the PLUS(+) sign of your next_button selection and use the CLICK command.

6. A popup will appear asking ‘if this is a “Next” page button’?(Fig.R1). Click Yes and enter the number of pages you’d like to navigate to. In this case, we will scrape nine additional pages(Fig.R2)

Fig.R1: Pop-up Asking If this is a next page button?
Fig.R2: Adding nine pages
Run the project

Running and Exporting your Project

Fig S1:Final Parsehub Project Set-up for the first Template
Fig.S2: Final setup for template 2 combines multiple paragraphs into one cell.
Final Results in JSON Format

Click on the “Get Data” button and click on the “Run” button to run your scrape on the left sidebar. We recommend doing a Test Run for longer projects to verify that your data will be formatted correctly. After the scrape job is completed, you will now be able to download all the information you’ve requested as a handy spreadsheet or as a JSON file.

Quodos 👏🏼👏🏼 ♚ 💪🏾 💯 You have come to the End of this Article.
Guess you Enjoy the article? Kindly click on the
FOLLOW button at the left corner of this page for more related and impactful articles by me.

Interested in Parsehub Training and Free Certification from Zero to Hero.
Consider the following Parsehub links:
1) https://academy.parsehub.com/

2) https://www.parsehub.com/blog/

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store