A Step-By-Step “NO-CODE” Approach To Web Scrape a News archives Sites with Several Pages and Multiple News links Per Page.
In this article, I’ll teach you how to use the no-code “PARSEHUB” app to scrape multi-page news sites with multiple news links. For example, we’ll use the URL http://www.premiumtimesng.com/tag/accident for these “HOW-TO” tips.
On each accident news page, we would extract the Title of the story, the owner of the report, the date of the accident, and the complete accident tale(story) for each page link per each accident news page.
- First, make sure you have ParseHub downloaded and installed. This project will make use of this web scraper.
- Open ParseHub, Click on “New project” and enter the URL from the premium accident news result page (Fig.B1)(https://www.premiumtimesng.com/tag/accident) into the “enter a website you will like to extract data from (Fig.B2).
Scraping Premium News Article Page
- Click on the first accident title from the displayed page once the accident site has been rendered (Fig.C). The Title you’ve picked will become green to show that it’s been selected(Fig.D).
2. The links to the remaining Accident titles will be highlighted in yellow (Fig.D1). On the accident title list, click the second accident title link. All accident titles will now be marked in green (Fig.D2).
- Rename your default selection (under select page) to Accident_Title in the left sidebar. You’ll note that ParseHub now extracts all Accident titles and the URL for each one.
- Select the RELATIVE SELECT command from the left sidebar by clicking the PLUS(+) symbol next to the Accident_Title selection (Fig. E).
3. Click the first accident title on the rendered page, then the “By text” beside the BY: keyword, using the “RELATIVE SELECT” command. The two options will be connected by an arrow (Fig. F).
4. Expand the new “RELATIVE” command you’ve created and delete the URL that is also being extracted by default (Fig. G).
To select the date listed on the news page repeat steps 3 - 4 above.
The steps above will only extract the news article’s Title, reported by, and accident date for each accident news link on the first page. Our current parsehub page will now appear as shown in Fig.H. Now we’ll extract the accident story (a full story for each news link on the first page), and then we’ll configure parsehub to go through all of the pages and extract the data mentioned above. (Stay tuned, grab a cup of coffee!!! (LOLS)
👋🏻 Enjoyed this article thus far? Kindly click on the FOLLOW button on the top left of this article to follow me for more upcoming articles.
Combining two or more paragraphs into a single cell or JSON object.
You might want to select multiple elements on a page and combine their values into a single cell or JSON object at times. This can be done with parsehub for something like an accident news details page with multiple paragraphs. I’ll show you how to combine two or more paragraph texts from a news page into one cell/JSON object in your results file in this tutorial.
In the next section, we’ll configure ParseHub to click on each of the Accident title links from the rendered premium times pages and extract all of the entire news stories that make up the accident news article from each accident title page links. For example, we want to extract the accident news articles/stories from each link as a single cell record value for each article URL.
Extracting full news stories from a news page links
- To begin, click the three dots next to the main_template text in the left sidebar (Fig.I1).
- Rename your Template to Accident_results_page. Templates help ParseHub keep different page layouts separate (Fig.I2).
3. pick the “CLICK” command using the PLUS(+) button next to the previously created Accident_title selection. A popup will appear asking if this link is a “next page” button. Click “No” and Next to Create New Template, input a new template name; in this case, we will use Accident_page.
4. ParseHub will now automatically create this new Template and render the first Accident link page.
👋🏻 Hi, remember to click on the FOLLOW button on the left hand of this page to get notified of more related content.
Joining Multiple Paragraphs into the cell
5. On the left-hand bar, make sure the accident_page Template is the current Template (the new Template created). Next, pick the first paragraph of the accident news story using the SELECT command (Fig.K2) automatically generated for you. Rename this command to “firstpara” by clicking on the command’s name.
Now click on the Extract icon beside “Select & Extract firstpara” (Fig.K1) to expand the EXTRACT command from the SELECT command.
6. Rename the “Extract firstpara” selection to “para” you should see (Extract para) (Fig.L) and then ensure $e.text is set in the property box as shown on the left side of the parsehub web app(Fig. L)
7. Select the + symbol next to “Select page” to create a new SELECT command. Then, click on the first paragraph in the accidents news stories to select it. It should be highlighted in green, while elements in similar paragraphs should be highlighted in yellow. Next, Click on the yellow highlights around the other paragraph’s thumbnails until all of them are selected. Next, rename the SELECT command to “allpara” (“you should see Select allpara(14)”), then click on the List icon to expand the “Begin new entry command” out from the SELECT command.
Now that the Begin new entry command is visible hover over it and delete it by clicking on the trash icon next to the command.
4. Click on the + sign next to “Select allpara”, click on Advanced, and then choose a “CONDITIONAL command” from the toolbox (Fig.O1). In our CONDITIONAL command property box, we will type $selection.index!=0 (Fig.O2). This means that ParseHub will execute the commands nested under the Conditional command for every paragraph in the news stories unless it is the first paragraph on the news page.
5. Click on the + sign next to the “CONDITIONAL” command(if $selection.index!=0)(Fig.P) and add an Extract command to your project (under Advanced). Rename this Extract command to “para” (the same way you spelled it while naming the “EXTRACT” command under “Select firstpara”). In this Extract command’s settings, type in para+” |” +$e.text (Fig.P). This works by taking the value of the paragraph text in your results, appending a “|” character to its end, and then appending the currently selected paragraph to the end of the results. This process is repeated for each paragraph selection on the current page, resulting in a single cell/JSON object containing all the paragraphs.
By looking at the results preview pane at the bottom of the ParseHub client, you can double-check that the extraction has worked properly. Note, that the CSV/Excel preview cut-off results after a certain number of characters, so use the “JSON” result preview to see the entire page extraction.
For this project, we want to scrape data from several pages of the Accident archives sites. We’ve only scraped rendered page 1 of the Accident_search_results Template, along with its multiple linked pages, so far. Let’s set up ParseHub to use the “next” button at the bottom of page 1 to navigate all the 9 Accidents pages. (Stretch your legs and arms while drinking a cup of coffee!!!!)
- Return to the Accident_serch_result Template in the left sidebar. It’s also possible that you’ll need to switch the browser tab to the first page of the accidents archives sites.
- Click on the PLUS(+) sign next to the select page selection and choose the “SELECT” command.
- Then select the Next page link at the bottom of the Accidents page. Rename the selection to “next_button”.
4. By default, ParseHub will extract the text and URL from this link, click on the Extract icon, expand your new “next_button” selection, and remove these two commands.
5. Click on the PLUS(+) sign of your next_button selection and use the CLICK command.
6. A popup will appear asking ‘if this is a “Next” page button’?(Fig.R1). Click Yes and enter the number of pages you’d like to navigate to. In this case, we will scrape nine additional pages(Fig.R2)
Running and Exporting your Project
Now that we are done setting up the project, it’s time to run our scraped job.
Click on the “Get Data” button and click on the “Run” button to run your scrape on the left sidebar. We recommend doing a Test Run for longer projects to verify that your data will be formatted correctly. After the scrape job is completed, you will now be able to download all the information you’ve requested as a handy spreadsheet or as a JSON file.
Quodos 👏🏼👏🏼 ♚ 💪🏾 💯 You have come to the End of this Article.
Guess you Enjoy the article? Kindly click on the FOLLOW button at the left corner of this page for more related and impactful articles by me.
Interested in Parsehub Training and Free Certification from Zero to Hero.
Consider the following Parsehub links: