Working with ParseHub, BoW, and Td-IDF
This article is based on a classification of online games. Using Parsehub’s online web scraping tool, a Minecraft category review was scraped from the Commonsense Media website (https://www.commonsensemedia.org/game-reviews). The game review was labeled as having been written by a teen or kid. The Web scraped data was then cleaned utilizing Preprocessing techniques such as stopword removal, punctuation removal, and tokenization. Following that, we create two sets of embeddings: BoW and TD-IDF. We then used a machine learning model to categorize the reviews as “written by teen or kid.” Finally, each embedding was evaluated and compared in terms of performance.
ParseHub is a free web scraping tool with a lot of features. Extracting data using our powerful web scraper as parsehub is as simple as clicking on the information you require from the rendered web page on the parsehub IDE interface.
Customers Reviews on parsehub:
We were one of the first customers to sign up for a paid ParseHub plan. We were initially attracted by the fact that it could extract data from websites that other similar services could not (mainly due to its powerful Relative Select command). The team at ParseHub were helpful from the beginning and have always responded promptly to queries. Over the last few years we have witnessed great improvements in both functionality and reliability of the service. We use ParseHub to extract relevant data and include it on our travel website. This has drastically cut the time we spend on administering tasks regarding updating data. Our content is more up-to-date and revenues have increased significantly as a result. I would strongly recommend ParseHub to any developers wishing to extract data for use on their sites.
— David Mottershead, Owner at Visit North West
Some features of Parsehub includes : No-code webscraper,Get data from millions of web pages ,IP rotation,Scheduled collection,Regular expressions ,API and webhooks integration,webscraped data in Json and csv/excel format for analysis.
NB. To learn Parsehub and get free certification from Parsehub visit there site here Happy Learning.
How to Web scraped the Minecraft Games reviews using ParseHub
- First, make sure to download and install ParseHub. We will use this web scraper for this Project.
- Open ParseHub, click on "New Project," and use the URL from Minecraft games review result page. On the Landing page, Paste the URL on the field "Enter a website you'd like to extract data from." The page will now be rendered inside the parsehub app. Click on "Start a project from this URL."
- Once the site is rendered, click on the reviews, i.e. (parent ) 99 reviews in all. Next, click on the plus + sign beside the select page. Next, choose the select command and click on age. The rest of the ages will be highlighted yellow. Also, click on another age to select the remaining age range on the page. The name you've clicked will become green to indicate that it's been selected.
- On the left sidebar, rename your selection to Reviewers_age. You will notice that ParseHub has now extracted the age.
- Click the PLUS(+) sign next to the Reviewers_age selection and choose the Relative Select command on the left sidebar.
- Using the Relative Select command, click on the first age range (say age 6+) on the page and its reviews. You will see an arrow connecting the two selections.
Getting New Info From the web Page
- Repeat steps 3 through 4 to extract any other reviewer's info rating. Make sure to rename your new selections accordingly.
From the results page above, we’ve selected all of the information from the sites we want to scrape. The following is how our project now appears:
👋🏻 Enjoyed this article thus far? Kindly click on the FOLLOW button on the top left of this article to follow me for more upcoming articles
For any project, you may want to scrape numerous pages of data. So far, we’ve simply scraped the first page of Minecraft review results. Let’s use ParseHub to browse the next 20 review pages.
- On the left sidebar.
- Click on the PLUS(+) sign next to the page selection and choose the Select command.
- Then select the Next page link at the bottom of the Minecraft reviews page. Rename the selection to next_button.
- By default, ParseHub will extract the text and URL from this link, so expand your new next_button selection and remove these two commands.
- Now, click on the PLUS(+) sign of your next_button selection and use the Click command.
- A pop-up will appear asking if this is a "Next" link. Click Yes and enter the number of pages you'd like to navigate to. In this case, we will scrape 20 additional pages.
The classification is dependent on whether the review is written by a child or a teen, as previously stated.
As indicated I import the necessary libraries for the project as shown :
Second, we scraped the Minecraft game reviews dataset and saved it to Gdrive. Finally, the dataset was read into colab for analysis from Gdrive.
Thirdly, the following is a simple function for getting the first word from the author's age column: This aids in the classification of the data as either child or adolescent.
The final dataset looks Thus :
For our classification, only three columns are required. Author’s age, Feature description(Concatenation of the author's age + review), and class(reviewers owners).
The code following illustrates how the three key features were created.
Stop-words and punctuation that will impair our classification must be removed from our model for it to effectively classify our text dataset.
There are several NLP modules available for removing stopwords and punctuation. NLTK (stop — words library, gensim, stem), for example.
The following steps were taken to remove stop-words and punctuation:
Model creation and Evaluation of Model Embeddings
BoW(Bag of Words) and TD-IDF were the two model embeddings used.
The steps for creating the models and evaluating them are outlined in the code below.
The complete code for this article can be viewed here
Quodos!!! ♚ 💪🏾 You have come to the End of this Article.
Guess you Enjoy the article? Kindly click on the FOLLOW button at the left corner of this page for more related and impactful articles by me.
A) Github for the codes: https://github.com/Ayanlola2002/DATA-SCIENCE-PROJECTS/blob/master/BP_NLP_PROJECT1.ipynb