What is web scraping?

OsiScraper makes it possible to extract, standardize and store information such as text, URLs, videos and images from web pages.

Scraping focuses on the conversion of unsorted data (usually in HTML format) into structured data that can be stored and analyzed in local databases, such as XLSX, CSV, SQL or other types of files.

Because scraping takes the web page information offline on the user’s computer in a spreadsheet format, the information becomes easily searchable. In addition, the data will also remain available at a later date, even if the information on the web page is no longer available.

In order to achieve this, a ‘recipe’ must be created for each web page. This concerns a recipe (script) for that particular web page. OsiScraper includes many features to scrape the most advanced websites for you.

It is important that the user knows how to find the correct elements from the source code of the page. Once the user has found the correct elements, he can save the recipe locally (Save This Scrape) in the form of a YAML file.

This file can be distributed and loaded very easily within the organization (Open Saved Scrape), so that not all users have to compose a recipe over and over again.


Example

You have a recipe for Amazon. This script, when opened via the ‘Open Saved Scrape’ button, ensures that OsiScraper knows exactly what to do the moment you place a Amazon-product URL in the OsiScraper-dashboard.

You just need to press ‘Start Browser’ and ‘Start Scraper’ and sit back. All product information, including description, ratings and prices are now stored in plain text in a CSV, XLSX and/or SQL file. The media (product-images) are stored in a separate folder. The data is locally stored on the user’s computer.


Built-in components and the available recipe commands make it possible to mimic human behavior when scraping. This ensures that OsiScraper is not recognized as a bot by the web page.

OsiScraper contains many advanced functions, so information that other scrapers cannot reach can still be scraped. You can think of mobile simulation and extracting information from the source code of the page, such as EXIF information or a Facebook ID.

OsiScraper? what is it? how does it work?