Changelog

Version 1.3.1 (2021-11-22)

  • Supported platforms:
    • macOS (tested in macOS 11 Big Sur and macOS 12 Monterey)
    • Windows (tested in Windows 10 and Windows 11)
    • Linux (.deb, tested in Ubuntu 20.x, Debian GNU Linux 10, Linux Mint 20)
  • Changed:
    • commands not related to an element are now idle on Color Elements (i.e., no coloring for commands such as ‘wait’, ‘current URL’ etc)
    • when saving scrapes, suffix .yaml is now always added to the name even if there’s already a suffix (to prevent replacement of a part of the given scrape name by the suffix for scrape names containing a dot but no suffix)
    • updated user agents for mobile emulation (Nexus 5 and Pixel 2 XL); this prevents the warning ‘This web browser is no longer supported.’ that previously appeared on some websites
  • Fix:
    • media directory now also made for ‘download file from a link given in {arg}, filename incl. URL + .jpg’
    • dealing with event being None (mostly coming from forced quit) that caused an error previously
    • corrected window background in a window opened by Save This Scrape which was invisible if there were no more than 12 saved scrapes
    • corrected scraping process for the combination of scroll-first with scroll-to-top

Version 1.3.0 (Plus version: 2021-10-29, Standard version: TBA)

  • Supported platforms:
    • macOS (tested in macOS 11 Big Sur and macOS 12 Monterey)
    • Windows (tested in Windows 10 and Windows 11)
    • Linux (.deb, tested in Ubuntu 20.x, Debian GNU Linux 10, Linux Mint 20)
  • New:
    • added a new tab Log in the main window, making it possible to easily follow along the scraping process. It shows up to 1000 most recent entries from the session log and the scrape logs. [Plus version: possibly extended by extra logs from specialised scrapers including, e.g., a part of the scraped data such as media footprint.]
    • added initial recipe, currently including all ‘click’-commands (may be enhanced in future versions). Similar to the before-scrape recipe but only executed once, dealing only with elements present when the recipe starts and only clicking once; while before-scrape recipe acts in a loop, checking and clicking (repeatedly) any elements that (still) correspond to the given selectors
    • [Plus:] added (optional) output in SQL
    • in XLSX output, column widths are automatically set depending on the respective column content
    • added command ‘extract text, combine multiple items using {arg}’ as a variant to the existing command ‘extract text’ but using the given separator to combine possible multiple entries
    • added commands ‘click and wait’ (waiting only when the specified element has been found and clicked) and ‘click if contains’ (only applied when the text of the to-be-clicked element matches the given text or regex)
    • added command ‘wait randomly between {arg} seconds’ (with the given min/max time); this is an extention to the existing command ‘wait randomly between 1 and {arg} seconds’ which uses a fixed minimum waiting time 1 second
    • added command ‘wait until visible’ with a given timeout; it may be useful, e.g., before commands ‘hover’ or ‘click’. Please keep in mind that while the visibility of an element is usually an important signal concerning the progress of loading the data, an additional wait may still be needed because visible elements are not always immediately interactable.
    • added command ‘wait until stale’ (experimental)
    • added command ‘scroll into view – align to center’ as an addition to the commands ‘scroll into view – align to the top’ and ‘scroll into view – align to the bottom’
    • added command ‘random sleep {arg}’; argument syntax: 10_30s,5_1 means take a 10-30s sleep on a randomly chosen call between the 5th and 10th call from now on
    • added command ‘random scroll down {arg}’; argument syntax: -10_20px,3_7 means scroll a randomly chosen amount between 10px(up)-20px(down) on a randomly chosen call between the 3rd and 7th call from now on. Further, it is possible to specify a random sleep to be taken directly after the scroll: for example, argument -10_20px,3_7,1_2s adds a random sleep taking 1-2 seconds.
      • Tip: Both ‘random sleep’ and ‘random scroll’ can be useful for, e.g., social media scrapes from sites using anti-bot mechanisms. 
      • Limitation: Each of these two commands should be used max. 1 time per recipe; if used 2 times (or more often), the values will not be reset by any extra calls, only the call counter will be increased, and thus, its effect is approximately like dividing the number of cycles by 2 (or the number of calls per recipe).
    • implemented downloading of background images from style-attribute of the elements using the new commands ‘download background image from a link in the element’s style-attribute’ and ‘download background image from the element’s style, filename incl. URL’; these commands accept regex to shorten the filenames
    • commands ‘download file from a link given in attribute {arg}, filename incl. URL’ and ‘download file from a link given in {arg}, filename incl. URL + .jpg’ now also accept regex to shorten the filenames
    • command ‘scroll down by {arg} pixels’ is now also applicable to other scrollable elements than <body>
    • in all wait-commands, floats (non-integer numbers) are now allowed as arguments
    • added option to disable SSL verification when downloading files from a server where the SSL-certificate verification fails (for example, in case of an incomplete certificate chain with a missing intermediate certificate). Please note that generally, using unverified connections is strongly discouraged: without valid SSL certificates, the connection cannot be validated and thus, we cannot know for sure whether the website is who it claims it is. There are various SSL checkers where you can analyse the SSL certificates of a site you are interested in, e.g., https://www.ssllabs.com/ssltest/analyze.html and https://www.digicert.com/help/
    • added option ‘This time, start at the current place on the site’ to
      • skip checking whether the current URL is identical to the one given in the OsiScraper dashboard and returning to the given URL if it isn’t,
      • skip initial recipe, and
      • don’t scroll to the starting position (i.e., top/bottom of the site).
      • This option only relates to the current run and therefore, it is not included in recipes. In GUI, it is shown in red colour when on (non-standard situation).
    • added regex to initial and before-scrape recipes to be able to use, e.g., ‘click if contains’
  • Changed:
    • behaviour of the ‘click’-command in the before-scrape recipe has been redefined: now all the ‘click’-commands act in a loop, searching for and clicking (repeatedly) any elements that (still) correspond to the given selectors.
      • Typical usage: exposing extra content using Show more comments, View replies, Read more etc, whereby clicking some of these elements may reveal other, previously hidden elements that also have to be clicked to reveal the complete to-be-scraped content.
      • Please use initial recipe (also added in this version; see above) for ‘click’-commands that need to be executed only once (e.g., opening a popup containing the to-be-scraped data).
    • elements to be clicked while executing the before-scrape recipe are now scrolled into view prior to clicking (related to the higher success rate of clicking / retrieving more data)
    • scrolling in various moments (e.g., scroll to a starting position, scrolling to reveal more content etc; the exact action may be dependent on the responsivity of the scrollable element) now better reflects the situation, taking into account the scroll direction (top-to-bottom vs. bottom-to-top) and scrollable element (body vs. other element)
    • a short ‘wait until visible’ is now automatically applied before commands ‘element screenshot’ and ‘hover’.
  • Fix:
    • fixed double suffix in filenames of the downloaded base64-images (these images are currently saved as PNG)
    • caught exception thrown when scrollable element not found
    • if there is another scrollable element than <body>, then its scroll height is now taken into account to recognise the end of scrolling
    • ‘hover-away’ now checks the current position to prevent the mouse pointer coming out-of-bound

Version 1.2.0 (2021-09-07)

  • Supported platforms: macOS 11, Windows 10, Linux (.deb, tested on Ubuntu 20.x, Debian GNU Linux 10, Linux Mint 20)
  • New:
    • [Plus:] added the possibility to execute multi-URL scrapes using the same scraping recipe, with all the data written into one common output file
    • in Scraping recipe, added (optional) parameter Regex to input regular expressions to be used with commands returning textual output
    • added option to save output as XLSX only (please note that while scraping, CSV is still saved as a backup and only removed after XLSX is saved at the end)
    • added command ‘reload/refresh the original URL’; this may be useful, e.g., in combination with command click that may change the site’s content
    • added command ‘current URL’; this may be useful, e.g., for multi-URL scrapes where a column containing current URL may help to distinguish which scraped data are related to which URL
    • added commands ‘page source: extract all the data matching regex’ and ‘element’s HTML: extract all the data matching regex’. If no regex is specified, these commands will extract and output the whole page source or the whole element’s HTML, respectively; therefore, in most cases, it is desirable to specify a regex to only extract the relevant content
    • added ‘Stop scraper’ to the menu and the tray menu; especially useful in case of intensive scrapes that open many new tabs in rapid succession, bringing the browser in front of the OsiScraper’s main window
  • Changed:
    • enhanced the code to limit occurrence of stale main elements while scraping
  • Fix / macOS:
    • dealt with a Mac bug causing slow start on some Macs under some circumstances

Version 1.1.0 (2021-07-24)

  • Supported platforms: macOS 11, Windows 10, Linux (.deb, tested on Ubuntu 20.x, Debian GNU Linux 10, Linux Mint 20)
  • New:
    • added (optional) output in XLSX format along with the default CSV
    • added command ‘download base64-encoded image from attribute {arg}’ (for image data included directly in an attribute instead of in a file; downloads the data and saves it as PNG)
    • added command ‘download file from the link given in {arg}, filename incl. URL + .jpg’ (for image files without a suffix, or with a complex or too long URL; adds suffix .jpg to the filename)
    • added command ’take screenshot of the element’
    • added command ‘scroll the element into view, align to the bottom’; while the already existing command ‘scroll the element into view’ aligns the upper edge of the element to the upper edge of the scrollable ancestor (which can be partially hidden behind a header), this new command aligns the lower edge of the element to the bottom; this may be useful, e.g., in combination with hover to extract more information that would otherwise be inaccessible
    • added command ‘scroll down by {arg} pixels’; scrolls down for positive values / up for negative values of the argument
    • added command ‘hover away from the element’; this is a complement to ‘hover over the element’. Example: ‘hover over the element’ can be used to show extra information in a popup window, followed by ‘hover away from the element’ to let the popup disappear to make other elements accessible.
    • command ‘does the element have class {arg}?’ now also works with ” (empty string / no argument given), meaning ‘does the element have no classes at all?’
    • for load-more-buttons (i.e., load-more button, next-page button), added another type of click to deal with JavaScript-driven buttons (automatically recognised)
  • Changed:
    • restore the scrape recipe after each scrape (reverts any automatically made changes due to scraping both text and URL for some elements to make it possible to repeat the scrape while changing your preferences to scrape text / URL / both; only relevant for elements with no specified command)
    • while scraping, the elements are styled after being scraped successfully
    • in About (Ctrl+Shift+A), added link to Acknowledgements
  • Fix:
    • caught exception when missing main element; now shows a warning

Version 1.0.1 (2021-06-26)

  • Supported platforms: macOS 11, Windows 10, Linux (.deb, tested on Ubuntu 20.x, Debian GNU Linux 10, Linux Mint 20)
  • New:
    • added menu item File -> Open Logfile… for quick access to logfiles
    • while scraping, the main elements that contain no value to scrape or their content just duplicates a previous one, are visualised differently; this helps to detect sub-optimally set main-element selectors
    • added checkbox Don’t scroll: an option to not scroll at all while scraping (except of any scrolls explicitly given in the recipes). This option is mutually exclusive with both Scroll bottom-to-top and Scroll to the end before scraping
    • (experimental) added command pair ‘open the link given in attribute {arg} in a new tab‘ / ‘close tab
    • added command ‘hover over the element‘ to be able to extract more data
    • added keyboard shortcuts Ctrl+Q to Quit, Ctrl+Shift+A for OsiScraper -> About OsiScraper
  • Changed:
    • suppressed empty rows in the output file (e.g., in case of a too broad definition of the main element)
    • for Scroll to the end before scraping, the default value = True is now used also when opening saved scrapes that do not specify this value
  • Fix:
    • corrected default command value in the before-scrape recipe (this bug also resulted in empty saved scrapes when using that default value)
    • corrected visibility of input fields for load-more-content after Open saved scrape (in case of no load-more/next-page button, the visibility of the respective input fields got mixed)
    • corrected behaviour after scraping has finished while having an extra window open (if scraping has finished while main window not active, the event got lost and OsiScraper could not finish that scrape properly)
    • corrected bug where ‘scroll_first=false‘ was ignored when reading a saved scrape
    • changed keyboard shortcut for Visit Community from Ctrl+C (reserved for Copy) to Ctrl+Y
    • suppressed duplicate rows in the output file
  • macOS:
    • menu text color for light appearance is now darker to improve readability
  • Windows:
    • added the OsiScraper icon to the taskbar
  • Windows/Linux:
    • now it is possible to switch to OsiScraper using Alt-Tab

Version 1.0.0 (2021-06-07) – First public release

  • Supported platforms: macOS 11, Windows 10, Linux (.deb, tested on Ubuntu 20.x, Debian GNU Linux 10)