Example usage

To use thatscraper in a project:

Fill simple form

Usually, forms are tag elements, threrefore you can select the html form structure using “tag name”. Howerver, in example below the form is in a div, with a class name “form”. You should always inspect the page to check out the structure of the element you want to select.

%load_ext autoreload
%autoreload 2

import time
import thatscraper

crawler = thatscraper.Crawler()
# open page
crawler.goto("https://phptravels.com/demo/")
# get form wrapper
form_element = crawler.element("form", "class name")
# form fields
elements = crawler.children_of(form_element, "input", "tag name")
# data to fill
data = {
    'first name': 'John',
    'last name': 'Doe',
    'bus name': 'Joe',
    'email': 'j.doe@gmail.com'
}
# filling
for element, field in zip(elements, data):
    crawler.send_to_element(element, data[field])
# wait long enough so you can check the result
time.sleep(5)

# always quit the driver
crawler.quit()

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/tmp/ipykernel_182/3723189635.py in <module>
      1 import time
----> 2 import thatscraper
      3 
      4 crawler = thatscraper.Crawler()
      5 # open page

ModuleNotFoundError: No module named 'thatscraper'

Extract a table

To scrap, collect ou handle data informations, such as text, tables or images, thatscraper comes with the module extractor to work with elements or addresses and return the desired data as needed. Here is an example of obtaining a table as pandas.DataFrame object:

import time
import thatscraper as ts

crawler = ts.Crawler(browser='chrome')

crawler.goto("https://www.techlistic.com/p/demo-selenium-practice.html")
costumers_table = ts.extractor.Table(crawler, "customers", "id")

# table as pandas dataframe (see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)
print(costumers_table.data[0])

# compare results
time.sleep(10)
crawler.quit()

          Company           Contact  Country
        Google      Maria Anders  Germany
          Meta   Francisco Chang   Mexico
     Microsoft     Roland Mendel  Austria
Island Trading     Helen Bennett       UK
         Adobe   Yoshi Tannamuri   Canada
        Amazon  Giovanni Rovelli    Italy

Get items from lists

As stated, the extractor module is suitable to retrieve informations. Like in example above, where Table is responsible to get the element convert the html into a dataframe. Here is an example where we can obtain the list items in html format.

import thatscraper as ts

crawler = ts.Crawler(browser='chrome', headless=True)

crawler.goto("https://www.techlistic.com/p/demo-selenium-practice.html")

items = ts.extractor.UnorderedList(crawler, "(//div[@dir=\'ltr\'])[7]", "xpath")

for item in items:
    li = ts.extractor.html(item)
    print(li)
crawler.quit()

<li><a href="https://www.techlistic.com/2017/02/how-to-handle-dynamic-web-table-in.html">Verify that there are only 4 structure values present in the table with Selenium</a></li>
<li><a href="https://www.techlistic.com/2017/02/how-to-handle-dynamic-web-table-in.html">Verify that 6th row of the table (Last Row) has only two columns with Selenium</a></li>
<li><a href="https://www.techlistic.com/2017/02/how-to-handle-dynamic-web-table-in.html">Find the tallest structure in the table with Selenium</a></li>

Download images

When working with search in websites, always prefer to place the query in the url instead of sent to input element the intended query. This is to avoid reacaptcha or other bot indentifier methods.

import re
import thatscraper as ts

crawler = ts.Crawler()

# let's get some photos of cats
query = "cat"
crawler.goto(f"https://www.pexels.com/search/{query}/")


grid = crawler.element_id("-")
images = crawler.children_of(grid, "//article/a/img", "xpath")

# the first 10 results
files = []
for image in images[:6]:
    img_url = image.get_attribute('src')
    # get the filename from url using regex
    result = re.findall(r"./(.*?)\?", img_url)
    img_filename = result[0].split('/')[-1]
    files.append(img_filename)
    ts.extractor.download(img_url, img_filename)
crawler.quit()

# just for you to check out the result:
from IPython.display import display
from IPython.core.display import HTML

def make_html(file_name):
    img_element = (
         f'<img src="{file_name}"'
         + ' style="display:inline;margin:1px;width:100px;"/>'
    )
    return img_element

images = ''.join([make_html(file) for file in files])
display(HTML(images))

Click on buttons

There are two methods of Crawler class for clicking buttons, or any other type of clickable elements:

click_element: receives an selenium webelement to click on.
click: receives value and attribute of element to be selected and clicked on.

import time
import thatscraper

crawler = thatscraper.Crawler()

url = "https://unixpapa.com/js/testmouse.html"
crawler.goto(url)

parent = crawler.element("//tbody/tr/td", "xpath")
# if you inspect the page, you'll see that any of the elements
# are buttons, mas tow anchors and one image. Also,
# there's a non clickable element: <br>. We can skip
# it by making sure the element has 'onclick' function on it.
buttons = crawler.children_of(parent, ".//*", "xpath")
for button in buttons:
    if "onclick" in thatscraper.extractor.html(button):
        crawler.click_element(button)
        time.sleep(1)

time.sleep(2)
crawler.quit()