Example usage
To use thatscraper in a project:
Fill simple form
Usually, forms are tag elements, threrefore you can select the html form structure using “tag name”. Howerver, in example below the form is in a div, with a class name “form”. You should always inspect the page to check out the structure of the element you want to select.
%load_ext autoreload
%autoreload 2
import time
import thatscraper
crawler = thatscraper.Crawler()
# open page
crawler.goto("https://phptravels.com/demo/")
# get form wrapper
form_element = crawler.element("form", "class name")
# form fields
elements = crawler.children_of(form_element, "input", "tag name")
# data to fill
data = {
'first name': 'John',
'last name': 'Doe',
'bus name': 'Joe',
'email': 'j.doe@gmail.com'
}
# filling
for element, field in zip(elements, data):
crawler.send_to_element(element, data[field])
# wait long enough so you can check the result
time.sleep(5)
# always quit the driver
crawler.quit()
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
/tmp/ipykernel_182/3723189635.py in <module>
1 import time
----> 2 import thatscraper
3
4 crawler = thatscraper.Crawler()
5 # open page
ModuleNotFoundError: No module named 'thatscraper'
Extract a table
To scrap, collect ou handle data informations, such as text, tables or images, thatscraper comes with the module extractor to work with elements or addresses and return the desired data as needed. Here is an example of obtaining a table as pandas.DataFrame object:
import time
import thatscraper as ts
crawler = ts.Crawler(browser='chrome')
crawler.goto("https://www.techlistic.com/p/demo-selenium-practice.html")
costumers_table = ts.extractor.Table(crawler, "customers", "id")
# table as pandas dataframe (see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)
print(costumers_table.data[0])
# compare results
time.sleep(10)
crawler.quit()
Company Contact Country
0 Google Maria Anders Germany
1 Meta Francisco Chang Mexico
2 Microsoft Roland Mendel Austria
3 Island Trading Helen Bennett UK
4 Adobe Yoshi Tannamuri Canada
5 Amazon Giovanni Rovelli Italy
Get items from lists
As stated, the extractor module is suitable to retrieve informations. Like in example above, where Table is responsible to get the element convert the html into a dataframe. Here is an example where we can obtain the list items in html format.
import thatscraper as ts
crawler = ts.Crawler(browser='chrome', headless=True)
crawler.goto("https://www.techlistic.com/p/demo-selenium-practice.html")
items = ts.extractor.UnorderedList(crawler, "(//div[@dir=\'ltr\'])[7]", "xpath")
for item in items:
li = ts.extractor.html(item)
print(li)
crawler.quit()
<li><a href="https://www.techlistic.com/2017/02/how-to-handle-dynamic-web-table-in.html">Verify that there are only 4 structure values present in the table with Selenium</a></li>
<li><a href="https://www.techlistic.com/2017/02/how-to-handle-dynamic-web-table-in.html">Verify that 6th row of the table (Last Row) has only two columns with Selenium</a></li>
<li><a href="https://www.techlistic.com/2017/02/how-to-handle-dynamic-web-table-in.html">Find the tallest structure in the table with Selenium</a></li>
Download images
When working with search in websites, always prefer to place the query in the url instead of sent to input element the intended query. This is to avoid reacaptcha or other bot indentifier methods.
import re
import thatscraper as ts
crawler = ts.Crawler()
# let's get some photos of cats
query = "cat"
crawler.goto(f"https://www.pexels.com/search/{query}/")
grid = crawler.element_id("-")
images = crawler.children_of(grid, "//article/a/img", "xpath")
# the first 10 results
files = []
for image in images[:6]:
img_url = image.get_attribute('src')
# get the filename from url using regex
result = re.findall(r"./(.*?)\?", img_url)
img_filename = result[0].split('/')[-1]
files.append(img_filename)
ts.extractor.download(img_url, img_filename)
crawler.quit()
# just for you to check out the result:
from IPython.display import display
from IPython.core.display import HTML
def make_html(file_name):
img_element = (
f'<img src="{file_name}"'
+ ' style="display:inline;margin:1px;width:100px;"/>'
)
return img_element
images = ''.join([make_html(file) for file in files])
display(HTML(images))





