Example usage
To use thatscrapper in a project:
import thatscrapper
print(thatscrapper.__version__)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
/tmp/ipykernel_182/1932891987.py in <module>
----> 1 import thatscrapper
2
3 print(thatscrapper.__version__)
ModuleNotFoundError: No module named 'thatscrapper'
Fill simple form
Usually, forms are tag elements, threrefore you can select the html form structure using “tag name”. Howerver, in example below the form is in a div, with a class name “form”. You should always inspect the page to check out the structure of the element you want to select.
import time
import thatscrapper
crawler = thatscrapper.Crawler()
# open page
crawler.goto("https://phptravels.com/demo/")
# get form wrapper
form_element = crawler.element("form", "class name")
# form fields
elements = crawler.children_of(form_element, "input", "tag name")
# data to fill
data = {
'first name': 'John',
'last name': 'Doe',
'bus name': 'Joe',
'email': 'j.doe@gmail.com'
}
# filling
for element, field in zip(elements, data):
crawler.send_to_element(element, data[field])
# wait long enough so you can check the result
time.sleep(5)
# always quit the driver
crawler.quit()
Extract a table
To scrap, collect ou handle data informations, such as text, tables or images, thatscrapper comes with the module extractor to work with elements or addresses and return the desired data as needed. Here is an example of obtaining a table as pandas.DataFrame object:
import time
import thatscrapper as ts
crawler = ts.Crawler(browser='chrome')
crawler.goto("https://www.techlistic.com/p/demo-selenium-practice.html")
costumers_table = ts.extractor.Table(crawler, "customers", "id")
# table as pandas dataframe (see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)
print(costumers_table.data)
# compare results
time.sleep(10)
crawler.quit()
[ Company Contact Country
0 Google Maria Anders Germany
1 Meta Francisco Chang Mexico
2 Microsoft Roland Mendel Austria
3 Island Trading Helen Bennett UK
4 Adobe Yoshi Tannamuri Canada
5 Amazon Giovanni Rovelli Italy]
Get items from lists
As stated, the extractor module is suitable to retrieve informations. Like in example above, where Table is responsible to get the element convert the html into a dataframe. Here is an example where we can obtain the list items in html format.
import thatscrapper as ts
crawler = ts.Crawler(browser='chrome', headless=True)
crawler.goto("https://www.techlistic.com/p/demo-selenium-practice.html")
items = ts.extractor.UnorderedList(crawler, "(//div[@dir=\'ltr\'])[7]", "xpath")
for item in items:
li = ts.extractor.html(item)
print(li)
crawler.quit()
<li><a href="https://www.techlistic.com/2017/02/how-to-handle-dynamic-web-table-in.html">Verify that there are only 4 structure values present in the table with Selenium</a></li>
<li><a href="https://www.techlistic.com/2017/02/how-to-handle-dynamic-web-table-in.html">Verify that 6th row of the table (Last Row) has only two columns with Selenium</a></li>
<li><a href="https://www.techlistic.com/2017/02/how-to-handle-dynamic-web-table-in.html">Find the tallest structure in the table with Selenium</a></li>
Download images
When working with search in websites, always prefer to place the query in the url instead of sent to input element the intended query. This is to avoid reacaptcha or other bot indentifier methods.
import re
import thatscrapper as ts
crawler = ts.Crawler()
# let's get some photos of cats
query = "cat"
crawler.goto(f"https://www.pexels.com/search/{query}/")
grid = crawler.element_id("-")
images = crawler.children_of(grid, "//article/a/img", "xpath")
# the first 10 results
files = []
for image in images[:6]:
img_url = image.get_attribute('src')
# get the filename from url using regex
result = re.findall(r"./(.*?)\?", img_url)
img_filename = result[0].split('/')[-1]
files.append(img_filename)
ts.extractor.download(img_url, img_filename)
crawler.quit()
from IPython.display import display
from IPython.core.display import HTML
def make_html(file_name):
img_element = (
f"<img src=\"{file_name}\""
+ " style=\"display:inline;margin:1px;width:100px;\"/>"
)
return img_element
images = ''.join([make_html(file) for file in files])
display(HTML(images))





