The system for gathering data from classifieds sites
The program for data gathering from classifieds sites. Robots-crawlers imitate the actions of the web-site user and collect required information. In addition to text data, robots also recognize information from images: addresses, telephone numbers.
We have implemented auto tests to check the functionality of the sites. Collecting information on one resource takes 3-6 days. Therefore, before running the tests, you need to check whether the functionality or location of the blocks has changed so that the robots didn’t get "lost".
Project in figures
10 robots developed
1 000 000 records per day
7 months of development
90% recognition of image data
Development: Scrapy, Spark, Scala, Java, Python, Tesseract
Testing Tools: XPath, Selenium, PyTest, JSON, request