Asynchronous web scraper
Async web scraper works with the ElasticSearch cluster and MongoDB. It takes the list of organizations from MongoDB database and creates the bulk of tasks (coroutines) that run asynchronously. For every organization in the database, it crawls all the pages, clears the texts, and updates ES cluster with new text content. It saves all necessary things into draft ES index including sitemaps, visited pages, visiting reports, "robots.txt" rules.
- Data mining and web scraping using Python - Providing and using API (REST, GraphQL) - Any web automation, search automation using Selenium Webdriver - Data exploring and processing, tasks and processes automation using Python - Data converting from/to XLSX, JSON, XML, PDF, DOCX, etc. - Desktop applications, Windows applications using PyQT - SQL, NoSQL Database integration: MongoDB, MySql, MariaDB, etc. Data exploring & statistical analysis using R and Python: Scipy, Numpy, Pandas - Linear regression - Logistic regression - Pearson product-moment correlation coefficient - Spearman's rank correlation coefficient - Kendall tau rank correlation coefficient - Pearson's chi-squared test - Fisher's exact test - Student's t-test - Mann - Whitney U-test - Analysis of variance (ANOVA) - Kruskal-Wallis test - Cluster analysis - Principal component analysis, etc. I would be pleased to consider proposals for long-term projects!