ToS;DR Crawler
The ToS;DR Crawler is important to the functionally of Phoenix.
By crawling a service we ensure that the documents are mirrored and cannot be altered until a further crawl (Verified using CRC)
We do not index websites on our own, all websites are crawled manually by curators or staff on our site.
Identifying the ToS;DR Crawler
All ToS;DR Crawlers send a respective user agent with all their requests
Check for the following user agent:
ToSDRCrawler/$VERSION_STRING (+https://to.tosdr.org/bot) Region/$REGION_STRING
robots.txt
If you want to forbid the crawling for some reason you can include the following directive into the robots.txt
User-Agent: TosDRCrawler
Disallow: YOUR_PATH
Crawler Clusters
Crawler | Location | IP | Useragent | Notes | |
---|---|---|---|---|---|
1 | crawler-eu-central-4 | π©πͺ EU-Central | 202.61.251.191 | β | Ignores Robots.txt |
2 | crawler-eu-central-3 | π©πͺ EU-Central | 45.136.28.177 | β | Ignores Robots.txt |
3 | crawler-us-east-1 | πΊπΈ OR - US-East | 5.78.55.193 | β | Ignores Robots.txt |
4 | crawler-us-east-2 | πΊπΈ OR - US-East | 5.78.55.194 | β | Ignores Robots.txt |
5 | crawler-us-east-3 | πΊπΈ OR - US-East | 5.78.55.195 | β | Ignores Robots.txt |
6 | crawler-us-west-1 | πΊπΈ VA - US-West | 5.161.124.209 | β | Ignores Robots.txt |
7 | crawler-eu-west-1 | π¬π§ EU-West | 86.152.8.108 | β | Ignores Robots.txt - Community |
8 | crawler-eu-central-1 | π©πͺ EU-Central | 5.75.154.116 | β | Ignores Robots.txt |
9 | crawler-eu-central-2 | π©πͺ EU-Central | 202.61.193.29 | β | Ignores Robots.txt |
Help us Host a crawler!
Most services use IP Based Content Localization, this was an issue where our first cluster of crawlers was based in Germany, a couple of documents were in german. Now we have distributed crawlers across the globe in 3 different regions including the US and the UK.
While the Hardware Requirements of each crawler are minimal, it still costs money and resources to host our cluster.
If you want to help us hosting a crawler server in your homelab or datacenter, get in touch with us, as the list of services and documents grows, meaning we need more resources. As weβve written above, while the hardware requirements are minimal, you still need to fulfill a couple of things:
ToS;DR Crawler Hardware:
RAM: 2GB Minimum
CPU: 2 core
HDD: 40GB
Architecture: 64 Bit (This is a must!)
Ability to run Docker
Exposed Firewall Port 6874
Static IP or DDNS
Crawler problems
If you are the provider of the website, common crawling issues are
Cloudflare
robots.txt
IPTables based restriction (See Crawler Clusters)
User-Agent based blocking
To fix this, add our servers or user agents to their respective whitelist.
Error codes, what do they mean?
Error | Explanation | Fix |
---|---|---|
Reason: Error Stacktrace: write EPROTO 140022019606400:error:141A318A:SSL routines:tls_process_ske_dhe:dh key too small:β¦/deps/openssl/openssl/ssl/statem/statem_clnt.c:2157: | This SSL Error means a secure connection could not be established as a handshake cipher is possibly too old. | Update the Ciphersuits on your webserver SSL configuration |
Expected status code 200:OK; got 403:Forbidden | The website blocks our crawler. Most likely this is cloudflare | Whitelist our Crawler Cluster or Useragent |
Please check that the XPath and URL are accurate. | The xpath you retrieved is possibly stored in an IFrame, we cannot crawl those. Or its simply the wrong XPath | Get the raw link from the iframe and use the xpath there. |
MimeType {MIMETYPE} is not in our whitelist | The document you crawled is not support by our server | Fix the mimetype or suggest the mimetype to be supported. |
Expected status code in range 2xx class; got 405:Method Not Allowed | The Crawler uses a HEAD request first to determine the content type, content size and crawl-ability of the document. If a service has not implemented HEAD requests, the crawling fails. | Contact the Site Owner to allow HEAD requests. |
Waiting for element to be located By(xpath, //div) Wait timed out after 10000ms | The Crawler did not find the specific XPath | Adjust your xpath. Note: Dynamic generated elements (e.g id randomly generated) are not supported. |