The ToS;DR Crawler is important to the functionally of Phoenix.

By crawling a service we ensure that the documents are mirrored and cannot be altered until a further crawl (Verified using CRC)

We do not index websites on our own, all websites are crawled manually by curators or staff on our site.

Identifying the ToS;DR Crawler

All ToS;DR Crawlers send a respective user agent with all their requests

Check for the following user agent:

ToSDRCrawler/1.0.0 (+https://to.tosdr.org/bot)
CODE

robots.txt

If you want to forbid the crawling for some reason you can include the following directive into the robots.txt

User-Agent: TosDRCrawler
Disallow: YOUR_PATH
CODE

Crawler Clusters

Crawler

Location

IP

rDNS

Useragent

DNS

Port

Notes

Internal IP

Atlas

Austria - EU

202.61.251.191

atlas.crawler.api.tosdr.org

0.0.0.0:80:6874

Ignores Robots.txt

10.0.0.7

Arachne

Germany - EU

45.136.28.177

arachne.crawler.api.tosdr.org

0.0.0.0:80:6874

Ignores Robots.txt

10.0.0.6

AvidReader

Germany - EU

37.120.165.131

havidreader.crawler.api.tosdr.org

0.0.0.0:80:6874

Ignores Robots.txt

10.0.0.1

Floppy

Germany - EU

37.120.177.70

floppy.crawler.api.tosdr.org

0.0.0.0:80:6874

Ignores Robots.txt

10.0.0.2

James

Germany - EU

185.228.137.101

james.crawler.api.tosdr.org

0.0.0.0:80:6874

Ignores Robots.txt

10.0.0.3

NosyPeeper

Germany - EU

188.68.49.4

nosypeeper.crawler.api.tosdr.org

0.0.0.0:80:6874

Ignores Robots.txt

10.0.0.4

Terra

Germany - EU

87.78.131.160

🚫

terra.crawler.api.tosdr.org

0.0.0.0:6874:6874

Backup Only

N/A

Whale

Virginia - US

157.245.142.64

🚫

whale.crawler.api.tosdr.org

0.0.0.0:80:6874

Ignores Robots.txt

N/A

Dmitri

Washington - US

71.227.178.97

🚫

dmitri.crawler.api.tosdr.org

0.0.0.0:80:6874

Ignores Robots.txt

N/A

Crawler problems

If you are the provider of the website, common crawling issues are

  • Cloudflare

  • robots.txt

  • IPTables based restriction (See Crawler Clusters)

  • User-Agent based blocking

To fix this, add our servers or user agents to their respective whitelist.

Error codes, what do they mean?

Error

Explanation

Fix

Reason: Error Stacktrace: write EPROTO 140022019606400:error:141A318A:SSL routines:tls_process_ske_dhe:dh key too small:…/deps/openssl/openssl/ssl/statem/statem_clnt.c:2157:

This SSL Error means a secure connection could not be established as a handshake cipher is possibly too old.

Update the Ciphersuits on your webserver SSL configuration

Expected status code 200:OK; got 403:Forbidden

The website blocks our crawler. Most likely this is cloudflare

Whitelist our Crawler Cluster or Useragent

Please check that the XPath and URL are accurate.

The xpath you retrieved is possibly stored in an IFrame, we cannot crawl those. Or its simply the wrong XPath

Get the raw link from the iframe and use the xpath there.

MimeType {MIMETYPE} is not in our whitelist

The document you crawled is not support by our server

Fix the mimetype or suggest the mimetype to be supported.

Expected status code in range 2xx class; got 405:Method Not Allowed

The Crawler uses a HEAD request first to determine the content type, content size and crawl-ability of the document. If a service has not implemented HEAD requests, the crawling fails.

Contact the Site Owner to allow HEAD requests.