Cloudflare protection often breaks crawlers because the site is no longer serving plain content directly , instead cloudflare may insert :
-
Bot detection challenges
-
Javascript verification
-
CAPTCHA checks
-
Fingerprint analysis
-
Rate limiting
-
Browser integrity checks
Maybe the cloudflare may technically ‘connect’ but it might not longer receiving the real page .
It might look like this
-
HTTP 403
-
Endless redirects
-
Empty html
-
Captcha page
-
‘Just a moment ….. ‘
Why this happens is traditional crawler expects or assumes
Request → HTML page
But the cloudflare protected site behaves more like
Request
↓Bot analysis
↓JS challenge / fingerprinting
↓Conditional access
If the request doesn't look like browser like then the page may be restricted