Resolve Article Scraping Issues via WAF
Introduction
When our system attempts to retrieve articles from your website, Audioboost Publisher Manager may flag scraping failures. These are typically caused by security measures blocking our requests. In order to resolve this, we provide two solutions:
- Whitelist our User-Agent (
speakup-article
) on your server / WAF. - Whitelist a unique GET parameter appended to our article URLs.
This guide aims to cover the Web Application Firewalls (WAFs) and to outline te steps for whitelisting.
What is a Web Application Firewall (WAF)
A WAF is a security layer between your web server and external traffic. It filters malicious requests (e.g., SQL injection, DDoS attacks) using predefined rules. While essential for security, WAFs can inadvertently block legitimate scrapers like ours.
How WAFs Interfere with Scraping
- User-Agent Blocking: WAFs may ban unknown / unrecognized user agents.
- Rate Limiting: Frequent requests from a single IP address or agent trigger blocks.
- Signature Detection: GET parameters or headers may resemble attack patterns.
Common WAFs in the market
Most enterprises use these WAF solutions. Whitelisting steps vary by provider.
WAF Provider | Deployment Model |
---|---|
Cloudflare | Cloud-based (SaaS) |
Akamai Kona Site Defender | Cloud-based (SaaS) |
AWS WAF | Cloud (integrated with AWS services) |
Imperva | Cloud / hybrid |
F5 Advanced WAF | On-premises/hardware |
Solution 1: Whitelist Speakup User-Agent
Add our user-agent speakup-article
to your WAF’s allowlist.
Generic Steps for whitelisting
- Access your WAF dashboard (e.g., Cloudflare, AWS WAF).
- Navigate to "Security Rules" > "Allowlists" (or equivalent).
- Create a new rule:
- Match type:
User-Agent
- Value:
speakup-article
- Match type:
- Set the rule action to
ALLOW
(bypass other checks). - Save and deploy changes.
Cloudflare Guide
- Select the website that you want to manage.
- In the right menu, select
Security > WAF
and click “Create rule”.
-
Create a new rule with this information:
- Field: "User Agent"
- Operator: "contains"
- Value:
speakup-article
- Action: Skip
- Mark the WAF components to skip as shown below:
-
Click on “Deploy” button.
Provider-Specific Guides
- AWS WAF: User-Agent Allowlisting
- Akamai: Modify Kona Rule Sets
Solution 2: Whitelist a specific GET Parameter
Use a unique parameter we'll append to the article URLs.
https://example.com/article?scraper_token=speakup_article
Whitelist the parameter scraper_token
parameter to bypass WAF rules.
Generic Steps
- In your WAF dashboard, locate "Allowlisted Parameters" (or "Ignore Rules").
- Add
scraper_token
to the allowlist. - Ensure the rule:
- Applies to the path:
/*
(all articles). - Action:
BYPASS
orIGNORE
.
- Applies to the path:
Example for AWS WAF
{
"Name": "AllowSpeakupScraperToken",
"Priority": 1,
"Action": "ALLOW",
"VisibilityConfig": {
"SampledRequestsEnabled": true,
"CloudWatchMetricsEnabled": true
},
"Rule": {
"Rule": {
"Name": "scraper_token-param-rule",
"Action": "ALLOW",
"Match": {
"QueryParameter": {
"Key": "scraper_token",
"Value": "speakup_article"
}
}
}
}
}
Verification
Our team will perform a new scraping attempt to articles within 24 business hours of whitelisting upon confirmation.
For issues, you can contact us at support [at] audioboost [dot] it, with the following details:
- Your WAF provider name
- Example article URLs
- Blocked request logs (if available).
Conclusion
Whitelisting either our User-Agent or the GET parameter ensures uninterrupted article scraping. Most WAFs support these adjustments via their dashboards. For urgent issues, contact our support team.