Golang Colly: Use Random Fake User-Agents When Scraping
To use fake user-agents with Go Colly, you just need to set a User-Agent
header everytime a new request is sent using the OnRequest()
event.
package main
import (
"bytes"
"log"
"github.com/gocolly/colly"
)
func main() {
// Instantiate default collector
c := colly.NewCollector(colly.AllowURLRevisit())
// Set Fake User Agent
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "1 Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148")
})
// Print the Response
c.OnResponse(func(r *colly.Response) {
log.Printf("%s\n", bytes.Replace(r.Body, []byte("\n"), nil, -1))
})
// Fetch httpbin.org/headers five times
for i := 0; i < 5; i++ {
c.Visit("http://httpbin.org/headers")
}
}
One of the most common reasons for getting blocked whilst web scraping is using bad user-agents.
However, integrating random fake user-agents into your Go Colly web scrapers is very easy.
So in this guide, we will go through:
- What Are Fake User-Agents?
- How To Set A Fake User Agent In Go Colly
- How To Rotate Through Random User-Agents
- How To Manage Thousands of Fake User-Agents
- Why Use Fake Browser Headers
- ScrapeOps Fake Browser Headers API
First, let's quickly go over some the very basics.
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
What Are Fake User-Agents?
User Agents are strings that let the website you are scraping identify the application, operating system (OSX/Windows/Linux), browser (Chrome/Firefox/Internet Explorer), etc. of the user sending a request to their website. They are sent to the server as part of the request headers.
Here is an example User agent sent when you visit a website with a Chrome browser:
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36'
When scraping a website, you also need to set user-agents on every request as otherwise the website may block your requests because it knows you aren't a real user.
For example, when you make a request with Go Colly it sends the following user-agent with the request.
"User-Agent": "colly - https://github.com/gocolly/colly",
This user agent will clearly identify your requests are being made by the Go Colly library, so the website can easily block you from scraping the site.
That is why we need to manage the user-agents Go Colly sends with our requests.
How To Set A Fake User-Agent In Go Colly
Setting Go Colly to use a fake user-agent is very easy. We just need to set a User-Agent
header everytime a new request is sent using the OnRequest()
event.
package main
import (
"bytes"
"log"
"github.com/gocolly/colly"
)
func main() {
// Instantiate default collector
c := colly.NewCollector(colly.AllowURLRevisit())
// Set Fake User Agent
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "1 Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148")
})
// Print the Response
c.OnResponse(func(r *colly.Response) {
log.Printf("%s\n", bytes.Replace(r.Body, []byte("\n"), nil, -1))
})
// Fetch httpbin.org/headers five times
for i := 0; i < 5; i++ {
c.Visit("http://httpbin.org/headers")
}
}
From here our scraper will use this user-agent for every request.
However, if you are scraping at scale then using the same user-agent for every request isn't best practice as it makes it easier for the website to detect you as a scraper.
To solve this problem we will need to configure our Go Colly scraper to use a random user-agent with every request.
How To Rotate Through Random User-Agents
With Go Colly, rotating through user-agents is also pretty straightforward. We just need a slice of user-agents and and have our scraper use a random one with every request.
package main
import (
"bytes"
"log"
"math/rand"
"github.com/gocolly/colly"
)
func RandomString(userAgentList []string) string {
randomIndex := rand.Intn(len(userAgentList))
return userAgentList[randomIndex]
}
func main() {
// Instantiate default collector
c := colly.NewCollector(colly.AllowURLRevisit())
userAgentList := []string{
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36",
"Mozilla/5.0 (iPhone; CPU iPhone OS 14_4_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363",
}
// Set Random Fake User Agent
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", RandomString(userAgentList))
})
// Print the Response
c.OnResponse(func(r *colly.Response) {
log.Printf("%s\n", bytes.Replace(r.Body, []byte("\n"), nil, -1))
})
// Fetch httpbin.org/ip five times
for i := 0; i < 5; i++ {
c.Visit("http://httpbin.org/headers")
}
}
This works but it has drawbacks as we would need to build & keep an up-to-date list of user-agents ourselves.
An alternative approach is to use the RandomUserAgent Extension as it generates the list of user-agents for you.
To use the RandomUserAgent Extension we just need to import it and add extensions.RandomUserAgent(c)
to our code.
package main
import (
"bytes"
"log"
"github.com/gocolly/colly"
"github.com/gocolly/colly/extensions"
)
func main() {
// Instantiate default collector
c := colly.NewCollector(colly.AllowURLRevisit())
// Add Random User Agents
extensions.RandomUserAgent(c)
// Print the Response
c.OnResponse(func(r *colly.Response) {
log.Printf("%s\n", bytes.Replace(r.Body, []byte("\n"), nil, -1))
})
// Fetch httpbin.org/ip five times
for i := 0; i < 5; i++ {
c.Visit("http://httpbin.org/headers")
}
}
The RandomUserAgent Extension works, however, as you can see from the code the number of fake user-agents it can generate is pretty small.