Golang Colly: How to Use & Rotate Proxies
In this guide for The Golang Web Scraping Playbook, we will look at how to integrate the 3 most common types of proxies into our Go Colly based web scraper.
Using proxies with the Go Colly library allows you to spread your requests over multiple IP addresses making it harder for websites to detect & block your web scrapers.
In this guide we will walk you through the 3 most common proxy integration methods and show you how to use them with Go Colly:
- Using Proxy IPs With Go Colly
- Proxy Authentication With Go Colly
- The 3 Most Common Proxy Formats
- Proxy Integration #1: Rotating Through Proxy IP List
- Proxy Integration #2: Using Proxy Gateway
- Proxy Integration #3: Using Proxy API Endpoint
Let's begin...
Using Proxy IPs With Go Colly
Using a proxy with Go Colly is very straight forward. We simply need to set a proxy on the collector using the SetProxy()
method.
package main
import (
"bytes"
"log"
"github.com/gocolly/colly"
)
func main() {
// Instantiate default collector
c := colly.NewCollector(colly.AllowURLRevisit())
// Set Proxy
c.SetProxy("http://proxy.example.com:8080")
// Print the Response
c.OnResponse(func(r *colly.Response) {
log.Printf("%s\n", bytes.Replace(r.Body, []byte("\n"), nil, -1))
})
// On Error Print Error
c.OnError(func(_ *colly.Response, err error) {
log.Println("Something went wrong:", err)
})
// Request Page
c.Visit("http://httpbin.org/ip")
}
This method will work for all request methods Go Colly supports: GET
, POST
, PUT
, DELETE
, PATCH
, HEAD
.
Proxy Authentication With Go Colly
To authenticate a proxy using a username
and password
we simply need to add the username
and password
to the proxy strings.
package main
import (
"bytes"
"log"
"github.com/gocolly/colly"
)
func main() {
// Instantiate default collector
c := colly.NewCollector(colly.AllowURLRevisit())
// Set Proxy
c.SetProxy("http://USERNAME:PASSWORD@proxy.example.com:8080")
// Print the Response
c.OnResponse(func(r *colly.Response) {
log.Printf("%s\n", bytes.Replace(r.Body, []byte("\n"), nil, -1))
})
// On Error Print Error
c.OnError(func(_ *colly.Response, err error) {
log.Println("Something went wrong:", err)
})
// Request Page
c.Visit("http://httpbin.org/ip")
}
The 3 Most Common Proxy Formats
That covered the basics of integrating a proxy into Go Colly, in the next sections we will show you how to integrate Go Colly into the 3 most common proxy formats:
- Rotating Through List of Proxy IPs
- Using Proxy Gateways
- Using Proxy APIs
A couple years ago, proxy providers would sell you a list of proxy IP addresses and you would configure your scraper to rotate through these IP addresses and use a new one with each request.
However, today more and more proxy providers don't sell raw lists of proxy IP addresses anymore. Instead providing access to their proxy pools via proxy gateways or proxy API endpoints.
We will look at how to integrate with all 3 proxy formats.
If you are looking to find a good proxy provider then check out our web scraping proxy comparison tool where you can compare the plans of all the major proxy providers.
Proxy Integration #1: Rotating Through Proxy IP List
Here a proxy provider will normally provide you with a list of proxy IP addresses that you will need to configure your scraper to rotate through and select a new IP address for every request.
The proxy list you recieve will look something like this:
'http://Username:Password@85.237.57.198:20000',
'http://Username:Password@85.237.57.198:21000',
'http://Username:Password@85.237.57.198:22000',
'http://Username:Password@85.237.57.198:23000',
To integrate them into our Go Colly scraper we can use Go Colly's Proxy Switcher functionality. First we need to install Proxy Switcher.
go get github.com/gocolly/colly/proxy
Next, create a rotating proxy instance and add this to our collector using the SetProxyFunc()
.
package main
import (
"bytes"
"log"
"github.com/gocolly/colly"
"github.com/gocolly/colly/proxy"
)
func main() {
// Instantiate default collector
c := colly.NewCollector(colly.AllowURLRevisit())
proxyList := []string{
"http://Username:Password@85.237.57.198:20000",
"http://Username:Password@85.237.57.198:21000",
"http://Username:Password@85.237.57.198:22000",
"http://Username:Password@85.237.57.198:23000",
}
// Create Rotating Proxy Switcher
rp, err := proxy.RoundRobinProxySwitcher(proxyList...)
if err != nil {
log.Fatal(err)
}
// Set Collector To Use Proxy Switcher Function
c.SetProxyFunc(rp)
// Print the Response
c.OnResponse(func(r *colly.Response) {
log.Printf("%s\n", bytes.Replace(r.Body, []byte("\n"), nil, -1))
})
// On Error Print Error
c.OnError(func(_ *colly.Response, err error) {
log.Println("Something went wrong:", err)
})
// Fetch httpbin.org/ip five times
for i := 0; i < 5; i++ {
c.Visit("http://httpbin.org/ip")
}
}
This is a simplistic example, as when scraping at scale we would also need to build a mechanism to monitor the performance of each individual IP address and remove it from the proxy rotation if it got banned or blocked.
Proxy Integration #2: Using Proxy Gateway
Increasingly, a lot of proxy providers aren't selling lists of proxy IP addresses anymore. Instead, they give you access to their proxy pools via a proxy gateway.
Here, you only have to integrate a single proxy into your Go Colly scraper and the proxy provider will manage the proxy rotation, selection, cleaning, etc. on their end for you.
This is the most common way to use residential and mobile proxies, and becoming increasingly common when using datacenter proxies too.
Here is an example of how to integrate a BrightData's residential proxy gateway into our Go Colly scraper:
package main
import (
"bytes"
"log"
"github.com/gocolly/colly"
)
func main() {
// Instantiate default collector
c := colly.NewCollector(colly.AllowURLRevisit())
// Set Proxy
c.SetProxy("http://zproxy.lum-superproxy.io:22225")
// Print the Response
c.OnResponse(func(r *colly.Response) {
log.Printf("%s\n", bytes.Replace(r.Body, []byte("\n"), nil, -1))
})
// On Error Print Error
c.OnError(func(_ *colly.Response, err error) {
log.Println("Something went wrong:", err)
})
// Request Page
c.Visit("http://httpbin.org/ip")
}
As you can see, it is easier to integrate than using a proxy list as you don't have to worry about implementing all the proxy rotation logic.