PycURL - Guide to Using cUrl With Python

PycURL: Guide to Using cUrl With Python

In this guide for The Python Web Scraping Playbook, use cURL in Python using PycURL.

PycURL is a Python interface to libcurl. PycURL is targeted at an advanced developer as it exposes most of the functionality that libcurl has to offer. Making it great if you need dozens of concurrent, fast and reliable connections or any of the sophisticated features that libcurl offers.

In this guide we will walk you through:

Why Use PycURL?
Installing PycURL
Making GET Requests With PycURL
Making POST Requests With PycURL
Follow Redirects With PycURL
Setting Headers & User Agents
Using Proxies With PycURL

Let's begin...

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

Why Use PycURL?

PycURL is a thin Python layer over libcurl, the multiprotocol file transfer library that gives you deep low-level control on how you make requests.

As PycURL is a relatively thin layer over libcurl, it doesn't have the nice Pythonic class hierarchies and user experience features that you are standard across many other Python HTTP client libraries like Python Requests, Python HTTPX, and Python aiohttp. Giving it a steeper learning curve.

However, what PycURL and libcurl lack in ease of use, it more than makes up for in its feature set and customisability.

Multiple Protocols: PycURL does not only support HTTP/HTTPS but DICT, FILE, FTP, FTPS, Gopher, IMAP, IMAPS, LDAP, LDAPS, POP3, POP3S, RTMP, RTSP, SCP, SFTP, SMTP, SMTPS, Telnet and TFTP.
Speed: PycURL has been shown to be is several times faster than Requests during benchmarking.
More Features + Low Level Control: PycURL gives you more low level control and features like the ability to use several TLS backends, more authentication options, and I/O multiplexing via the libcurl multi interface.

A HTTP client library like Python Requests is generally easier to learn and use than PycURL. Meaning PycURL is better suited to more advanced use cases which require more fine grained control over how requests are being made.

Installing PycURL

As PycURL installation requires some C extensions, PycURL’s installation can be complex (depending on your operating system). However, using pip should work for most users.

pip install pycurl

If this does not work, please see the PycURL Installation docs.

For a lot of applications you will also need to install Certifi, a library that provides SSL with Mozilla’s root certificates.

pip install certifi

PycURL does not provide security certificate bundles as they change overtime. Some operating systems do provide them, however, if not you can use the Certifi Python package. This will allow you access HTTPS servers.

Making GET Requests With PycURL

Making GET requests to HTTP servers is more tricky with PycURL than with a library like Requests, however, it is still pretty straight-forward once you get the hang of it.

import pycurl
import certifi
from io import BytesIO

## Create PycURL instance
c = pycurl.Curl()

## Define Options - Set URL we want to request
c.setopt(c.URL, 'http://pycurl.io/')

## Setup buffer to recieve response
buffer = BytesIO()
c.setopt(c.WRITEDATA, buffer)

## Setup SSL certificates
c.setopt(c.CAINFO, certifi.where())

## Make Request
c.perform()

## Close Connection
c.close()

## Retrieve the content BytesIO & Decode
body = buffer.getvalue()
print(body.decode('iso-8859-1'))

Let's go through this code step by step so that we understand what is happening here:

PycURL Instance: Using c = pycurl.Curl() we will create a PycURL Instance.
Options: Use setopt to set options like the URL we want to scrape. Full list of options here.
Buffer: PycURL does not provide storage for network responses so we must setup a buffer buffer = BytesIO() and instruct PycURL to write to that buffer c.setopt(c.WRITEDATA, buffer).
SSL Certs: Set the filename holding the SSL certificates using the certifi library.
Make Request: To make the request we use c.perform() and to close the connection we use c.close().
Content: We then need to retrieve and decode the response from the buffer we defined.

Accessing Response Details

To access details about the curl session in PycURL you need to use the c.getinfo(). With it you can get information like the response status code, the final URL, etc.

To access this data you must use c.getinfo() before you close the connection with c.close().

import pycurl
import certifi
from io import BytesIO

## Create PycURL instance
c = pycurl.Curl()

## Define Options - Set URL we want to request
c.setopt(c.URL, 'http://pycurl.io/')

## Setup buffer to recieve response
buffer = BytesIO()
c.setopt(c.WRITEDATA, buffer)

## Setup SSL certificates
c.setopt(c.CAINFO, certifi.where())

## Make Request
c.perform()

## Response Status Code
print('Response Code:', c.getinfo(c.RESPONSE_CODE))

## Final URL
print('Response URL:', c.getinfo(c.EFFECTIVE_URL))

## Cert Info
print('Response Cert Info:', c.getinfo(c.INFO_CERTINFO))

## Close Connection
c.close()

## Retrieve the content BytesIO & Decode
body = buffer.getvalue()
print(body.decode('iso-8859-1'))

Making POST Requests With PycURL

Your can also make POST requests to servers with PycURL using the option c.setopt(c.POSTFIELDS, post_data).

To use this we first need to encode the post body we want to send using urlencode and then pass that to our PycURL instance.

By using c.setopt(c.POSTFIELDS, post_data) we are telling PycURL to send the data with Content-Type equal to application/x-www-form-urlencoded.

import pycurl
import certifi
from io import BytesIO
from urllib.parse import urlencode

## Create PycURL instance
c = pycurl.Curl()

## Define Options - Set URL we want to request
c.setopt(c.URL, 'https://httpbin.org/post')

# Setting POST Data + Encoding Data
post_body = {'test': 'value'}
post_data = urlencode(post_body)
c.setopt(c.POSTFIELDS, post_data)

## Setup buffer to recieve response
buffer = BytesIO()
c.setopt(c.WRITEDATA, buffer)

## Setup SSL certificates
c.setopt(c.CAINFO, certifi.where())

## Make Request
c.perform()

## Close Connection
c.close()

## Retrieve the content BytesIO & Decode
body = buffer.getvalue()
print(body.decode('iso-8859-1'))

To send the post body as JSON then we would need to set 'Content-Type: application/json' in the headers.

c.setopt(c.HTTPHEADER, ['Accept: application/json', 'Content-Type: application/json'])

More information on PycURLs POST request functionality can be found here.

Follow Redirects With PycURL

By default PycURL doesn't follow redirects, however, you can enable redirect following by using the FOLLOWLOCATION option.

# Follow Redirects
c.setopt(c.FOLLOWLOCATION, True)

Writing Data To Files

With PycURL we can write data directly to a file without having to decode it, if the file has been openned in binary mode.

import pycurl

"""

As long as the file is opened in binary mode, both Python 2 and Python 3
can write response body to it without decoding.

"""

with open('output.html', 'wb') as f:
    c = pycurl.Curl()
    c.setopt(c.URL, 'http://pycurl.io/')
    c.setopt(c.WRITEDATA, f)
    c.perform()
    c.close()

Setting Headers & User Agents

To add headers and user-agents to your PycURL requests we just need to use the HTTPHEADER option:

c.setopt(c.HTTPHEADER, ['Accept: application/json', 'User-Agent: Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'])

Or if you just want to set the user-agent then you can use the USERAGENT option.

c.setopt(c.USERAGENT, 'User-Agent: Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148')

Using Proxies With PycURL

PycURL also lets you route your requests through proxy servers if you would like to hide your IP address.

# Set Proxy
c.setopt(pycurl.PROXY, f"https://{host}:{port}")

# Proxy Auth (If Needed)
c.setopt(pycurl.PROXYUSERPWD, f"{username}:{password}")

# Set Proxy Type = "HTTPS"
c.setopt(pycurl.PROXYTYPE, 2)

# Set Proxy as Insecure If Required
c.setopt(c.PROXY_SSL_VERIFYHOST, 0)
c.setopt(c.PROXY_SSL_VERIFYPEER, 0)

PycURL: Guide to Using cUrl With Python

Need help scraping the web?

Why Use PycURL?​

Installing PycURL​

Making GET Requests With PycURL​

Accessing Response Details​

Making POST Requests With PycURL​

Follow Redirects With PycURL​

Writing Data To Files​

Setting Headers & User Agents​

Using Proxies With PycURL​

More Web Scraping Tutorials​