LLM Extract

The ScrapeOps Proxy API Aggregator provides intelligent content extraction using Large Language Models (LLM) to automatically parse and structure data from web pages. This feature integrates seamlessly with the proxy service and can be enabled via query parameters.

LLM extraction uses advanced AI models to intelligently understand and extract structured data from any webpage, eliminating the need for manual CSS selectors or complex parsing logic. It can identify and extract relevant information based on the page type and return it in your preferred format.

Usage Examples

# Extract data from a product page in JSON format
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://example-shop.com/product/123&llm_extract=true&llm_extract_response_type=json&llm_data_schema=product_page"

# Extract data in markdown format with auto-detection
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://example.com/article&llm_extract=true&llm_extract_response_type=markdown"

Using llm_extract=true will consume 25 additional API Credits per request beyond the standard proxy cost.
The extracted data is returned in your specified format (json or markdown) with intelligent structuring based on the page content.

Choose JSON format for programmatic processing and integration with your applications. Use markdown format for human-readable content extraction and documentation purposes.

Parameters

Parameter	Description	Options
`llm_extract`	Enable/disable LLM extraction	`true` or `false`
`llm_extract_response_type`	Format of the extracted data	`json` or `markdown`
`llm_data_schema`	Page type for optimized extraction	See supported schemas below

Supported Page Schemas

The LLM extraction supports various page types optimized for different content structures:

E-Commerce Pages

product_page - Individual product detail page showing specific item info, price, description, etc.
product_search_page - Search results page listing multiple products based on user query
product_reviews_page - Page dedicated to customer reviews/ratings for a specific product
product_seller_page - Page showing products from a specific seller/vendor/merchant

Search Pages

serp_search_page - Search engine results page displaying search query results

Company Pages

company_page - Main company profile/information page
company_search_page - Page for searching/browsing companies
company_review_page - Company reviews and ratings
company_location_page - Company office/branch locations
company_job_page - Company's career opportunities
company_social_media_page - Company's social media presence

Job Pages

job_page - Individual job posting details
job_search_page - Job listings search interface
job_advert_page - Featured/sponsored job posting

Real Estate Pages

real_estate_page - Individual real estate listing
real_estate_search_page - Real estate search interface
real_estate_profile_page - Real estate publisher/organization profile

Intelligent Extraction

LLM extraction automatically detects and structures content regardless of the website's layout or CSS structure. It works with dynamic content and can understand context to extract the most relevant information.

If you need to scrape JavaScript-heavy sites, combine LLM extraction with render_js=true. For sites that require residential IP addresses, add residential=true to your requests.

Combined Usage with Javascript Rendering

If you use render_js=true in addition to LLM extraction, the cost per request increases as follows:

llm_extract=true&render_js=true will consume 35 API Credits per request (25 for LLM + 10 for JS rendering).
This combination is perfect for extracting data from single-page applications (SPAs) and sites with dynamic content loading.

Example Usage

# LLM extraction with Javascript rendering
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://spa-example.com/product/123&llm_extract=true&llm_extract_response_type=json&render_js=true&llm_data_schema=product_page"

# LLM extraction with residential proxies
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://example.com/article&llm_extract=true&llm_extract_response_type=markdown&residential=true"

# LLM extraction with both JS rendering and residential proxies
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://protected-site.com/data&llm_extract=true&llm_extract_response_type=json&render_js=true&residential=true&llm_data_schema=company_page"

These combinations are useful for scraping complex sites that require both intelligent content extraction and advanced proxy features for successful data extraction.

Combined Usage with Premium Proxies

You can also combine LLM extraction with premium proxy levels for maximum reliability:

# LLM extraction with premium level 1 proxies
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://example.com/product&llm_extract=true&premium=level_1&llm_data_schema=product_page"

# LLM extraction with premium level 2 proxies and JS rendering
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://complex-site.com/data&llm_extract=true&premium=level_2&render_js=true&llm_extract_response_type=json"

llm_extract=true&premium=level_1 will consume 26.5 API Credits per request (25 for LLM + 1.5 for premium level 1).
llm_extract=true&premium=level_2 will consume 28 API Credits per request (25 for LLM + 3 for premium level 2).

LLM Extract

Usage Examples​

Parameters​

Supported Page Schemas​

E-Commerce Pages​

Search Pages​

Company Pages​

Job Pages​

Real Estate Pages​

Combined Usage with Javascript Rendering​

Example Usage​

Combined Usage with Premium Proxies​