Skip to main content

LLM Extract

The ScrapeOps Proxy API Aggregator provides intelligent content extraction using Large Language Models (LLM) to automatically parse and structure data from web pages. This feature integrates seamlessly with the proxy service and can be enabled via query parameters.

LLM extraction uses advanced AI models to intelligently understand and extract structured data from any webpage, eliminating the need for manual CSS selectors or complex parsing logic. It can identify and extract relevant information based on the page type and return it in your preferred format.

Usage Examples

# Extract data from a product page in JSON format
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://example-shop.com/product/123&llm_extract=true&llm_extract_response_type=json&llm_data_schema=product_page"

# Extract data in markdown format with auto-detection
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://example.com/article&llm_extract=true&llm_extract_response_type=markdown"
  • Using llm_extract=true will consume 25 additional API Credits per request beyond the standard proxy cost.
  • The extracted data is returned in your specified format (json or markdown) with intelligent structuring based on the page content.

Choose JSON format for programmatic processing and integration with your applications. Use markdown format for human-readable content extraction and documentation purposes.

Parameters

ParameterDescriptionOptions
llm_extractEnable/disable LLM extractiontrue or false
llm_extract_response_typeFormat of the extracted datajson or markdown
llm_data_schemaPage type for optimized extractionSee supported schemas below

Supported Page Schemas

The LLM extraction supports various page types optimized for different content structures:

E-Commerce Pages

  • product_page - Individual product detail page showing specific item info, price, description, etc.
  • product_search_page - Search results page listing multiple products based on user query
  • product_reviews_page - Page dedicated to customer reviews/ratings for a specific product
  • product_seller_page - Page showing products from a specific seller/vendor/merchant

Search Pages

  • serp_search_page - Search engine results page displaying search query results

Company Pages

  • company_page - Main company profile/information page
  • company_search_page - Page for searching/browsing companies
  • company_review_page - Company reviews and ratings
  • company_location_page - Company office/branch locations
  • company_job_page - Company's career opportunities
  • company_social_media_page - Company's social media presence

Job Pages

  • job_page - Individual job posting details
  • job_search_page - Job listings search interface
  • job_advert_page - Featured/sponsored job posting

Real Estate Pages

  • real_estate_page - Individual real estate listing
  • real_estate_search_page - Real estate search interface
  • real_estate_profile_page - Real estate publisher/organization profile
Intelligent Extraction

LLM extraction automatically detects and structures content regardless of the website's layout or CSS structure. It works with dynamic content and can understand context to extract the most relevant information.

If you need to scrape JavaScript-heavy sites, combine LLM extraction with render_js=true. For sites that require residential IP addresses, add residential=true to your requests.

Combined Usage with Javascript Rendering

If you use render_js=true in addition to LLM extraction, the cost per request increases as follows:

  • llm_extract=true&render_js=true will consume 35 API Credits per request (25 for LLM + 10 for JS rendering).
  • This combination is perfect for extracting data from single-page applications (SPAs) and sites with dynamic content loading.

Example Usage

# LLM extraction with Javascript rendering
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://spa-example.com/product/123&llm_extract=true&llm_extract_response_type=json&render_js=true&llm_data_schema=product_page"

# LLM extraction with residential proxies
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://example.com/article&llm_extract=true&llm_extract_response_type=markdown&residential=true"

# LLM extraction with both JS rendering and residential proxies
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://protected-site.com/data&llm_extract=true&llm_extract_response_type=json&render_js=true&residential=true&llm_data_schema=company_page"

These combinations are useful for scraping complex sites that require both intelligent content extraction and advanced proxy features for successful data extraction.

Combined Usage with Premium Proxies

You can also combine LLM extraction with premium proxy levels for maximum reliability:

# LLM extraction with premium level 1 proxies
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://example.com/product&llm_extract=true&premium=level_1&llm_data_schema=product_page"

# LLM extraction with premium level 2 proxies and JS rendering
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://complex-site.com/data&llm_extract=true&premium=level_2&render_js=true&llm_extract_response_type=json"
  • llm_extract=true&premium=level_1 will consume 26.5 API Credits per request (25 for LLM + 1.5 for premium level 1).
  • llm_extract=true&premium=level_2 will consume 28 API Credits per request (25 for LLM + 3 for premium level 2).