LLM Extract
The ScrapeOps Proxy API Aggregator provides intelligent content extraction using Large Language Models (LLM) to automatically parse and structure data from web pages. This feature integrates seamlessly with the proxy service and can be enabled via query parameters.
LLM extraction uses advanced AI models to intelligently understand and extract structured data from any webpage, eliminating the need for manual CSS selectors or complex parsing logic. It can identify and extract relevant information based on the page type and return it in your preferred format.
Usage Examples
# Extract data from a product page in JSON format
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://example-shop.com/product/123&llm_extract=true&llm_extract_response_type=json&llm_data_schema=product_page"
# Extract data in markdown format with auto-detection
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://example.com/article&llm_extract=true&llm_extract_response_type=markdown"
- Using
llm_extract=true
will consume 25 additional API Credits per request beyond the standard proxy cost. - The extracted data is returned in your specified format (
json
ormarkdown
) with intelligent structuring based on the page content.
Choose JSON format for programmatic processing and integration with your applications. Use markdown format for human-readable content extraction and documentation purposes.
Parameters
Parameter | Description | Options |
---|---|---|
llm_extract | Enable/disable LLM extraction | true or false |
llm_extract_response_type | Format of the extracted data | json or markdown |
llm_data_schema | Page type for optimized extraction | See supported schemas below |
Supported Page Schemas
The LLM extraction supports various page types optimized for different content structures:
E-Commerce Pages
product_page
- Individual product detail page showing specific item info, price, description, etc.product_search_page
- Search results page listing multiple products based on user queryproduct_reviews_page
- Page dedicated to customer reviews/ratings for a specific productproduct_seller_page
- Page showing products from a specific seller/vendor/merchant
Search Pages
serp_search_page
- Search engine results page displaying search query results
Company Pages
company_page
- Main company profile/information pagecompany_search_page
- Page for searching/browsing companiescompany_review_page
- Company reviews and ratingscompany_location_page
- Company office/branch locationscompany_job_page
- Company's career opportunitiescompany_social_media_page
- Company's social media presence
Job Pages
job_page
- Individual job posting detailsjob_search_page
- Job listings search interfacejob_advert_page
- Featured/sponsored job posting
Real Estate Pages
real_estate_page
- Individual real estate listingreal_estate_search_page
- Real estate search interfacereal_estate_profile_page
- Real estate publisher/organization profile
LLM extraction automatically detects and structures content regardless of the website's layout or CSS structure. It works with dynamic content and can understand context to extract the most relevant information.
If you need to scrape JavaScript-heavy sites, combine LLM extraction with render_js=true
. For sites that require residential IP addresses, add residential=true
to your requests.
Combined Usage with Javascript Rendering
If you use render_js=true
in addition to LLM extraction, the cost per request increases as follows:
llm_extract=true&render_js=true
will consume 35 API Credits per request (25 for LLM + 10 for JS rendering).- This combination is perfect for extracting data from single-page applications (SPAs) and sites with dynamic content loading.
Example Usage
# LLM extraction with Javascript rendering
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://spa-example.com/product/123&llm_extract=true&llm_extract_response_type=json&render_js=true&llm_data_schema=product_page"
# LLM extraction with residential proxies
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://example.com/article&llm_extract=true&llm_extract_response_type=markdown&residential=true"
# LLM extraction with both JS rendering and residential proxies
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://protected-site.com/data&llm_extract=true&llm_extract_response_type=json&render_js=true&residential=true&llm_data_schema=company_page"
These combinations are useful for scraping complex sites that require both intelligent content extraction and advanced proxy features for successful data extraction.
Combined Usage with Premium Proxies
You can also combine LLM extraction with premium proxy levels for maximum reliability:
# LLM extraction with premium level 1 proxies
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://example.com/product&llm_extract=true&premium=level_1&llm_data_schema=product_page"
# LLM extraction with premium level 2 proxies and JS rendering
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://complex-site.com/data&llm_extract=true&premium=level_2&render_js=true&llm_extract_response_type=json"
llm_extract=true&premium=level_1
will consume 26.5 API Credits per request (25 for LLM + 1.5 for premium level 1).llm_extract=true&premium=level_2
will consume 28 API Credits per request (25 for LLM + 3 for premium level 2).