Nimble x Orchestra: Powering end-to-end, AI-enabled web scraping
We’re excited to announce our integration and partnership with Orchestra, a data release tool that helps data teams automate their orchestration, monitoring, and metadata collection.
We’re excited to announce our integration and partnership with Orchestra, a data release pipeline tool that helps data teams automate their orchestration, monitoring, and metadata collection across their entire tech stack.
Through this partnership, users will be able to gather and utilize web data with unparalleled accuracy, speed, and reliability. Together we are committed to continuously innovating the field of web scraping—let's dive into more detail about the value and use cases of web scraping and the benefits and applications of our partnership.
The following has been repurposed from Orchestra, head over to their blog to see the original post.
The importance of web scraping in 2024
There are over 50 billion pages on the internet — that’s a lot of data. While incredibly valuable, this information is hard to access for production systems. Changes in HTML-code structure, the necessity of proxy network management, and long-running infrastructure requirements mean scraping data at scale is often a difficult task and not for the faint-hearted.
To this point, many intelligence or research providers, hedge funds, and software vendors in the travel market have historically had scores of developers whose role is to manage the infrastructure to scrape data at scale. This powers use cases such as real-time price intelligence, real-time weather updates, search engine optimization monitoring, and the monitoring of reviews.
By combining Nimble with Orchestra, anybody working in data—technical or otherwise—can leverage Nimble for these types of use cases. This unlocks an enormous range of hitherto inaccessible opportunities for deriving value from data — for teams of any size.
How does Nimble work?
Nimble is an AI-enabled API for web scraping. Nimble manages problems such as proxy management, concurrency, and HTML-parsing. This means users need only define the data source via a URL or list of URLs, provide credentials to storage options (Amazon S3 or Google Cloud Storage (“GCS”)), and execute jobs on a schedule in order to gather data from the web at scale.
This requires users to write and execute [python] code, while asynchronously monitoring Nimble’s API to poll for job completion.
Upon job completion, additional tasks such as JSON flattening or Machine learning / Generative AI model training can be undertaken. Another popular use-case is to move this data from Object storage (S3/GCS) to a data warehouse environment following some basic transformation, for use in dashboards or other analytical use-cases.
An example of how you might design such a data pipeline using an Open-Source Workflow Orchestration tool such as Airflow is below.
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from bs4 import BeautifulSoup
import requests
import json
from datetime import datetime
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2024, 5, 8),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
}
def scrape_website():
# Scraping the website
url = 'YOUR_WEBSITE_URL_HERE'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Extracting data
data = []
for item in soup.find_all('YOUR_HTML_TAG'):
# Extract whatever data you need from each item
# Example:
title = item.find('YOUR_TITLE_TAG').text.strip()
description = item.find('YOUR_DESCRIPTION_TAG').text.strip()
link = item.find('YOUR_LINK_TAG')['href']
# Create a dictionary for each item
item_data = {
'title': title,
'description': description,
'link': link
# Add more fields as needed
}
data.append(item_data)
return data
def flatten_json():
# Getting scraped data
scraped_data = scrape_website()
# Flattening data
flattened_data = []
for item in scraped_data:
flattened_item = {
'title': item['title'],
'description': item['description'],
'link': item['link']
# Flatten more fields if needed
}
flattened_data.append(flattened_item)
# Writing flattened data to JSON file
with open('/path/to/output.json', 'w') as json_file:
json.dump(flattened_data, json_file)
with DAG('webscraper_dag',
default_args=default_args,
description='A DAG to scrape a website and flatten JSON',
schedule_interval='@daily') as dag:
scrape_and_flatten = PythonOperator(
task_id='scrape_and_flatten',
python_callable=flatten_json
)
scrape_and_flatten()
The benefits of Orchestra and Nimble
Orchestra’s integration makes this process significantly easier. We’ve built a market-leading integration with Nimble that means users need only define the frequency, data type, and storage location in order to get started.
This means there is minimal learning curve required to make full-use of Nimble’s AI-web scraping abilities, in stark contrast to trying to scrape the web at scale using a combination of Beautiful Soup, Selenium, and own-managed infrastructure.
Secondly, by handling the execution, monitoring and alerting for the end-to-end data pipeline, Orchestra can save hundreds of engineer hours writing boilerplate infrastructure required to robustly and efficiently interact with Nimble’s API. The Architecture comparison diagram illustrates this below:
Using Traditional Architecture, engineering teams are required to maintain orchestration tools, Kubernetes clusters, EC2 repos, logging, proxy networks and plugins. This is a huge drain on resources that make web scraping at scale impossible for small to medium-sized teams. Data teams leveraging Nimble with Orchestra need only define basic parameters and clean, flattened data can be pushed to warehouse environments with minimal lift while retaining all the good bits.
Finally, leveraging Orchestra is the most reliable way to ensure data is scraped accurately and consistently. By integrating deeply with Nimble’s API and handling mechanisms such as retries and timeouts, Orchestra provides an additional layer of defense against DAG failure.
It should also go without saying that integrating Nimble into Orchestra’s data lineage and Data Asset Graph is extremely valuable. Orchestra’s engine is context/asset-aware, which ensures files scraped using Nimble render as Data Assets underpinning models in S3, Snowflake, Databricks, Bigquery and so on.
Examples of leveraging AI-powered web scraping
Intelligence gathering
Building a pipeline to gather near real-time pricing information using Nimble and Orchestra takes minutes. You can read more about how to implement an intelligence gathering pipeline here.
Intelligence gathering can be used to scrape data from websites with reviews for sentiment analysis, or from public websites. This data can then be enriched and aggregated over time to predict valuable data products.
Real-time pricing information
Nimble’s e-Commerce API focusses on parsing data specifically from marketplace and other retail-style websites. These are unique due to the generally common HTML structure of having cards with images, prices, descriptions, SKU numbers and so-on.
Companies in the retail or e-Commerce spaces can leverage Nimble and Orchestra to instantly gather real-time pricing data from the market in order to identify opportunities and trends. This represents enormous opportunities for maximizing revenue in real-time during periods of high traffic such as Black Friday.
While the e-commerce industry has extremely sophisticated tools for analyzing internal data and web traffic, the extent to which these make use of external data is much lower. This changes now with the advent of AI-powered web scraping and enterprise grade, near real-time data orchestration.
Conclusion
We’re thrilled to be working with the team at Orchestra and we cannot wait to see what you build with our integrated partnership.
If you’re interested in learning more or trying out Nimble for yourself, get in touch with our team for a demo or sign up for a free trial.
You can also reach out to Orchestra here and sign up for their trial here.
FAQ
Answers to frequently asked questions