Crawling & Collecting Data in 2024
Welcome to our Definitive Guide to Crawling & Collecting Data in 2024! In this guide, we’ll expand on the foundations established in our Definitive Guide to Web Scraping, and provide real world use cases, explanations of the latest trends, and more thorough examination of the current challenges facing the industry.
In our Definitive Guide to Web Scraping, we introduced the CCCD framework to web scraping as a tool to assist data engineers understand the full scope of the process, and the unique challenges, tools, and methodologies for each phase. Crawling and Collection are the opening two phases in the CCCD framework, and take us from the initial discovery of relevant data through to its extraction at scale.
Crawling
In their quest to deliver effective data products, data leaders and web scraping teams seek the best sources to extract business impact. Optimal sources are identified along several criteria including data relevance, accuracy, freshness, and technical aspects such as the complexity of anti-bot systems and other complicating factors.
Once the target sources are defined by the data or business leader, the engineering team begins preparing for web scraping operations, starting with crawling. Websites typically lack clear database schemas, so the engineering team must devise a logic to create these schemas. Creating this schema is equal parts science and art, and essentially involves translating the unstructured website data into a definable index that satisfies the organization’s business logic.
The crawling approach varies depending on the website's structure. For instance, crawling a single-page application requires different techniques than those used for traditional PHP server-based applications. The success of crawling is measured by the completeness of the data (such as covering all product listings), runtime, server costs, and reliability over time.
Let's explore various crawling techniques used to index relevant data source URLs for the collection phase:
- Search Engine Utilization: This method employs search engines like www.google.com to discover URLs across the web. This technique could be employed to identify target websites for scraping. For instance, when determining target websites for scraping, if the goal is to scrape real estate property data in Chicago, a site like http://www.zillow.com can serve as an ideal primary source for simplicity. However, in a real-world scenario, using search engines like Google, queries such as 'Real estate properties in Chicago' can be conducted, and then the SERP can be crawled and analyzed to identify additional relevant websites.
- Sitemap Crawling: This technique utilizes the sitemap provided by the website, crawlers can systematically access and navigate through the listed URLs to retrieve the structured data outlined across different pages on the web. For instance, in a scenario focusing on comprehensive crawling of data, a sitemap URL like http://mortgageapi.zillow.com/sitemap.xml can be utilized for thorough exploration of the site’s content.
- Website Search-Based Crawling: This method leverages a website's own search functionality to discover URLs by submitting queries and crawling the resulting pages for links and data. For example, using Zillow’s search functionality, a crawler can submit queries like 'Homes in Chicago' and then crawl the resulting pages to gather links to individual listings. This approach is effective for focusing the search on specific sections of a website, such as Zillow, to uncover specialized URLs like Chicago listings (http://www.zillow.com/chicago-il/).
- URL Formatting: This technique involves crafting URLs to discover and access data on different web pages by modifying or appending certain parameters to the base URL. It facilitates the retrieval of related content across various pages on the web and is particularly useful for managing URL complexities like pagination and parameter filtering.
For example, to ensure comprehensive data retrieval on a site like Zillow, understanding pagination and parameter filtering is crucial.
Zillow uses a URL format of http://www.zillow.com/chicago-il/1_p/ for page 1,
http://www.zillow.com/chicago-il/2_p/ for page 2,
and so on, demonstrating how URL formatting can be navigated for effective data extraction.
Web crawlers are typically configured to follow links on discovered pages that may lead to other relevant pages, and in doing so execute the classic “crawling” methodology of mapping out the pages and structure of websites.
To continue our Zillow example, our crawler could follow links from the listings to individual property detail pages across multiple paginated listings. For example, it would follow links to pages like http://www.zillow.com/homedetails/1814-W-Newport-Ave-Chicago-IL-60657/3710168_zpid/ to gather detailed information on each property. These links will be fed into the future Collection phase for data extraction.
No crawling technique is necessarily superior to the other. Each crawling method has its unique advantages depending on the task at hand. This highlights the flexible nature of web scraping in navigating the vast world of data.
Transitioning to Advanced Crawling Tools
The advent of Large Language Models (LLMs) and AI-powered crawling tools like Diffbot heralds a new era of web crawling, offering a chance to streamline data collection endeavors. Diffbot utilizes AI and machine learning to automate data extraction, making it a powerful solution for collecting data at scale, particularly with its Automatic APIs that can identify and extract data based on the page type.
SERP APIs like Nimble and SerpApi streamline data extraction from search engine pages, significantly enhancing crawling capabilities for scraping projects. For example, by inputting a relevant search phrase such as 'Chicago real estate listings' into APIs, users can quickly obtain a list of URLs from a search engine's index. These URLs lead to pages with desired data. From there, data can be scraped directly from these pages, or further traversal of the hosting websites can be done to find more in-depth information or additional listings, making the data collection process more robust and comprehensive.
Other popular crawling solutions include:
- Open Source Crawling Frameworks: Scrapy
- Browser Automation Tools: Selenium, Puppeteer
- Managed Crawling Services: Import.io, Octoparse
- Cloud-Based Crawling Solutions: Crawlera (by Scrapinghub)
In the pursuit of insights, careful selection of crawling methods and tools, tailored to the structure of the target websites and the data source is pivotal. Ensuring a smooth and effective data retrieval process, setting a strong foundation for the upcoming stages of collection, cleaning, and debugging.
Collection
Having successfully compiled relevant URLs during the Crawling phase, the CCCD framework guides us to the next phase - Collection. In this phase, the focus is shifted from identifying and accessing web pages to extracting data from them for further analysis.
To do so effectively, data teams need to contend with a series of challenges including scalability, JS rendering, and bot detection mechanisms. Let’s take a deeper look at each of these challenges, as well as the strategies and technologies being deployed to address them.
Collection Challenges
Collection at Scale
Efficiently scaling web scraping collection operations involves navigating a range of technical and logistical challenges.
- Resource and System Scalability: The efficiency of collection at scale hinges on the strategic allocation of computational resources and the orchestration of scraping infrastructure. This involves not only managing server loads and bandwidth but also ensuring seamless coordination across distributed systems. The challenge is to maintain this balance to support scalable data collection, all while adapting to dynamic web environments and preserving data integrity.
- Scraper Maintenance and Adaptability: Websites frequently update their structures, necessitating regular updates to scraping scripts. Adapting to these changes quickly is crucial to maintain data accuracy and access.
- Data Storage and Processing: Storing and processing large datasets demand robust infrastructure. This can be expensive and requires efficient database management and processing systems to handle the volume and variety of data.
- Managing Billing Costs: Large-scale scraping can incur significant costs, particularly when using cloud services or paid proxy servers. Efficiently managing these costs is crucial to maintaining a cost-effective scraping operation.
To effectively tackle the challenges of large-scale collection, various specialized solutions have been developed:
Nimble's serverless, fully-managed cloud API addresses resource management and scalability. It dynamically adjusts to changing workloads and system demands, making it ideal for extensive scraping operations.
Alternatively, Bright Data has addressed the same issue by offering tools for scraper maintenance and adaptability. Its automated script updates and AI-driven monitoring systems are essential for quickly adapting to website changes and ensuring data accuracy.
For data storage and processing, cloud-based solutions like Snowflake and Amazon S3 offer a robust, scalable infrastructure, crucial for efficiently handling large datasets.
Additionally, Bright Data and Nimble provide cost-optimization features for cloud services and proxy usage, helping manage billing costs. This includes monitoring and adjusting resource usage to control expenses effectively.
JS rendering
The growing use of JavaScript for dynamic content loading on websites poses significant challenges for traditional scraping methods that rely on static HTML. This shift complicates data extraction, creating interpretative and interactional hurdles for scraping tools:
- Complexity in Handling Dynamic Content: Websites increasingly use JavaScript to load content dynamically, making it difficult for traditional scrapers that rely on static HTML to extract data.
- Increased Use of Advanced JS Frameworks: Modern web applications often use complex JavaScript frameworks and libraries, which can pose challenges for scraping tools to interpret and interact with the rendered content.
- Execution Time and Resource Consumption: Scraping JavaScript-heavy websites often requires the use of headless browsers, which can be more resource-intensive and slower compared to scraping static content.
- Detecting and Handling AJAX Calls: Asynchronous JavaScript and XML (AJAX) are commonly used in web applications for dynamic content loading. Detecting and properly handling these AJAX calls during scraping can be challenging.
- Rendering Completeness Issues: Ensuring that all necessary content is fully rendered and available before scraping can be a challenge, especially with sites that have complex loading mechanisms or rely heavily on user interactions to display content.
These challenges collectively increase the complexity of scraping processes, necessitating more sophisticated tools and techniques. Tools like Selenium, Puppeteer, and Playwright have been relied on heavily in this context, offering robust environments that effectively interact with JavaScript-heavy sites. These headless browsers are capable of processing JavaScript and managing AJAX calls effectively, but come at significant cost in terms of resource consumption and instability.
Bot Detection Systems
Headless browsers may have solved many issues related to javascript rendering and webpage navigation, but their popularity led to a rise in bot detection systems that identify and challenge potential headless browsers. Increasingly sophisticated anti-bot measures, such as those implemented by Cloudflare and DataDome, include CAPTCHA tests, rate limiting, and other anti-bot techniques.
Bot Detection Mechanisms:
- JavaScript Challenge: Cloudflare presents a JavaScript challenge to visitors, requiring them to solve a math problem and return the result. Bots lacking a JavaScript engine often fail these challenges, assisting in bot identification.
- Browser Fingerprinting: DataDome uses browser fingerprinting to distinguish bots from genuine users, like detecting Selenium Chrome browsers created by the Selenium framework through client-side JavaScript.
- TLS Fingerprinting: Employing machine learning, DataDome detects inconsistencies in new TLS fingerprints using signals from other fingerprinting methods, aiding in bot detection.
AI-Powered Bypass
Once again, we see AI as being a pivotal technology that plays a role in overcoming web scraping challenges. Building on the transformative impact of generative AI in web scraping, in 2023 web scraping tools are utilizing AI to generate synthetic fingerprints, mimicking human-like browsing profiles to evade anti-bot systems.
Fingerprints vs Synthetic Fingerprints
In web browsing, a fingerprint serves as a unique identifier, typically generated by aggregating various attributes and characteristics, such as IP address, browser type, and operating system, associated with a user or a browser. Websites and online platforms utilize these fingerprints to identify, track, and sometimes block or restrict users, especially automated bots used in web scraping.
Synthetic fingerprints are artificially generated identifiers that mimic the attributes of a real user or browser. By crafting a synthetic fingerprint, web scraping tools can present themselves as genuine users to the websites they are interacting with, thus evading detection mechanisms that aim to block automated data collection activities.
A telltale fingerprint characteristic is an attribute that reveals unique details about the user or browser, making it possible for websites to track or identify users. The User-Agent string is a prime example, disclosing the browser type, version, and operating system of a user. A simplistic User-Agent string like:
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
reveals a user operating on Windows 10 with a Chrome browser version 58.0. When crafting synthetic fingerprints, it's crucial to adeptly mimic such details to avoid detection by anti-bot measures, ensuring successful data collection endeavors in web scraping.
Tools such as the Nimble Browser and Undetect are at the forefront of employing AI to generate and manage synthetic fingerprints. Through AI's learning and adaptive capabilities, these synthetic fingerprints can continually evolve, adapting to the ever-changing landscapes of web design and anti-bot measures. This not only boosts the effectiveness of web scraping ventures but also underscores the ingenious synergy between AI and web scraping, enabling scrapers to evade bot detection mechanisms deployed by entities like Cloudflare and DataDome.
To see fingerprinting in action, you can check your own fingerprint using tools like http://amiunique.org/, which gives a firsthand understanding of how websites track individual users or browsers based on their unique fingerprints.
In a practical demonstration of Synthetic Fingerprint effectiveness, Nimble Browser was tested against robust anti-bot solutions such as DataDome, PerimeterX, and Cloudflare, the latter holding a significant market share in anti-bot solutions. In the test, Nimble Browser successfully navigated the high-level security of Cloudflare, PerimeterX, and DataDome, obtaining the necessary data. This demonstrates the efficacy of AI in enhancing web scraping technologies amidst evolving anti-bot measures.
Data Collection Strategies:
Having taken these challenges into consideration, there are various methods available for data collection, and we will explore each strategy, delineating their advantages and disadvantages.
- Traditional HTML Scraping: This involves scripting scrapers to handle static HTML content. It relies on parsing HTML elements, using tools to navigate through HTML tags, attributes, and structures. Identifiers like CSS classes and XPath are used to locate and extract data. The effectiveness of Traditional HTML Scraping hinges on the consistent structure of HTML elements present on the page making it a bad choice for dynamically loaded content.
- Headless Browsers: For dynamic web pages where content changes via JavaScript, headless browsers like Puppeteer or Selenium are used. These browsers render JavaScript just like a standard browser, allowing access to dynamically loaded content.
- Internal APIs: Many websites have internal APIs which can be accessed for scraping. These APIs provide a more structured and reliable way of extracting data, offering a programmatic interface that is often more efficient than traditional scraping methods.
Popular Collection Tools
In 2023, some of the most popular tools used for the collection phase include:
- Automated Browsers: Selenium, Puppeteer, and Playwright are tools that automate web interactions and collect data from dynamic sites, making them suitable for websites requiring user interaction like clicking buttons, filling out forms, and scrolling pages.
- Bright Data: Offers a comprehensive web data collection platform with various tools and solutions designed to cater to different data collection needs at scale.
- Scrapinghub: Known for providing a suite of AI tools and solutions to efficiently collect data at a large scale.
- Browse.ai: Known for its intuitive and automated web data collection capabilities. Browse.ai enables users of all skill levels to efficiently gather data without requiring coding expertise.
- Nimble: Utilizes AI to revolutionize web scraping, overcoming anti-bot challenges through an innovative browser and AI Fingerprinting engine, enabling effortless large-scale data collection.
Scraping Best Practices: DIY vs Managed Infrastructure
- Potentially more cost-effective
- Enhances programming skills | 1. User-friendly with simplified interfaces
- Advanced features (AI parsing, IP rotation)
- Efficiently scalable for large-scale tasks | | Cons | 1. Requires regular maintenance
- Complex, especially for dynamic content
- Resource-intensive | 1. Higher costs due to service fees
- Limited customization options |
The decision between Do-It-Yourself (DIY) and managed infrastructure for web scraping hinges on the specific needs of the project, the technical expertise at hand, and resource availability.
DIY methods grant greater control and can be more cost-effective initially but may entail more effort and higher long-term costs for maintenance and scaling, especially for large-scale projects. Conversely, managed solutions offer ease of use, scalability, and access to advanced features, streamlining the scraping process at a higher financial outlay.
What is next?
With crawling and collection finished, we are now in possession of relevant, impactful data - possibly in HTML form. Even if we’ve used a solution that has data parsing, our data still needs to be validated before we can trust it’s reliability.
Furthermore, the entire process we’ve gone through may work well in the development phase, but out in the world may encounter hard to predict issues. To ensure that we detect and mitigate issues as effectively as possible, debugging and data observability tools are a must.
To round out the CCCD framework, keep on reading by heaving over to our Cleaning Data & Debugging Scraping Ops in 2024 guide.
FAQ
Answers to frequently asked questions