The engineers on vteam #261 were given the task of automating content extraction. Before this solution, our client was using a laborious process to manually go to numerous sites and copy and paste content manually.
The team looked at various open source libraries and short listed 2 main contenders:
- Abot C# Web Crawler
- Html Agility Pack
After a technical consideration, they chose the HTML Agility pack for following reasons:
- Supports Linq to Objects (via a LINQ to XML Like interface).
- Page fixing or generation. You can fix a page the way you want, modify the DOM, add nodes, copy nodes etc.
- Web scanners. You can easily get to img/src or a/hrefs with a bunch of XPATH queries.
- Web scrapers. You can easily scrap any existing web page into an RSS feed with just an XSLT file serving as the binding.
HTML Agility Pack library was integrated in Windows Form Application developed using VB.NET, Visual Studio 2010 and .Net Framework 3.5. It used to send a web request to desired URL and returned the RAW HTML. With the help of LINQ queries, RAW HTML was parsed to extract the required information and save the content in database (SQL Server 2008 R2) for further use.
Task Implemented successfully.