Venom is an open source focused crawler for the deep web.

Install with Maven

                         <dependency>
                            
                             <groupId>ai.preferred</groupId>
                             <artifactId>venom</artifactId>
                             <version>[4.0,5.0)</version>
                         </dependency>
                    

Build Your Own Crawler

                        public class Example {

                            private static class VenomHandler implements Handler {

                                @Override
                                public void handle(Request request,
                                                   VResponse response,
                                                   Scheduler scheduler,
                                                   Session session,
                                                   Worker worker) {

                                    String about = response.getJsoup().select(".about p").text();
                                    System.out.println("ABOUT: " + about);

                                }

                            }

                            public static void main(String[] args) throws Exception {
                                try (Crawler c = Crawler.buildDefault().start()) {
                                    Request r = new VRequest("https://venom.preferred.ai");
                                    c.getScheduler().add(r, new VenomHandler());
                                }
                            }

                        }

Crawling with Venom

One can hardly underestimate the huge role that data play in our everyday live. Tapping on a right data source and having relevant information is a wealth by itself. Here we focus on collecting open web data, such as blogs, reviews, tweets, and other public opinions. We want to do it effectively and efficiently.

Searching for the lightweight, robust, flexible and fast java web crawling frameworks, we have explored several alternatives, but decided to develop our own. We seek for a focused crawler, which selectively looks out for the pages and information that are relevant to our tasks. Rather than collecting and indexing all accessible Web documents, we specify which parts are important and how we want to represent it in our storage, avoiding irrelevant parts of the web resources. This leads to significant savings in hardware and network resources.

Venom, a new kind of crawler, meets all our requirements. It is an asynchronous event-driven framework. Venom specifies request-handler page processing scheme, which is comparable to the way a human looks out for the important information on the page. Every freshly fetched page can be processed individually via handler, a human model, which defines where the important page content is, how it should be interpreted, and what browsing actions should follow.

Venom is easy and intuitive to use. It integrates content parsing and processing libraries, which facilitate quick and pleasant development. Venom suites well for both quick-and-dirty prototyping and highly loaded production development.

Blazing Fast

Venom is designed a around synchronous, event driven I/O model, and can send and process massive number of requests. The time-consuming jobs like data parsing or indexing can be run in separate thread thereby maximizing network throughput. Venom uses all available resources to process each request as fast as possible.

Customizable

Venom combines high-level API with low-level fine tuning, managing throughput and processing resources. Event-driven request-handler scheme enables page-specific content analysis, minimizing unnecessary processing.

Robust

Venom handles network and content provider issues gracefully. Failed requests fit into request-handler scheme and are rescheduled if necessary. Incoming pages can be passed through validators to ensure expected data content.

Simple and Handy

Venom defines default crawling parameters and seamlessly integrates content parsing. It is easy to use for prototyping and production. One can write a full-fledged crawler in a few lines of code!