Niocchi is a java crawler library implementing synchronous I/O multiplexing.
This specific type of implementation allows crawling tens of thousands of hosts in parallel on a single low end server. Niocchi has been designed for big search engines that need to crawl massive amount of data, but can also be used to write no frills crawlers. It is currently used in production by Enormo and Vitalprix.

javadoc

Index

  1. Introduction
  2. Requirements
  3. License
  4. Package organization
  5. Architecture
  6. Usage
  7. Caveats
  8. To Do
  9. Download
  10. Change history
  11. About the authors

Introduction

Most of the java crawling libraries use standard java IO package.
That means crawling N documents in parallel requires at least N running
threads. Even if each thread is not taking a lot of resources while
fetching the content, that approach becomes costly when crawling at a
large scale. On the contrary, doing synchronous I/O multiplexing by using the NIO
package introduced in java 1.4 allows the crawling of many documents in
parallel using one single thread.

Requirements

Niocchi requires java 1.5 or above.

License

This software is licensed under the Apache license version 2.0.

Package organization

Architecture