## Prerequisites

* [node + npm](https://nodejs.org/) (Current Version)
* Tested with v12.16.1

## Includes the following libs

* typeScript
* mongoose
* node-fetch
* tldjs
* lodash
* puppeteer
* user-agents
* workerpool
* shelljs

## Project Structure

* `data/inputs/tranco_domains`: input domains
* `data/inputs/language`: used for tld language mappings
* `data/outputs`: where large outputs/payloads are stored
* `src`: TypeScript source files
* `src/config`: config files for the crawler
* `src/controllers`: methods to access db models CRUD methods
* `src/models`: db models
* `src/types`: typescript data types 
* `src/crawler.ts`: crawler main 
* `src/worker`: crawler worker (one browser launch per URL) 

## Setup

```
npm install
```

## Build

```
npm run compile 
```

## Launch the crawler

```
npm run init_db
npm run start_crawler
```

## Stop (Soft) The Crawler

```
npm run pause_data_flow
npm run start_data_flow

```

## Tip

Chromium profiles are created as part of the crawling, make sure to look into /tmp (for Ubunto servers) and ever now and then clean all profiles temp folders using:

$ sudo find . -name 'puppeteer*' | xargs rm -Rf
