This weekend I will try to build a sophisticated web scraping interface. Make a request to the API, which will tell the BOT to crawl the requested url next - meaning the API will put this user's requested url next in its queue. So, when the BOT finishes its current crawl, it will crawl the requested url.

The BOT will have to be different.

Currently it uses predefined selectors to get the content, with only a little automated logic - you currently usually only have to specify the selector for each blog/repeating article - and it will figure out the title/description/date/price, etc. If it fails, then you can add specific selectors for the title/description/date/price, whichever it failed to parse.

But now, its time to upgrade it to automatically parse repeating content. This will be difficult. Maybe I'll even be able to figure out some machine-learning?!


((the APP)) makes request to ((the API)) which tells ((the BOT)) to crawl the requested url. If the url already exists and its content is not too old, then (theAPI) responds immediately without waiting for new data. Meanwhile, (theAPP) keeps poling (theAPI) every second or so until it receives the result...?

Aha! Each user query (poll) to (the API) should get back (if not the url results) the status of the request. The API should find out if the web crawler is currently crawling the site requested - if so, will respond with status "crawling". Or, if the crawler hasn't gotten to it yet, the API will respond with status "waiting... to finish previous job ...will start yours soon" or something.

Maybe not... (FLIP PAGE >>)

results matching ""

    No results matching ""