How-Tos¶

Setting a scraper’s scheduler¶

You can schedule a scraper’s scheduler in as granular detail as you want. However, we only support granularity down to the Minute. We currently support the CRON syntax for scheduling. You can set the ‘schedule’ field or the ‘timezone’ to specify the timezone when the job will be started. If timezone is not specified, it will default to “UTC”. Timezone values are IANA format. Please see the list of available timezones. The scheduler also has the option to cancel current job, anytime it starts a new one. By default a scraper’s scheduler does not start a new job if there is an already existing job that is active.

$ answersengine scraper create <scraper_name> <git_repo> --schedule "0 1 * * * *" --timezone "America/Toronto" --cancel-current-job
$ answersengine scraper update <scraper_name> --schedule "0 1 * * * *" --timezone "America/Toronto"

The following are allowed CRON values:

Field Name	Mandatory?	Allowed Values	Allowed Special Characters
Minutes	Yes	0-59	* / , -
Hours	Yes	0-23	* / , -
Day of month	Yes	1-31	* / , - L W
Month	Yes	1-12 or JAN-DEC	* / , -
Day of week	Yes	0-6 or SUN-SAT	* / , - L #
Year	No	1970–2099	* / , -

Asterisk ( * )

The asterisk indicates that the cron expression matches for all values of the field. E.g., using an asterisk in the 4th field (month) indicates every month.

Slash ( / )

Slashes describe increments of ranges. For example 3-59/15 in the minute field indicate the third minute of the hour and every 15 minutes thereafter. The form */… is equivalent to the form “first-last/…”, that is, an increment over the largest possible range of the field.

Comma ( , )

Commas are used to separate items of a list. For example, using MON,WED,FRI in the 5th field (day of week) means Mondays, Wednesdays and Fridays.

Hyphen ( - )

Hyphens define ranges. For example, 2000-2010 indicates every year between 2000 and 2010 AD, inclusive.

L

L stands for “last”. When used in the day-of-week field, it allows you to specify constructs such as “the last Friday” (5L) of a given month. In the day-of-month field, it specifies the last day of the month.

W

The W character is allowed for the day-of-month field. This character is used to specify the business day (Monday-Friday) nearest the given day. As an example, if you were to specify 15W as the value for the day-of-month field, the meaning is: “the nearest business day to the 15th of the month.”

So, if the 15th is a Saturday, the trigger fires on Friday the 14th. If the 15th is a Sunday, the trigger fires on Monday the 16th. If the 15th is a Tuesday, then it fires on Tuesday the 15th. However if you specify 1W as the value for day-of-month, and the 1st is a Saturday, the trigger fires on Monday the 3rd, as it does not ‘jump’ over the boundary of a month’s days.

The W character can be specified only when the day-of-month is a single day, not a range or list of days.

The W character can also be combined with L, i.e. LW to mean “the last business day of the month.”

Hash ( # )

# is allowed for the day-of-week field, and must be followed by a number between one and five. It allows you to specify constructs such as “the second Friday” of a given month.

Changing a Scraper’s or a Job’s Proxy Type¶

We support many types of proxies to use:

Proxy Type	Description
standard	The standard rotating proxy that gets randomly used per request. This is the default.

$ answersengine scraper update <scraper_name> --proxy-type sticky1

Keep in mind that the above will only take effect when a new scrape job is created.

To change a proxy of an existing job, first cancel the job, and then change the proxy_type, and then resume the job:

$ answersengine scraper job cancel <scraper_name>
$ answersengine scraper job update <scraper_name> --proxy-type sticky1
$ answersengine scraper job resume <scraper_name>

Setting a specific ruby version¶

By default our ruby version that we use is 2.4.4, however if you want to specify a different ruby version you can do so by creating a .ruby-version file on the root of your project directory.

NOTE: we currently only allow the following ruby versions:

2.4.4
2.5.3
If you need a specific version other than these, please let us know

Setting a specific Ruby Gem¶

To add dependency to your code, we use Bundler. Simply create a Gemfile on the root of your project directory.

$ echo "gem 'roo', '~> 2.7.1'" > Gemfile
$ bundle install # this will create a Gemfile.lock
$ ls -alth | grep Gemfile
total 32
-rw-r--r--   1 johndoe  staff    22B 19 Dec 23:43 Gemfile
-rw-r--r--   1 johndoe  staff   286B 19 Dec 22:07 Gemfile.lock
$ git add . # and then you should commit the whole thing into Git repo
$ git commit -m 'added Gemfile'
$ git push origin

Changing a Scraper’s Standard worker count¶

The more workers you use on your scraper, the faster your scraper will be. You can use the command line to change a scraper’s worker count:

$ answersengine scraper update <scraper_name> --workers N

Keep in mind that this will only take effect when a new scrape job is created.

Changing a Scraper’s Browser worker count¶

The more workers you use on your scraper, the faster your scraper will be. You can use the command line to change a scraper’s worker count:

$ answersengine scraper update <scraper_name> --browsers N

NOTE: Keep in mind that this will only take effect when a new scrape job is created.

Changing an existing scrape job’s worker count¶

You can use the command line to change a scraper job’s worker count:

$ answersengine scraper job update <scraper_name> --workers N --browsers N

This will only take effect if you cancel, and resume the scrape job again:

$ answersengine scraper job cancel <scraper_name> # cancel first
$ answersengine scraper job resume <scraper_name> # then resume

Enqueueing a page to Browser Fetcher’s queue¶

You can enqueue a page like so in your script. The following will enqueue a headless browser:

pages << {
  url: "http://test.com",
  fetch_type: "browser" # This will enqueue headless browser
}

Or use the command line:

$ answersengine scraper page add <scraper_name> <url> --fetch-type browser

You can enqueue a page like so in your script. The following will enqueue a full browser (non-headless):

pages << {
  url: "http://test.com",
  fetch_type: "fullbrowser" # This will enqueue headless browser
}

Or use the command line:

$ answersengine scraper page add <scraper_name> <url> --fetch-type fullbrowser

Setting fetch priority to a Job Page¶

The following will enqueue a higher priority page. NOTE: You can only create a page that has priority, not update an existing page with a new priority value on the script. Also, updating a priority only works via the command line tool.

pages << {
  url: "http://test.com",
  priority: 1 # defaults to 0. Higher numbers means will get fetched sooner
}

Or use the command line:

$ answersengine scraper page add <scraper_name> <url> --priority N
$ answersengine scraper page update <job> <gid> --priority N