wget: Recursive Accept/Reject Options
2.12 Recursive Accept/Reject Options
====================================
‘-A ACCLIST --accept ACCLIST’
‘-R REJLIST --reject REJLIST’
Specify comma-separated lists of file name suffixes or patterns to
accept or reject (⇒Types of Files). Note that if any of the
wildcard characters, ‘*’, ‘?’, ‘[’ or ‘]’, appear in an element of
ACCLIST or REJLIST, it will be treated as a pattern, rather than a
suffix. In this case, you have to enclose the pattern into quotes
to prevent your shell from expanding it, like in ‘-A "*.mp3"’ or
‘-A '*.mp3'’.
‘--accept-regex URLREGEX’
‘--reject-regex URLREGEX’
Specify a regular expression to accept or reject the complete URL.
‘--regex-type REGEXTYPE’
Specify the regular expression type. Possible types are ‘posix’ or
‘pcre’. Note that to be able to use ‘pcre’ type, wget has to be
compiled with libpcre support.
‘-D DOMAIN-LIST’
‘--domains=DOMAIN-LIST’
Set domains to be followed. DOMAIN-LIST is a comma-separated list
of domains. Note that it does _not_ turn on ‘-H’.
‘--exclude-domains DOMAIN-LIST’
Specify the domains that are _not_ to be followed (⇒Spanning
Hosts).
‘--follow-ftp’
Follow FTP links from HTML documents. Without this option, Wget
will ignore all the FTP links.
‘--follow-tags=LIST’
Wget has an internal table of HTML tag / attribute pairs that it
considers when looking for linked documents during a recursive
retrieval. If a user wants only a subset of those tags to be
considered, however, he or she should be specify such tags in a
comma-separated LIST with this option.
‘--ignore-tags=LIST’
This is the opposite of the ‘--follow-tags’ option. To skip
certain HTML tags when recursively looking for documents to
download, specify them in a comma-separated LIST.
In the past, this option was the best bet for downloading a single
page and its requisites, using a command-line like:
wget --ignore-tags=a,area -H -k -K -r http://SITE/DOCUMENT
However, the author of this option came across a page with tags
like ‘<LINK REL="home" HREF="/">’ and came to the realization that
specifying tags to ignore was not enough. One can’t just tell Wget
to ignore ‘<LINK>’, because then stylesheets will not be
downloaded. Now the best bet for downloading a single page and its
requisites is the dedicated ‘--page-requisites’ option.
‘--ignore-case’
Ignore case when matching files and directories. This influences
the behavior of -R, -A, -I, and -X options, as well as globbing
implemented when downloading from FTP sites. For example, with
this option, ‘-A "*.txt"’ will match ‘file1.txt’, but also
‘file2.TXT’, ‘file3.TxT’, and so on. The quotes in the example are
to prevent the shell from expanding the pattern.
‘-H’
‘--span-hosts’
Enable spanning across hosts when doing recursive retrieving (⇒
Spanning Hosts).
‘-L’
‘--relative’
Follow relative links only. Useful for retrieving a specific home
page without any distractions, not even those from the same hosts
(⇒Relative Links).
‘-I LIST’
‘--include-directories=LIST’
Specify a comma-separated list of directories you wish to follow
when downloading (⇒Directory-Based Limits). Elements of
LIST may contain wildcards.
‘-X LIST’
‘--exclude-directories=LIST’
Specify a comma-separated list of directories you wish to exclude
from download (⇒Directory-Based Limits). Elements of LIST
may contain wildcards.
‘-np’
‘--no-parent’
Do not ever ascend to the parent directory when retrieving
recursively. This is a useful option, since it guarantees that
only the files _below_ a certain hierarchy will be downloaded.
⇒Directory-Based Limits, for more details.