Quickstart
Install
Goose3 is a python3 fork of the python-goose library. To use goose3, one must run everything using python3. All python commands assume the usage of the correct python version.
Using pip
The easiest way to install goose3 is to use pip:
$ pip install goose3
From source
To install from source simply clone the repository on GitHub, then run the following command from the extracted folder:
$ python setup.py install
Setup
Setting up Goose3 using the standard configuration is fairly straight forward:
from goose3 import Goose
g = Goose()
article = g.extract(url='http://this-url.html')
print(article.cleaned_text)
g.close()
For extracting lots of HTML files or URLs, one can also use it as a context manager:
from goose3 import Goose
urls = [...]
with Goose() as g:
for tmp in urls:
article = g.extract(url=tmp)
print(article.cleaned_text)
Setting Config Options
One can also alter how goose3 performs the extraction and what items are extracted by passing a configuration to Goose. There are several ways to set the configuration options.
For more details on available configuration settings, see Configuration
Use Configuration object
from goose3 import Goose
from goose3.configuration import Configuration
config = Configuration()
config.strict = False # turn of strict exception handling
config.browser_user_agent = 'Mozilla 5.0' # set the browser agent string
config.http_timeout = 5.05 # set http timeout in seconds
with Goose(config) as g:
...
Use Dictionary
One can pass in a dictionary with keys that match the configuration properties one would like to change:
from goose3 import Goose
config = {}
config['strict'] = False # turn of strict exception handling
config['browser_user_agent'] = 'Mozilla 5.0' # set the browser agent string
config['http_timeout'] = 5.05 # set http timeout in seconds
with Goose(config) as g:
pass
Or if there are only a few changes:
from goose3 import Goose
with Goose({'http_timeout': 5.0}) as g:
pass
After Object Creation
One can also change configuration options after the Goose object has been created:
from goose3 import Goose
g = Goose()
g.config.browser_user_agent = 'Mozilla 5.0'
Configuration Helper Classes
For some, more complex configuration options, there are classes available to help ensure that the correct values are provided. One does not need to use the provided classes, but it does make things a bit simpler.
from goose3 import Goose
from goose3.configuration import Configuration, ArticleContextPattern, PublishDatePattern, AuthorPattern
config = Configuration()
# we know of a particular article location in the site we are pulling from
config.known_context_patterns = ArticleContextPattern(attr="id", value="my-site-article")
# publish date
config.known_publish_date_tags = PublishDatePattern(attr="id", value="pubdate", content="content")
# author
config.known_author_patterns = AuthorPattern(attr="id", value="writer", content="content")
Reading Results
Results from the extraction are returned as an Article object. Reading the desired results is as simple as reading the desired property. The most commonly asked for property is cleaned_text which holds the non-html formatted text of the extracted article.
For more details and for all available properties, see Article
from goose3 import Goose
urls = [...]
with Goose() as g:
for tmp in urls:
article = g.extract(url=tmp)
print(article.cleaned_text)
Using with PyInstaller
It should be possible to use goose3
with tools such as PyInstaller to add
goose to your executable program. To do so, you will need to add the
required resources to the executable.
You will need to add the files to a folder in your executable called goose3/resources/
to match the location that goose3
checks for the required files.
pyinstaller \
--add-data="goose3/resources/images/*;goose3/resources/images/" \
--add-data="goose3/resources/text/*;goose3/resources/text/" \
my_prog.py