Goose3 API¶

Goose3¶

class goose3.Goose(config=None)[source]¶

Extract most likely article content and aditional metadata from a URL or previously fetched HTML document

Parameters:	config (Configuration, dict) – A configuration file or dictionary representation of the configuration file
Returns:	An instance of the goose extraction object
Return type:	Goose

close()[source]¶: Close the network connection and perform any other required cleanup

Note

Auto closed when using goose as a context manager or when garbage collected

extract(url=None, raw_html=None)[source]¶

Extract the most likely article content from the html page

Parameters:	url (str) – URL to pull and parse raw_html (str) – String representation of the HTML page
Returns:	Representation of the article contents including other parsed and extracted metadata
Return type:	Article

shutdown_network()[source]¶: Close the network connection

Note

Auto closed when using goose as a context manager or when garbage collected

Configuration¶

Configuration options to change how and what goose3 extracts and parses.

class goose3.Configuration[source]¶

available_parsers¶: list(str) – A list of all possible parser values for the parser_class

Note

Not settable

browser_user_agent¶

Browser user agent string to use when making URL requests

Note

Defaults to Goose/{goose3.__version__}

Examples

Using the non-standard browser agent string is advised when pulling frequently

>>> config.browser_user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2)'
>>> config.browser_user_agent = 'AppleWebKit/534.52.7 (KHTML, like Gecko)'
>>> config.browser_user_agent = 'Version/5.1.2 Safari/534.52.7'

debug¶: bool – Turn on or off debugging

Note

Defaults to False

Warning

Debugging is currently not implemented

enable_image_fetching¶: bool – Turn on or off image extraction

Note

Defaults to True

get_parser()[source]¶

Retrieve the current parser class to use for extraction

Returns:	The parser to use
Return type:	Parser

http_auth¶: tuple – Authentication class and information to pass to the requests library

See also

Requests Authentication

http_headers¶: dict – Custom headers to pass directly to the supporting requests object

See also

Requests Custom Headers

http_proxies¶: dict – Proxy information to pass directly to the supporting requests object

See also

Requests Proxy Support

http_timeout¶: float – The time delay to pass to requests to wait for the response in seconds

Note

Defaults to 30.0

imagemagick_convert_path¶: str – Path to the convert program that is part of imagemagick

Note

Defaults to “/opt/local/bin/convert”

Warning

Currently not used / implemented

imagemagick_identify_path¶: str – Path to the identify program that is part of imagemagick

Note

Defaults to “/opt/local/bin/identify”

Warning

Currently not used / implemented

images_min_bytes¶: int – Minimum number of bytes for an image to be evaluated to be the main image of the site

Note

Defaults to 4500 bytes

known_context_patterns¶: list – The context patterns to search to find the likely article content

Note

Each entry must be a dictionary with the following keys: attr and value or just tag

local_storage_path¶: str – The local path to store temporary files

Note

Defaults to the value of os.path.join(tempfile.gettempdir(), ‘goose’)

parser_class¶: str – The key of the parser to use

Note

Defaults to lxml

stopwords_class¶: StopWords – The StopWords class to use when analyzing article content

Note

Defaults to the english stop words

Note

Current stop words available in goose3.text include:

StopWords, StopWordsChinese, StopWordsArabic, and StopWordsKorean

strict¶: bool – Enable strict mode and throw exceptions instead of swallowing them.

Note

Defaults to True

target_language¶: str – The default target language if the language is not extractable or if use_meta_language is set to False

Note

Default language is ‘en’

use_meta_language¶: bool – Determine if language should be extracted from the meta tags or not. If this is set to False then the target_language will be used. Also, if extraction fails then the target_language will be utilized.

Note

Defaults to True

Article¶

The result of a goose3 extraction is to return an Article object that contains the results of the parsing process.

class goose3.Article[source]¶

additional_data¶: dict – A property bucket for consumers of goose3 to store custom data extractions

Note

Read only

authors¶: list – A listing of authors as parsed from the meta tags

Note

Read only

canonical_link¶: str – The canonical link of the article if found in the meta data

Note

Read only

cleaned_text¶: str – Cleaned text of the article without HTML tags; most commonly desired property

Note

Read only

doc¶: etree – lxml document that is being processed

Note

Read only

domain¶: str – Domain of the article parsed

Note

Read only

final_url¶: str – The URL that was used to pull and parsed; None if raw_html was used

Note

Read only

infos¶: dict – The summation of all data available about the extracted article

Note

Not settable

link_hash¶: str – The MD5 of the final url to be used for various identification tasks

Note

Read only

links¶: list – A listing of URL links within the article

Note

Read only

meta_description¶: str – Contents of the meta-description field from the HTML source

Note

Read only

meta_favicon¶: str – Contents of the meta-favicon field from the HTML source

Note

Read only

meta_keywords¶: str – Contents of the meta-keywords field from the HTML source

Note

Read only

meta_lang¶: str – Contents of the meta-lang field from the HTML source

Note

Read only

movies¶: list(Video) – A listing of all videos within the article such as YouTube or Vimeo

Note

Read only

Todo

Document the goose3 Video class members

opengraph¶: dict – All opengraph tag data

Note

Read only

publish_date¶: str – The date the article was published based on meta tag extraction

Note

Read only

raw_doc¶: etree – Original, uncleaned, and untouched lxml document to be processed

Note

Read only

raw_html¶: str – The HTML represented as a string

Note

Read only

tags¶: list – List of article tags (non-metadata tags)

Note

Read only

title¶: str – Title extracted from the HTML source

Note

Read only

top_image¶: Image – The top image object that likely represents the article

Note

Read only

Todo

Document the goose3 image class members

top_node¶: etree – The top Element that is a candidate for the main body of the article

Note

Read only

tweets¶: list – A listing of embeded tweets in the article

Note

Read only