Goose3 API¶
Goose3¶
-
class
goose3.Goose(config=None)[source]¶ Extract most likely article content and aditional metadata from a URL or previously fetched HTML document
Parameters: config (Configuration, dict) – A configuration file or dictionary representation of the configuration file Returns: An instance of the goose extraction object Return type: Goose -
close()[source]¶ Close the network connection and perform any other required cleanup
Note
Auto closed when using goose as a context manager or when garbage collected
-
extract(url=None, raw_html=None)[source]¶ Extract the most likely article content from the html page
Parameters: - url (str) – URL to pull and parse
- raw_html (str) – String representation of the HTML page
Returns: Representation of the article contents including other parsed and extracted metadata
Return type:
-
Configuration¶
Configuration options to change how and what goose3 extracts and parses.
-
class
goose3.Configuration[source]¶ -
available_parsers¶ list(str) – A list of all possible parser values for the parser_class
Note
Not settable
-
browser_user_agent¶ Browser user agent string to use when making URL requests
Note
Defaults to Goose/{goose3.__version__}
Examples
Using the non-standard browser agent string is advised when pulling frequently
>>> config.browser_user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2)' >>> config.browser_user_agent = 'AppleWebKit/534.52.7 (KHTML, like Gecko)' >>> config.browser_user_agent = 'Version/5.1.2 Safari/534.52.7'
-
debug¶ bool – Turn on or off debugging
Note
Defaults to False
Warning
Debugging is currently not implemented
-
enable_image_fetching¶ bool – Turn on or off image extraction
Note
Defaults to True
-
get_parser()[source]¶ Retrieve the current parser class to use for extraction
Returns: The parser to use Return type: Parser
-
http_auth¶ tuple – Authentication class and information to pass to the requests library
See also
-
http_headers¶ dict – Custom headers to pass directly to the supporting requests object
See also
-
http_proxies¶ dict – Proxy information to pass directly to the supporting requests object
See also
-
http_timeout¶ float – The time delay to pass to requests to wait for the response in seconds
Note
Defaults to 30.0
-
imagemagick_convert_path¶ str – Path to the convert program that is part of imagemagick
Note
Defaults to “/opt/local/bin/convert”
Warning
Currently not used / implemented
-
imagemagick_identify_path¶ str – Path to the identify program that is part of imagemagick
Note
Defaults to “/opt/local/bin/identify”
Warning
Currently not used / implemented
-
images_min_bytes¶ int – Minimum number of bytes for an image to be evaluated to be the main image of the site
Note
Defaults to 4500 bytes
-
known_context_patterns¶ list – The context patterns to search to find the likely article content
Note
Each entry must be a dictionary with the following keys: attr and value or just tag
-
local_storage_path¶ str – The local path to store temporary files
Note
Defaults to the value of os.path.join(tempfile.gettempdir(), ‘goose’)
-
parser_class¶ str – The key of the parser to use
Note
Defaults to lxml
-
stopwords_class¶ StopWords – The StopWords class to use when analyzing article content
Note
Defaults to the english stop words
Note
Current stop words available in goose3.text include:
StopWords, StopWordsChinese, StopWordsArabic, and StopWordsKorean
-
strict¶ bool – Enable strict mode and throw exceptions instead of swallowing them.
Note
Defaults to True
-
target_language¶ str – The default target language if the language is not extractable or if use_meta_language is set to False
Note
Default language is ‘en’
-
use_meta_language¶ bool – Determine if language should be extracted from the meta tags or not. If this is set to False then the target_language will be used. Also, if extraction fails then the target_language will be utilized.
Note
Defaults to True
-
Article¶
The result of a goose3 extraction is to return an Article object that contains the results of the parsing process.
-
class
goose3.Article[source]¶ -
additional_data¶ dict – A property bucket for consumers of goose3 to store custom data extractions
Note
Read only
list – A listing of authors as parsed from the meta tags
Note
Read only
-
canonical_link¶ str – The canonical link of the article if found in the meta data
Note
Read only
-
cleaned_text¶ str – Cleaned text of the article without HTML tags; most commonly desired property
Note
Read only
-
doc¶ etree – lxml document that is being processed
Note
Read only
-
domain¶ str – Domain of the article parsed
Note
Read only
-
final_url¶ str – The URL that was used to pull and parsed; None if raw_html was used
Note
Read only
-
infos¶ dict – The summation of all data available about the extracted article
Note
Not settable
-
link_hash¶ str – The MD5 of the final url to be used for various identification tasks
Note
Read only
-
links¶ list – A listing of URL links within the article
Note
Read only
-
meta_description¶ str – Contents of the meta-description field from the HTML source
Note
Read only
-
meta_favicon¶ str – Contents of the meta-favicon field from the HTML source
Note
Read only
-
meta_keywords¶ str – Contents of the meta-keywords field from the HTML source
Note
Read only
-
meta_lang¶ str – Contents of the meta-lang field from the HTML source
Note
Read only
-
movies¶ list(Video) – A listing of all videos within the article such as YouTube or Vimeo
Note
Read only
Todo
Document the goose3 Video class members
-
opengraph¶ dict – All opengraph tag data
Note
Read only
-
publish_date¶ str – The date the article was published based on meta tag extraction
Note
Read only
-
raw_doc¶ etree – Original, uncleaned, and untouched lxml document to be processed
Note
Read only
-
raw_html¶ str – The HTML represented as a string
Note
Read only
list – List of article tags (non-metadata tags)
Note
Read only
-
title¶ str – Title extracted from the HTML source
Note
Read only
-
top_image¶ Image – The top image object that likely represents the article
Note
Read only
Todo
Document the goose3 image class members
-
top_node¶ etree – The top Element that is a candidate for the main body of the article
Note
Read only
-
tweets¶ list – A listing of embeded tweets in the article
Note
Read only
-