Goose3 API¶
Goose3¶
-
class
goose3.
Goose
(config=None)[source]¶ Extract most likely article content and aditional metadata from a URL or previously fetched HTML document
Parameters: config (Configuration, dict) – A configuration file or dictionary representation of the configuration file Returns: An instance of the goose extraction object Return type: Goose -
close
()[source]¶ Close the network connection and perform any other required cleanup
Note
Auto closed when using goose as a context manager or when garbage collected
-
extract
(url=None, raw_html=None)[source]¶ Extract the most likely article content from the html page
Parameters: - url (str) – URL to pull and parse
- raw_html (str) – String representation of the HTML page
Returns: Representation of the article contents including other parsed and extracted metadata
Return type:
-
Configuration¶
Configuration options to change how and what goose3 extracts and parses.
-
class
goose3.
Configuration
[source]¶ -
available_parsers
¶ A list of all possible parser values for the parser_class
Note
Not settable
Type: list(str)
-
browser_user_agent
¶ Browser user agent string to use when making URL requests
Note
Defaults to Goose/{goose3.__version__}
Examples
Using the non-standard browser agent string is advised when pulling frequently
>>> config.browser_user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2)' >>> config.browser_user_agent = 'AppleWebKit/534.52.7 (KHTML, like Gecko)' >>> config.browser_user_agent = 'Version/5.1.2 Safari/534.52.7'
-
debug
¶ Turn on or off debugging
Note
Defaults to False
Warning
Debugging is currently not implemented
Type: bool
-
enable_image_fetching
¶ Turn on or off image extraction
Note
Defaults to False
Type: bool
-
get_parser
()[source]¶ Retrieve the current parser class to use for extraction
Returns: The parser to use Return type: Parser
-
http_auth
¶ Authentication class and information to pass to the requests library
See also
Type: tuple
-
http_headers
¶ Custom headers to pass directly to the supporting requests object
See also
Type: dict
-
http_proxies
¶ Proxy information to pass directly to the supporting requests object
See also
Type: dict
-
http_timeout
¶ The time delay to pass to requests to wait for the response in seconds
Note
Defaults to 30.0
Type: float
-
imagemagick_convert_path
¶ Path to the convert program that is part of imagemagick
Note
Defaults to “/opt/local/bin/convert”
Warning
Currently not used / implemented
Type: str
-
imagemagick_identify_path
¶ Path to the identify program that is part of imagemagick
Note
Defaults to “/opt/local/bin/identify”
Warning
Currently not used / implemented
Type: str
-
images_min_bytes
¶ Minimum number of bytes for an image to be evaluated to be the main image of the site
Note
Defaults to 4500 bytes
Type: int
-
keep_footnotes
¶ Specify if footnotes should be kept or not in the cleaned_text output
Note
Defaults to True
Type: bool
The tags to search to find the likely published date
Note
Each entry must be a dictionary with the following keys: attribute, value, and content.
Type: list
-
known_context_patterns
¶ The context patterns to search to find the likely article content
Note
Each entry must be a dictionary with the following keys: attr and value or just tag
Type: list
The tags to search to find the likely published date
Note
Each entry must be a dictionary with the following keys: attribute, value, and content.
Type: list
-
local_storage_path
¶ The local path to store temporary files
Note
Defaults to the value of os.path.join(tempfile.gettempdir(), ‘goose’)
Type: str
-
parse_headers
¶ Specify if headers should be pulled or not in the cleaned_text output
Note
Defaults to True
Type: bool
-
parser_class
¶ The key of the parser to use
Note
Defaults to lxml
Type: str
-
pretty_lists
¶ Specify if lists should be pretty printed in the cleaned_text output
Note
Defaults to True
Type: bool
-
stopwords_class
¶ The StopWords class to use when analyzing article content
Note
Defaults to the english stop words
Note
Current stop words available in goose3.text include:
StopWords, StopWordsChinese, StopWordsArabic, and StopWordsKorean
Type: StopWords
-
strict
¶ Enable strict mode and throw exceptions instead of swallowing them.
Note
Defaults to True
Type: bool
-
target_language
¶ The default target language if the language is not extractable or if use_meta_language is set to False
Note
Default language is ‘en’
Type: str
-
use_meta_language
¶ Determine if language should be extracted from the meta tags or not. If this is set to False then the target_language will be used. Also, if extraction fails then the target_language will be utilized.
Note
Defaults to True
Type: bool
-
Configuration Helper Classes¶
-
class
goose3.configuration.
ArticleContextPattern
(*, attr=None, value=None, tag=None, domain=None)[source]¶ Help ensure correctly generated article context patterns
Parameters: - attr (str) – The attribute type: class, id, etc
- value (str) – The value of the attribute
- tag (str) – The type of tag, such as article that contains the main article body
- domain (str) – The domain to which this pattern pertains (optional)
Note
Must provide, at a minimum, (attr and value) or (tag)
-
class
goose3.configuration.
AuthorPattern
(*, attr=None, value=None, content=None, tag=None, subpattern=None)[source]¶ Ensures that the author patterns are correctly formed for use with the known_author_patterns of configuration
Parameters: - attr (str) – The attribute type: class, id, etc
- value (str) – The value of the attribute
- content (str) – The name of another attribute (of the element) that contains the value
- tag (str) – The type of tag, such as author that contains the author information
- subpattern (str) – A subpattern for elements within the main attribute
-
class
goose3.configuration.
PublishDatePattern
(*, attr=None, value=None, content=None, subcontent=None, tag=None, domain=None)[source]¶ Ensure correctly formed publish date patterns; to be used in conjuntion with the configuration known_publish_date_tags property
Parameters: - attr (str) – The attribute type: class, id, etc
- value (str) – The value of the attribute
- content (str) – The name of another attribute (of the element) that contains the value
- subcontent (str) – The name of a json object key (optional)
- tag (str) – The type of tag, such as time that contains the publish date
- domain (str) – The domain to which this pattern pertains (optional)
Note
Must provide, at a minimum, (attr and value) or (tag)
Article¶
The result of a goose3 extraction is to return an Article object that contains the results of the parsing process.
-
class
goose3.
Article
[source]¶ -
additional_data
¶ A property bucket for consumers of goose3 to store custom data extractions
Note
Read only
Type: dict
A listing of authors as parsed from the meta tags
Note
Read only
Type: list(str)
-
canonical_link
¶ The canonical link of the article if found in the meta data
Note
Read only
Type: str
-
cleaned_text
¶ Cleaned text of the article without HTML tags; most commonly desired property
Note
Read only
Type: str
-
doc
¶ lxml document that is being processed
Note
Read only
Type: etree
-
domain
¶ Domain of the article parsed
Note
Read only
Type: str
-
final_url
¶ The URL that was used to pull and parsed; None if raw_html was used and no url element was found.
Note
Read only
Type: str
-
infos
¶ The summation of all data available about the extracted article
Note
Read only
Type: dict
-
link_hash
¶ The MD5 of the final url to be used for various identification tasks
Note
Read only
Type: str
-
links
¶ A listing of URL links within the article
Note
Read only
Type: list(str)
-
meta_description
¶ Contents of the meta-description field from the HTML source
Note
Read only
Type: str
-
meta_encoding
¶ Contents of the encoding/charset field from the HTML source
Note
Read only
Type: str
-
meta_favicon
¶ Contents of the meta-favicon field from the HTML source
Note
Read only
Type: str
-
meta_keywords
¶ Contents of the meta-keywords field from the HTML source
Note
Read only
Type: str
-
meta_lang
¶ Contents of the meta-lang field from the HTML source
Note
Read only
Type: str
-
movies
¶ A listing of all videos within the article such as YouTube or Vimeo
Returns: See more information on the goose3.Video class Return type: list(Video) Note
Read only
Type: list(Video)
-
opengraph
¶ All opengraph tag data
Note
Read only
Type: dict
-
publish_date
¶ The date the article was published based on meta tag extraction
Note
Read only
Type: str
-
publish_datetime_utc
¶ The date time version of the published date based on meta tag extraction in the UTC timezone, if timezone information is known
Note
Read only
Type: datetime.datetime
-
raw_doc
¶ Original, uncleaned, and untouched lxml document to be processed
Note
Read only
Type: etree
-
raw_html
¶ The HTML represented as a string
Note
Read only
Type: str
-
schema
¶ All schema tag data
Note
Read only
Type: dict
List of article tags (non-metadata tags)
Note
Read only
Type: list(str)
-
title
¶ Title extracted from the HTML source
Note
Read only
Type: str
-
top_image
¶ The top image object that likely represents the article
Returns: See more information on the goose3.Image class Return type: Image Note
Read only
Type: Image
-
top_node
¶ The top Element that is a candidate for the main body of the article
Note
Read only
Type: etree
-
tweets
¶ A listing of embeded tweets in the article
Note
Read only
Type: list(str)
-
Image¶
-
class
goose3.
Image
[source]¶ -
bytes
¶ The size of the image in bytes
Note
Read only
Type: int
-
confidence_score
¶ The confidence score that this is the main image
Note
Read only
Type: float
-
extraction_type
¶ The extraction type used
Note
Read only
Type: str
-
height
¶ The image height in pixels
Note
Read only
Type: int
-
src
¶ Source URL for the image
Note
Read only
Type: str
-
top_image_node
¶ The most likely top image element node
Note
Read only
Type: etree
-
width
¶ The image width in pixels
Note
Read only
Type: int
-
Video¶
-
class
goose3.
Video
[source]¶ Video object
-
embed_code
¶ The embed code of the video
Note
Read only
Type: str
-
embed_type
¶ The type of embeding such as embed, object, or iframe
Note
Read only
Type: str
-
height
¶ The video height in pixels
Note
Read only
Type: int
-
provider
¶ The video provider
Note
Read only
Type: str
-
src
¶ The URL source of the video
Note
Read only
Type: str
-
width
¶ The video width in pixels
Note
Read only
Type: int
-