regparser.tree package¶
Subpackages¶
- regparser.tree.appendix package
- regparser.tree.depth package
- regparser.tree.xml_parser package
- Submodules
- regparser.tree.xml_parser.appendices module
- regparser.tree.xml_parser.extended_preprocessors module
- regparser.tree.xml_parser.flatsubtree_processor module
- regparser.tree.xml_parser.import_category module
- regparser.tree.xml_parser.interpretations module
- regparser.tree.xml_parser.paragraph_processor module
- regparser.tree.xml_parser.preprocessors module
- regparser.tree.xml_parser.reg_text module
- regparser.tree.xml_parser.simple_hierarchy_processor module
- regparser.tree.xml_parser.tree_utils module
- regparser.tree.xml_parser.us_code module
- regparser.tree.xml_parser.xml_wrapper module
- Module contents
Submodules¶
regparser.tree.build module¶
regparser.tree.interpretation module¶
regparser.tree.paragraph module¶
-
class
regparser.tree.paragraph.
ParagraphParser
(p_regex, node_type)[source]¶ -
best_start
(text, p_level, paragraph, starts, exclude=None)[source]¶ Given a list of potential paragraph starts, pick the best based on knowledge of subparagraph structure. Do this by checking if the id following the subparagraph (e.g. ii) is between the first match and the second. If so, skip it, as that implies the first match was a subparagraph.
-
build_tree
(text, p_level=0, exclude=None, label=None, title='')[source]¶ Build a dict to represent the text hierarchy.
-
find_paragraph_start_match
(text, p_level, paragraph, exclude=None)[source]¶ Find the positions for the start and end of the requested label. p_Level is one of 0,1,2,3; paragraph is the index within that label. Return None if not present. Does not return results in the exclude list (a list of start/stop indices).
-
static
matching_subparagraph_ids
(p_level, paragraph)[source]¶ Return a list of matches if this paragraph id matches one of the subparagraph ids (e.g. letter (i) and roman numeral (i).
-
-
regparser.tree.paragraph.
hash_for_paragraph
(text)[source]¶ Hash a chunk of text and convert it into an integer for use with a MARKERLESS paragraph identifier. We’ll trim to just 8 hex characters for legibility. We don’t need to fear hash collisions as we’ll have 16**8 ~ 4 billion possibilities. The birthday paradox tells us we’d only expect collisions after ~ 60 thousand entries. We’re expecting at most a few hundred
regparser.tree.priority_stack module¶
-
class
regparser.tree.priority_stack.
PriorityStack
[source]¶ Bases:
object
-
add
(node_level, node)[source]¶ Add a new node with level node_level to the stack. Unwind the stack when necessary. Returns self for chaining
-
regparser.tree.reg_text module¶
-
regparser.tree.reg_text.
build_empty_part
(part)[source]¶ When a regulation doesn’t have a subpart, we give it an emptypart (a dummy subpart) so that the regulation tree is consistent.
-
regparser.tree.reg_text.
build_subjgrp
(title, part, letter_list)[source]¶ We’re constructing a fake “letter” here by taking the first letter of each word in the subjgrp’s title, or using the first two letters of the first word if there’s just one—we’re avoiding single letters to make sure we don’t duplicate an existing subpart, and we’re hoping that the initialisms created by this method are unique for this regulation. We can make this more robust by accepting a list of existing initialisms and returning both that list and the Node, and checking against the list as we construct them.
-
regparser.tree.reg_text.
find_next_section_start
(text, part)[source]¶ Find the start of the next section (e.g. 205.14)
-
regparser.tree.reg_text.
find_next_subpart_start
(text)[source]¶ Find the start of the next Subpart (e.g. Subpart B)
-
regparser.tree.reg_text.
next_section_offsets
(text, part)[source]¶ Find the start/end of the next section
regparser.tree.struct module¶
-
class
regparser.tree.struct.
FrozenNode
(text='', children=(), label=(), title='', node_type=u'regtext', tagged_text='')[source]¶ Bases:
object
Immutable interface for nodes. No guarantees about internal state.
-
child_labels
¶
-
children
¶
-
clone
(**kwargs)[source]¶ Implement a namedtuple _replace style functionality, copying all fields that aren’t explicitly replaced.
-
static
from_node
(node)[source]¶ Convert a struct.Node (or similar) into a struct.FrozenNode. This also checks if this node has already been instantiated. If so, it returns the instantiated version (i.e. only one of each identical node exists in memory)
-
hash
¶
-
label
¶
-
label_id
¶
-
node_type
¶
-
prototype
()[source]¶ When we instantiate a FrozenNode, we add it to _pool if we’ve not seen an identical FrozenNode before. If we have, we want to work with that previously seen version instead. This method returns the _first_ FrozenNode with identical fields
-
tagged_text
¶
-
text
¶
-
title
¶
-
-
class
regparser.tree.struct.
FullNodeEncoder
(skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, encoding='utf-8', default=None)[source]¶ Bases:
json.encoder.JSONEncoder
Encodes Nodes into JSON, not losing any of the fields
-
FIELDS
= set(['tagged_text', 'title', 'text', 'source_xml', 'label', 'node_type', 'children'])¶
-
-
class
regparser.tree.struct.
Node
(text='', children=None, label=None, title=None, node_type=u'regtext', source_xml=None, tagged_text='')[source]¶ Bases:
object
-
APPENDIX
= u'appendix'¶
-
EMPTYPART
= u'emptypart'¶
-
EXTRACT
= u'extract'¶
-
INTERP
= u'interp'¶
-
INTERP_MARK
= 'Interp'¶
-
MARKERLESS_REGEX
= <_sre.SRE_Pattern object>¶
-
NOTE
= u'note'¶
-
REGTEXT
= u'regtext'¶
-
SUBPART
= u'subpart'¶
-
cfr_part
¶
-
-
class
regparser.tree.struct.
NodeEncoder
(skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, encoding='utf-8', default=None)[source]¶ Bases:
json.encoder.JSONEncoder
Custom JSON encoder to handle Node objects
-
regparser.tree.struct.
filter_walk
(node, fn)[source]¶ Perform fn on the label for every node in the tree and return a list of nodes on which the function returns truthy.
-
regparser.tree.struct.
find
(root, label)[source]¶ Search through the tree to find the node with this label.
-
regparser.tree.struct.
find_first
(root, predicate)[source]¶ Walk the tree and find the first node which matches the predicate
-
regparser.tree.struct.
find_parent
(root, label)[source]¶ Search through the tree to find the _parent_ or a node with this label.
-
regparser.tree.struct.
merge_duplicates
(nodes)[source]¶ Given a list of nodes with the same-length label, merge any duplicates (by combining their children)