Metadata-Version: 1.0
Name: acora
Version: 1.2
Summary: Fast multi-keyword search engine for text strings
Home-page: http://pypi.python.org/pypi/acora
Author: Stefan Behnel
Author-email: stefan_ml@behnel.de
License: UNKNOWN
Download-URL: http://pypi.python.org/packages/source/a/acora/acora-1.2.tar.gz
Description: Acora
        ======
        
        Author: Stefan Behnel
        
        
        What is Acora?
        ---------------
        
        Acora is 'fgrep' for Python, a fast multi-keyword text search engine.
        
        Based on a set of keywords, it generates a search automaton (DFA) and
        runs it over string input, either unicode or bytes.
        
        It is based on the Aho-Corasick algorithm and an NFA-to-DFA powerset
        construction.
        
        Acora comes with both a pure Python implementation and a fast binary module
        written in Cython.
        
        
        Features
        ---------
        
        * works with unicode strings and byte strings
        * about 2-3x as fast as Python's regular expression engine
        * finds overlapping matches, i.e. all matches of all keywords
        * support for case insensitive search (~10x as fast as 're')
        * frees the GIL while searching
        * additional (slow but short) pure Python implementation
        * support for Python 2.5+ and 3.x
        * support for searching in files
        * permissive BSD license
        
        
        How do I use it?
        -----------------
        
        Import the package::
        
        >>> from acora import AcoraBuilder
        
        Collect some keywords::
        
        >>> builder = AcoraBuilder('ab', 'bc', 'de')
        >>> builder.add('a', 'b')
        
        Generate the Acora search engine for the current keyword set::
        
        >>> ac = builder.build()
        
        Search a string for all occurrences::
        
        >>> ac.findall('abc')
        [('a', 0), ('ab', 0), ('b', 1), ('bc', 1)]
        >>> ac.findall('abde')
        [('a', 0), ('ab', 0), ('b', 1), ('de', 2)]
        
        Iterate over the search results as they come in::
        
        >>> for kw, pos in ac.finditer('abde'):
        ...     print("%2s[%d]" % (kw, pos))
        a[0]
        ab[0]
        b[1]
        de[2]
        
        
        FAQs and recipes
        -----------------
        
        #) how do I run a greedy search for the longest matching keywords?
        
        >>> builder = AcoraBuilder('a', 'ab', 'abc')
        >>> ac = builder.build()
        
        >>> for kw, pos in ac.finditer('abbabc'):
        ...     print(kw)
        a
        ab
        a
        ab
        abc
        
        >>> from itertools import groupby
        >>> from operator import itemgetter
        
        >>> def longest_match(matches):
        ...     for pos, match_set in groupby(matches, itemgetter(1)):
        ...         yield max(match_set)
        
        >>> for kw, pos in longest_match(ac.finditer('abbabc')):
        ...     print(kw)
        ab
        abc
        
        #) how do I parse line-by-line, as fgrep does, but with arbitrary line endings?
        
        >>> def group_by_lines(s, *keywords):
        ...     builder = AcoraBuilder('\r', '\n', *keywords)
        ...     ac = builder.build()
        ...
        ...     current_line_matches = []
        ...     last_ending = None
        ...
        ...     for kw, pos in ac.finditer(s):
        ...         if kw in '\r\n':
        ...             if last_ending == '\r' and kw == '\n':
        ...                 continue # combined CRLF
        ...             yield tuple(current_line_matches)
        ...             del current_line_matches[:]
        ...             last_ending = kw
        ...         else:
        ...             last_ending = None
        ...             current_line_matches.append(kw)
        ...     yield tuple(current_line_matches)
        
        >>> kwds = ['ab', 'bc', 'de']
        >>> for matches in group_by_lines('a\r\r\nbc\r\ndede\n\nab', *kwds):
        ...     print(matches)
        ()
        ()
        ('bc',)
        ('de', 'de')
        ()
        ('ab',)
        
        
        Changelog
        ----------
        
        * 1.2 [2009-01-30]
        
        - deep-copy support for AcoraBuilder class
        - doc/test fixes
        - include .hg repo in source distribution
        - built using Cython 0.12.1 (beta0)
        
        * 1.1 [2009-01-29]
        
        - doc updates
        - some cleanup
        - built using Cython 0.12.1 (beta0)
        
        * 1.0 [2009-01-29]
        
        - initial release
        
Platform: UNKNOWN
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: License :: OSI Approved :: BSD License
Classifier: Programming Language :: Cython
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.5
Classifier: Programming Language :: Python :: 2.6
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.0
Classifier: Programming Language :: Python :: 3.1
Classifier: Operating System :: OS Independent
Classifier: Topic :: Text Processing
