flatisfy.filters package¶
Submodules¶
flatisfy.filters.cache module¶
Caching function for pictures.
-
class
flatisfy.filters.cache.
ImageCache
(max_items=200, storage_dir=None)[source]¶ Bases:
flatisfy.filters.cache.MemoryCache
A cache for images, stored in memory.
-
class
flatisfy.filters.cache.
MemoryCache
[source]¶ Bases:
object
A cache in memory.
-
get
(key)[source]¶ Get an element from cache. Eventually call
on_miss
if the item is not already cached.Parameters: key – Key of the element to retrieve. Returns: Requested element.
-
hit_rate
()[source]¶ Get the hit rate, that is the rate at which we requested an item which was already in the cache.
Returns: The hit rate, in percents.
-
miss_rate
()[source]¶ Get the miss rate, that is the rate at which we requested an item which was not already in the cache.
Returns: The miss rate, in percents.
-
flatisfy.filters.duplicates module¶
Filtering functions to detect and merge duplicates.
-
flatisfy.filters.duplicates.
compare_photos
(photo1, photo2, photo_cache, hash_threshold)[source]¶ Compares two photos with average hash method.
Parameters: - photo1 – First photo url.
- photo2 – Second photo url.
- photo_cache – An instance of
ImageCache
to use to cache images. - hash_threshold – The hash threshold between two images. Usually two different photos have a hash difference of 30.
Returns: True
if the photos are identical, elseFalse
.
-
flatisfy.filters.duplicates.
deep_detect
(flats_list, config)[source]¶ Deeper detection of duplicates based on any available data.
Parameters: - flats_list – A list of flats dicts.
- config – A config dict.
Returns: A tuple of the deduplicated list of flat dicts and the list of all the flats objects that should be removed and considered as duplicates (they were already merged).
-
flatisfy.filters.duplicates.
detect
(flats_list, key='id', merge=True, should_intersect=False)[source]¶ Detect obvious duplicates within a given list of flats.
There may be duplicates found, as some queries could overlap (especially since when asking for a given place, websites tend to return housings in nearby locations as well). We need to handle them, by either deleting the duplicates (
merge=False
) or merging them together in a single flat object.Parameters: - flats_list – A list of flats dicts.
- key – The flat dicts key on which the duplicate detection should be done.
- merge – Whether the found duplicates should be merged or we should only keep one of them.
- should_intersect – Set to
True
if the values in the flat dicts are lists and you want to deduplicate on non-empty intersection (typically if they have a common url).
Returns: A tuple of the deduplicated list of flat dicts and the list of all the flats objects that should be removed and considered as duplicates (they were already merged).
-
flatisfy.filters.duplicates.
find_number_common_photos
(flat1_photos, flat2_photos, photo_cache, hash_threshold)[source]¶ Compute the number of common photos between the two lists of photos for the flats.
Fetch the photos and compare them with average hash method.
Parameters: - flat1_photos – First list of flat photos. Each photo should be a
dict
with (at least) aurl
key. - flat2_photos – Second list of flat photos. Each photo should be a
dict
with (at least) aurl
key. - photo_cache – An instance of
ImageCache
to use to cache images. - hash_threshold – The hash threshold between two images.
Returns: The found number of common photos.
- flat1_photos – First list of flat photos. Each photo should be a
-
flatisfy.filters.duplicates.
get_duplicate_score
(flat1, flat2, photo_cache, hash_threshold)[source]¶ Compute the duplicate score between two flats. The higher the score, the more likely the two flats to be duplicates.
Parameters: - flat1 – First flat dict.
- flat2 – Second flat dict.
- photo_cache – An instance of
ImageCache
to use to cache images. - hash_threshold – The hash threshold between two images.
Returns: The duplicate score as
int
.
-
flatisfy.filters.duplicates.
get_or_compute_photo_hash
(photo, photo_cache)[source]¶ Get the computed hash from the photo dict or compute it if not found.
Parameters: - photo – A photo, as a
dict
with (at least) aurl
key. - photo_cache – An instance of
ImageCache
to use to cache images.
- photo – A photo, as a
-
flatisfy.filters.duplicates.
homogeneize_phone_number
(numbers)[source]¶ Homogeneize the phone numbers, by stripping any space, dash or dot as well as the international prefix. Assumes it is dealing with French phone numbers (starting with a zero and having 10 characters).
Parameters: numbers – The phone number string to homogeneize (can contain multiple phone numbers). Returns: The cleaned phone number. None
if the number is not valid.
flatisfy.filters.metadata module¶
Filtering functions to handle flatisfy-specific metadata.
This includes functions to guess metadata (postal codes, stations) from the actual fetched data.
-
flatisfy.filters.metadata.
compute_travel_times
(flats_list, constraint, config)[source]¶ Compute the travel time between each flat and the points listed in the constraints.
Parameters: - flats_list – A list of flats dict.
- constraint – The constraint that the
flats_list
should satisfy. - config – A config dict.
Returns: An updated list of flats dict with computed travel times.
Note
Requires a Navitia or CityMapper API key in the config.
-
flatisfy.filters.metadata.
fuzzy_match
(query, choices, limit=3, threshold=75)[source]¶ Custom search for the best element in choices matching the query.
Parameters: - query – The string to match.
- choices – The list of strings to match with.
- limit – The maximum number of items to return. Set to
None
to return all values above threshold. - threshold – The score threshold to use.
Returns: Tuples of matching items and associated confidence.
Note
This function works by removing any fancy character from the
query
andchoices
strings (replacing any non alphabetic and non numeric characters by space), converting to lower case and normalizing them (collapsing multiple spaces etc). It also converts any roman numerals to decimal system. It then compares the string and look for the longest string inchoices
which is a substring ofquery
. The longest one gets a confidence of 100. The shorter ones get a confidence proportional to their length.See also
flatisfy.tools.normalize_string
Example:
>>> match("Paris 14ème", ["Ris", "ris", "Paris 14"], limit=1) [("Paris 14", 100) >>> match( "Saint-Jacques, Denfert-Rochereau (Colonel Rol-Tanguy), " "Mouton-Duvernet", ["saint-jacques", "denfert rochereau", "duvernet", "toto"], limit=4 ) [('denfert rochereau', 100), ('saint-jacques', 76)]
-
flatisfy.filters.metadata.
guess_postal_code
(flats_list, constraint, config, distance_threshold=20000)[source]¶ Try to guess the postal code from the location of the flats.
Parameters: - flats_list – A list of flats dict.
- constraint – The constraint that the
flats_list
should satisfy. - config – A config dict.
- distance_threshold – Maximum distance in meters between the
constraint postal codes (from config) and the one found by this
function, to avoid bad fuzzy matching. Can be
None
to disable thresholding.
Returns: An updated list of flats dict with guessed postal code.
-
flatisfy.filters.metadata.
guess_stations
(flats_list, constraint, config)[source]¶ Try to match the station field with a list of available stations nearby.
Parameters: - flats_list – A list of flats dict.
- constraint – The constraint that the
flats_list
should satisfy. - config – A config dict.
Returns: An updated list of flats dict with guessed nearby stations.
-
flatisfy.filters.metadata.
init
(flats_list, constraint)[source]¶ Create a flatisfy key containing a dict of metadata fetched by flatisfy for each flat in the list. Also perform some basic transform on flat objects to prepare for the metadata fetching.
Parameters: - flats_list – A list of flats dict.
- constraint – The constraint that the
flats_list
should satisfy.
Returns: The updated list
Module contents¶
This module contains all the filtering functions. It exposes first_pass
and
second_pass
functions which are a set of filters applied during the first
pass and the second pass.
-
flatisfy.filters.
refine_with_details_criteria
(flats_list, constraint)[source]¶ Filter a list of flats according to the criteria which require the full details to be fetched. These include minimum number of photos and terms that should appear in description.
Note
This has to be done in a separate function and not with the other criterias as photos and full description are only fetched in the second pass.
Parameters: - flats_list – A list of flats dict to filter.
- constraint – The constraint that the
flats_list
should satisfy.
Returns: A tuple of flats to keep and flats to delete.
-
flatisfy.filters.
refine_with_housing_criteria
(flats_list, constraint)[source]¶ Filter a list of flats according to criteria.
Housings posts websites tend to return broader results that what was actually asked for. Then, we should filter out the list to match the user criteria, and avoid exposing unwanted flats.
Parameters: - flats_list – A list of flats dict to filter.
- constraint – The constraint that the
flats_list
should satisfy.
Returns: A tuple of flats to keep and flats to delete.