flatisfy.filters package

Submodules

flatisfy.filters.cache module

Caching function for pictures.

class flatisfy.filters.cache.ImageCache(max_items=200, storage_dir=None)[source]

Bases: flatisfy.filters.cache.MemoryCache

A cache for images, stored in memory.

static compute_filename(url)[source]

Compute filename (hash of the URL) for the cached image.

Parameters:url – The URL of the image.
Returns:The filename, with its extension.
on_miss(url)[source]

Helper to actually retrieve photos if not already cached.

class flatisfy.filters.cache.MemoryCache[source]

Bases: object

A cache in memory.

get(key)[source]

Get an element from cache. Eventually call on_miss if the item is not already cached.

Parameters:key – Key of the element to retrieve.
Returns:Requested element.
hit_rate()[source]

Get the hit rate, that is the rate at which we requested an item which was already in the cache.

Returns:The hit rate, in percents.
miss_rate()[source]

Get the miss rate, that is the rate at which we requested an item which was not already in the cache.

Returns:The miss rate, in percents.
static on_miss(key)[source]

Method to be called whenever an object is requested from the cache but was not already cached. Typically, make a HTTP query to fetch it.

Parameters:key – Key of the requested object.
Returns:The object content.
total()[source]

Get the total number of calls (with hits to the cache, or miss and fetching with on_miss) to the cache.

Returns:Total number of item accessing.

flatisfy.filters.duplicates module

Filtering functions to detect and merge duplicates.

flatisfy.filters.duplicates.compare_photos(photo1, photo2, photo_cache, hash_threshold)[source]

Compares two photos with average hash method.

Parameters:
  • photo1 – First photo url.
  • photo2 – Second photo url.
  • photo_cache – An instance of ImageCache to use to cache images.
  • hash_threshold – The hash threshold between two images. Usually two different photos have a hash difference of 30.
Returns:

True if the photos are identical, else False.

flatisfy.filters.duplicates.deep_detect(flats_list, config)[source]

Deeper detection of duplicates based on any available data.

Parameters:
  • flats_list – A list of flats dicts.
  • config – A config dict.
Returns:

A tuple of the deduplicated list of flat dicts and the list of all the flats objects that should be removed and considered as duplicates (they were already merged).

flatisfy.filters.duplicates.detect(flats_list, key=u'id', merge=True, should_intersect=False)[source]

Detect obvious duplicates within a given list of flats.

There may be duplicates found, as some queries could overlap (especially since when asking for a given place, websites tend to return housings in nearby locations as well). We need to handle them, by either deleting the duplicates (merge=False) or merging them together in a single flat object.

Parameters:
  • flats_list – A list of flats dicts.
  • key – The flat dicts key on which the duplicate detection should be done.
  • merge – Whether the found duplicates should be merged or we should only keep one of them.
  • should_intersect – Set to True if the values in the flat dicts are lists and you want to deduplicate on non-empty intersection (typically if they have a common url).
Returns:

A tuple of the deduplicated list of flat dicts and the list of all the flats objects that should be removed and considered as duplicates (they were already merged).

flatisfy.filters.duplicates.find_number_common_photos(flat1_photos, flat2_photos, photo_cache, hash_threshold)[source]

Compute the number of common photos between the two lists of photos for the flats.

Fetch the photos and compare them with average hash method.

Parameters:
  • flat1_photos – First list of flat photos. Each photo should be a dict with (at least) a url key.
  • flat2_photos – Second list of flat photos. Each photo should be a dict with (at least) a url key.
  • photo_cache – An instance of ImageCache to use to cache images.
  • hash_threshold – The hash threshold between two images.
Returns:

The found number of common photos.

flatisfy.filters.duplicates.get_duplicate_score(flat1, flat2, photo_cache, hash_threshold)[source]

Compute the duplicate score between two flats. The higher the score, the more likely the two flats to be duplicates.

Parameters:
  • flat1 – First flat dict.
  • flat2 – Second flat dict.
  • photo_cache – An instance of ImageCache to use to cache images.
  • hash_threshold – The hash threshold between two images.
Returns:

The duplicate score as int.

flatisfy.filters.duplicates.get_or_compute_photo_hash(photo, photo_cache)[source]

Get the computed hash from the photo dict or compute it if not found.

Parameters:
  • photo – A photo, as a dict with (at least) a url key.
  • photo_cache – An instance of ImageCache to use to cache images.
flatisfy.filters.duplicates.homogeneize_phone_number(numbers)[source]

Homogeneize the phone numbers, by stripping any space, dash or dot as well as the international prefix. Assumes it is dealing with French phone numbers (starting with a zero and having 10 characters).

Parameters:numbers – The phone number string to homogeneize (can contain multiple phone numbers).
Returns:The cleaned phone number. None if the number is not valid.

flatisfy.filters.metadata module

Filtering functions to handle flatisfy-specific metadata.

This includes functions to guess metadata (postal codes, stations) from the actual fetched data.

flatisfy.filters.metadata.compute_travel_times(flats_list, constraint, config)[source]

Compute the travel time between each flat and the points listed in the constraints.

Parameters:
  • flats_list – A list of flats dict.
  • constraint – The constraint that the flats_list should satisfy.
  • config – A config dict.
Returns:

An updated list of flats dict with computed travel times.

Note

Requires a Navitia or CityMapper API key in the config.

flatisfy.filters.metadata.fuzzy_match(query, choices, limit=3, threshold=75)[source]

Custom search for the best element in choices matching the query.

Parameters:
  • query – The string to match.
  • choices – The list of strings to match with.
  • limit – The maximum number of items to return. Set to None to return all values above threshold.
  • threshold – The score threshold to use.
Returns:

Tuples of matching items and associated confidence.

Note

This function works by removing any fancy character from the query and choices strings (replacing any non alphabetic and non numeric characters by space), converting to lower case and normalizing them (collapsing multiple spaces etc). It also converts any roman numerals to decimal system. It then compares the string and look for the longest string in choices which is a substring of query. The longest one gets a confidence of 100. The shorter ones get a confidence proportional to their length.

See also

flatisfy.tools.normalize_string

Example:

>>> match("Paris 14ème", ["Ris", "ris", "Paris 14"], limit=1)
[("Paris 14", 100)

>>> match(                 "Saint-Jacques, Denfert-Rochereau (Colonel Rol-Tanguy), "                 "Mouton-Duvernet",                 ["saint-jacques", "denfert rochereau", "duvernet", "toto"],                 limit=4             )
[('denfert rochereau', 100), ('saint-jacques', 76)]
flatisfy.filters.metadata.guess_postal_code(flats_list, constraint, config, distance_threshold=20000)[source]

Try to guess the postal code from the location of the flats.

Parameters:
  • flats_list – A list of flats dict.
  • constraint – The constraint that the flats_list should satisfy.
  • config – A config dict.
  • distance_threshold – Maximum distance in meters between the constraint postal codes (from config) and the one found by this function, to avoid bad fuzzy matching. Can be None to disable thresholding.
Returns:

An updated list of flats dict with guessed postal code.

flatisfy.filters.metadata.guess_stations(flats_list, constraint, config)[source]

Try to match the station field with a list of available stations nearby.

Parameters:
  • flats_list – A list of flats dict.
  • constraint – The constraint that the flats_list should satisfy.
  • config – A config dict.
Returns:

An updated list of flats dict with guessed nearby stations.

flatisfy.filters.metadata.init(flats_list, constraint)[source]

Create a flatisfy key containing a dict of metadata fetched by flatisfy for each flat in the list. Also perform some basic transform on flat objects to prepare for the metadata fetching.

Parameters:
  • flats_list – A list of flats dict.
  • constraint – The constraint that the flats_list should satisfy.
Returns:

The updated list

Module contents

This module contains all the filtering functions. It exposes first_pass and second_pass functions which are a set of filters applied during the first pass and the second pass.

flatisfy.filters.refine_with_details_criteria(flats_list, constraint)[source]

Filter a list of flats according to the criteria which require the full details to be fetched. These include minimum number of photos and terms that should appear in description.

Note

This has to be done in a separate function and not with the other criterias as photos and full description are only fetched in the second pass.

Parameters:
  • flats_list – A list of flats dict to filter.
  • constraint – The constraint that the flats_list should satisfy.
Returns:

A tuple of flats to keep and flats to delete.

flatisfy.filters.refine_with_housing_criteria(flats_list, constraint)[source]

Filter a list of flats according to criteria.

Housings posts websites tend to return broader results that what was actually asked for. Then, we should filter out the list to match the user criteria, and avoid exposing unwanted flats.

Parameters:
  • flats_list – A list of flats dict to filter.
  • constraint – The constraint that the flats_list should satisfy.
Returns:

A tuple of flats to keep and flats to delete.