flatisfy.filters package¶

Submodules¶

flatisfy.filters.cache module¶

Caching function for pictures.

class flatisfy.filters.cache.ImageCache(max_items=200, storage_dir=None)[source]¶

Bases: flatisfy.filters.cache.MemoryCache

A cache for images, stored in memory.

static compute_filename(url)[source]¶

Compute filename (hash of the URL) for the cached image.

Parameters:	url – The URL of the image.
Returns:	The filename, with its extension.

on_miss(url)[source]¶: Helper to actually retrieve photos if not already cached.

class flatisfy.filters.cache.MemoryCache[source]¶

Bases: object

A cache in memory.

get(key)[source]¶

Get an element from cache. Eventually call on_miss if the item is not already cached.

Parameters:	key – Key of the element to retrieve.
Returns:	Requested element.

hit_rate()[source]¶

Get the hit rate, that is the rate at which we requested an item which was already in the cache.

Returns:	The hit rate, in percents.

miss_rate()[source]¶

Get the miss rate, that is the rate at which we requested an item which was not already in the cache.

Returns:	The miss rate, in percents.

static on_miss(key)[source]¶

Method to be called whenever an object is requested from the cache but was not already cached. Typically, make a HTTP query to fetch it.

Parameters:	key – Key of the requested object.
Returns:	The object content.

total()[source]¶

Get the total number of calls (with hits to the cache, or miss and fetching with on_miss) to the cache.

Returns:	Total number of item accessing.

flatisfy.filters.duplicates module¶

Filtering functions to detect and merge duplicates.

flatisfy.filters.duplicates.compare_photos(photo1, photo2, photo_cache, hash_threshold)[source]¶

Compares two photos with average hash method.

Parameters:	photo1 – First photo url. photo2 – Second photo url. photo_cache – An instance of `ImageCache` to use to cache images. hash_threshold – The hash threshold between two images. Usually two different photos have a hash difference of 30.
Returns:	`True` if the photos are identical, else `False`.

flatisfy.filters.duplicates.deep_detect(flats_list, config)[source]¶

Deeper detection of duplicates based on any available data.

Parameters:	flats_list – A list of flats dicts. config – A config dict.
Returns:	A tuple of the deduplicated list of flat dicts and the list of all the flats objects that should be removed and considered as duplicates (they were already merged).

flatisfy.filters.duplicates.detect(flats_list, key='id', merge=True, should_intersect=False)[source]¶

Detect obvious duplicates within a given list of flats.

There may be duplicates found, as some queries could overlap (especially since when asking for a given place, websites tend to return housings in nearby locations as well). We need to handle them, by either deleting the duplicates (merge=False) or merging them together in a single flat object.

Parameters:	flats_list – A list of flats dicts. key – The flat dicts key on which the duplicate detection should be done. merge – Whether the found duplicates should be merged or we should only keep one of them. should_intersect – Set to `True` if the values in the flat dicts are lists and you want to deduplicate on non-empty intersection (typically if they have a common url).
Returns:	A tuple of the deduplicated list of flat dicts and the list of all the flats objects that should be removed and considered as duplicates (they were already merged).

flatisfy.filters.duplicates.find_number_common_photos(flat1_photos, flat2_photos, photo_cache, hash_threshold)[source]¶

Compute the number of common photos between the two lists of photos for the flats.

Fetch the photos and compare them with average hash method.

Parameters:	flat1_photos – First list of flat photos. Each photo should be a `dict` with (at least) a `url` key. flat2_photos – Second list of flat photos. Each photo should be a `dict` with (at least) a `url` key. photo_cache – An instance of `ImageCache` to use to cache images. hash_threshold – The hash threshold between two images.
Returns:	The found number of common photos.

flatisfy.filters.duplicates.get_duplicate_score(flat1, flat2, photo_cache, hash_threshold)[source]¶

Compute the duplicate score between two flats. The higher the score, the more likely the two flats to be duplicates.

Parameters:	flat1 – First flat dict. flat2 – Second flat dict. photo_cache – An instance of `ImageCache` to use to cache images. hash_threshold – The hash threshold between two images.
Returns:	The duplicate score as `int`.

flatisfy.filters.duplicates.get_or_compute_photo_hash(photo, photo_cache)[source]¶

Get the computed hash from the photo dict or compute it if not found.

Parameters:	photo – A photo, as a `dict` with (at least) a `url` key. photo_cache – An instance of `ImageCache` to use to cache images.

flatisfy.filters.duplicates.homogeneize_phone_number(numbers)[source]¶

Homogeneize the phone numbers, by stripping any space, dash or dot as well as the international prefix. Assumes it is dealing with French phone numbers (starting with a zero and having 10 characters).

Parameters:	numbers – The phone number string to homogeneize (can contain multiple phone numbers).
Returns:	The cleaned phone number. `None` if the number is not valid.

flatisfy.filters.metadata module¶

Filtering functions to handle flatisfy-specific metadata.

This includes functions to guess metadata (postal codes, stations) from the actual fetched data.

flatisfy.filters.metadata.compute_travel_times(flats_list, constraint, config)[source]¶

Compute the travel time between each flat and the points listed in the constraints.

Parameters:	flats_list – A list of flats dict. constraint – The constraint that the `flats_list` should satisfy. config – A config dict.
Returns:	An updated list of flats dict with computed travel times.

Note

Requires a Navitia or CityMapper API key in the config.

flatisfy.filters.metadata.fuzzy_match(query, choices, limit=3, threshold=75)[source]¶

Custom search for the best element in choices matching the query.

Parameters:	query – The string to match. choices – The list of strings to match with. limit – The maximum number of items to return. Set to `None` to return all values above threshold. threshold – The score threshold to use.
Returns:	Tuples of matching items and associated confidence.

Note

This function works by removing any fancy character from the query and choices strings (replacing any non alphabetic and non numeric characters by space), converting to lower case and normalizing them (collapsing multiple spaces etc). It also converts any roman numerals to decimal system. It then compares the string and look for the longest string in choices which is a substring of query. The longest one gets a confidence of 100. The shorter ones get a confidence proportional to their length.

Module contents¶

This module contains all the filtering functions. It exposes first_pass and second_pass functions which are a set of filters applied during the first pass and the second pass.

flatisfy.filters.refine_with_details_criteria(flats_list, constraint)[source]¶

Filter a list of flats according to the criteria which require the full details to be fetched. These include minimum number of photos and terms that should appear in description.

Note

This has to be done in a separate function and not with the other criterias as photos and full description are only fetched in the second pass.

Parameters:	flats_list – A list of flats dict to filter. constraint – The constraint that the `flats_list` should satisfy.
Returns:	A tuple of flats to keep and flats to delete.

flatisfy.filters.refine_with_housing_criteria(flats_list, constraint)[source]¶

Filter a list of flats according to criteria.

Housings posts websites tend to return broader results that what was actually asked for. Then, we should filter out the list to match the user criteria, and avoid exposing unwanted flats.

Parameters:	flats_list – A list of flats dict to filter. constraint – The constraint that the `flats_list` should satisfy.
Returns:	A tuple of flats to keep and flats to delete.

flatisfy.filters package¶

Submodules¶

flatisfy.filters.cache module¶

flatisfy.filters.duplicates module¶

flatisfy.filters.metadata module¶

Module contents¶

Table of Contents

Previous topic

Next topic

This Page