Pages

Friday, July 5, 2019

Diff Between Two Strings task

On the Internet, I recently came across an interesting task. I share with you its solution.

Task Description
Given two strings of uppercase letters source and target, list (in string form) a sequence of edits to convert from source to targetthat uses the least edits possible.
For example, with strings source = "ABCDEFG", and target = "ABDFFGH" we might return: ["A", "B", "-C", "D", "-E", "F", "+F", "G", "+H"
More formally, for each character C in source, we will either write the token C, which does not count as an edit; or write the token -C, which counts as an edit.
Additionally, between any token that we write, we may write +Dwhere D is any letter, which counts as an edit.
At the end, when reading the tokens from left to right, and not including tokens prefixed with a minus-sign, the letters should spell out target (when ignoring plus-signs.)
In the example, the answer of A B -C D -E F +F G +H has total number of edits 4 (the minimum possible), and ignoring subtraction-tokens, spells out A, B, D, F, +F, G, +H which represents the string target.
If there are multiple answers, use the answer that favors removing from the source first.
Constraints:
  • [time limit] 5000ms
  • [input] string source
    • 2 ≤ source.length ≤ 12
  • [input] string target
    • 2 ≤ target.length ≤ 12
  • [output] array.string


Solution: Dynamic Programming + Construction

Our solution will be built in two steps: first we’ll find the editing distance, and then use it to construct the answer.

Finding the Editing Distance

First, let dp(i, j) = the minimum number of edits required for the problem with strings source[i:] and target[j:]. We will perform what is known as “top-down dynamic programming.” What this means is: Since dp(i, j) may depend only on the values of source[i]target[j]dp(i+1, j)dp(i, j+1), and dp(i+1, j+1), we can write a function dp(i, j) that returns the desired answer recursively in terms of these values.
Additionally, every time we calculate dp(i, j), we will remember the result and record it in memo[i][j]. That ensures that we only perform the effort of calculating dp(i, j) once: afterwards, we simply return the cached result.
In general, when source[i] == target[j], then dp(i, j) = dp(i+1, j+1), because we simply write source[i]. When source[i] != target[j], then we either edited source[i](subtraction) and have the problem of transforming source[i+1:]to target[j:] left over (which has answer dp(i+1, j)), or we edited target[j] (addition) and have the problem of transforming source[i:] to target[j+1:] left over (which has answer dp(i, j+1)).
Let’s look at the first part of our solution:
function diffBetweenTwoStrings(source, target):
    # memo[i][j] will be our cached answer to dp(i, j)
    memo = new Array(source.length, target.length)

    # dp(i, j) is the minimum edits to transform
    # string source[i:] into string target[j:].
    function dp(i, j):
        # If one of the strings is empty, we know
        # the answer already.
        if i == source.length OR j == target.length:
            return target.length - j

        # Otherwise, if we don't have a cached answer,
        # then find one.
        else if memo[i][j] == null:
            if source[i] == target[j]:
                memo[i][j] = dp(i+1, j+1)
            else:
                memo[i][j] = 1 + min(dp(i+1, j), dp(i, j+1))

        return memo[i][j]

Constructing the Answer

Now we need to build our answer. We should iterate through the source and target, and in each step decide whether we need to delete or add another letter.
To figure this out, we need to leverage this information available in dp(i, j). When we have a decision to make between subtracting source[i] or adding target[j], we will use our knowledge of the minimum edit distances dp(i+1, j) and dp(i, j+1) to make our decision.
function diffBetweenTwoStrings(source, target):
    ans = []
    i = 0
    j = 0
    # We are always considering strings source[i:] and target[j:]
    while i < source.length AND j < target.length:
        if source[i] == target[j]:
            # Write the string with no edits
            ans.push(source[i])
            i += 1
            j += 1
        else:
            # We have to decide whether to subtract source[i],
            # or add target[j].
            if dp(i+1, j) <= dp(i, j+1):
                ans.push('-' + source[i])
                i += 1
            else:
                ans.push('+' + target[j])
                j += 1

    while j < target.length:
        ans.push('+' + target[j])
        j += 1

    return " ".join(ans)
We could have also used a “bottom-up” dynamic program for calculating the initial edit-distance function, as follows:
function diffBetweenTwoStrings(source, target):
    dp = new Array(source.length + 1, target.length + 1)

    for i from 0 to source.length:
        dp[i][target.length] = 0
    for j from 0 to target.length:
        dp[source.length][j] = target.length - j
    
    for i from source.length - 1 to 0:
     for j from target.length - 1 to 0:
         if source[i] == target[j]:
          dp[i][j] = dp[i+1][j+1]
         else:
          dp[i][j] = 1 + min(dp[i+1][j], dp[i][j+1])
Instead of caching our results, we compute them in an order that guarantees the components used in computing our answer are correct.
Here, in the main “for i from source.length - 1 to 0, for j from target.length -1 to 0” loop, we iterated i and j in an order that guaranteed that dp[i+1][j], dp[i][j+1], and dp[i+1][j+1] are correct.
The rest of the solution proceeds similarly.

Time Complexity: O(S.length * T.length), from our construction of dp.
Space Complexity: O(S.length * T.length), the space taken by dp.

Friday, June 21, 2019

Introducing Microsoft.FeatureManagement

Andrew Lock has published a set of posts Series: Adding feature flags to an ASP.NET Core app.

The idea of feature flags is described in detail in Martin Fowler portal. 

Microsoft.FeatureManagement is built on top of the Microsoft.Extensions.Configuration configuration system used in ASP.NET Core (but which can also be used in any .NET Standard app). It provides a centralised but extensible way for adding feature flags to your system. This lets you roll out new features to a subset of users, limiting the availability of a feature by time, or performing A/B tests, for example.

Microsoft.FeatureManagement is currently in preview, so some details may change when it's properly released.

Monday, November 26, 2018

MongoDb Atlas in Azure vs Azure CosmosDb

Intro

At the beginning of our developing on one of my projects we planned that we will use Azure CosmosDB for Mongo database hosting. Mainly we chose Cosmos because it is easy to set up, good features like autoscaling, autoindexing, SLA, and we can work with it as with black box without any big expertise in that technology. But during the development process, we have found some disadvantages for us. Two main limitations for us:
  1. CosmosDB does not support API version of MongoDB greater than 3.4. We could not use many important Mongo features like transactions, schema validations etc.
  2. There is no normal CosmosDB emulator for local dev environment.

Description

MongoDB Atlas is a fully-managed cloud database developed by the same people that build MongoDB. Atlas handles all the complexity of deploying, managing, and healing your deployments on the cloud service provider of your choice (AWS, Azure, and GCP). Follow the links below to get started.

Main advantages and disadvantages

Fully MongoDB API version support
We confidently can work with a local MongoDB database on dev environments.

We should choose between sharded cluster and replica set. Also, we should deal with Global Clusters and Global Write Zones and Zone Mapping.
Anyway, we should have more expertise in MongoDB.

Comparison with CosmosDB

MongoDB Connector for BI

The MongoDB Connector for Business Intelligence (BI) allows users to create queries with SQL and visualize, graph, and report on their MongoDB Enterprise data using existing relational business intelligence tools such as TableauMicroStrategy, and Qlik.
The MongoDB Connector for BI acts as a layer that translates queries and data between a mongod ormongos instance and your reporting tool. The BI Connector stores no data, and purely serves to bridge your MongoDB cluster with business intelligence tools.


User Authentication and Authorization with LDAP - Connect to your clusters via LDAP / Active Directory

Connect via BI Connector for Atlas 

Database Auditing - Audit activity on your database

MongoDB Stitch - Create a serverless application

MongoDB Charts - Visualize data in MongoDB

Cluster Tier selection

Select your preferred cluster instance size. The selected instance size dictates the memory, storage, and IOPS specification for each data-bearing server [1] in the cluster.
Atlas categorizes the instance sizes into tiers as follows:

Shared Clusters

Sandbox replica set clusters for getting started with MongoDB. These instances deploy to a shared environment with access to a subset of Atlas features and functionality. For complete documentation on shared cluster limits and restrictions, see Atlas M0 (Free Tier), M2, and M5 Limitations.

Dedicated Development Clusters

Instances that support development environments and low-traffic applications.
These instances support replica set deployments only, but otherwise, provide full access to Atlas features and functionality.

Dedicated Production Clusters

Instances that support production environments with high traffic applications and large datasets.
These instances support replica set and sharded cluster deployments with full access to Atlas features and functionality.
The following table highlights key differences between an M0 Free Tier cluster, an M2 or M5 shared starter cluster, and an M10+ dedicated cluster.


Atlas M0 (Free Tier), M2, and M5 Limitations

Maximum of 100 operations per second allowed for M0 Free Tier and M2/M5 shared clusters.
M0 Free Tier and M2/M5 shared clusters are allowed a maximum of 100 connections.
M0/M2/M5 clusters limit the total data transferred into or out of the cluster as follows:
  • M0: 10 GB per week
  • M2: 20 GB per week
  • M5: 50 GB per week
Atlas throttles the network speed of clusters which exceed the posted limits.
M0 Free Tier and M2/M5 shared clusters have a maximum of 100 databases and 500 collections total.

Global Clusters

Atlas Global Clusters uses a highly curated implementation of sharded cluster zones to support location-aware read and write operations for globally distributed application instances and clients. Global Clusters support deployment patterns such as:
  • Low-latency read and write operations for globally distributed clients.
  • Uptime protection during partial or full regional outages.
  • Location-aware data storage in specific geographic regions.

How does MongoDB Atlas deliver high availability?

Billing

Deploying clusters onto Microsoft Azure

All MongoDB docs: https://docs.mongodb.com/
MongoDB Connector for BI docs: https://docs.mongodb.com/bi-connector/current/

Tuesday, August 28, 2018

How to test locally SignalR on ASP.NET Core app with different connection issues

For example, I used an app with SignalR on ASP.NET Core.

How to limit speed between localhost apps using NetLimiter4

Can be downloaded at https://www.netlimiter.com/

  1. Run application using Visual Studio and select Google Chrome (or your other browsers) in NetLimiter

  2. Right-click to Selected row and choose connection speed you want to select. Then enable incoming/outgoing limit rule by selection checkbox on the selected row
My tests for SignalR apps showed, that WebSocket connection works perfect on 5KB/s speed. It even worked for me on very slow speed (0.5KB/sec) with some delay.
To view WebSocket frames use Developers tools from Google Chrome (F12 key), select Network tab, and filter by WS filter.


How to test network disconnection between machines:
You can use a virtual machine to test this case. I prefer using Hyper-V built-in in Windows 10 (available from Windows professional and upper).
  1. Hyper-V setup documentation: https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/quick-start/enable-hyper-v
  2. install Windows (or other OS) from OS image via Hyper-V, manager.
  3. Run your app from host machine with dotnet run --urls http://0.0.0.0:5000 to allow Kestrel external connections
  4. Setup Windows firewall to allow external connections netsh advfirewall firewall add rule name="Http Port 5000" dir=in action=allow protocol=TCP localport=5000
  5. Run browser in you virtual machine browser and connect to you host machine by IP and specified port (5000 in my example)

Tuesday, January 30, 2018

Introduction to MongoDB Aggregation Pipeline

The main goal of this document is to describe the most commonly used commands of the aggregation pipeline and also give some recommendations for aggregation requests implementation. There will be also a sample solution for C# environment at the end of the document.

Quick Reference

$match

Filters the documents to pass only the documents that match the specified condition(s) to the next pipeline stage.
{ $match: { <query> } }

$project

Passes along the documents with the requested fields to the next stage in the pipeline. The specified fields can be existing fields from the input documents or newly computed fields.
{ $project: { <specification(s)> } }
Specifications can be the following:
<field>: <1or true>Specifies the inclusion of a field.
_id: <0 orfalse>Specifies the suppression of the _id field.
<field>:<expression>Adds a new field or resets the value of an existing field.
<field>:<0or false>
Specifies the exclusion of a field.

$sort

Sorts all input documents and returns them to the pipeline in sorted order.
{ $sort: { <field1>: <sort order>, <field2>: <sort order> ... } }
Sort order can be the following values:
  • 1 to specify ascending order.
  • -1 to specify descending order.

$lookup

Performs a left outer join to an unsharded collection in the same database to filter in documents from the “joined” collection for processing. To each input document, the $lookup stage adds a new array field whose elements are the matching documents from the “joined” collection. The $lookup stage passes these reshaped documents to the next stage.
{
    $lookup:
    {
        from: <collection to join>,
        localField: <field from the input documents>,
        foreignField: <field from the documents of the "from" collection>,
        as: <output array field>
    }
}

Aggregation Pipeline Optimization

All optimizations here have their target to minimize the amount of data that are sent between pipeline stages. Also, these optimizations are done automatically by MongoDB engine, but it will probably be the right decision to eliminate at least partly the need for such optimizations to make DB engine work a bit faster.
All optimizations are done in two phases: sequence optimization and coalescence optimization. As a result, long chains of aggregation phases sometimes can be transformed into a lesser number of phases that require less memory.

Pipeline Sequence Optimization

$project or $addFields + $match

If $match follows $project or $addFields, then that expressions from match stage that doesn't need to be computed in projection stage are moved before projection stage.

$sort + $match

In this case $match is moved before $sort to minimize number of items to sort.

$redact + $match

If $redact stays before $match, then sometimes we can add portion of $match statement before $redact to limit number of documents aggregated.

$skip + $limit

During optimization $limit is moved before $skip, and the $limit value is increased by $skip amount.

$project + $skip or $limit

Obviously, in this case $skip or $limit goes before $project to limit number of documents to be projected.

Pipeline Coalescence Optimization

When possible, the optimization phase coalesces a pipeline stage into its predecessor. Generally, coalescence occurs after any sequence reordering optimization.

$sort + $limit

When a $sort immediately precedes a $limit, the optimizer can coalesce the $limit into the $sort. This allows the sort operation to only maintain the top n results as it progresses, where n is the specified limit, and MongoDB only needs to store n items in memory.

$limit + $limit

When a $limit immediately follows another $limit, the two stages can coalesce into a single $limit where the limit amount is the smaller of the two initial limit amounts.

$skip + $skip

When $skip immediately follows another $skip, the two stages can coalesce into a single $skip where the skip amount is the sum of the two initial skip amounts.

$match + $match

When a $match immediately follows another $match, the two stages can coalesce into a single $match combining the conditions with an $and.

$lookup + $unwind

When a $unwind immediately follows another $lookup, and the $unwind operates on the as a field of the $lookup, the optimizer can coalesce the $unwind into the $lookup stage. This avoids creating large intermediate documents.

Aggregation Pipeline Limits

Each document in the result set is limited by the maximum size of BSON Document, it's currently 16 megabytes. If any single document exceeds this limit, the 'aggregate' command will produce an error. The limit only applies to the returned documents; during the pipeline processing, the documents may exceed this size.
Pipeline stages have a limit of 100 megabytes of RAM. If a stage exceeds this limit, MongoDB will produce an error. To allow for the handling of large datasets, use the allowDiskUse option to enable aggregation pipeline stages to write data to temporary files.
The $graphLookup stage must stay within the 100 megabyte memory limit. If allowDiskUse: true is specified for the aggregate operation, the $graphLookup stage ignores the option. If there are other stages in the aggregate operation, allowDiskUse: true option will affect these other stages.

Aggregation Pipeline and Sharded Collections

The aggregation pipeline supports operations on sharded collections.

Behavior

If the pipeline starts with an exact $match on a shared key, the entire pipeline runs on the matching shard only. Previously, the pipeline would have been split, and the work of merging it would have to be done on the primary shard.
For aggregation operations that must run on multiple shards, if the operations do not require running on the database’s primary shard, these operations will route the results to a random shard to merge the results to avoid overloading the primary shard for that database. The $out stage and the $lookup stage require running on the database’s primary shard.

Optimization

When splitting the aggregation pipeline into two parts, the pipeline is split to ensure that the shards perform as many stages as possible with consideration for optimization.
To see how the pipeline was split, include the explain option in the db.collection.aggregate() method.
Optimizations are subject to change between releases.

Time measure experiment

For this experiment I used following data:
100 000 records for Users
100 000 records for Items
1 000 000 records for Orders, that contain user id and can contain up to 10 items ids.
Experiment was done with four tests: simple match, match using collection 'contains' function, lookup with match, and lookup & match & unwind & group.
Also time has been measured for following cases: when foreign key was object id with ascending index and hash index, and when foreign key was GUID with ascending index and hash index.
I've got following results:

ObjectId & Ascending Index
ObjectId & Hash Index
GUID & Ascending Index
GUID & Hash Index
Ascending GUID & Ascending Index
Ascending GUID & Hash Index
CombGUID & Ascending Index
CombGUID & Hash Index
Non-ID without Index
Non-ID & Hash
simple match0.168s0.168s0.171s0.18s0.183s0.179s0.185s0.183s0.194s0.177s
match with 'contains'0.709s0.722s0.814s0.83s0.793s0.828s0.798s0.787s0.79s0.781s
lookup & match66.373s79.796s79.823s97.733s83.317s97.171s85.767s98.76s42501.798s ≈ 11h 48m 21.798s83.502s
lookup & unwind & match & unwind & group75.856s74.563s81.012s86.546s84.847s86.375s86.045s85.928s44605.215s ≈ 12h 23m 25.215s82.692s
lookup & unwind & match & unwind & lookup & unwind & group73.797s74.129s83.504s87.338s85.418s86.321s86.303s86.572s44763.693s  12h 26m 03.693s82.749s
Results are pretty bad, as we can see. But, there is a way to do it right!
This measurements were done by using aggregation pipeline on Orders, that was lookup-ed with Users and filtered on username AFTER that. But what if we use aggregation pipeline on Users, filter it on username and do a lookup with Orders collection. The results I've got are following:
simple match0.187s
match with 'contains'0.807s
lookup & match0.062s
lookup & unwind & match & unwind & group0.038s
lookup & unwind & match & unwind & lookup & unwind & group0.014s
So, as we can see, now we have a very good request speed with aggregation pipeline. So, we should think very carefully about optimizations while using the aggregation pipeline.
Also I did tests on the same tasks but without aggregation pipeline usage. The results are following:
simple match0.008s
match with 'contains'0.012s
lookup & match0.566s
lookup & unwind & match & unwind & group0.652s
lookup & unwind & match & unwind & lookup & unwind & group0.838s
Of course, these tests were done without such things as indexes, etc. That's why they are executed for a bit longer.