Bansal Blog!!

Learning To Rank

2020-05-02T00:00:00+00:00

Lottery Ticket Hypothesis

2019-12-30T00:00:00+00:00

Generally, in deep learning, we build a deep model and then prune some weight of model (which don’t effect the accuracy). Using this process, we can get a smaller size network (75% - 90% in size) at same accuracy. But what happens, if we train the subnetwork again? Will the accuracy improves? Can we reduce the size of network further? AI researchers have found that their accuracy decreases if we retrain the pruned network. So they cann’t be pruned further. But researchers from MIT challenge this assumption and proposed a smart method to prune the model upto 0.3% with samewhat similar accuracy.

They experiment with the initialization of the network and found that if we reinitialize the pruned network with the new random value, it performance decreases. But if they are re-initialized with the same weight value that are used at the start of training, we can get same or even more (sometime) on the trained model. They call this subnetwork as a winning ticket in big deep network.

Algorithms:

Randomly initialize a neural network f(x; θ) (call it θ_0)
Train the network for k iteration, so the parameter becomes θ_k
Prune p% of the parameter θ_k (create a mask m for that)
Reset the remaining parameters to their value in θ_0, creating the winning ticket f(x; m*θ_0)
Repeat step 2-5, till the accuracy change are in threshold level.

Their experiment result are unbelieve. Iterative pruning make training much faster with better generalization. They proves that we can achieve same test accuracy with only 10% - 20% of the original model. And this technique can be applied on any neural network structure.

In the following image, we can see that the model performance is 0.3% more than original model with only 1.2% of the original model.

Let’s talk about if this concept is related to our learning mechanism. I believe that our brain has a similar pruning mechanism. When we read a topic, it make some connection with weak and strong synaptic weights. But when we go through that topic again and again, some of the weak synaptic weight’s connection breaks and other becomes more strong. It also create some new connections based on a relation between the current topic and our prior experience. Except this facts, this proposed technique behave the same. Following this, we can raise question like, how to choose hyper-parameter k, to train a model after pruning step? What would be the best pruning percentage value p? Author has done a detail analysis of these questions, please check out paper.

data-science-I

2019-11-17T00:00:00+00:00

Things to talk about?

Distance Metrics ?

This post is in progress.

Categorical variable is called qualitative variable/predictors and numerical variables is called quantitative variables.

Common steps for the data cleaning:

Better the data, fancier the algorithm will be

memory optimization of numerical feature int64 -> int16, int32
Duplicate observation removal
Filter outlier
- num feature: use box/distribution plot, either drop them or replace them with mean or another corner element(upper-bound)prefered
- cat feature: create a spearate category for outlier or fill null. Also can use new feature, which represent outlier
handle missing value using inter-quartile range
Fo text data, we have following step:
1. Convert all words to Lower Case
2. contraction mapping isn't -> is not.
3. Extra Space removal
4. Punctualtion removal
5. Digit and other special character removal.
6. clean html markup or other sort of chracter
Type conversion. integer -> category for cat feature

Standardize/Normalization

Feature transformation X -> logX

IQR:

The IQR approximates the amount of spread in the middle half of the data that week. Step to find iqr is:

sort the data
create two buckets of data
- even size: divided data is of odd size
- odd size : divided data is of even size
Pick the median from each bucket and take the diff

Missing Data Handling:

There are 3 types of missing data cases:

Completely missing at random
Missing at random
Not missing at random Here is the graph, which tells all about bias happens, in these three cases: Source: Nakagawa & Freckleton (2008)

To test these cases:

We can partition data into two big chunks and compute the t-test for both sepeartely.
1. If t-value is same, then it is CMAR.
2. Else it may be the case of MAR or NMAR

Methods for missing value imputation:

Mean, Mode, Median imputation
KNN based, we can take average of its k nearest neighbours (not very good for higher dimensional data)
- Euclidean, Manhattan, cosine similarity
- Hamming Distance, jaccard(very good for sparse data)
Tree based Imputation
EM (Iterative approach)
Linear Regression based
- assume missing-value attribute as depeendent variable and all other variable as independent
- predict the batch of missing value and include them as well for training
- repeat till converge
- it add linearity, make worse this model
Mice [Multiple imputation by chained equation]:
- Assume feature/predictor with missing value as dependent variable and rest of them as independent variable.
- Fit any predictive modelling algo such as linear regression and predict for the missing value

Sampling Techniques:

To model the bahaviour of population, we need a good strategy to choose sample, which can describe the model bahaiour.

We can’t deal with entire population, better is to chose some sample which will have same empirircal mean as that of entire population mean

Exp: when we are building some application or running some experiment, we never have the population sample, we have subset of that population. Now our objective is to approximate the behaviour of population, using emperical observation. So our sampling helps here. Bootstrap sampling is very important in this context. As it is proved that if we build model using bootstrap sampling and run this experiment a large number of times, its avg emperical mean approx equal to population mean.

Sampling can be categorize in two buckets in broad ways:

Probability based sampling
- allot some propbability to sample, can be weighted or uniform(mostly)
- weighted sampling, acc to user experience. For example, in a servey of maedical diagnose, an doctor servey will be more important than patient’s response.
  1. Random Sampling
  2. Staratified Sampling
  - divide the population into groups/strata and then use random sampling on each group 3. Bootstrap sampling
  - random sampling with replacment
  - an average bootstrap sample contains 63.2% of the original observations and omits 36.8%.
  - The probability that a particular observation is not chosen from a set of n observations is 1 - 1/n and for collecting the n samples, it becomes (1 - 1/n)^n.
  - proof: As n → ∞ of (1 - 1/n)^n is 1/e. Therefore, when n is large, the probability that an observation is not chosen is approximately 1/e ≈ 0.368.
  - very important to build decorrelated model in bagging
Non-Probability based
1. Convenience Sampling
  - choose, whatever you can find
  - biased
  - not efficient
  - poor representation of population
2. Quota sampling
  - order based sampling
  - select some random number and then choose k samples in ascending/some order
  - Biased
  - poor representation

Basic Step of ML practising:

Explore the data
- draw histogram, cross-plot and so on understand the data distribution
Feature Engineering
- Come up with hypothesis (with assumption) and prove your hypothesis
- Color can be important on buying second hand car, It is better to embedded color, instead of feeding raw data of images as it is.
- In text data-set, length, average and other statistics of sentence can be another features
- In tree based model, this statistics can be helpful
- Log(x), log(1 + x), fit poisson distribution for counting variable
- For large categorical in a feature, mean encoding is very helpful, also it helps in converge fast. First check its distribution or distribution before and after encoding
Fit a model

Stacking (stack net)

It is a meta modelling approach.
In the base leevl, we train week learner and then their prediction is used by another models, to get final prediction.
It is simply a NN model, where each node is replaced by one model.

Process:

Split the adta in K parts
train weak learner on each K-1 parts and holdout one part for prediction for each weak learner
Algorithm steps with exp:
1. We split the dataset in 4 parts.
2. Now, train first weak learner on 1,2,3 and predict on 4th.
3. Train 2nd weak learner on 1,2,4 and predict on 3rd.
4. repeat on
5. Now, we have prediction of eavh learner on separate hold-out and after combining all, we get prediction on entire data-set.

Data-Leakage

data-leakage make model to learn something other than what we intended.
produce bias in model
If we have information or feature in training data-set, that is outside from training data-set or that features has not any coorelation with the training data distribution, that is data-leakage
How do we induce data-leakage (generally)?: While building model, if we use entire data (train + test) for standardization which will know the entire distribution. Whereas our aim is to learn that distribution by training our model only trainining data-set.

Use standarization o training data-set and while testing normalize the test data with the same parameters used in training time.

Cross Validation

We generally, split our data-set into training and testing. Further from training data-set, we take some part for validation. This is classical setting. We use K-Fold validation strategy to obtain unbiased estimate of the performance, i.e. sum of all fold’s prediction / K

Noe that this K-Fold validation considers on training data

Nested Validation

This is more robust method, Especially in time-series dataset, where data-leakage generally occurs and affect the model performance by an enormous amount.

The idea is that there are two loops, One is outer loop, same as classical validation step and another is inner loop, where futher training data in one step of K-Fold is divided into training and validation and The 1-Fold, which is hold for validation in outer loop, act as testing dataset.

Using nested cross-validation, we train K-models with different paraameters, and each model use grid serach to find the optimal parameters. If our model is stable, then each model will have same hyper-parameyters in the end.

Why is Cross-Validation Different with Time Series?

When dealing with time series data, traditional cross-validation (like k-fold) should not be used for two reasons:

Temporal Dependencies
Arbitrary choice of Test data-set

Nested CV method

Predict Second half
- Choose any random test set and on remaining data-set, main training and validation with temporal relation
- Not much robust, because opf random test-set selection.
Forward chaining Maintain temporal relation between all three train, validation and test set.
For example, we have data for 10 days.
1. train on 1st day, validate on 2nd and test on else
2. train on first-two, validate on third and test on else
3. repeat. This method produces many different train/test splits and the error on each split is averaged in order to compute a robust estimate of the model error.

Feature Selection [src-analytics-vidya]:

Filter Methods
Wrapper Methods
Embedded Methods
Difference between Filter and Wrapper methods

Filter Methods.

Pearson’s Correlation: It is used as a measure for quantifying linear dependence between two continuous variables X and Y. Its value varies from -1 to +1.
LDA: Linear discriminant analysis is used to find a linear combination of features that characterizes or separates two or more classes (or levels) of a categorical variable.
Chi-Square: It is a is a statistical test applied to the groups of categorical features to evaluate the likelihood of correlation or association between them using their frequency distribution.

NOTE: Filter Methods does not remove multicollinearity.

wrapper methods:

Here, we try to use a subset of features and train a model using them. Based on the inferences that we draw from the previous model, we decide to add or remove features from your subset.

This is computationally very expensive.

Methods:

forward feature selection
- we start with having no feature in the model. At each iteration, we keep adding the feature which best improves our model
backward feature elimination
- we start with all the features and removes the least significant feature at each iteration which improves the performance of the model
recursive feature elimination
- It is a greedy optimization algorithm which aims to find the best performing feature subset.
  1. It repeatedly creates models and keeps aside the best or the worst performing feature at each iteration.
  2. It constructs the next model with the left features until all the features are exhausted.
  3. It then ranks the features based on the order of their elimination.

Difference between Filter and Wrapper methods

The main differences between the filter and wrapper methods for feature selection are:

Filter methods measure the relevance of features by their correlation with dependent variable while wrapper methods measure the usefulness of a subset of feature by actually training a model on it.
Filter methods are much faster compared to wrapper methods as they do not involve training the models. On the other hand, wrapper methods are computationally very expensive as well.
Filter methods use statistical methods for evaluation of a subset of features while wrapper methods use cross validation.
Filter methods might fail to find the best subset of features in many occasions but wrapper methods can always provide the best subset of features.
Using the subset of features from the wrapper methods make the model more prone to overfitting as compared to using subset of features from the filter methods

Afterward, post is in progress.

Feature SelectionMore-Info

1) Feature selection with correlation and random forest classification¶

correlation map

f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap(x.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)

using this coorelation map, we select some of the feature and check our algo pred rate.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,confusion_matrix
from sklearn.metrics import accuracy_score

# split data train 70 % and test 30 %
x_train, x_test, y_train, y_test = train_test_split(x_1, y, test_size=0.3, random_state=42)

#random forest classifier with n_estimators=10 (default)
clf_rf = RandomForestClassifier(random_state=43)
clr_rf = clf_rf.fit(x_train,y_train)

ac = accuracy_score(y_test,clf_rf.predict(x_test))
print('Accuracy is: ',ac)
cm = confusion_matrix(y_test,clf_rf.predict(x_test))
sns.heatmap(cm,annot=True,fmt="d")

Accuracy is:  0.9532163742690059

2) Univariate feature selection and random forest classification In univariate feature selection, we will use SelectKBest that removes all but the k highest scoring features

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# find best scored 5 features
select_feature = SelectKBest(chi2, k=5).fit(x_train, y_train)

print('Score list:', select_feature.scores_)
print('Feature list:', x_train.columns)

Using this selction score, we obtain the top k feature, using transform function

x_train_2 = select_feature.transform(x_train)
x_test_2 = select_feature.transform(x_test)
#random forest classifier with n_estimators=10 (default)
clf_rf_2 = RandomForestClassifier() 
clr_rf_2 = clf_rf_2.fit(x_train_2,y_train)
ac_2 = accuracy_score(y_test,clf_rf_2.predict(x_test_2))
print('Accuracy is: ',ac_2)
cm_2 = confusion_matrix(y_test,clf_rf_2.predict(x_test_2))
sns.heatmap(cm_2,annot=True,fmt="d")

Accuracy is:  0.9590643274853801

3) Recursive feature elimination (RFE) with random forest Basically, it uses one of the classification methods (random forest in our example), assign weights to each of features. Whose absolute weights are the smallest are pruned from the current set features. That procedure is recursively repeated on the pruned set until the desired number of features

from sklearn.feature_selection import RFE
# Create the RFE object and rank each pixel
clf_rf_3 = RandomForestClassifier()
rfe = RFE(estimator=clf_rf_3, n_features_to_select=5, step=1)
rfe = rfe.fit(x_train, y_train)

print('Chosen best 5 feature by rfe:',x_train.columns[rfe.support_])

Chosen best 5 feature by rfe: Index(['area_mean', 'concavity_mean', 'area_se', 'concavity_worst',
       'symmetry_worst'],
      dtype='object')

In this method, we select the no of feature, what if we select less no of feature than which can increase acc much greater than this.

4) Recursive feature elimination with cross validation and random forest classification

Now we will not only find best features but we also find how many features do we need for best accuracy.

from sklearn.feature_selection import RFECV

# The "accuracy" scoring is proportional to the number of correct classifications
clf_rf_4 = RandomForestClassifier() 
rfecv = RFECV(estimator=clf_rf_4, step=1, cv=5,scoring='accuracy')   #5-fold cross-validation
rfecv = rfecv.fit(x_train, y_train)

print('Optimal number of features :', rfecv.n_features_)
print('Best features :', x_train.columns[rfecv.support_])

# Optimal number of featu<!-- res : 14
# Best features : Index(['te -->xture_mean'....]

plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)

5) Tree based feature selection and random forest classification

Random forest choose randomly at each iteration, therefore sequence of feature importance list can change.

clf_rf_5 = RandomForestClassifier()
clr_rf_5 = clf_rf_5.fit(x_train,y_train)
importances = clr_rf_5.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf_rf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(x_train.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest

plt.figure(1, figsize=(14, 13))
plt.title("Feature importances")
plt.bar(range(x_train.shape[1]), importances[indices],
       color="g", yerr=std[indices], align="center")
plt.xticks(range(x_train.shape[1]), x_train.columns[indices],rotation=90)
plt.xlim([-1, x_train.shape[1]])
plt.show()

Feature ranking:
1. feature 1 (0.213700) ....

Cat var: Qualitive variable Num var: Quantitative Var

t-statistics: Final the coeeficient of feature in model and also find the std dev error and t-stat = (coeff/std-dev error)

document-retrieval-system

2019-08-27T00:00:00+00:00

This will be a very quick post on document retrieval system, where we see, how we can retrieve the similar set of document on the basis of queries.

❤ Problem statement ❤

Let’s say, we have 1000 documents unlabeled document. Our objective is to return the response of query document, which will be the most similar document (semantic-wise) This problem is similar to topic modelling.

Few Approaches:

There are some topic modelling approach, which we will discuss later.

Latent Semantic Analysis (LSA)
Probabilistic-LSA (PSLA)
Latent Dirichlet Allocation(Bayesian version of PSLA)

Document Retrieval Approach using deep learning.

We will go thorough step by step procedure to prepare the pipeline.

we prepare a bag-of-words for these document. For text cleaning, we use:
1. remove punctuation, stop words and spaces.
2. perform stemming(Removing the suffix such as -ize, -s, -es etc).
We prepare count vector of top-words from all document. To use in neural network as fearture, we can prepare tfidf features. It is avvreviated as term frequency inverse document frequency which is calculated as term-freq * log(inv-doc-freq), where
1. term-freq = count-of-word / total-word
2. idf = log(toal-document / #document-in-which-word-appear )
Prepare a auto-encoder model, with very small latent dimensions. For example, if we select 5000 top-word, we can have choose following configuration: 5000 -> 1000 -> 200 -> 10 -> 200 -> 1000 -> 5000 Note: our loss function is to reconstruct the original features
Now, we have latent features(10 dims), we extract this features, for each document and as well as the query. Now only step remaining is to find cosine-similarity between each document with the query. If we want to categorize each document, we can do the same using similarity metrics. Note: For this we need to compare each pair, which can be huge (NC2 combinations)

That’s it, we can get similar document as the query set, if NN is trained well.

Applications of information retrieval

better labeling of product let’s say on amazon, flipkart etc (let’s say, we have some wooden chair for child, it is labeled as furniture, now using the above method, we can find other product similar to this, from there we can have history of buyer and their other product and can estimate its more better label as baby-product)
dicovering similar neughbourhood (for house price estimation, viloent/crime forecasting etc)
structring web search results (categorize the result, for example we search for watson, it will show ibm-watson, emma-watson or other things, so we can display these result structurally based on categories)
meta feature to train another model

flip-the-matrix-to-maximize-sum-in-top-quadrant

2019-08-05T00:00:00+00:00

Problem: Sean invented a game involving a matrix where each cell of the matrix contains an integer. He can reverse any of its rows or columns any number of times. The goal of the game is to maximize the sum of the elements in the n X n submatrix located in the upper-left quadrant of the matrix 2n X 2n.

For example, given the matrix:

1 2
3 4

It is 2 X 2 so we want to maximize the top left matrix that is 1 X 1.

Reverse 2nd row
Reverse Ist column we get,
```
4 2
1 3
```
The maximal sum is 4

Example 2:

The maximal sum is 414

Explanation:

Reverse 3rd column
then reverse 1st row

Approach: If you try to take bottom most corner (2n, 2n) element to Ist position (1,1), we need two flip operation Try flipping matrix so that (2n-1, 2n-1) element reach at (2,2), More precisely, we can take element (i,j) to any of other three position symetrical to centre position (c,c), which means that we can make swapping of an element with its corresponding 3 element. So total are 4 element at each position. We use this idea to flip matrix to get maximum at the top quadrant in matrix.

Note: We assume 1 based indexing

int flippingMatrix(vector<vector<int>> A) {
    int sum = 0;
    int n = A.size(), m = A[0].size();
    int cur, right, down, diag, ans;
    for(int i=0; i<n/2; i++){
        for(int j=0; j<m/2; j++){
            cur = A[i][j];
            right = A[i][m-j-1];
            down = A[n-i-1][j];
            diag = A[n-i-1][m-j-1];
            ans = max({cur, right, down, diag});
            sum += ans;
        }
    }
    return sum;
}

recommendation-system

2019-08-04T00:00:00+00:00

Recommendation system

In this post, we visit various approach used in recommendation system. For each approach, i will walk through major points. This post is not in depth explanation, but revision of various approach followed in research and industry.

The content of this post is as follows:

Content Based
Collaborative filtering
Hybrid Approach
Matrix factorization
Deep Learning Based recSys
Graph Based inference model

Content Based Method
1. Based on user history
2. Not helpful for cold-start problem
3. Feature would be likes of product, location, feature of product we bought etc
Collaborative filtering Approach:
1. Interaction Based feature
  - user-user interaction (I like sitcom tv series, it find the similarity between users and recommend product/movies of users, who have same flavor as me)
  - item-item interaction (Recommend similar product on amazon)
2. can handle cold-start problem very well
3. KNN algorithm to measure similarity
Hybrid Approach:
1. User History
2. Interaction of user-user or item-item
3. Most company use this approach
Matrix factorization:
1. We need to predict the missing entries in matrix for user-item rating, it can be done using SVD as R = U S V', where U is user-feature, S is eigen-value and V is Item-features
2. It can be done using Alterning Optimization approach with loss function of |r_{ij} - u_i S v_j|^2
Probabilistic Matrix Factorization:
1. We can include user and item known history/feature as well to learn better latent-representation.
2. Loss funtion is |r_{mn} - u_n v_m|^2 + |u_n - W_u a_n|^2 + |v_m - W_v b_m|^2 , where a_n is feature or history of nth user and b_m is the history/feature of mth item.
3. We can even add regularization on u_n and v_m
4. Work fantastically in Netflix-movies
Deep Learning Based
1. deep and shallow network approach
2. Deep network will use word-embedding of product description and shallow-network use the user-history as feature or meta-data
3. implemented in You-tube recSys
4. There are many possibile way to build network and use feature, play with word embedding
  - time series content
  - time distributed layer for parsing document on item
  - tfidf feature
  - svd feature
5. can even used doc2vec for each document. Use gensim doc2vec for training. **Main idea is that, while training we use the same approach as word2vec, except than we add a document tag with it, which maintains the context for each doc as well
Another Deep Learning Approach:
1. two network, one for users and other for items
2. Compute user-item interaction by dot or cosine product
3. Build more dense layer to have more complex representation
4. use multiclass cross entropy loss for 5-star rating
5. We can also user sparse implict feedback feature, which are binary in nature (implicit feedback which is generated by system on click-based and explicit-feedback is collected by likes, review and purchasing history)
Graph Based Network
1. use graph embedding to build NN (deep walk, random-walk)
2. use each feature as node and interaction as edge. For example, for movie recommendation system, we have 3 user, 5 movies, 8 actors, 5-star rating, 10 genres we can use each attribute as a node and then for each interation, we create an edge as user1 like actor1 and given 4-star
3. can use node2vec, where each node is represented by vector which is trained on same concept of random-walk
4. Current state of art recSys platform follows this.

Some practical insights of Recommender System

Netflix uses 100s of different base model, final prediction is the weighted average of all. (Generally non-linear blending is preferred)
Weighted Hybrid: Choose 10 items from users rating for collaborative filtering as well as from content based filtering. Now make a list using 60% weighted collaborative and 40% weighted content list. Finally sort the list.
Mix Hybrid: Take 5 items from content, 5 from user history, 5 from trending and 5 from others
Switching: confidence: If user is logged in, switch to collaborative filtering, otherwise switch to content filtering.

largest-rectangle-in-histogram

2019-08-01T00:00:00+00:00

Problem Statement

Given n non-negative integers representing the histogram’s bar height where the width of each bar is 1, find the area of largest rectangle in the histogram.

Example 1:

Input: [3,4,5,3,4] Output: 15

Example 2:

Input: [2,1,5,6,2,3] Output: 10

Example 3:

Input: [2,1,5,6,2,3] Output: 10

 |22|   |22|22|  |22| 22|22|22|22|22| 22|22|22|22|22| 22|22|22|22|22| ----------------

Approach:

In the above numbered-figure, notice that what and longest rectangle can be formed. Think this way, we go back on the position where element is less than the current value. And from there, we see the area of rectangle on the way right till current index.
- For example: in [3,4,5], when we are 5, we go back its prev smaller element that is 4, then we see the length of rectangle in forward direction that is 5.
- Another example: [3,4,9,5], when we are 5, we go back its prev smaller element that is 4, then we see the length of rectangle forward, that is 10.
- Another case: [3,4,1,5], Now, when we are at 5, we go back its prev smaller element that is 1, then we see the length of rectangle forward, that is 5.

Approach:

If element is in increasing order, add that element on stack. Note: we add index of element, random than element, to keep track its position in the array.
Else
1. we pop the element from stack, till we get smaller element in stack than current position.
2. We also calculate the area of rectangle using above logic:
```
 if(s.empty()) curArea = h[lastTop]*i;
 else curArea = h[lastTop]*(i-s.top()-1);
```

long largestRectangle(vector<int> h) {
    int n = h.size();
    stack<long> s;
    long maxArea = INT_MIN, curArea, idx;
    int i;
    for(i=0; i<n; i++){
        if(s.empty()) s.push(i);
        else if(h[i] >= h[s.top()]) s.push(i);
        else{
            while(!s.empty() && h[s.top()] > h[i]){
                idx = s.top(); s.pop();
                if(s.empty()) curArea = h[idx]*i;
                else curArea = h[idx]*(i-s.top()-1);
                maxArea = max(maxArea, curArea);
            }
            s.push(i);
        }
    }
    while(!s.empty()){
        idx = s.top(); s.pop();
        if(s.empty()) curArea = h[idx]*i;
        else curArea = h[idx]*(i-s.top()-1);
        maxArea = max(maxArea, curArea);
    }
    return maxArea;
}

smallest-sufficient-team-leetcode

2019-07-25T00:00:00+00:00

Problem Statement

In a project, you have a list of required skills req_skills, and a list of people. The i-th person people[i] contains a list of skills that person has.

Consider a sufficient team: a set of people such that for every required skill in req_skills, there is at least one person in the team who has that skill. We can represent these teams by the index of each person: for example, team = [0, 1, 3] represents the people with skills people[0], people[1], and people[3].

Return any sufficient team of the smallest possible size, represented by the index of each person.

Example 1:

Input: req_skills = [“java”,”nodejs”,”reactjs”], people = [[“java”],[“nodejs”],[“nodejs”,”reactjs”]] Output: [0,2]

Example 2:

Input: req_skills = [“algorithms”,”math”,”java”,”reactjs”,”csharp”,”aws”], people = [[“algorithms”,”math”,”java”],[“algorithms”,”math”,”reactjs”],[“java”,”csharp”,”aws”],[“reactjs”,”csharp”],[“csharp”,”math”],[“aws”,”java”]] Output: [1,2]

Constraints:

<= req_skills.length <= 16
<= people.length <= 60
<= people[i].length, req_skills[i].length, people[i][j].length <= 16

Approach:

As length of vector is very small, and we need to check each subset, we can use bit masking to represnt the whole skill_set.
Now our task boils down to take OR of skills of each subset, if it equal to all 1 or target then we find the team.
Can we break down this problem into smaller parts, where we can optimize subprobolem. Yes, we guess right. We use DP to do that.

Time Complexity: O(n^2)

Space Complexity: O(n^2)

Note: Space complexity can be optimized to O(n), by storing only the last optimization step. For exp, first we check, if only 1 person can fill for all skill_set, then we check for 2 person, and so on.

#include <bits/stdc++.h>
using namespace std;

void smallestSufficientTeam(vector<string>& req_skills, vector<vector<string>>& people) {
        unordered_map<string, int> mapping;
        int target = 0;
        for(int i=0; i<req_skills.size(); i++){
            // target = (target<<1)|1;
            target += pow(2,i);
            mapping[req_skills[i]] = i;
        }
        cout<<target<<endl;
        int n = people.size();
        vector<int> skill_people(n,0);
        int temp;
        for(int i=0; i<n; i++){
            temp = 0;
            for(int j=0; j<people[i].size(); j++){
                temp += pow(2, mapping[people[i][j]]);
            }
            skill_people[i] = temp;
        }
        
        for(auto itr : skill_people) cout<<itr<<" ";
        cout<<endl;
        // return skill_people;
        cout<<(skill_people[1] | skill_people[2] | skill_people[3])<<endl;
        int ans;
        vector<vector<int>> dp(n, vector<int>(n,0));
        for(int k=0; k<n; k++){
            for(int i=0; i<n-k; i++){
                int j = i+k;
                if(i == j) dp[i][j] = skill_people[i];
                else{
                    int result = (dp[i+1][j] | dp[i][j-1]);
                    dp[i][j] = result;
                    if(result == target){
                        cout<<i<<"-----"<<j<<endl;
                        i = n;
                        k = n;
                    }
                }
            }
        }
        
        cout<<endl;
        for(auto itr1 : dp){
            for(auto itr2 : itr1){
                cout<<itr2<<" ";
            }
            cout<<endl;
        }
    }
    
int main()
{
    int test; cin>>test;
    while(test--){
        int n; cin>>n;
        vector<string> req_skills(n);
        for(int i=0; i<n; i++){
            cin>>req_skills[i];
        }
        int m; cin>>m;
        vector<vector<string>> people;
        for(int i=0; i<m; i++){
            int p; cin>>p;
            vector<string> temp(p);
            for(int j=0; j<p; j++){
                cin>>temp[j];
            }
            people.push_back(temp);
            temp.clear();
        }
        smallestSufficientTeam(req_skills, people);
        
    }
    return 0;
}

Input:

1
"algorithms" "math" "java" "reactjs" "csharp" "aws"
6
"algorithms" "math" "java"
"algorithms" "math" 
"java" "csharp" "aws"
"reactjs" "csharp"
"csharp" "math"
"aws" "java"

Output:

63
3 52 24 18 36 
63
1-----3

7 55 0 0 0 
3 55 63 0 0 
0 52 60 0 0 
0 0 24 26 0 
0 0 0 18 54 
0 0 0 0 36