slug: c2numpy-pandas-and-xgboost-tree-learning datepublished: 2018-07-27T22:05:14 dateupdated: 2018-07-27T22:14:09 tags: English Posts, Acedemic Notes –-

Figure Source


Our use-case is to use C2Numpy inside ROOTand the process the classification problem of particles by using Xgboostand pandas.

Our data set basically has 6 kinematic variables (like jet pt and dilepton mass etc.) Then a label set telling us from monte carlo is the particle is sourced from interesting process or not (also we have a weight array, because events come from Monte Carlo need to be weighted).

Assume we prepared these accordingly in .npy files, we can then load and put data intu Pandas.DataFrame:

xfiles= glob.glob("./xdata_*.npy") xfiles.sort() yfiles= glob.glob("./ydata_*.npy") yfiles.sort() xarrays = [np.load(f) for f in xfiles] rawdata= np.concatenate(xarrays) yarrays = [np.load(f) for f in yfiles] rawydata= np.concatenate(yarrays) dfx = pd.DataFrame(rawdata) dfy = pd.DataFrame(rawydata) setsize = rawydata.shape[0]

Then, we want to shuffle the data. Here, not to wrecked by np.random.seed, you want to generate a permutation list according to length of data set first.

perm = np.random.permutation(setsize) dfx = dfx.iloc[perm] dfy = dfy.iloc[perm]

And then you may need to extract/drop rows (for us, the weight row) for separate use in XGboost:

weight = dfx['weight'] dfx= dfx.drop('weight',axis=1) # separate into train and test set weight_train = weight.head(int(setsize*0.7)) weight_test= weight.tail(int(setsize*0.3)) trainx = dfx.head(int(setsize*0.7)) testx = dfx.tail(int(setsize*0.3)) trainy = binaryy[:int(setsize*0.7)] testy = binaryy[-int(setsize*0.3):]

And then you're basically free to go!

dtrain = xgb.DMatrix(trainx.values, label=trainy, weight=np.abs(weight_train.values)) dtest = xgb.DMatrix(testx.values, label=testy, weight=np.abs(weight_test.values)) evallist = [(dtest,'eval'), (dtrain,'train')] num_round = 700 param = {} param['objective'] = 'binary:logistic' param['eta'] = 0.05 param['max_depth'] = 4 param['silent'] = 1 param['nthread'] = 12 param['eval_metric'] = "auc" param['subsample'] = 0.6 param['colsample_bytree'] = 0.5 bst = xgb.train(param.items(), dtrain, num_round, evallist, early_stopping_rounds=200) bst.save_model('./001.model')