Figure Source

Our use-case is to use C2Numpy inside ROOT and the process the classification problem of particles by using Xgboost and pandas.

Our data set basically has 6 kinematic variables (like jet pt and dilepton mass etc.) Then a label set telling us from monte carlo is the particle is sourced from interesting process or not (also we have a weight array, because events come from Monte Carlo need to be weighted).

Assume we prepared these accordingly in .npy files, we can then load and put data intu Pandas.DataFrame:

xfiles= glob.glob("./xdata_*.npy")
yfiles= glob.glob("./ydata_*.npy")
xarrays = [np.load(f) for f in xfiles]
rawdata= np.concatenate(xarrays)
yarrays = [np.load(f) for f in yfiles]
rawydata= np.concatenate(yarrays)

dfx = pd.DataFrame(rawdata)
dfy = pd.DataFrame(rawydata)
setsize = rawydata.shape[0]

Then, we want to shuffle the data. Here, not to wrecked by np.random.seed, you want to generate a permutation list according to length of data set first.

perm = np.random.permutation(setsize)
dfx = dfx.iloc[perm]
dfy = dfy.iloc[perm]

And then you may need to extract/drop rows (for us, the weight row) for separate use in XGboost:

weight = dfx['weight']
dfx= dfx.drop('weight',axis=1)
# separate into train and test set
weight_train = weight.head(int(setsize*0.7))
weight_test= weight.tail(int(setsize*0.3))
trainx = dfx.head(int(setsize*0.7))
testx = dfx.tail(int(setsize*0.3))
trainy = binaryy[:int(setsize*0.7)]
testy = binaryy[-int(setsize*0.3):]

And then you're basically free to go!

dtrain = xgb.DMatrix(trainx.values, label=trainy, weight=np.abs(weight_train.values))
dtest = xgb.DMatrix(testx.values, label=testy, weight=np.abs(weight_test.values))
evallist  = [(dtest,'eval'), (dtrain,'train')]
num_round = 700
param = {}
param['objective'] = 'binary:logistic'
param['eta'] = 0.05
param['max_depth'] = 4
param['silent'] = 1
param['nthread'] = 12
param['eval_metric'] = "auc"
param['subsample'] = 0.6
param['colsample_bytree'] = 0.5
bst = xgb.train(param.items(), dtrain, num_round, evallist, early_stopping_rounds=200)