Introduction

If a picture equals a thousand words...

What pictures are hiding in globus logs...

50GB logs per day, 300M log lines

David Raila - Sr. Engineer NCSA Storage Group/Blue Waters

I'm going to talk about

My problems ... I mean challenges...

operating large globus endpoints

the tools I used and how they work

and how you can use them too

From 50GB of

Oct 11 03:58:42 hpss12 globus-gridftp-server[45623]: dsi[../../../source/module/dsi.c:117]dsi_init: GridFTP HPSS DSI 2.4: COMMIT: log level 3 Oct 11 03:58:43 hpss12 globus-gridftp-server[45623]: Transfer stats: DATE=20171011085843.469040 HOST=hpss-md12.ncsa.illinois.edu PROG=globus-gridftp-server NL.EVNT=FTP_INFO START=20171011085842.725508 USER=liska SHARE=0 SHAREE=none FILE="/u/sciteam/liska/project.bady/07042017/TDISKS25A93T45S/dumps515/new_dumpdiag11" BUFFER=87380 BLOCK=262144 NBYTES=311040000 VOLUME=/ STREAMS=1 STRIPES=1 DEST=[141.142.176.66] TYPE=RETR CODE=226 TASKID=97b43ee2-ae0f-11e7-afcf-22000a92523b retrans=0 Oct 11 03:58:45 hpss12 globus-gridftp-server[45623]: Transfer stats: DATE=20171011085845.011820 HOST=hpss-md12.ncsa.illinois.edu PROG=globus-gridftp-server NL.EVNT=FTP_INFO START=20171011085844.313072 USER=liska SHARE=0 SHAREE=none FILE="/u/sciteam/liska/project.bady/07042017/TDISKS25A93T45S/dumps523/new_dumpdiag6" BUFFER=87380 BLOCK=262144 NBYTES=311040000 VOLUME=/ STREAMS=1 STRIPES=1 DEST=[141.142.176.66] TYPE=RETR CODE=226 TASKID=97b43ee2-ae0f-11e7-afcf-22000a92523b retrans=0 Oct 11 03:58:46 hpss12 globus-gridftp-server[45623]: Transfer stats: DATE=20171011085846.564444 HOST=hpss-md12.ncsa.illinois.edu PROG=globus-gridftp-server NL.EVNT=FTP_INFO START=20171011085845.837304 USER=liska SHARE=0 SHAREE=none FILE="/u/sciteam/liska/project.bady/07042017/TDISKS25A93T45S/dumps524/new_dumpdiag6" BUFFER=87380 BLOCK=262144 NBYTES=311040000 VOLUME=/ STREAMS=1 STRIPES=1 DEST=[141.142.176.66] TYPE=RETR CODE=226 TASKID=97b43ee2-ae0f-11e7-afcf-22000a92523b retrans=0 Oct 11 03:58:48 hpss12 globus-gridftp-server[45623]: Transfer stats: DATE=20171011085848.028754 HOST=hpss-md12.ncsa.illinois.edu PROG=globus-gridftp-server NL.EVNT=FTP_INFO START=20171011085847.391681 USER=liska SHARE=0 SHAREE=none FILE="/u/sciteam/liska/project.bady/07042017/TDISKS25A93T45S/dumps529/new_dumpdiag11" BUFFER=87380 BLOCK=262144 NBYTES=311040000 VOLUME=/ STREAMS=1 STRIPES=1 DEST=[141.142.176.66] TYPE=RETR CODE=226 TASKID=97b43ee2-ae0f-11e7-afcf-22000a92523b retrans=0 Oct 11 03:58:49 hpss12 globus-gridftp-server[45623]: Transfer stats: DATE=20171011085849.599464 HOST=hpss-md12.ncsa.illinois.edu PROG=globus-gridftp-server NL.EVNT=FTP_INFO START=20171011085848.890210 USER=liska SHARE=0 SHAREE=none FILE="/u/sciteam/liska/project.bady/07042017/TDISKS25A93T45S/dumps529/new_dumpdiag12" BUFFER=87380 BLOCK=262144 NBYTES=311040000 VOLUME=/ STREAMS=1 STRIPES=1 DEST=[141.142.176.66] TYPE=RETR CODE=226 TASKID=97b43ee2-ae0f-11e7-afcf-22000a92523b retrans=0 Oct 11 03:58:50 hpss12 xinetd[11582]: EXIT: gsiftp status=0 pid=45623 duration=8(sec)

BlueWaters challenges

~ 30PB online storage

~ 300PB nearline storage

80 endpoint servers

>3 Tbps aggregate bandwidth

>1000 users/ 10-100 TB per-day

Operational Challanges

Users - transfer performance questions

Collaborators - performance investigations

Scale - monitoring 80 nodes @ 40Gbps is hard

Searching/correlating 50GB of logs - intractible

Drowning in production - no time to improve

Metrics could help?

Collect higher-order operational data

See what the system is doing visually

Enable Dashboards and automated alterting

Enable investigations into problems

Initial Results

More powerful than expected

Able to monitor in detail

Enable Dashboards and alerts

Enable investigations into problems

Became a daily-use go-to tool

Understanding Metrics/Logs

Logs are an exact account of discrete events

- A bank statement is a log, precise/auditable

Metrics are sampled indicators of performance

- A credit score is a metric, an approximate indicator

Logs and Metrics are complimentary - use both

Monitoring Strategies

Resource metrics - apply the USE method

Utilization, Saturation, Error-rate

Service metrics - apply the RED method

Request-rate, Error-rate, Duration-of-request

Practical Tooling - Prometheus

Best of breed metrics system

Cloud Native Computing Foundation project

Golang implementation - efficient and easy

Single binary, no packaging or dependencies

Simple yaml configuration - 2 file deployment

Prometheus Components

Prometheus - Time Series Database + Query Language

Collectors - Tiny, specific, metrics collectors

100's for OS, DB's, filesystems, languages, logs ...

Grafana - Graphical Interface for navigation/drill-downs

Alert Manager - Automated alerts based on metrics

Primarily tiny tight efficient golang apps

Prometheus + globus

You can observe a lot by just watching ...

A few metrics can provide a LOT of insight

Per DTN and summary endpoint performance

Per-transfer/per-user information

Some errors

Set

Need to count globus events

No globus metrics endpoint, use syslogs

Mtail - a syslog scraper

Regexp line parsing

Converts fields to metrics/dimensions

gridftp.mtail

const M_GRIDFTP_USER /USER=/ + /(?P/ + P_WORD + /)/
const M_GRIDFTP_TASKID /TASKID=/ + /(?P/ + P_UUID + /)/
const M_GRIDFTP_STATS /Transfer stats:/ + P_SP + M_GRIDFTP_DATE + P_ANY + M_GRIDFTP_START_TIME + P_SP + M_GRIDFTP_USER + P_ANY + M_GRIDFTP_BUFFER + P_SP + M_GRIDFTP_BLOCK + P_SP + M_GRIDFTP_NBYTES + P_ANY + M_GRIDFTP_TYPE + P_ANY +  M_GRIDFTP_TASKID + P_ENDLN
...

counter gridftp_transfer_type_sum by gridftp_type
counter gridftp_xfer_count by hostname, gridftp_pid, gridftp_user, gridftp_taskid, gridftp_type
counter gridftp_xfer_bytes_count by hostname, gridftp_user, gridftp_taskid, gridftp_type
 ...

 // + M_GRIDFTP_STATS {
	debug["gridftp_xferstats_count"]++
	gridftp_transfer_type_sum[$gridftp_type]++
	gridftp_transfers_sum++
	gridtp_bytes_transferred_sum[$gridftp_type] += $gridftp_nbytes
 }

	

Setup - Prometheus

So simple - url to mtail(s)

Scrape globus collector(s) every 15s

Gives 30 sec visibility - Nyquist

Setup - Grafana

Tell it where prometheius URL

Drag/drop tools onto screens

Add some queries to get the data

Save/share the elements/dashboard

experiences and recommendations

Scraping logs is OK

Limited by what's logged

Detailed GCS metrics would be powerful

Will be very important - move to components

Other NCSA GO work

GO + Jenkins workflow performance monitor

Verify GO functionality/performance with a realistic test

Jenkins manages the workflow

Using GO python SDK script components

create Jira issues automatically with GO task/event details

Catches improper useabuse of BW Globus endpoints

Thank You - Resources

https://github.com/ncsa

https://prometheus.io/

https://github.com/google/mtail

https://github.com/ncsa/endpoint_task_monitor

https://github.com/ncsa/endpoint_task_errors

Slides: https://davidraila.github.io/gw18