Condor week summary: Difference between revisions
(Page stared) |
(Completed page) |
||
(11 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
Condor week 2012 | '''Condor week 2012, | ||
UW-Madison | UW-Madison, | ||
May 1 - May 5, 2012 | May 1 - May 5, 2012''' | ||
Attendees: | '''Attendees:''' | ||
John-Paul Robinson | John-Paul Robinson, | ||
Poornima Pochana | Poornima Pochana, | ||
Thomas Anthony | Thomas Anthony | ||
Line 15: | Line 15: | ||
==Day 1: Tutorials == | ==Day 1: Tutorials == | ||
=== Basic Introduction to using Condor: Karen Miller=== | |||
Background HTC | |||
Definitions: Job, Class Ads, Match Making, Central Manager, Submit host, Execute Host | |||
What Condor does: submit- condor bundles up the executable and input files, condor locates a machine, runs the job, and gets the output back to the submit host. | |||
Requirements (needs), Rank( preferences) | |||
Condor Class Ads: used to describe aspects of each item outside condor. | |||
job Class ad: | |||
machine class ad: | |||
Match making: requirement, rank and priorities (fair share allocation) | |||
Getting started: | |||
universe, make job batch -ready, submit file, condor_submit | |||
Universe-environment | |||
batch ready- run w/o interaction (as if in the background), make input, output available, data files | |||
submit description file- # comments, commands on left are not case sensitive, filenames are | |||
'''Good advice: always have a log file''' | |||
file transfer; Transfer_Input_Files, Transfer_Output_Files | |||
Should_transfer_Files: Yes (no shared files system), NO (use shared FS) IF_NEEDED | |||
emails: NOTIFICATION = complete, never, error, error, always. | |||
Job Identifier: cluster.process eg. 20.1, 20.2 etc.. | |||
Multiple jobs : to create directories (based on the process id) | |||
InitialDir=run_0,run_1 etc… | |||
Queue all 1,000,000 jobs | |||
Queue 100000 | |||
$(Process) | |||
use macro subs: %this gives the process id.. | |||
InitialDir=run_$(Process) | |||
===Condor and Workflows: Nathan Panike=== | |||
Introduction: Workflows? sequence of connected steps.. | |||
launch and forget | |||
DAG MAN -dependencies define possible order of job execution | |||
===Pegasus - A system to run, manage and debug complex workflows on top of Condor: Karan Vahi=== | |||
Scientific workflows,larger monolithic applications broken to smaller jobs | |||
'''Why workflows:''' portable, scalable, reuse, reproduce, WMS-recovery | |||
'''Pegasus:''' local desktop, local condor pool. campus cluster. | |||
Pegasus GUI | |||
''' | |||
Mapping-''' | |||
workflow monitoring: SQLite and MySQL, python api to query, transfers executable s as part of workflow | |||
===Basic Condor Administration: Alan De Semet=== | |||
'''Starting job:''' | |||
condor_master= all machines.. (start other processes) | |||
'''Central manager:''' master, negotiator, collector | |||
'''collector:''' daemon knows about other daemons | |||
'''Submit:''' master, schedd, | |||
schedd--.>shadow | |||
'''Compute machine:''' master, startd, | |||
startd --->starter ---> launches Job | |||
condor compile---> calls condor Syscall Lib | |||
<pre> | |||
***** configuration file***** | |||
/etc/condor/condor_config | |||
LOCAL_CONFIG_FILE (CSV) | |||
long entry \ splits across multiple lines | |||
</pre> | |||
'''****Policy******''' | |||
specified in condor_config | |||
ends up in slot ClassAD | |||
'''Machine''' -- one computer, managed by one started | |||
START Policy: | |||
RANK- floating point, larger number are higher ranked, | |||
Suspend and continue: | |||
Preemt (polite) Kill (sigkill) | |||
Slot states: | |||
Custom slot attributes: | |||
dynamic attributes settings STARTD_CRON_* | |||
'''***Job priorities***''' | |||
condor_userprio: lower number means more machines.. | |||
real priority and priority factor: | |||
priority factor-default is 1,assign it user by user basis | |||
preemption_requirements=false % no preemption | |||
'''***Tools***''' | |||
<pre> | |||
condor_config_val | |||
condor_conifg_cal -v CONDOR_HOST | |||
condor_conifg_cal config | |||
condor_Status -master | |||
conder_status -long (everything) | |||
condor_status -format '%s' Arch -format '%s\n' | |||
-constraint -format | |||
condor_q -analyze | |||
Debug level: D_FULLDEBUG D_COMMAND | |||
</pre> | |||
===Security: Lockdown a Condor Pool: Zach Miller=== | |||
Trust, authentication > authorization | |||
machines, users | |||
schedd -daemon-daemon authentication | |||
Pool password- hash (unix) registry( Windows) | |||
condor_store_cred -c add | |||
SSL instead of pool password | |||
condor map file--> map to specific canonical name | |||
===Condor High Availability :Rob Rati / Will Benton (Red Hat)=== | |||
Master based high availability: wallaby project cluster suite (Red Hat) | |||
===Remote Condor:Jeff Dost (UCSD)=== | |||
What is remote Condor: | |||
Condor over ssh available in Condor contrib as RCONDOR | |||
authentication using shh, no local servers, wifi friends, easy | |||
Install: get src tar, make and install | |||
install sshfs | |||
rcondor_Config | |||
===Configuration of Partition-able slots: Greg Thain=== | |||
Out of box 8core 8GB (1GB/core) default does not work because of different memory requirements | |||
Static requirements work but still not better (wait for entire machine to be free and reinitialize the entire machine as single slot) | |||
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=WholeMachineSlots | |||
partitionable slot: introduced v7.2 | |||
8cores 8GB, can be partitioned as | |||
1core + 4GB and the remaining 4 GB distributed among the other 7cores (585.14 MB) or any way the user wants to partition the slot.. | |||
The slots are De-fragmented occasionally so that jobs with different requirements can run. | |||
===Condor Statistics on your submit Node: TJ Knoeller, Ken Hanh, Becky Gietzel=== | |||
condor_status -direct name -schedd | |||
-statistics schedd:2 | |||
Ganglia Plugin | |||
==Day 2: Talks== | ==Day 2: Talks== | ||
Day 2 talk were focused on use cases of Condor at different institutions, research labs, and companies and their specific implementations for the same. | |||
===Session 1=== | |||
* Brookhaven National Lab: Virtualization | |||
* UAB: Pilot | |||
* Syracuse university: Virtualized desktop grid (Condor VM coordinator, runs as non privileged user, distrust of 3rd party application i.e. Condor, uses a MS task scheduler) | |||
* RENCI: Condor in networked clouds (ORCA-Open Resource Control Architecture) | |||
===Session 2=== | |||
* Redhat: Redhat and Condor Developer community (MRG- Messaging, realtime, grid) | |||
* The Hartford: GPU computing with Condor (250 M2070, two pools, Windows, ~7000 cores, actuarial, financial modelling, everything written in CUDA in-house, 40x-60x improvement) | |||
* Pacific Life: MoSes on Condor (actuarial, insurance statistics modelling, no GPU's yet, excess computing moved to Amazon EC2) | |||
* Aptina: Condor in a 24x7 manufacturing environment | |||
===Session 3=== | |||
* UW-Madison: Deep Linguistic Processing with Condor (ad-hoc Map reduce cluster, daily crawl onto millions of webpages using condor, making logical relations) | |||
* U.S. Geologic Survey: Hydrological Modelling | |||
* UW-Madison: Machine design optimization | |||
* UW-Madison: JMAG and Condor(multiphysics) | |||
* OSMOSIX: Condor integrated Cloud bursting (brain imaging applications) | |||
===Session 4=== | |||
* MIT: MW for mixed integer non -linear problems (Master worker, branch and bound, Coupe-COUenne Parallel Extensions) | |||
* Notre Dame: Compiling and linking workflows (Student of Dough Thain, working off Makeflow, Weaver,Abstraction: simple programming interface, hides details of distributed system, Workflows: exploit natural parallelism, large applications, abstractions DAG) | |||
* Information Sciences Institute: Pegasus for large workflows | |||
* Fermilab: GlideinWMS in the cloud | |||
* U.ChicagoL:Cloud based services | |||
* U.Of Buffalo: Stork-Cloud hosted data transfer optimization | |||
==Day 3: Talks== | ==Day 3: Talks== | ||
===Session 1=== | |||
* Fermilab: FermiCloud--Dynamic Resource Provisioning (High availability, Open Nebula-cloud 2.0 command line launch | |||
Virtual machines distributed via SCP | |||
Typical VM: 1 Virtual CPu, HT, 2GB ram. Nagios, RSV based monitoring Gratia Accounting Reports) | |||
* U.Chicago: UC3-A Framework for Cooperative Computing | |||
* Argonne National Lab: BOSCO (campus grid, OSG 100 campuses | |||
Bosco- connect to local cluster, condor pool, or OSG, or engage, CHTC etc..Download, untag, install, cluster_Add,, cluster_test. Beta testers needed ) | |||
* Fermilab: OSG (background, resources, provisioning, opportunistic use, HTC in parallel) | |||
===Session 2 === | |||
* Cycle Computing: Cycle computing (Condor or the cloud, 50000 core cluster, commercial use) | |||
* Morgridge Institute for Research: NGS (genome sequencing, interest in Galaxy) | |||
* Eli Lily: Condor and Eli Lily (use case) | |||
* Embraer: Condor at Embraer (use care, muultiphysics, additional scripts) | |||
===Session 3=== | |||
* UCSD: CMS - move requirements out of users hands | |||
* Dreamworks: Condor at Dreamworks (Condor render farms, each film -5yrs, 65 Million CPU hours, 200TB data, 500+ million files) | |||
* Universitat Autonoma De Barcelona: Common programming mistakes (buffer Overflow, numeric error, Race conditions, Exceptions, Too much information, Directory Traversal, SQL injections, Command injection, Code Injection, Web attacks) | |||
* Newcastle University: Simulating Condor (modelling and simulations on Condor, energy saved, throughput etc ) | |||
* Condor Project: Love Preemption | |||
===Session 4=== | |||
* Purdue: BLAST and Bioinformatics on DiaGrid (use case) | |||
* U. Nebraska Lincoln: Putting Condor in a container (Cgroups, namespaces) | |||
* Cisco: Automated provisioning (Cisco UCS- Unified Computing system) | |||
* Condor Project: Use of Cisco UCS | |||
* Condor Project: What's new and Whats coming (v7.8, dynamic partition-able slots, statistics, IPv6, support SGE, Globus, resource management) | |||
==Day 4: Discussion Panels == | ==Day 4: Discussion Panels == | ||
=== Panel 1: Trust but verify=== | |||
Monitoring the pool, scheduling policy, health, botnets? | |||
Reboots each day and match the software image they are supposed to run, and reimage it automatically if it is not exact.. | |||
Hash of the operating system. | |||
hold after too many restarts… | |||
Cloud execute nodes is a fresh nodes.. | |||
Make job not tun on the host it failed on last.. (Requirements) | |||
Chirp command: to write back to the logs.. | |||
Exploratory activity into jobs.. to see if everything is running fine with the pool… | |||
Botnets: shared port daemon | |||
Condor security mechanism: slot user, privilege separation, limit number of services on the machine, lot of watching… | |||
insulate the users and only run blessed applications.. windows without shared file systems.. | |||
password authentication between nodes, encryption between nodes, | |||
monitor the thing you provision | |||
=== Panel 2: Feedback Forum=== | |||
How-to recipies on the Wiki | |||
Release cycle: 1 developer release 1 stable release | |||
=== Panel 3: Growing your Condor Pool=== | |||
Strategies and challenges for growing the size of your Condor installation | |||
Growing condor pool: flocking is best bet, BOSCO, CMS | |||
advertise pool utilization, and work done hours used… | |||
Carrot and stick approach for soliciting participation at your organization |
Latest revision as of 20:27, 8 May 2012
Condor week 2012, UW-Madison, May 1 - May 5, 2012
Attendees: John-Paul Robinson, Poornima Pochana, Thomas Anthony
Website: http://research.cs.wisc.edu/condor/CondorWeek2012/
Condor Week is a four day annual event that gives collaborators and users the chance to exchange ideas and experiences, to learn about latest research, and to influence our short and long term research and development directions.
Day 1: Tutorials
Basic Introduction to using Condor: Karen Miller
Background HTC Definitions: Job, Class Ads, Match Making, Central Manager, Submit host, Execute Host
What Condor does: submit- condor bundles up the executable and input files, condor locates a machine, runs the job, and gets the output back to the submit host.
Requirements (needs), Rank( preferences)
Condor Class Ads: used to describe aspects of each item outside condor. job Class ad: machine class ad:
Match making: requirement, rank and priorities (fair share allocation)
Getting started: universe, make job batch -ready, submit file, condor_submit
Universe-environment batch ready- run w/o interaction (as if in the background), make input, output available, data files submit description file- # comments, commands on left are not case sensitive, filenames are
Good advice: always have a log file
file transfer; Transfer_Input_Files, Transfer_Output_Files
Should_transfer_Files: Yes (no shared files system), NO (use shared FS) IF_NEEDED
emails: NOTIFICATION = complete, never, error, error, always.
Job Identifier: cluster.process eg. 20.1, 20.2 etc..
Multiple jobs : to create directories (based on the process id) InitialDir=run_0,run_1 etc… Queue all 1,000,000 jobs
Queue 100000 $(Process)
use macro subs: %this gives the process id.. InitialDir=run_$(Process)
Condor and Workflows: Nathan Panike
Introduction: Workflows? sequence of connected steps..
launch and forget
DAG MAN -dependencies define possible order of job execution
Pegasus - A system to run, manage and debug complex workflows on top of Condor: Karan Vahi
Scientific workflows,larger monolithic applications broken to smaller jobs
Why workflows: portable, scalable, reuse, reproduce, WMS-recovery
Pegasus: local desktop, local condor pool. campus cluster.
Pegasus GUI Mapping- workflow monitoring: SQLite and MySQL, python api to query, transfers executable s as part of workflow
Basic Condor Administration: Alan De Semet
Starting job: condor_master= all machines.. (start other processes)
Central manager: master, negotiator, collector collector: daemon knows about other daemons
Submit: master, schedd, schedd--.>shadow
Compute machine: master, startd, startd --->starter ---> launches Job
condor compile---> calls condor Syscall Lib
***** configuration file***** /etc/condor/condor_config LOCAL_CONFIG_FILE (CSV) long entry \ splits across multiple lines
****Policy****** specified in condor_config ends up in slot ClassAD
Machine -- one computer, managed by one started
START Policy: RANK- floating point, larger number are higher ranked, Suspend and continue: Preemt (polite) Kill (sigkill)
Slot states: Custom slot attributes: dynamic attributes settings STARTD_CRON_*
***Job priorities*** condor_userprio: lower number means more machines.. real priority and priority factor:
priority factor-default is 1,assign it user by user basis
preemption_requirements=false % no preemption
***Tools***
condor_config_val condor_conifg_cal -v CONDOR_HOST condor_conifg_cal config condor_Status -master conder_status -long (everything) condor_status -format '%s' Arch -format '%s\n' -constraint -format condor_q -analyze Debug level: D_FULLDEBUG D_COMMAND
Security: Lockdown a Condor Pool: Zach Miller
Trust, authentication > authorization machines, users
schedd -daemon-daemon authentication Pool password- hash (unix) registry( Windows)
condor_store_cred -c add
SSL instead of pool password condor map file--> map to specific canonical name
Condor High Availability :Rob Rati / Will Benton (Red Hat)
Master based high availability: wallaby project cluster suite (Red Hat)
Remote Condor:Jeff Dost (UCSD)
What is remote Condor: Condor over ssh available in Condor contrib as RCONDOR
authentication using shh, no local servers, wifi friends, easy
Install: get src tar, make and install
install sshfs
rcondor_Config
Configuration of Partition-able slots: Greg Thain
Out of box 8core 8GB (1GB/core) default does not work because of different memory requirements
Static requirements work but still not better (wait for entire machine to be free and reinitialize the entire machine as single slot) https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=WholeMachineSlots
partitionable slot: introduced v7.2
8cores 8GB, can be partitioned as 1core + 4GB and the remaining 4 GB distributed among the other 7cores (585.14 MB) or any way the user wants to partition the slot..
The slots are De-fragmented occasionally so that jobs with different requirements can run.
Condor Statistics on your submit Node: TJ Knoeller, Ken Hanh, Becky Gietzel
condor_status -direct name -schedd -statistics schedd:2
Ganglia Plugin
Day 2: Talks
Day 2 talk were focused on use cases of Condor at different institutions, research labs, and companies and their specific implementations for the same.
Session 1
- Brookhaven National Lab: Virtualization
- UAB: Pilot
- Syracuse university: Virtualized desktop grid (Condor VM coordinator, runs as non privileged user, distrust of 3rd party application i.e. Condor, uses a MS task scheduler)
- RENCI: Condor in networked clouds (ORCA-Open Resource Control Architecture)
Session 2
- Redhat: Redhat and Condor Developer community (MRG- Messaging, realtime, grid)
- The Hartford: GPU computing with Condor (250 M2070, two pools, Windows, ~7000 cores, actuarial, financial modelling, everything written in CUDA in-house, 40x-60x improvement)
- Pacific Life: MoSes on Condor (actuarial, insurance statistics modelling, no GPU's yet, excess computing moved to Amazon EC2)
- Aptina: Condor in a 24x7 manufacturing environment
Session 3
- UW-Madison: Deep Linguistic Processing with Condor (ad-hoc Map reduce cluster, daily crawl onto millions of webpages using condor, making logical relations)
- U.S. Geologic Survey: Hydrological Modelling
- UW-Madison: Machine design optimization
- UW-Madison: JMAG and Condor(multiphysics)
- OSMOSIX: Condor integrated Cloud bursting (brain imaging applications)
Session 4
- MIT: MW for mixed integer non -linear problems (Master worker, branch and bound, Coupe-COUenne Parallel Extensions)
- Notre Dame: Compiling and linking workflows (Student of Dough Thain, working off Makeflow, Weaver,Abstraction: simple programming interface, hides details of distributed system, Workflows: exploit natural parallelism, large applications, abstractions DAG)
- Information Sciences Institute: Pegasus for large workflows
- Fermilab: GlideinWMS in the cloud
- U.ChicagoL:Cloud based services
- U.Of Buffalo: Stork-Cloud hosted data transfer optimization
Day 3: Talks
Session 1
- Fermilab: FermiCloud--Dynamic Resource Provisioning (High availability, Open Nebula-cloud 2.0 command line launch
Virtual machines distributed via SCP Typical VM: 1 Virtual CPu, HT, 2GB ram. Nagios, RSV based monitoring Gratia Accounting Reports)
- U.Chicago: UC3-A Framework for Cooperative Computing
- Argonne National Lab: BOSCO (campus grid, OSG 100 campuses
Bosco- connect to local cluster, condor pool, or OSG, or engage, CHTC etc..Download, untag, install, cluster_Add,, cluster_test. Beta testers needed )
- Fermilab: OSG (background, resources, provisioning, opportunistic use, HTC in parallel)
Session 2
- Cycle Computing: Cycle computing (Condor or the cloud, 50000 core cluster, commercial use)
- Morgridge Institute for Research: NGS (genome sequencing, interest in Galaxy)
- Eli Lily: Condor and Eli Lily (use case)
- Embraer: Condor at Embraer (use care, muultiphysics, additional scripts)
Session 3
- UCSD: CMS - move requirements out of users hands
- Dreamworks: Condor at Dreamworks (Condor render farms, each film -5yrs, 65 Million CPU hours, 200TB data, 500+ million files)
- Universitat Autonoma De Barcelona: Common programming mistakes (buffer Overflow, numeric error, Race conditions, Exceptions, Too much information, Directory Traversal, SQL injections, Command injection, Code Injection, Web attacks)
- Newcastle University: Simulating Condor (modelling and simulations on Condor, energy saved, throughput etc )
- Condor Project: Love Preemption
Session 4
- Purdue: BLAST and Bioinformatics on DiaGrid (use case)
- U. Nebraska Lincoln: Putting Condor in a container (Cgroups, namespaces)
- Cisco: Automated provisioning (Cisco UCS- Unified Computing system)
- Condor Project: Use of Cisco UCS
- Condor Project: What's new and Whats coming (v7.8, dynamic partition-able slots, statistics, IPv6, support SGE, Globus, resource management)
Day 4: Discussion Panels
Panel 1: Trust but verify
Monitoring the pool, scheduling policy, health, botnets? Reboots each day and match the software image they are supposed to run, and reimage it automatically if it is not exact..
Hash of the operating system.
hold after too many restarts…
Cloud execute nodes is a fresh nodes..
Make job not tun on the host it failed on last.. (Requirements)
Chirp command: to write back to the logs..
Exploratory activity into jobs.. to see if everything is running fine with the pool…
Botnets: shared port daemon
Condor security mechanism: slot user, privilege separation, limit number of services on the machine, lot of watching…
insulate the users and only run blessed applications.. windows without shared file systems..
password authentication between nodes, encryption between nodes,
monitor the thing you provision
Panel 2: Feedback Forum
How-to recipies on the Wiki
Release cycle: 1 developer release 1 stable release
Panel 3: Growing your Condor Pool
Strategies and challenges for growing the size of your Condor installation
Growing condor pool: flocking is best bet, BOSCO, CMS
advertise pool utilization, and work done hours used…
Carrot and stick approach for soliciting participation at your organization