Querying service contracts UCSD Technical Report Elio Damaggio UC San Diego [email protected]

Alin Deutsch UC San Diego [email protected]

Dayou Zhou UC San Diego [email protected]

April 1, 2010 Abstract Considering a broad definition for service contracts (beyond web services and software, e.g. airline tickets and insurance policies), we tackle the challenges of building a high performance broker in which contracts are both specified and queried through their temporal behavior. The temporal dimension, in conjunction with traditional relational attributes, enables our system to better address difficulties arising from the great deal of information regarding the temporal interaction of the various events cited in contracts (e.g. ”No refunds are allowed after a reschedule of the flight, which can be requested only before any flight leg has been used”). On the other hand, querying big repositories of temporal specifications poses many interesting challenges. In this paper, we introduce two distinct and complementary optimization techniques that enable our system to scale the evaluation of a novel and theoretically sound notion of permission of a temporal property by a service contract. Our notion of permission is inspired by previous work on model checking but, given the specific characteristic of our problem, does not reduce to it. We evaluate experimentally our implementation, showing that it scales well with both the number and the complexity of the contracts.

1

Introduction

Service contracts are seldom completely represented by a set of predefined attributes and as such are usually modeled only partially in IT systems. As an example, consider the market of airfares. Every plane ticket is actually a complex contract with dozens of conditions regarding validity, refundability and changeability, among others. Some fixed categories exist (e.g. economy tickets, business, first class), but they do not express all the subtleties of the contracts that the customers are required to sign. Any nonstandard travel arrangements require sifting through the full text of the contracts of the fares returned by such brokers as Expedia, Orbitz and Travelocity, or asking an expert travel agent. In this work, we envision a contract brokering system that, in addition to standard relational attributes, allows contracts to be specified and queried by their temporal behavior. For illustration purposes, we introduce now a running example in the airfares scenario. Airlines, as contract providers, register all their fares on our brokering system. Each airfare specifies its temporal aspects, in addition to the usual relational attributes (e.g. baggage conditions, prices). These aspects involve how tickets are changed (if and when it is possible), how the customers can be refunded and so on. A possible query could look for: the cheapest fare on 2/19/2010 from San Diego to New York, that allows a partial ticket refund after the first leg has been used. In order to answer this query, our broker would first retrieve the list of fares available from San Diego to New York, along with their prices, using a standard DBMS. It would then use the temporal specifications to filter the fares that do not satisfy the temporal portion of the query (e.g. ‘... that allows a date change even in the case of a missed flight’). In this work, we will focus on the temporal aspect of our system. Finding the appropriate representation for the full variety of possible interactions allowed by the contracts is not trivial. Let us consider a customer’s specific requirement regarding the refundability options of a partially used ticket. It would be easy to add an attribute ‘refundable’ with a set of predefined values that code the various situations: no refunds, only full refunds, partial refunds allowed. The problem lies in the interaction of these categories with other ticket features like changeability, e.g. 1

a refund might not be available for a ticket that was changed. Moreover, both aspects interact with the various phases of the trip (e.g. before or after the first flight leg was completed). In principle, every new aspect can interact with all the others, leading to an exponential blow up of the number of possible interaction ‘categories’ (e.g. ‘partial refunds allowed only if requested before the date of the first flight leg’, ‘partial refunds allowed even if the first leg is missed’). These categories then become too numerous to specify as schema attributes. Moreover, they are too hard to define and specify for the contract providers and, even more importantly, hard for the consumer to understand. The alternative we propose is to consider a common vocabulary of events that is small enough to overview and will act as the interface between providers and customers. This vocabulary refers to a set of familiar events pertaining to the application domain, and is much more compact than the rich family of interaction categories arising from the various sequences of these events. This family is never explicitly listed. Instead, when interested in a particular sequence of events, customers formulate an ad-hoc query against the vocabulary, detailing the desired sequence of events. Each contract specifies in a similar way the allowed sequences of events. So the query simply returns all contracts that allow the sequence required by the query. We say that such contracts permit the query. Example 1 We use the following event vocabulary to crudely model single-trip flights: purchase (ticket purchased), use (ticket used), missedFlight (customer missed the flight), refund (customer is refunded for the ticket), dateChange (the flight is rescheduled). An airfare that allows flight reschedulings will contain sequences in which this happens, but also the ones in which the ticket is used in the originally scheduled date: 1. (purchase) 99K (use) 99K ... 2. (purchase) 99K (dateChange) 99K (use) 99K ... 3. ... A customer looking for a ticket that allows reschedules even after a missed flight, would think of sequences of the form: (purchase) 99K (missedFlight) 99K (dateChange) 99K ... We believe that thinking of contracts as a set of allowed sequences provides a simple way to describe and query their behavior. However, it would be impractical and error-prone to specify airfares by listing explicitly the set of all allowed sequences. The same observation holds for queries, as a customer may only wish to focus on the temporal relationship of two events of interest, not caring how the other events relate to them on the timeline. The customer in Example 1, for instance, is interested in the events missedFlight and dateChange. She does not care about refunds that might happen after the date change or the class upgrade that may happen for the first leg of the flight. Given these observations, we define contracts and queries in a declarative way. Analogously to the real world, contracts are specified as a set of declarative clauses, each of them specifying a required property of the allowed sequences of events. Intuitively, the allowed sequences of events of a contract will be the set of all sequences that simultaneously satisfy all contract clauses. With the same intuition we will use declarative clauses also to specify properties required by the customer at query time. Example 2 Using the vocabulary of Example 1 we show here three different ticket offers, which differ in their policies regarding refunds and changeability. Ticket A

1. No refunds are allowed after date changes

2. Unlimited date changes Ticket B

1. Refunds always allowed

2. Date changes only before the scheduled departure Ticket C

1. No refunds are allowed

2. Only one date change is allowed 3. Date changes only before the scheduled departure 2

We also state some clauses that are valid for all contracts in our domain and that model the intuitive meaning of the events in the vocabulary. C1 The ticket is purchased once. C2 The ticket has to be purchased before any of the following events: it is used, the flight missed or rescheduled, or the customer is refunded. C3 If the flight is missed, the ticket is unusable unless rescheduled C4 If a refund is issued, then no other event can happen C5 If the ticket is used, then no other event can happen The above clauses identify the sequences of events allowed by each ticket. The idea is that a customer might ask: Q1 Is there a possible sequence of events that after a missed flight contains either a refund or a date change? Intuitively, the system should return Ticket A (because it would allow a reschedule) and Ticket B (that would allow a refund). It should not return Ticket C because it does not allow any refund nor rescheduling after a missed flight. The reader should not be tempted to fall back on the usual categorization scheme (e.g. ’no reschedule after misses’, ’reschedules allowed’), which is always possible for simple examples like this one. Roundtrip tickets, multiple legs flights and class upgrades would all contribute to the proliferation of categories, which would contain unmanageable entries such as ’reschedules cannot happen on different legs of the same flight’ or ’if the first flight is used, a refund can be issued only if the second flight is not missed’. We now summarize our problem setting: - We focus on the behavioral aspect of querying contracts, so we assume a traditional DBMS takes care of the features modeled as relational attributes. - A contract describes what is allowed and what is not, so it represents a set of allowed sequences of events. - Contracts are specified with declarative clauses (i.e. properties of sequences of events). - A contract permits a query (also specified as a declarative clause) if it contains a sequence satisfying the query, - Our system stores the temporal specification of all contracts and has to provide a scalable and high performance way to retrieve all contracts that permit an online temporal query. Solving this problem involved the following contributions: • We model an intuitive notion of permission with respect to service contracts and queries, which captures temporal behavior aspects of services that are not handled currently by state-of-the-art systems. Our notion gracefully handles the fact that contracts might not specify the behavior of all events mentioned in the query, avoiding both false positives and the non-realistic requirements of fully-detailed contract specifications. • Aiming to represent and reason about temporal aspects, we picked a standard and well established formalism, namely Linear Temporal Logic. Checking permission does not reduce to any of the many decision problems studied for LTL, so we developed a novel algorithm. Note that we do not expect final users to utilize LTL directly. Analogously to SQL in database-powered e-commerce systems, LTL will be used in our system only by application developers.

3

• We develop a novel indexing scheme that allows our system to quickly generate a set of candidate contracts. This eliminates the need to execute the complex permission algorithm on every single contract specification in the temporal repository and is extremely effective for highly selective complex queries. • We develop a novel way to automatically simplify temporal specifications of contracts based on the events cited in a query. This allows our system to execute the permission algorithm on smaller contract specifications and provides the best results for simple queries that mention few events. • Finally, we provide an experimental evaluation of our techniques to prove the feasibility of our approach. Exploiting both our optimization techniques, we show an optimized average query evaluation time of 6sec over databases of 3000 contracts, improving by more than an order of magnitude over the unoptimized average time of 98sec. The size of the tested databases is adequate for most applications as a complete implementation would use relational attributes (e.g. travel date and destination) to pre-select contracts from a potentially much larger database. The rest of the paper is organized as follows. Section 2 introduces the temporal search problem along with our novel algorithm and a high level unoptimized system architecture, Section 3 describes our indexing scheme, Section 4 describes our automatic simplification technique, Section 5 presents our experimental results, Section 6 discusses some related work and Section 7 concludes the paper.

2

Querying contracts

In §1 we argued that a compact vocabulary of events is a much better choice than complex categories to query temporal behavior of service contracts. In particular we introduced the concept of sequence of events as a natural way to describe temporal behavior. Refining the intuition, we want to add the notion of snapshot. At every moment in the temporal sequence, many events could be happening, a snapshot captures all the information regarding the events in a particular moment (i.e. a truth assignment for every event in the vocabulary). A temporal sequence is then just a list of snapshots. Intuitively, we are able to define declaratively properties of these temporal sequences using clauses, so we can define a contract as a set of clauses that identify the set of allowed temporal sequences of snapshots, and a query as a clause that specifies the required property of a temporal sequence.

2.1

Designing the Semantics

Usually, customers are interested in knowing if a certain sequence is possible once they have subscribed to a contract. More precisely, they are interested in a particular property of that sequence (e.g. the fact that a ticket is partially refundable), which they specify using the query clause. A natural semantics would be that a contract is returned as a result of a query Q, if at least one of its allowed sequences satisfies Q. This semantics, however, has a subtle problem. Example 3 Let us consider Ticket A of Example 2, but with a vocabulary that now contains also the event class upgrade. The airfare policy is still that no refund is allowed after a date change and that date changes are unlimited. Let us assume that a customer issues the following query: Q2 An airfare that allows a class upgrade after a date change. If we think of a sequence of events in which the customer obtains a date change and then obtains a class upgrade, we can easily see that it satisfies all clauses of Ticket A (since they impose no restriction of class upgrades, which is not mentioned). It follows that under this semantics Ticket A will be returned as part of the result. In Example 3, we showed how, using the previous semantics, a contract, which did not say anything about class upgrades, would be returned as part of the result of a query regarding class upgrades. It

4

is clear that the contract is not fully specified. We do not want, however, to force full specification on the part of contract publishers, as it would raise the publishing cost considerably. It would also prevent a simple introduction of new events in the common vocabulary, forcing a revision of all contract specifications in the system. On the other hand, we contend that returning underspecified contracts as part of the answer (as in Example 3) would not serve customers well for two reasons. First, customers would be required to read all returned contracts in order to know which contracts explicitly permit the query and which ones were underspecified. Second, in order to gain visibility for their contracts, publishers would be incentivized to underspecify them, so that their contracts would be selected by more queries. This would exacerbate the first problem and make our system useless. For these reasons we refine the semantics in order to exclude contracts that do not explicitly allow some events. Our semantics takes into account only the cited events of each contract, and assumes that the contract makes no commitment on events not explicitly cited in its clauses. In Example 3, Ticket A would not be returned because it never cites events regarding class upgrades. This discussion motivates the following semantics for permission, stated formally in Appendix A and informally as: Definition 1 A contract permits a query if and only if there exists a sequence that is allowed by the contract, consists only of the events mentioned in the contract and satisfies the query. Note that the restriction to the events cited in the specification of the contract does not rule out the permission of all queries citing events not mentioned in the contract. Considering Ticket B of Example 2 (i.e. all refunds, date changes only before scheduled departure), it is easy to see that it should be returned if the customer is interested in the following contracts: Q3 After a date change, airfare allows a class upgrade or a refund. This is because, even though Ticket B does not specify a class upgrade policy, it explicitly allows refunds after date changes.

2.2

Declarative language

Many domains deal with temporal behavior of systems, most notably model checking and verification ([23]). In all these domains, temporal logic has become the accepted standard formalism for declaration of properties over behavioral sequences. Many techniques based on this kind of logic are successfully applied to industrial level implementations of tools [12], [3], [15]. We use the simplest form of temporal logic (Linear Temporal Logic, LTL [19]) as the formalism to capture declarative clauses, informally introduced in §1. We report here the LTL specification of Ticket C from Example 2. Ticket C

1. G(¬ref und)

2. G(dateChange → X(¬FdateChange)) 3. G(missedF light → ¬FdateChange) Clause 1 is read ‘globally not refund ’, globally means that a certain formula has to be true in all the snapshots of a sequence. Clause 2 reads ‘globally, if a date change occurs, it implies that in the next instant (X) it is not the case that eventually another date change occurs’, eventually (F) means that a certain formula will be true in some future instant. Clause 3 reads ‘globally, if a flight is missed, it is never the case that eventually a date chage occurs’. It is easy to see that these formulas are very close to the informal description of Example 2. We also want to stress the fact that the usage of LTL in our scenario is similar to the use of SQL in database powered e-commerce applications: LTL will be used by the application developers. User-friendly GUIs for LTL have already been studied [4] and will save most final users from ever having to use the formal LTL language. They are not, however, the focus of the present paper. For the formal treatment of the use of LTL and the complete LTL specifications for Example 2, refer to Appendix A.

5

init dateChange

purchase state_1

dateChange

init

missedFlight

missedFlight missedFlight d a t e C h a n g e use || refund

true

state_2

state_3

use

state_1

true

refund

refund state_4

state_2

(a)

true

(b)

Figure 1: BAs for: (a) Ticket A of Example 2 (no refund after a date change AND unlimited date changes, (b) a query asking for a refund after a missed flight. Double circled states are final. For readability: In (a) transitions show only the positive literals, e.g. ref und = ref und ∧ ¬purchase ∧ ¬missedF light ∧ ¬dateChange ∧ ¬use.

2.3

Data model

In order to reason on temporal clauses expressed in LTL, we make use of an important connection [24] between LTL and a particular kind of finite state automata, called B¨ uchi automata (BA). Exploiting the result in [24], there are many tools (e.g. [10]) which build a BA accepting all and only the sequences that satisfy a given set of declarative LTL clauses. As we will see, this reduces our problem to searching a database of BAs representing contracts, while checking an individual contract reduces to exploring paths in its BA. B¨ uchi automata (BA) are similar to standard finite state automata, with the exception that they recognize infinite strings. To some readers the choice of using infinite sequences might seem unnecessarily complicated. For technical reasons, however, it is the standard formalization when reasoning about temporal properties. Moreover, everything we said until now is unaffected by this decision. For the convenience of readers unfamiliar with LTL, we summarize the motivation for the use of infinite sequences in Appendix A. We can now state the BA acceptance condition: a string is accepted if the recognizing automaton traverses a final state infinitely many times. We provide now an intuitive example of BAs in our context (the formal definitions can be found in Appendix B.1). Example 4 Let us consider the B¨ uchi automaton in figure 1b. If we use the automaton to recognize a sequence of snapshots, it is easy to check that it will accept it if and only if this sequence contains a snapshot in which missedF light is true and in any snapshot after that ref und is true. Indeed, recall that the accepting condition is that the automaton traverses state 2 infinitely many times. Since this state has a self loop with true as label, it means that once that state is reached, it will always be part of a cycle, and thus traversed infinitely many times. Note that the labels of our BAs are conjunctions of literals (a literal is an atomic condition on an event, e.g. purchase, ¬use) that, during recognition, have to be satisfied by the current snapshot. Moreover, it should be intuitive from the BA accepting condition and Example 4 that, in order to accept a sequence, a recognizing BA can traverse its states in a path that it is formed by a finite prefix leading to a final state k (called knot) and by a cycle leading back to k. These kind of paths are called lasso paths and are always finite. The intuition is that, despite the finiteness of lasso paths, their cycle may be repeated infinitely many times in order to satisfy the BA accepting condition. It should not be surprising, then, that we can state that a sequence ρ satisfies a temporal property ϕ if and only if any BA for ϕ traverses a lasso path while recognizing ρ [24]. We refer again to §B.1 for a formal treatment of BA in our context.

6

2.4

Algorithm

Recall Definition 1 in §2.1. Assuming that both the contract and the query are represented by the BAs of all their allowed sequences, we provide below the intuition that the lasso path concept can be trasfered to a permission checking algorithm. Example 5 Let us consider the contract BA in figure 1a and the query in figure 1b. It is easy to see that the sequences that are accepted by both automata satisfy a lasso path of the form {init, 1, ..., 1, 3, ..., 3, 4, 4, ... } for the contract automaton, and of the form {init, .., init, 1, ..., 1, 2, 2, ... } for the query automaton. Intuitively, we see that we can build a sequence that satisfies a lasso path for both the contract automaton and the query automaton (i.e. we just need to synchronize the traversal of the transitions 1 → 3 in the contract BA with the init → 1 in the query BA and 3 → 4 with 1 → 2). Note that in this process we tried to pair the labels on the mentioned transitions (evidently not all labels are ’compatible’). The intuition in Example 5 rests on the concept of pairing a lasso path on the contract BA with another lasso path on the query BA. We can provide a more precise intuition by introducing the concept of simultaneous lasso path. Definition 2 A simultaneous lasso path is a sequence of pairs hsi , qi ii≥0 so that: 1. hsi ii≥0 is a lasso path in the contract BA (with the cycle repeated infinitely many times) 2. hqi ii≥0 is a lasso path in the query BA (with the cycle repeated infinitely many times) 3. in every i, the ith transition of the query lasso path contains only events from the contract specification and does not conflict with the events in the ith transition of the contract path. In §B.1 we formally treat this intuition resulting in the following: Theorem 1 Let C be a contract and q a query. There exists a simultaneous-lassopath in the BAs representing C and q if and only if C permits q. Note that simultaneous lasso paths as in Definition 2 are infinite. However, it is easy to see that we can finitely represent them with the prefixes and cycles for both the contract and query lasso paths. The finiteness of the representation of simultanoues lasso paths and Theorem 1 imply that the permission problem is decidable. 2.4.1

Implementation

Our verification algorithm looks for a simultaneous lasso path in a way inspired by the nested depth first search technique used in model checking [23], which though it solves a different problem also relies on finding lasso paths. Recall that a lasso path is formed by a prefix and a cycle we now define similar concepts for simultaneous lasso paths. We call knot any pair hsi , qi i s.t. it appears infinitely many times in the simultaneous lasso path and qi is a final state for the query BA. We call prefix w.r.t. a knot k, the portion of the simultaneous lasso path from the beginning to the first appearance of k. In order to define the cycles for simultaneous lasso paths, we need to force the existence of a final state in the contract BA. More formally, let i be the instant of the first appearance of a knot k and let j > i be the instant of the first appearance of a pair hsi , qi i where si is a final state for the contract BA, we call cycle w.r.t. a knot k, the portion of the simultaneous lasso path between i and the first subsequent appearance of k after j. In order to verify the existence of a prefix and a cycle, we start considering the initial pair, then we generate all possible prefixes using a recursive depth first strategy: for every pair we inspect both input BAs (contract and query), build all possible successor pairs (i.e. those satisfying condition 3 of Definition 9) and recurse on each of those. Since we do not know in advance the length of the prefix, at every step, if the considered pair s = hsi , qi i is a potential knot (i.e. qi is a final state for the query BA), we verify if there exists a cycle containing s. We call seeds all potential knots in which we start a search for a cycle. In order to verify the existence of a cycle containing a seed s, we start a nested search in s which looks for a path 7

Algorithm 1: Verifying that C(ϕ) permits property ψ Input: A = {QA , iA , δA , FA } is the service BA, B = {QB , iB , δB , FB } is the query BA, (w.l.o.g. they have a single initial state) ψ, an LTL query property Output: returns true if S(ϕ) permits ψ 1 begin 2 visited = ∅ 3 found = f alse 4 visit(hiA , iB i) 5 return found 6 end 7 procedure visit(hss , sq i) 8 begin 9 if hss , sq i ∈visited then return 10 visited ⇐ visited ∪{hss , sq i} 11 if sq ∈ FB then 12 visited2 ⇐ ∅ 13 cycle search(hss , sq i, hss , sq i, f alse) 14 if found then return 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

foreach hss , λA , s0s i ∈ δA do foreach hsq , λB , s0q i ∈ δB s.t. compatible(λA , λB ) do visit(hs0s , s0q i) if found then return end procedure cycle search(hss , sq i, start, foundFinal ) begin foundFinal2 ⇐ foundFinal ∨(ss ∈ FA ) if hss , sq i ∈ visited2 then return visited2 ⇐ visited2 ∪{hss , sq i} foreach hss , λA , s0s i ∈ δA do foreach hsq , λB , s0q i ∈ δB s.t. compatible(λA , λB ) do if hs0s , s0q i = start then found ⇐ foundFinal2 cycle search(hs0s , s0q i, start, foundFinal2 ) if found then return visited2 ⇐ visited2 \{hss , sq i} end

that leads back to s but that also contain a pair hsi , qi i in which si is a contract final state. The most straightforward way to look for such cycles is to perform a backtracking depth first search that keeps track of the paths that contain a pair with a contract final state and returns true iff one of these paths lead back to s. Algorithm 1, implements the strategy we just outlined. Lines 1 through 6 initialize the sets of visited pairs for the depth first visit, and the global variable that will contain true iff a simultaneous lasso path is found. It then invokes the procedure visit which recursively generates all prefixes. Lines 9 and 10 verify that we are visiting every pair once. Lines 11 through 14 test if the current pair is a seed (qi is a final state). If this is the case, the data structure visited2 is initialized and the procedure cycle search is invoked. If it returns successfully, the procedure terminates. Lines 15-18 generate all possible successor pairs by inspecting the transitions in the contract BA (line 15) and in the query BA (line 16). A recursive call to visit is then issued for each such pair. The predicate compatible at line

8

16 (and also later at line 32) implements condition 3 of Definition 9. The procedure cycle search accesses the visited2 variable which stores the path it is currently exploring. The variable f oundF inal2 is true if the current path contains a pair with a contract final state. Lines 23 and 24, avoid creating paths that contain cycles.Lines 25 and 26, generate all possible successor pairs. Line 27 and 28, update the global variable f ound if a cycle is present which contains a pair with a contract final state. If this is not the case, line 29 and 30 recursively continue to extend the current path.Line 31 backtracks to the previous explored path by removing the current pair from the visited2 variable. We can now state formally: Theorem 2 Algorithm 1 returns true iff C(ϕ) permits ψ. Note that the procedure cycle search can exploit a simple memoization scheme which stores for every visited pair a boolean variable, which is true if that pair can eventually lead to the original seed. In this way, we can code the whole procedure as a depth first visit, never visiting any pair more than once. 2.4.2

Seeds optimization

Algorithm 1 starts a nested search for a cycle in every reachable pair hsi , qi i s.t. qi is a query final state. As pointed out in Section 2.4.1, we know that in order to build a simultaneous lasso path we need a cycle, which has to contain a pair with a contract final state. It should be easy to see that if the contract state si of the seeding pair is not part, in the contract BA, of a cycle containing a final state, it is impossible to build a simultaneous lasso path cycle that contains the seed hsi , qi i. This is trivially true because, by Definition 9, hsi ii≥0 has to form a lasso path for the contract BA. In some cases we can avoid many cycle searches by knowing in advance which contract states are actually part of a cycle containing a final state. We can thus modify our notion of seed, by considering seeds all and only the pairs hsi , qi i so that qi is a query final state and si is contained in a cycle that contains a final state in the contract BA. Our system precomputes all contract states with this property at registration time and Algorithm 1 is easily modified to exploit this information by modifying line 11, which will check if the current pair satisfies the updated definition of seed. 2.4.3

Complexity

As we saw in §2.4.1, the search for a simultaneous lasso path can be coded with a simple backtracking search on the two BAs, which results in a LOGSPACE complexity in the size of the BAs. The size of BAs, however, is known to be worst case exponential in the size of the original clauses (i.e. contract specifications and queries) [24]. This results in a PSPACE upper bound for the permission decision problem. With a reduction from LTL satisfiability [21] we can prove that the problem is PSPACE-hard, which gives us the following complexity result: Theorem 3 Deciding permission is PSPACE-complete. This result is not completely unexpected because it is the same complexity class of the model checking problem. The algorithm used for model checking checks the non-emptiness of the intersection of the sets of runs. In our case, we need to check the existence of a whole projection class in the intersection, however we are able to prove that it can be done using the same basic technical idea, originated in the model checking context, of looking for a lasso path [23].

2.5

High level architecture

The algorithm in §2.4.1 suggests a straightforward implementation. At registration time, a contract LTL specification is translated to an equivalent BA representation that is the one stored. In our prototype we use an existing tool [10] to perform this stage. At query time, the query in LTL form is converted to an equivalent BA in the same way and then checked (with the algorithm described in §2.4) against all the contract BAs in the database. Given the high complexity of the permission problem, however, such an approach is impractical. We expect our contract database to be fairly static and that each contract will be queried multiple times. So all our techniques share the common idea of precomputing some auxiliary information during 9

init

purchase

init

state_1

purchase

dateChange use OR missedFlight

state_2

d a t e C h a n g e useFirst

state_2

state_3

use OR missedFlight

useFirst

state_4

askRefund useSecond refund

state_5

state_3

(a)

state_6

(b)

init dateChange state_1

*

init

miss

state_1

*

* flightCanceled c h a n g e A p p r o v e d

dateChange state_2

*

use state_3

state_2

*

*

state_3

*

requestChange

requestChange

(c)

changeApproved

state_4

*

(d)

Figure 2: B¨ uchi automata for (a) a ticket that allows no refunds and date changes only before the scheduled departure (Ticket C of Example 2), (b) a round-trip ticket allowing a single change before the first flight and a refund for the second one, (c) a query asking for two date changes, (d) a query asking for tickets that can be changed indefinitely even after a flight is canceled or it is missed and already rescheduled once. Double circled states are final. Note: BA (a) and (b) follow the same convention as Figure 1a. the registration step, in order to achieve faster online query processing. In this paper, we describe two optimization techniques.

3

Prefiltering optimization

The technique that we present in this section reduces the number of executions of the verification algorithm by pruning the number of contracts to test. At registration time, an index data structure is updated with information about the contract. At query time, the index is used to identify a set of contracts which is guaranteed to contain all contracts permitting the query. The permission algorithm can then run only on these contracts (candidates). At a high level, the technique extracts a necessary condition for permission from the query BA (called pruning condition), it then uses the index data structure to quickly extract the set of contracts satisfying that condition (i.e. candidates).

3.1

Pruning Conditions

Example 6 Let us consider the query BA in Figure 2c. Since permission is checked looking for a simultaneous lasso path, it is clear that, in order to permit this query, a contract BA must have some transitions that are ‘compatible’ (as in Definition 2, point 3.) with both formulas: dateChange and use. Otherwise, no simultaneous lasso path can be built. While this condition is necessary, it is clearly not sufficient: the contract in Figure 2a has such transitions but does not permit the stated query. We now generalize the idea of Example 6. From Theorem 1 we know that we have to build a simultaneous lasso path. This means that, by definition, a lasso path has to exist in the query BA. We also know that all transitions in this query lasso path have to be ‘compatible’ to their simultaneous

10

transitions in the contract BA. It follows that in the contract BA, at the very least, there has to exist one compatible label for every label in the query lasso path. Using this intuition, it is clear that if we find all lasso paths in the query BA, we can create a set of candidates by gathering all contracts whose BAs contain a compatible label for every label of a query lasso path. The problem is that the number of query lasso paths is exponential in the size of the query automaton. However, we never explicitly compute all lasso paths, as we can extract a more compact representation directly from the query BA. In order to explain this, we will assume to have a data structure that allows us to obtain S(λ), the set of contracts in the repository that contain a label compatible with a query BA label λ. We will describe the implementation of this data structure in §3.2. Example 7 Let us consider the query BA in Figure 2d. It only contains a single final state state 2. In order to build a lasso path knotted in state 2, we need to have both a prefix and a cycle. By visual inspection, there are only two possible prefixes that are simple paths (i.e. do not contain cycles): one with label f lightCanceled, the other with miss and changeApproved. We do not consider the self-loops in init and state 1 (and any other cycle) because their labels are not strictly necessary to build a prefix, so we cannot exclude any contract for not having them. Knowing all the possible prefixes, it follows that all the contracts that contain labels compatible with these prefixes are in the following set: S(f lightCancelled) ∪ (S(miss) ∩ S(changeApproved)) The idea is that this set will be smaller than the whole database. We can also prune other contracts by considering the cycles of the query lasso paths knotted in state 2. Since the transition from state 2 to state 3 is labeled with true, it provides no way to prune any service. It follows that in order to go from state 2 to state 4 we can only force the presence of a transition compatible with requestChange. In order to reach back state 2 we need changeApproved. Adding this additional condition to the previous one (i.e. intersecting the two sets), we know that we need to run the permission algorithm only for contracts in the following set: (S(f lightCancelled) ∪ (S(miss) ∩ S(changeApproved))) ∩ (S(requestChange) ∩ S(changeApproved)) In Example 7 we generated a set of candidates from a condition derived from the analysis of lasso paths knotted in state 2. We call conditions of this kind lasso pruning conditions. Intuitively, a lasso pruning condition for a knot k selects all contracts that permit the query with a simultaneous lasso path whose query lasso is knotted in k. Formally, we call pruning condition for a query BA A, a formula containing union and intersection operations and in which the elements are only S(λ), where λ is a disjunction-free propositional formula using variables in the query LTL formula. Definition 3 Given a CDB and a query q whose BA is A, a lasso pruning condition for f ∈ FA is a pruning condition evaluating to a set containing every contract s ∈ CDB s.t. there exists a simultaneous lasso path hsi , qi ii≥0 for s and q knotted in f . Many heuristics can be used to build a lasso pruning condition (e.g. considering only the transitions entering the knot). In Example 7, the lasso pruning condition is formed by a first part which considers the possible prefixes of the query lasso paths, and a second one which considers the possible cycles. It should be clear that this is a valid lasso pruning condition and that it is always possible to build such conditions (i.e. trivially, enumerating all lasso paths). Our specific implementation, similar to the one in Example 7, is detailed later. In Example 7, state 2 is the only final state, so its lasso pruning condition selects all contracts in the answer. In general, a query BA contains more than one final state.Since any simultaneous lasso path is enough to guarantee permission, we have to take the union of the sets retrieved by the lasso pruning conditions of every final state. Theorem 4 Given a CDB and a query q whose BA is A, let lf be a lasso pruning condition for f ∈ FA . The union of lf for all f ∈ FA is a superset of the result of q on CDB.

11

Algorithm 2: Implemented algorithm to compute pruning conditions for query BAs Input: A = {QA , iA , δA , FA } is the query BA Output: returns a pruning condition for A 1 begin 2 condition ⇐ true 3 C ⇐ connected components(A) 4 foreach c ∈ C do 5 foreach final state t ∈ c do 6 current path ⇐ ∅ 7 condition ⇐ condition ∨(( cycle condition(t, c) ) ∧ ( compute path from init(t) )) 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

return condition end function cycle condition(t, c): DNF formula begin cycle ⇐ true foreach hi, λ, ti ∈ c do cycle ⇐ cycle ∨λ return cycle end function compute path from init(s: state): DNF formula begin paths ⇐ true if s = iA ∨ s ∈ current path then return paths current path ⇐ current path ∪{s} foreach hi, λ, si ∈ δA do ψ ⇐ compute path W from init (i) foreach χi with i χ = ψ do paths ⇐ paths ∨(χi ∧ λ) current path ⇐ current path \{s} return paths end

The algorithm that we use to compute the lasso pruning conditions computes an approximated version of the ones presented in Example 7. Given a final state f , it does not consider the possible cycles, but only the possible incoming transitions of f from inside its strongly connected component (in Example 7, it would be only changeApproved ). This is an approximation because those incoming transitions might not all be part of a cycle, moreover, we are not considering the rest of the cycles. However, it is much faster and in experimental results it has nearly the same number of false positives as the complete pruning conditions. A reference implementation of this pruning condition algorithm is shown in Algorithm 2. Lines 3-5 simply cycle through all the final states after having identified the connected components of the query BA. As already explained pruning conditions are disjunctions of lasso pruning conditions knotted on final states. Lines 7 generates this disjunction after having created the lasso pruning condition for the current final state t. Lasso pruning condition are formed by a condition derived from the possible paths from the initial state to t in conjunction with the condition on the cycle containing t. We generate these two conditions with the two functions cycle condition and compute path from init. It is immediate to see that the cycle condition computed in lines 12-15 simply takes the disjunctions of the incoming edges of the final state in its connected component. Note that we can consider only the connected component because we know that any cycle has to be contained in a connected component. The function compute path from init generates all backward paths from state s to the intial state, and generates a condition that is the disjunction of the conjunctions of all labels in every such paths. For clarity we show an exponential algorithm. In our implementation we exploit a memoization scheme that stores for

12

each state s the pruning condition generated from the possible paths from the init to s, which gives a linear time algorithm (w.r.t. the size of the BA). Moreover, we remove redundant conditions during the computation in order to minimize the number of conditions we have to iterate in lines 24-25.

3.2

Index data structure

In §3.1, we assumed a way to easily compute S(λ), i.e. the set of contracts in the repository that contain a label compatible (point 3 of Def.2) with a query BA label λ. We will now introduce the index data structure that we use in order to perform this operation. At a high level, we have to find a way to identify labels that are compatible with λ and then gather all contracts containing them. Assuming that a contract label γ involves all events in λ, it is easy to see that we can simply check if the set of literals in γ contains the ones in λ. However, by definition, we might have to consider not only events in γ but also the ones cited in its contract c. This is the case, for instance, if γ is the boolean formula ref und, in a contract that cites both ref und and dateChange. We have then that both labels, ref und ∧ dateChange and ref und ∧ ¬dateChange are compatible with γ. In order to treat compatibility checks as containment checks, we create a set for γ (called the expansion, E(γ)) that contains all literals in γ plus all literals (both positive and negative) for any remaining event cited in its contract. Example 8 Let us consider a transition label t = p ∧ c from a contract that cites the events p, c and m. Given the above definition we have that its expansion E(p ∧ c) = {p, c, m, ¬m}. We have that a query transition q = p ∧ m is compatible with t, while q 0 = p ∧ ¬c and q 00 = c ∧ r are not. Using expansions to check compatibility, our prefiltering index data structure is conceptually a map of sets of contracts, indexed by expansion. The idea is to store for each expansion the set of contract BAs in which there exists a label that generates that expansion. Having this data structure, it is immediate that, given an input formula λ, we can compute S(λ) by taking the union of the contract sets associated to the expansions compatible with λ. Example 9 We consider the two contracts in Figure 1a (Ticket A) and 2a (Ticket C), called A and C. The following prefiltering index contains all the expansions of the labels of both contracts along with the list of contracts in which a particular expansion appear. For readability: p = purchase, c = dateChange, m = missedFlight, u = use, r = refund. {p,¬c,¬m,¬u,¬r} {¬p,c,¬m,¬u,¬r} {¬p,¬c,m,¬u,¬r}

{A,C} {A,C} {A,C}

{¬p,¬c,¬m,u,¬r} {¬p,¬c,¬m,¬u,r}

{A,C} {A }

Note that in this example all the labels in the contract automata contained already all the events, but this is not always the case. If we consider the query in Figure 1b, it is easy to see that a lasso pruning condition will be S(m)∩S(r). Trivially, S(m) evaluates to {A, C} and S(r) to {A}. The whole condition selects the candidate {A}. Using the prefiltering technique, we avoid the execution of the permission checking algorithm on contract C. In order to implement and scale this index we face two problems. The first is that we want to identify all compatible expansions without a full index scan of the index. We solve this problem using a simple hierarchical data structure that enables access to any expansion set in a number of steps bounded by its size. An example of this structure is shown in Figure 3. The expansions are arranged in a graph so that if the literals in expansion a contain literals in expansion b an edge hb, ai is added to the graph. This structure has two important properties. The first is that it can be incrementally maintained (adding and removing expansions) with local modification to the graph structure. The second is that any expansion can be accessed in a number of steps that is linear to the size of the expansion. We achieve this second result by maintaining a direct list of all linked nodes. Note that this list is bounded by the number of possible literals in the system, which is usually contained. The second, and more important, scalability problem is due to the fact that the number of possible expansions is exponential in the size of the vocabulary. This means that the index grows exponentially in 13

{}

{!m,c}

{!m}

{c}

{m}

{!c}

{!m,m}

{m,c}

{!m,!c}

{!c,c}

{!m,m,c}

{!m,!c,c}

{!m,m,!c}

{m,!c}

{m,!c,c}

Figure 3: Hierarchical structure of a prefilter index with a vocabulary of two variables (m and c) and bounded to expansion sets of size 3. The nodes are labeled with the expansion sets and contain the sets of contracts with at least transition that generate it. Note that the structure does not need to contain all possible expansions like in this example. size, leading to intractability. In order to avoid this exponential blow up, we create a data structure that, while maintaining the soundness and completeness of the technique, returns for every label λ a superset S 0 (λ) of S(λ). This modification does not affect the soundness and completeness of the technique because pruning conditions needs to evaluate to a set that just contains a certain set of contracts (§3.1). Since conditions use only union and intersection operators, they are monotonic. This means that the fact that the S 0 () returned by the data structure are supersets of the required S() results in a set of candidates that is a superset of the one returned if we were using the correct S(). Since all candidates will be checked with our permission algorithm and we are returning a superset of the original set of candidates, soundness and completeness are preserved. This data structure avoids the exponential blow up by limiting the size of the expansions in the index structure to a predefined number k. The set of contracts associated to any expansion e of size k will then contain all contracts whose BA representations contains a label whose expansion contains e. In this way, when retrieving S 0 (λ) with |λ| > k, we can just pick any set of contracts (called expansion sets) associated with an expansion that is contained in the set of literals of λ. In order to refine the set of candidates we retrieve, we can take the intersection of many of these expansion sets. In our prototype, we implemented this refinement without any performance impact, since usually query labels are small and do not have a high number of k-subsets.

4

Bisimulation optimization

In this section we describe an optimization technique that is based on the observation that the query usually pertains to a much smaller set of events than the full contract specification. At registration time, we precompute a set of simplified versions of every contract BA. Intuitively, these simplified versions can be thought as ‘projections’ of the full contract specification on a subset of events (i.e. albeit smaller, they faithfully represent the behavior of the contract with respect only to the considered events). At query time, our system retrieves the appropriate precomputed ‘projection’ for each candidate contract that has to be checked for permission.

4.1

Simplified service automaton

We will show now how we are able to use a simplified version in lieu of the full contract BA.First we define a notion of equivalence with respect to a query q: two contracts BA A and B are equivalent if they both either permit or do not permit q. From Theorem 1, it follows that this is the case if and only if we can build simultaneous lasso paths with the query BA for both A and B.

14

init

!d

state_2

!d state_3

init

!d

state_4 !d

!d

!d

state_2

!d

!d

!d state_5

!d

!d

state_3

!d !d

!d state_6

!d

state_4

!d

!d

!d

(a)

state_6

!d

(b)

Figure 4: For readability: d = dateChange. (a) projection of BA in Figure 2b on ¬ dateChange. (b) simplified version of (a). Remember that in order to build a simultaneous lasso path we need two lasso paths (one for the contract BA, one for the query BA) so that in every instant their transitions have ‘compatible’ labels (point 3 of Definition 2). It follows that the only information that we really need in order to find this simultaneous lasso path is if a contract label is ‘compatible’ with a query label. For instance, given the definition of compatibility, if a query BA has transitions that contain only the labels true and dateChange, the only information we need to derive compatibility (besides knowing all events in the contract) is the presence of ¬dateChange in the contract transitions. It should be intuitive now why it is useful to introduce the concept of projection. Given a contract BA A and a set of literals L, we call projection of A on L (πL (A)) the BA built from A by keeping in every label only literals in L, e.g. Figure 4a represents the projection BA on ¬dateChange of BA in Figure 2b. This results in the following theorem, formalized in §C.1: Theorem 5 Given a BA A and a set of literals L, the projection of A on E is equivalent to A for all query BAs that refer to literals that are either negations of literals in L or pertaining to events not in cited in A. The use of projections alone does not improve the performance of the permission checking algorithm as the size of the projection BA is the same as the original one. Projections are, however, simpler BAs than their original counterparts in the sense that many transitions that had different labels have now the same label. This provides the opportunity to use a standard state reduction technique [9] that produces a BA that is equivalent to the original one (i.e. it accepts the same set of strings). This classical polynomial-time technique is based on collapsing bisimilar states and is the same used for the minimization of standard finite state automata [13]. In the case of BAs it does not output the minimum BA (BA minimization is PSPACE-hard [22]), it does provide, however, significant reductions in the number of states of our projections. This allows us to equivalently check our query on a contract BA that is a projection of our original one but that has significatly fewer states. Example 10 In Figure 4a we show a BA derived from a projection, where we have two states (state 4 and state 5) that can be collapsed. Intuitively, once the recognizing BA reaches either of those states, it will accept the exact same set of sequences, i.e. an infinite sequence of snapshots containing ¬dateChange. Its simplification is shown in 4b.

4.2

Using simplified BAs

The previous section details how we can equivalently use simplified projections to check permission. As we already anticipated at the beginning of the section, for every contract we precompute the projections at registration time, so that when our system has to check permission of a query, it can use the precomputed simplified projection. Clearly, the most appropriate projection is the smallest one that is still equivalent to the original BA for the current query.

15

Given the result in Theorem 5, it is easy to see that in order to have the smallest projection for every possible query, we need to have one projection for every subset of literals cited in the contract. The problem that immediately arises is the fact that the number of possible projections is, in the worst case, exponential in the size of the set of events of the contract. We have, however, two important observations that enable us to pursue the precomputation route. The first observation is that we do not necessarily need to precompute all projections, as we can use any projection that contains all required literals for the query, albeit with a possible perfomance penalty. The second observation is that, in practice, not all subsets of literals generate distinct simplified projection BAs, e.g. in our datasets the number of distinct simplified projection BAs was ∼5% of the number of subsets. Intuitively, removing a single event from a projection does not usually lead to a simplification of the observed behavior of the others. It follows that it is often the case that the simplifications are triggered only when whole sets of ‘independent’ events are excluded from the projection. Both these observations provide us great freedom in designing a system that balances runtime performance with precomputation time and storing needs. Storing needs are quite contained, since projections can be represented as partitions of collapsed states of the original BA. To address precomputation time, we designed a novel algorithm that efficiently computes all simplifications in parallel. We could not use the standard bisimulation algorithm (Paige and Tarjan [18]), as we would have had to run it for every subset of literals. Our algorithm avoids testing subsets that are not going to result in simplifications and reuses partial computations between different subsets. We describe this algorithm in §4.3. In our experiments we found it feasible to precompute all projections. However, in case contract complexity grows so much that full precomputation cannot be pursued, other possible approaches include the precomputation of subsets up to a certain size (queries are usually smaller than contracts and contain less literals) or the use of heuristics based on historical data and/or expected workloads to determine which simplification to precompute.

4.3

Computing all simplified BAs in parallel

First of all, notice that the bisimulation is an equivalence relation, so its computation can be thought as the partitioning of the set of nodes into subsets that contains only bisimilar nodes. Now we can state more formally that we want to compute for every service automaton a correspondence P between sets of literals and partitions. For every pair hL, ρi ∈ P , we have that ρ is a partition of the nodes in the service automaton that represents the bisimulation relation of the relevant service automaton w.r.t. the literals in L. In order to describe the main idea we briefly introduce the basic concepts behind the Paige and Tarjan algorithm. Considering a bisimulation relation that ignores the distinction between final and non-final states (i.e. point 1 of Definition 11), the basic algorithm proceeds by successively refining an initial partition that contains the set of all states. We say that a partition ρ is a refinement of ρ0 iff for every set B ∈ ρ there exists B 0 ∈ ρ0 s.t. B ∈ B 0 . In the algorithm, the refining is based on the concept of stability of a partition. Informally, a partition ρ is stable w.r.t. a set B ∈ ρ, if for all sets X ∈ ρ (and for all labels a), either all states in X lead to B (i.e. there exists a transition with label a to a node in B) or none of them. An operator can be designed that iteratively splits all sets X ∈ ρ w.r.t. a B ∈ rho in order to force the stability property. The paper [8] proves that applying this operator for all B ∈ rho defines a compound operator Φ whose fixpoint yields the coarsest partition of bisimilar states (i.e. the coarsest refinement of the initial partition that contains sets of bisimilar nodes). This means that a simplified version of the algorithm iteratively applies this splitting operator, generating a sequence of partitions that terminates with the coarsest one. Our case is a little different since we have to consider final states, so we use an initial partition containing a set with all final states and a set with all remaining ones (analogous to Hopcroft [13]). However, this does not change the properties of the algorithm. We exploit the following property. Theorem 6 Let L be a set of literals and ρL the coarsest bisimulation partition for it, we have that if L0 ⊇ L, then ρL0 is a refinement of ρL Intuitively, adding literals to the considered set will produce a more refined partition, until we consider all literals in the automaton which will produce the partition containing only singleton sets (assuming 16

the starting automaton does not contain redundant states). This property can be exploited by computing first the refinements of smaller subsets literals, since we know that all containing sets of literals will produce partitions that are refinements of those previously computed. In this way, we avoid repeating the same refinement steps multiple times. Another optimization comes from the fact that during the splitting process we generate and consider only those subsets of literals that could result in a split. This is done in the splitting operator by analyzing the labels of the transitions from the nodes in X to those in B (i.e. if a literal does not appear in any transition, it is useless to test sets of literals containing it). Usually, we do not have to test all subsets of literals for all sets.

5 5.1

Experimental evaluation Prototype Architecture

Our prototype is written in Java and consists of four independent modules. The first is the data generator and produces databases of LTL formulae as described in §5.2. It also converts the formulae to BAs using the freely available library LTL2BA [10]. Its output is a database text file. The second module generates the prefiltering index from a database text file. The third precomputes the simplified BAs from a database text file. The fourth module is the runtime module. It uses the prefiltering index and the simplified BAs in order to evaluate queries. It takes as input a query workload dataset text file and outputs statistics regarding their evaluation. We executed all code on an AMD Phenom 2.2 GHz with 4Gb of RAM and we used Sun 32bit JVM 1.6.0.10. The code does not utilize multithreading in order to isolate the technique performance from system level effects.

5.2

Data generation

To avoid a completely random generation of LTL formulae, we adopted a sophisticated generation method based on real-life usage patterns of temporal properties ([7]). The work in [7] surveys over 500 reallife specifications of temporal properties to be formally verified on finite state systems (theoretically equivalent to our contracts) and extracts recurrently appearing patterns that cover over 92% of the surveyed cases, along with their appearing frequencies. While the specifications of these systems come from different application domains (e.g. communication protocols, GUIs, distributed object systems, operating systems, databases), we contend that the sort of properties that form these specifications are very similar to the ones found in service contracts, as they are effectively describing the allowed temporal sequences of interactions between multiple actors. In [7], patterns are classified along two dimensions: required behavior and scope (i.e. temporal interval in which the behavior has to show itself). The possible scopes include global (the whole timeline), before (up to certain event), after (after a given event), between (any part of the timeline between specified event p and event q). The possible behaviors are: Absence An event does not occur in the scope. Existence A given event must occur within a scope. A variation can force the occurrence number to k. Universality A given event occurs throughout the scope. Precedence An event p must always be preceded by an event q. A variation can consider sets of events instead of p and q. Response An event p must always be followed by an event q. A variation can consider sets of events instead of p and q. As an example, we show in Table 1 (from [7]) the LTL expressions for the cited scopes. Contracts and queries are generated as conjunctions of clauses expressed as LTL properties. Our generator randomly generates LTL properties using the distribution reported in [7]; at each generation it substitutes the variable placeholders (e.g. p, q) with variables from the common vocabulary. Given a parameter n, it generates a contract or query specification formed by the conjunction of n properties thus generated. 17

Scope Global Before r After q Between q and r Global Before r After q Between q and r Global Before r After q Between q and r Global Before r After q Between q and r Global Before r After q Between q and r

Table 1: LTL patterns, from [7]. Description Absence G(¬p) p is never true Fr → (¬pUr) p is never true before r G(q → G(¬p)) p is never true after q G((q ∧ ¬r ∧ Fr) → (¬pUr) p is never true between q and r Existence F (p) p is eventually true ¬rW(p ∧ ¬r) p is true some time before r G(¬q) ∨ F(q ∧ Fp) p is true some time after q G(q ∧ ¬r → (¬rW(p ∧ ¬r))) p is true some time between q and r Universality Gp p is always true Fr → (pUr) p is true in every instant before r G((q ∧ ¬r ∧ Fr) → (pUq)) p is true in every instant after q G((q ∧ ¬r ∧ Fr) → (pUr)) p is true in any instant between q and r Precedence Fp → (¬pU(s ∨ G(¬p))) s precedes p at any time Fr → (¬pU(s ∨ r)) if p happens before r, s precedes p G(¬q) ∨ F(q ∧ (¬pU(s ∨ G(¬p)))) if p happens after r, s precedes p and follows q G((q ∧ ¬r ∧ Fr) → (¬pU(s ∨ r))) s predeces p, both events between q and r Response G(p → Fs) if p is true, s will follow Fr → (p → (¬rU(s ∧ ¬r)))Ur if p is true before r, s will follow p and preceed r G(q → G(p → Fs)) if p is true after q, s will follow p G((q ∧ ¬r ∧ Fr) → (p → (¬U(s ∧ s follows p, between q and r ¬r)))Ur) LTL pattern

Example 11 We report here one sample contract from our database of simple contracts. As already said, these contracts are generated by taking the conjunction of randomly generated temporal properties: P1 (G((p14 ∧ ¬p1 ∧ Fp1) → (p17Up1)))∧ P2 (G(¬p2))∧ P3 (¬p7U(p16 ∨ G¬p7))∧ P4 (G(p19 ∧ ¬p12 → ((p6 → (¬p12U(p1 ∧ ¬p12)))U(p12 ∨ G(p6 → (¬p12U(p1 ∧ ¬p12)))))))∧ P5 (G(p20 → Fp17)) Every property is generated from a pattern described in [7], from a vocabulary of 20 events: P1 This property forces the existence of p17 between p14 and p1. In our airline scenario it could force the eligibility of a partial refund only between the first leg use and the second one. P2 This property forces the absence of p2. In our scenario, it could be that this contract does not allow class upgrades. P3 This property forces the existence of p16 before p7. In our scenario P4 This property forces the response of p1 to p6 (i.e. if p1 happens p6 is bound to eventually happen), after p19 and until p12. In our scenario, a contract could postulate that a customer will be refunded if she asks for it but only if it happens after a flight is cancelled and before she is scheduled on another flight. 18

Dataset

size

Simple services Medium services Complex services Simple queries Medium queries Complex queries

3000 1000 1000 100 100 100

Table 2: Datasets statistics #LTL #states #states patterns avg stddev 5 30.94 34.24 6 41.82 43.23 7 50.85 47.5 1 2.31 1.41 2 5.44 4.81 3 9.6 11.11

#trans. avg 622.48 964.69 1291.63 5.2 23.86 92.84

#trans. stddev 1245.77 1628.66 1904.82 5.4 33.18 203.42

P5 This property forces the response of p17 to p20. In our scenario, this could model the fact that when a ticket is purchased it will eventually expire. Given the description of some random generated properties, we want to make two observations. First, the properties are often related between each other as some variables appear in multiple statements (e.g. p17 in P1 and P5). This creates complex interactions that model the fact that a contract is not a series of independent clauses. Second, the temporal patterns represents our contract fairly well. Indeed, if we look at the specification of the tickets in Example 12 in §A (i.e. the LTL formalization of the specification in Example 2) we can see that the properties that we use fall in the patterns that we use in our random generator. For instance, Ticket A property 1, is a simple absence pattern with an ‘after’ scope (i.e. ‘a date change cannot happen after a missed flight’). Note that some properties (e.g. C4 and C5) might be a conjunction of pattern properties. In the particular case of C4 and C5, they are the conjunction of absence patterns with ‘after’ scope. Although we believe that a real contract specification database would have been the best test case for our techniques, given these observations we contend that our random generation method produces fairly realistic test data and allows us to test our system on database sizes hard to obtain in practice. Table 2 shows the statistics of our generated datasets (i.e. contract databases and query workloads). Along with the name of the dataset we report its size (i.e. number of contracts or queries) and the number of LTL properties that form each contract or query. Since we found that the actual complexity of the LTL formulae is better characterized by the statistics of their associated BAs, we also report the average number of states of the BAs in the specification and its standard deviation along with the average number of transitions with its standard deviation.

5.3

Experiments

We measure the running times of query evaluation for the unoptimized system (§2.5) and for the system using both the prefiltering and the bisimulation techniques. The query evaluation time for the unoptimized system is the sum of the conversion time of the query from LTL to BA plus the evaluation on the contract database stored as set of contract BAs. The query evaluation time for the optimized system include the query conversion time plus the time used to evaluate the query using our precomputed data (i.e. prefilter index, simplified BAs) in addition to the contract BAs database. A first batch of measurements was aimed to prove scalability w.r.t. the number of contracts in the database. We evaluated all queries (all complexity levels) on databases of sizes 100 to 3000 of simple contracts (subsets of the Simple Contract database in Table 2). For every contract database size, we recorded the average query evaluation time for the unoptimized and optimized systems. To some readers, these databases might seem too small for practical use. Recall, however, that a complete implementation would use relational attributes (e.g. travel date and destination) to select the contracts that have to be checked for permission from a potentially much larger database. A second batch of measurements was aimed at scalability w.r.t. complexities of both queries and contract specification. We evaluated individually every query database against contract databases of size 1000 (simple, medium, complex). Again, for every combination of queries and contract database complexities we recorded the query evaluation time for the unoptimized and optimized systems.

19

80 speedup w.r.t. no optimization

1e+07

avg speedup (with stdev) avg scan time avg optimized time

1e+06

70

100000

60

10000

50

1000

40

100

30

10

20

1

10

0.1

0

running time (ms)

90

0.01 100

500 1000 2000 number of services in database

3000

Figure 5: Average speedup and running times (unoptimized and optimized) w.r.t. database sizes (small contracts, average of all query complexities)

5.4

Results

First batch - Scaling w.r.t. database size. In Figure 5 the two lines at the top of the graph represent the average running time in the case of the unoptimized approach, called scan time, and the one of the optimized system, called optimized time. The times scale is on the right y axis and it is in milliseconds and logscale. The bar chart on the same picture shows the average speedup (along with standard deviation bars) achieved by the optimized system compared to the unoptimized one (the speedup scale is on the left y axis). Notice that the running times for both evaluations scale nearly linearly w.r.t. the size of the database. The unoptimized time ranges from 2sec for a small 100 contracts database to 100sec in the case of a 3000 contracts database. These times would limit the usage of our approach in interactive systems. Our optimization techniques, however, achieve an average speedup of at least 20 (scaling nicely for larger databases), reducing the average running time in the case of the 3000 contracts database to few seconds instead of minutes. Note that the speedups increase together with the size of the database, a common effect of indexing schemes. Finally, we rarely have speedups less than 10. Second batch - Scaling w.r.t. complexity. Figure 6 reports the results for the scalabitility w.r.t. query and contract complexity. The bars represent the average speedups (along with standard deviation bars) of the optimized system compared to the unoptimized one w.r.t. complexities of both contracts in the database and queries. We explain the reduced speedup of complex queries noting that, using more variables, they cannot take advantage of the most simplified contract BAs. The bisimulation technique is also responsible for the increasing speedup w.r.t. complexities of the contracts. Usually complex contracts cite more variables and with the bisimulation technique the algorithm can ignore the behavior of all variables that are not used in the query. This keeps the complexity down to the minimum required to evaluate the query. Index building and size. For our optimized evaluation, each contract database required the precomputation of two data structures. The computation of the prefiltering index (§3.2) lasted less than 25 mins for our biggest database of 3000 contracts. The average insertion time was around 500ms. The size is also minimal as our biggest prefilter index (3000 contracts) was ∼10Mb. The computation of the simplified BAs is more expensive as we compute all possible projections. The workload is inherently parallel as each contract is simplified independently. For our 3000 contract database the average insertion time of a contract is 42secs, and the total computation of our biggest datasets took 11 hours on our workstation using three cores. We want to note that our technique does not require this massive precomputation to be effective, as motivated in §4.2. The size of the simplified

20

speedup w.r.t. no optimization

100

Simple Queries Medium Queries Complex Queries

80

60

40

20

0 Simple Services

Medium Services

Complex Services

Figure 6: Average speedup w.r.t. contracts and query complexities (database size = 1000 contracts) BAs data is in average around 80% of the database original size and in absolute terms quite contained: our biggest database of 3000 contracts had a combined size (database plus simplified BAs) of 112Mb.

6

Related work

To the best of our knowledge, this is the first work that deals with the discovery of contracts based on their temporal behavior. Previous works in electronic commerce and enterprise computing have dealt with the problem of managing contracts. The field of e-contracting [1] deals with the automation of contracting practices and it is mainly tailored for the B2B case. The direction that is most relevant to the present work is the verification and monitoring of real world contracts. As explained in [16], contracts are usually represented using rules written in some form of temporal and/or deontic logic (i.e. stating permissions, obligations and prohibitions), [14], [17]. Both [14],[17] perform verification on contracts specified with deontic constraints. These works focus on the internal consistency of contracts, leading to the need of more expressive specification language. Queries on a database of contracts are never considered and we are not aware of any contract indexing scheme. Moreover, in [14] the higher order logic used to represent contracts is internally translated to B¨ uchi Automata, opening the way to the application of our optimization techniques also for such higher order formalisms. Web Service search exhibits many similarities with our problem. Web Services are complex pieces of software that expose their functionality on the web. Many works have addressed this problem. Some of them do not consider the temporal behavior, focusing on search in their annotated natural language description ([6]), hence are not applicable to our context. In [11], a method based on graph matching is proposed in order to implement behavioral similarity search of web services. This system returns approximated results, expecting a domain-knowledgeable user to read the actual specification of the returned contracts. Our system, on the other hand, implements an exact query evaluation algorithm, which avoid this. In [20], web services are modeled as directed graphs with nodes representing either actions or messages. An exact evaluation algorithm is proposed to query for graph patterns in service specifications. We contend that such a system would not be suited for our context in which both contracts and queries are inherently declarative specifications and not directly related to a procedural implementation. The problem of querying behavioral objects is present also in the area of business process management. In [2], business processes are represented as directed graphs encoding the BPEL specification. An exact query evaluation algorithm is presented that allows to query the structure of the specification graphs 21

using graph queries. The framework is extended in [5], in order to consider some behavioral semantics. The premise of this line of work differs from ours in that business processes are handled as procedural implementations (workflows), while in our work we manage contracts in the form of declarative clauses. Asking for a full workflow registration by the contract provider would significantly hinder registration into the contract database, as contracts specify only some key properties of the workflow. Moreover, a workflow query will likely miss all contracts whose workflows do not match its structure yet are semantically equivalent to one that does. Finally, many techniques we used are inspired by the work done in model checking [23]. Our permission problem is a novel variation on the standard verification of satisfiability of LTL formulae.

7

Conclusions

We present a broker that enables providers to register service contracts and consumers to query them, in both cases based on the contracts’ temporal behavior. Our novel semantics for permission of a query by a contract takes into account the fact that contracts may not mention some of the events of interest to the query. Permission does not reduce to any standard temporal decision problem. We establish the complexity of the permission problem (PSPACE-complete), and design an algorithm for checking it. We show that the theoretical worst case is not an impediment to scalable implementation, presenting two distinct and complementary techniques for querying large collections of service contracts. We evaluate experimentally our implementation, showing that it scales well with both the number and the complexity of the contracts.

References [1] S. Angelov and P. Grefen. The business case for b2b e-contracting. In Proc. of the 6th Int. Conf. on Electronic Commerce, 2004. [2] C. Beeri, A. Eyal, S. Kamenkovich, and T. Milo. Querying business processes. In VLDB ’06, pages 343–354. VLDB Endowment, 2006. [3] J. Bengtsson, K. Larsen, F. Larsson, P. Pettersson, and W. Yi. Uppaal - a tool suite for automatic verification of real-time systems. In Hybrid Systems III. 1996. [4] M. Brambilla, A. Deutsch, L. Sui, and V. Vianu. The role of visual tools in a web application design and verification framework : A visual notation for ltl formulae. Int. Conf. on Web Engineering, 2005. [5] D. Deutch and T. Milo. Querying structural and behavioral properties of business processes. In Database Programming Languages. 2007. [6] X. Dong, A. Halevy, J. Madhavan, E. Nemes, and J. Zhang. Similarity search for web services. In VLDB, 2004. [7] M. B. Dwyer, G. S. Avrunin, and J. C. Corbett. Property specification patterns for finite-state verification. In Workshop on Formal Methods in Software Practice, 1998. [8] J. C. Fernandez. An implementation of an efficient algorithm for bisimulation equivalence. Sci. Comput. Program., 1990. [9] K. Fisler and M. Y. Vardi. Bisimulation minimization in an automata-theoretic verification framework. In Formal Methods in Computer-Aided Design, 1998. [10] P. Gastin and D. Oddoux. Fast ltl to b¨ uchi automata translation. In CAV, 2001. [11] D. Grigori, J. C. Corrales, and M. Bouzeghoub. Behavioral matchmaking for service retrieval. In Int. Conf. on Web Services, ’06. [12] G. J. Holzmann. The model checker spin. IEEE Trans. Soft. Eng., 1997.

22

[13] J. E. Hopcroft. An n log n algorithm for minimizing states in a finite automaton. Technical report, Stanford, CA, USA, 1971. [14] M. Kyas, C. Prisacariu, and G. Schneider. Run-time monitoring of electronic contracts. In ATVA ’08: Automated Technology for Verification and Analysis, 2008. [15] M. R. Lowry. Software construction and analysis tools for future space missions. In European Conf. on Theory and Practice of Software, 2002. [16] O. Marjanovic and Z. Milosevic. Towards formal modeling of e-contracts. In Enterprise Distributed Object Computing, 2001. [17] Z. Milosevic and R. G. Dromey. On expressing and monitoring behaviour in contracts. In EDOC, ’02. [18] R. Paige and R. E. Tarjan. Three partition refinement algorithms. SIAM J. Comput., 1987. [19] A. Pnueli. The temporal logic of programs. In FOCS, 1977. [20] Z. Shen and J. Su. Web service discovery based on behavior signatures. In Int. Conf. on Services Computing, 2005. [21] A. P. Sistla and E. M. Clarke. The complexity of propositional linear temporal logics. J. ACM, 1985. [22] L. J. Stockmeyer and A. R. Meyer. Word problems requiring exponential time. In STOC ’73: ACM Symp. on Theory of Computing. [23] M. Vardi and P. Wolper. An automata-theoretic approach to automatic program verification. In LICS, 1986. [24] P. Wolper, M. Y. Vardi, and P. A. Sistla. Reasoning about infinite computation paths. In FOCS, 1983.

A

Formalization

We consider a common vocabulary of propositional variables V, intuitively associated to the modeled events. LTL is essentially a propositional logic which, in addition to the usual boolean operators, uses some temporal operators with the following intuitive meaning: • Xp, p is true on the next instant • Fp, eventually p is true • pUq, q is eventually true and p is true until q is true • Gp, globally (i.e. in every instant) p is true • pBq, p is true before q is true • pWq, weak until : p is always true or p is true until q is true Note that all temporal operators can be derived from X and U (e.g. Fϕ ≡ true U ϕ, Gϕ ≡ ¬F(¬ϕ), ϕBψ ≡ ¬(¬ϕ U ψ)). Example 12 Let us see how the clauses of the contracts in Example 2 can be expressed in LTL. We assume a vocabulary V containing the variables { purchase, use, missedFlight, refund, dateChange } that are true in the instant in which the event happens.

23

C0 In this case, we force the fact that only one event happens in each instant, G(purchase → ¬use ∧ ¬missedF light ∧ ¬ref und∧ ¬dateChange), G(use → ¬purchase ∧ ¬missedF light ∧ ¬ref und∧ ¬dateChange), ..., C1 G(purchase → X(¬Fpurchase)) C2 G(purchase B (use ∨ missedF light ∨ ref und ∨ dateChange)) C3 G((missedF light → ¬Fuse) W dateChange) C4 G(ref und → ¬F(use ∨ missedF light ∨ ref und ∨ dateChange)) C5 G(use → ¬F(use ∨ missedF light ∨ ref und ∨ dateChange)) Ticket A

1. G(dateChange → ¬Fref und)

2. no clause necessary Ticket B

1. no clause necessary

2. G(missedF light → ¬FdateChange) Ticket C

1. G(¬ref und)

2. G(dateChange → X(¬FdateChange)) 3. G(missedF light → ¬FdateChange) Note that the specifications of the contracts will be the conjunction of all common clauses along with the specific ones (e.g. for Ticket A: C0,C1,C2,C3,C4,C5,TicketA). Also, note that some natural language clauses (e.g. ’Unlimited date changes’) did not require any additional LTL clause, as they are implicit in the LTL semantics. This is because they did not add any additional constraints to what was already specified in the common clauses (e.g. for date changes: C0,C2,C3,C4,C5). The formal counterpart of our intuitive notion of temporal sequence is the run. A run ρ is a function ρ(t) : N → T (V), where N are the natural numbers, T (V) is the set of truth assignments for the variables in V, and t a particular instant. The ’tail’ of the run starting at instant i is written as ρ|i (formally: ρ|i = ρ(t + i)). The semantics of LTL is given inductively for all boolean operators and for the universal temporal operators (X, U). A run ρ satisfies an LTL formula ϕ (i.e. ρ |= ϕ), iff, inductively: • ρ |= p, with p ∈ V, iff p is true in ρ(0) • ρ |= Xϕ, iff ρ|1 |= ϕ • ρ |= φUψ, iff ∃k ≥ 0 s.t. ρ|k |= ψ and ∀0 ≤ i < k(ρ|i |= φ) • closure w.r.t. boolean operators (∧, ¬) Remember from the discussion after Example 3 that our intuitive semantics did not take into account the behavior of events that were not explicitly cited in the contract. In particular, we wanted to avoid cases where the behavior of a variable not cited in the contract specification results in the contract permission of a query, that is not intended to be supported by the contract. In order to formalize this intuition we introduce the following concepts.

24

Definition 4 The projection of a run ρ w.r.t a set of variables V (a V -projection), is a sequence σ that assigns to the variables in V the same truth values assigned by ρ, and does not assign any truth value to other variables. A run ρ is compatible with a V -projection σ, if σ assigns in every instant the same truth values of ρ to all variables in V . The set of all runs compatible with a V -projection is called projection class. Definition 5 Given an LTL formula ϕ, let V be the set of its variables. We call P (ϕ) ( set of allowed projections of a contract specified by ϕ) the set of V -projections of all runs satisfying ϕ. Definition 6 Given an LTL formula ϕ, let V be the set of its variables. The contract defined by ϕ (referred to as C(ϕ)) permits an LTL query ψ iff there exists a V -projection σ ∈ P (ϕ) s.t. all runs compatible with σ satisfy ψ. Intuitively, forcing the property to be true for all runs in a projection class will avoid the fact that a particular behavior of some variables not mentioned in the contract will trigger a false positive permission of a query. For instance, let us consider that a query q is satisfied by a run ρ of a contract specification because of the behavior of variable v. Let us also assume that without the specific behavior of the variable v, q would not be permitted as it is the case in Example 3 for the variable ‘class upgrade’. We note that in the projection class of ρ there will be a run ρ0 that does not exhibit the same behavior for v, as v is not part of the vocabulary of the contract. It follows that q will not be satisfied by all runs in a projection class and thus it will not be permitted by the contract. We can now formally define the result of the contract search problem. Definition 7 Given a contract database CDB (i.e. a set of LTL specifications for service contracts) and an LTL formula φ, the result R of the query φ on CDB is the set of all and only the contracts that permit φ. Note on infinite runs.To some readers, the decision to choose an infinite list of truth assignments to model sequences might seem an unnecessary complication. We want, however, to make two important observations. The first is that using infinite runs does not reduce the expressive power of the formalism, as any finite sequence can be encoded simply appending an infinite list of dummy snapshots. The second is that we need infinite runs to identify a single witness run for every temporal property. Let us assume a standard contract requirement: being able to change the date of a flight an unlimited number of times. Any finite run would not be able to satisfy this property, preventing us from verifying properties with the exhibition of a single satisfying run.

B

Algorithm

It is easily seen that Definition 6 can be rephrased to: Definition 8 A contract C(ϕ) with vocabulary V permits a query ψ iff the intersection of the set of runs satisfying ϕ with the set of runs satisfying ψ, contains a projection class w.r.t. V . This alternative definition is more suited for the design of an algorithm as we could think of computing the intersection of the two sets, and then checking the presence of a projection class in the intersection set. The first part of this algorithm is very similar to the non-empty intersection problem, extensively studied in other domains (e.g. model checking [23]). To solve this problem the theoretical tool of choice are B¨ uchi automata (BA), which have a deeply exploited connection with LTL [24]. Using BAs, we are able to implement the stated algorithm with a single step: scanning the intersection of the two sets of runs and, at the same time, checking the existence of a complete projection class.

B.1

B¨ uchi automata

Formally, a B¨ uchi automaton is a tuple {Q, I, δ, F }, where Q is the set of states, I the initial states and F the final ones. The transition relation is δ ⊆ Q × Σ × Q, where Σ is the set of disjunction-free propositional formula over V (e.g. Figure 1). 25

In order to define the semantics we introduce the concept of path as a sequence of states σ = σ1 , σ2 , ..., linked by transitions in δ (i.e. ∀i(∃λ(hσi , λ, σi+1 i ∈ δ))). A path is called lasso path if it is formed by a finite prefix leading to a final state k (called knot) and by a cycle leading back to k. The cycle is then repeated infinitely many times. Intuitively, a run ρ satisfies a path σ if at any instant i the truth assignment ρ(i) satisfies a formula on the transition between σi and σi+1 . Now we can state the accepting condition: a BA A accepts a run ρ, if ρ satisfies a lasso path of A. We call BA(ϕ) any BA that accepts all and only the runs satisfying an LTL formula ϕ. We assume transition labels of BA(ϕ) contain only variables in ϕ. We consider now the problem of verifying that a contract permits a certain property. They are both specified as LTL formulae, but our approach will handle them in form of BAs. As in §2.4, we formalize this idea by introducing the concept of simultaneous lasso path. Definition 9 Given a contract defined by ϕ (referred to as C(ϕ)), with a set of variables V , and a query ψ. Let A(ϕ) = {QA , IA , δA , FA } be a BA for C(ϕ) and B(ψ) = {QB , IB , δB , FB } a BA for ψ, we define a simultaneous lasso path an infinite sequence of pairs hsi , qi ii≥0 s.t.: 1. hsi ii≥0 form a lasso path for A 2. hqi ii≥0 form a lasso path for B 3. for every i, there exist θi a formula s.t. hsi , θi , si+1 i ∈ δA and τi s.t. hqi , τi , qi+1 i ∈ δB , and we have that θi ∧ τi is satisfiable (i.e. they are not conflicting), and τi contains only variables in V (θi and τi are compatible). The intuition in §2.4 is then formalized as follows: Lemma 1 Let σ be a lasso path of a BA A(ϕ), P (ϕ) be the set of projection classes of the contract specified by ϕ, and ρ be a run in any p ∈ P (ϕ). We have: ρ |= σ → ∀ρ0 ∈ p(ρ0 |= σ) We assume w.l.o.g. that all the formulae in the transition of BA(ϕ) contain only the variables that appear in ϕ. Let us assume by contradiction that there exists a run ρ0 ∈ s and ρ0 2 σ. So there must exists an instant i where it does not exist a hσi , λ, σi+1 i ∈ δ ∧ρ0 (i) |= λ. However, we know that a λ0 exists for that particular instant i s.t. ρ |= λ0 . Since ρ0 is in s, we know that ∀i(∀v ∈ ϕ(ρ(i)(v) = ρ0 (i)(v))). We also assumed that λ0 does not contain any variable not in ϕ. It follows that ρ0 |= λ0 . Contradiction. Theorem 7 Let P (ϕ) the set of projection classes of a contract with variables in V and ψ a query. There exists σ = hsi , qi i a simultaneous lasso path for S(ϕ) and φ iff ∃s ∈ S(ϕ)(∀ρ ∈ s(ρ |= ψ)). ⇒) For every i, we have by definition that there exists a formula θi (hsi , θi , si+1 i ∈ δA ), a τi (hqi , τi , qi+1 i ∈ δB ) and a set of truth assignements Ti s.t. ∀ti ∈ Ti (ti |= θi ∧ ti |= τi ). Let ρ be a run ρ(i) = ti ∈ Ti , picking for each instant any ti ∈ Ti . Let σ be the laso path formed by si . It follows that ρ |= σ. From the result in Lemma 1, we know that ∀ρ0 ∈ s(ρ0 |= σ)). Since they all share the same assignments on the variables in V (by definition of service), it follows (by definition of service-query-lasopath) that ∀ρ0 ∈ s(∀i(ρ0 (i) |= τi )). Let π be the laso path formed by qi , the previous result means that ∀ρ0 ∈ s(ρ0 |= π), which (being π a laso path for B(ψ)) imply that ∀ρ0 ∈ s(ρ0 |= ψ), which is what we wanted to prove. ⇐) We assume that ∃s ∈ S(ϕ)(∀ρ ∈ s(ρ |= ψ)). From Lemma 1, we know that there exists a laso path σ in A(ϕ) s.t. ∀ρ ∈ s(ρ |= σ). Let ρ be a run in s, since ρ |= ψ, we know that there must be a laso path π in B(ψ) s.t. ρ |= π. Let us consider now the service-query-lasopath χ formed by the states of σ and π. We have to prove that χ is a valid service-query-lasopath by proving the third condition of the definition. For every instant i, let Ti = {ρ(i)|ρ ∈ s}. Since ϕ cannot distinguish on variables not in V , it follows that Ti contains all truth assignments that assign the same truth values to the variables in V . Again, we assume w.l.o.g. that all the formulae in the transition of BA(ϕ) contain only the variables that appear in ϕ. Since all ρ ∈ s satisfy ψ, we have that ∀ρ ∈ s(ρ |= π). We also know 26

that there exists a formula θi on a transition from σi to σi+1 in A and a formula τi between πi and πi+1 in B, so that ∀ρ ∈ s(ρ(i) |= θi ∧ ρ(i) |= τi ). It follows, by construction of Ti , that for every i, ∀ti ∈ Ti (ti |= θi ∧ ti |= τi ). This proves that χ is a valid service-query-lasopath. This theorem implies that the permission problem is decidable. This follows from the fact that we can check permission by looking for simultaneous-lassopaths. These are finite objects (completely described by the two prefixes and cycles of the lassopaths for the contract and query BAs) and any pair of BAs has a finite number of distinct simultaneous-lassopaths. It is immediately clear that this identifies a na¨ıve way to look for simultaneous-lassopaths, that consists in exhaustive exploration.

C

Bisimulation Optimization

C.1

Simplified service automaton

We say that a BA A is equivalent to a BA B w.r.t. a query BA Q iff there exists a simultaneous lasso path for A and Q iff there exists one for B and Q. We consider a query BA Q, and let LQ be the set of literals that appear in labels of Q. LnQ is the set that contains the negation of every literal in LQ . Definition 10 We call the relevant BA Ar of a service BA A w.r.t. a query BA Q, a BA that has the same states as A but that for every transition hs, χ, s0 i ∈ δA we have hs, χ0 , s0 i ∈ δAr s.t. χ0 is a conjunction of all the literals in χ that are in LnQ . We have: Theorem 8 Given a contract BA A and a query BA B, there exists a simultaneous lasso path for A and B iff there exists one for its relevant BA Ar and B. ⇒) Let hsi , qi ii≥0 be a simultaneous lasso path for A and B. We argue that it is a simultaneous lasso path also for Ar and B. Clearly hsi ii≥0 and hqi ii≥0 are lasso paths for Ar and B, respectively. Now, let us assume by contradiction that there exists θi s.t. hsi , θi , si+1 i ∈ δA compatible with τi r comatible with τi . By s.t. hqi , τi , qi+1 i ∈ δB but there are no transitions λi s.t. hsi , λi , si+1 i ∈ δA 0 0 r construction, there exists λi s.t. hsi , λi , si+1 i ∈ δA and that contains only the negated query literals of θi . It follows that λ0i is not conflicting with τi (otherwise θi would also be conflicting). We also know that τi does not contain any literal not in the contract vocabulary (otherwise it would not be compatible with θi ). So λ0i is compatible with τi . Contradiction. ⇐) Let hsi , qi ii≥0 be a simultaneous lasso path for Ar and B. We argue that it is a simultaneous lasso path also for A and B. Clearly hsi ii≥0 and hqi ii≥0 are lasso paths for Ar and B, respectively. r Now, let us assume by contradiction that there exists λi s.t. hsi , λi , si+1 i ∈ δA compatible with τi s.t. hqi , τi , qi+1 i ∈ δB but there are no transitions θi s.t. hsi , θi , si+1 i ∈ δA comatible with τi . By construction, there exists θi0 s.t. hsi , θi0 , si+1 i ∈ δA and that contains all and only the negated query literals of λi . It follows that θi0 is not conflicting with τi (otherwise λi would also be conflicting). We also know that τi does not contain any literal not in the contract vocabulary (otherwise it would not be compatible with θi ). So θi0 is compatible with τi . Contradiction. We make use of the bisimulation concept to reduce the number of states. Definition 11 Let a and b be two nodes of a BA A. We say that a ∼ b (read bisimulates) iff: (1) if a ∈ FA then b ∈ FA and vice versa, and (2) for every edge ha, λ, a0 i ∈ δA we have hb, λ, b0 i ∈ δA with a0 = b0 or a0 ∼ b0 , and vice versa. As with the minimization of traditional finite state automata, we create equivalence classes of BA states (called B(A)) w.r.t. bisimulation, and we use them as the states of a new BA. Definition 12 Given a BA A, we call As (its simplification) a BA s.t. (1) QAs = B(A), (2) IAs is any s ∈ B(A) that contains a state in IA , (3) FAs is any s ∈ B(A) that contains only states in FA , (4) let C(a) be the bisimulation equivalence class for the state a, for every ha, λ, a0 i ∈ δA we have hC(a), λ, C(a0 )i ∈ δAs .

27

The simplification of a BA is equivalent to the original one in the following sense: Theorem 9 For every path σ = σ1 , σ2 , ... for a BA A there exists a path σ s = σ1s , σ2s , ... for As and vice versa, where for all i, σis is the bisimulation class of σi . ⇒) Let us assume by contradiction that there exists a λ s.t. hσi , λ, σi+1 i ∈ δA , but there exists no λ0 s i ∈ δAs . This is a contradition with point 4 of Definition 12. s.t. hσis , λ0 , σi+1 ⇐) By induction on the steps of the path on As . Since σ1s is an initial state for As , by definition it contains σ1 the initial state for A. For the induction step we assume to have built a path σ1 , σ2 , ..., σi for A from the first states of s the path σ1s , σ2s , ..., σis for As . Also, since there exists a transition hσis , λ, σi+1 i ∈ δAs , we know, by s 0 s 0 0 0 . Since σi0 ∈ σi+1 construction that there must be a transition hσi , λ, σi+1 i ∈ δA s.t. σi ∈ σi and σi+1 s 00 and σi are in the same bisimulation class σi we know that there exists a transition hσi , λ, σi+1 i ∈ δA s 00 00 s.t. σi+1 ∈ σi + 1 . This means that we can build a path of length i + 1, σ1 , σ2 , ..., σi , σi+1 in As . Using Theorems 8 and 9, we prove now the main result that allows our bisimulation optimization technique. Theorem 10 Given a service BA A and a query BA B, there exists a simultaneous lasso path for A and B iff there exists one for the simplification of its relevant BA (Ar )s and B. ⇒) Let hsi , qi ii≥0 be a simultaneous lasso path for A and B. We argue that there exists a simultaneous lasso path also for (Ar )s and B. We already know from Theorem 8 that there exists a simultaneous lasso path hs0i , qi ii≥0 for Ar and r B. From Theorem 9 we know that we can build a path hs0s i ii≥0 in (A )s from the contract path 0 r hsi ii≥0 in A . It is easy to see that this is also a lasso path. By construction (definition of the simplified BA), we have the same transition labels on hs0s i ii≥0 as we have on hs0i ii≥0 . It follows that the compatibility requirement with the labels on the query r lasso path hqi ii≥0 is satisfied. So hs0s i , qi ii≥0 is a simultaneous lasso path for (A )s and B. ⇐) Let hssi , qi ii≥0 be a simultaneous lasso path for (Ar )s and B. We argue that there exists a simultaneous lasso path also for A and B. From Theorem 9 we know that we can build a path hsi ii≥0 in Ar from the contract path hssi ii≥0 in (Ar )s . It is easy to see that this is also a lasso path. Using the same induction as in Theorem 9, it is easy to see that we have the same transition labels on hsi ii≥0 as we have on hssi ii≥0 . It follows that the compatibility requirement with the labels on the query lasso path hqi ii≥0 is satisfied. So hsi , qi ii≥0 is a simultaneous lasso path for Ar and B. From Theorem 8 we know that hsi , qi ii≥0 is also a simultaneous lasso path for A and B. Note how Theorem 9 formalizes Theorem 5.

28

Alin Deutsch UC San Diego [email protected]

Dayou Zhou UC San Diego [email protected]

April 1, 2010 Abstract Considering a broad definition for service contracts (beyond web services and software, e.g. airline tickets and insurance policies), we tackle the challenges of building a high performance broker in which contracts are both specified and queried through their temporal behavior. The temporal dimension, in conjunction with traditional relational attributes, enables our system to better address difficulties arising from the great deal of information regarding the temporal interaction of the various events cited in contracts (e.g. ”No refunds are allowed after a reschedule of the flight, which can be requested only before any flight leg has been used”). On the other hand, querying big repositories of temporal specifications poses many interesting challenges. In this paper, we introduce two distinct and complementary optimization techniques that enable our system to scale the evaluation of a novel and theoretically sound notion of permission of a temporal property by a service contract. Our notion of permission is inspired by previous work on model checking but, given the specific characteristic of our problem, does not reduce to it. We evaluate experimentally our implementation, showing that it scales well with both the number and the complexity of the contracts.

1

Introduction

Service contracts are seldom completely represented by a set of predefined attributes and as such are usually modeled only partially in IT systems. As an example, consider the market of airfares. Every plane ticket is actually a complex contract with dozens of conditions regarding validity, refundability and changeability, among others. Some fixed categories exist (e.g. economy tickets, business, first class), but they do not express all the subtleties of the contracts that the customers are required to sign. Any nonstandard travel arrangements require sifting through the full text of the contracts of the fares returned by such brokers as Expedia, Orbitz and Travelocity, or asking an expert travel agent. In this work, we envision a contract brokering system that, in addition to standard relational attributes, allows contracts to be specified and queried by their temporal behavior. For illustration purposes, we introduce now a running example in the airfares scenario. Airlines, as contract providers, register all their fares on our brokering system. Each airfare specifies its temporal aspects, in addition to the usual relational attributes (e.g. baggage conditions, prices). These aspects involve how tickets are changed (if and when it is possible), how the customers can be refunded and so on. A possible query could look for: the cheapest fare on 2/19/2010 from San Diego to New York, that allows a partial ticket refund after the first leg has been used. In order to answer this query, our broker would first retrieve the list of fares available from San Diego to New York, along with their prices, using a standard DBMS. It would then use the temporal specifications to filter the fares that do not satisfy the temporal portion of the query (e.g. ‘... that allows a date change even in the case of a missed flight’). In this work, we will focus on the temporal aspect of our system. Finding the appropriate representation for the full variety of possible interactions allowed by the contracts is not trivial. Let us consider a customer’s specific requirement regarding the refundability options of a partially used ticket. It would be easy to add an attribute ‘refundable’ with a set of predefined values that code the various situations: no refunds, only full refunds, partial refunds allowed. The problem lies in the interaction of these categories with other ticket features like changeability, e.g. 1

a refund might not be available for a ticket that was changed. Moreover, both aspects interact with the various phases of the trip (e.g. before or after the first flight leg was completed). In principle, every new aspect can interact with all the others, leading to an exponential blow up of the number of possible interaction ‘categories’ (e.g. ‘partial refunds allowed only if requested before the date of the first flight leg’, ‘partial refunds allowed even if the first leg is missed’). These categories then become too numerous to specify as schema attributes. Moreover, they are too hard to define and specify for the contract providers and, even more importantly, hard for the consumer to understand. The alternative we propose is to consider a common vocabulary of events that is small enough to overview and will act as the interface between providers and customers. This vocabulary refers to a set of familiar events pertaining to the application domain, and is much more compact than the rich family of interaction categories arising from the various sequences of these events. This family is never explicitly listed. Instead, when interested in a particular sequence of events, customers formulate an ad-hoc query against the vocabulary, detailing the desired sequence of events. Each contract specifies in a similar way the allowed sequences of events. So the query simply returns all contracts that allow the sequence required by the query. We say that such contracts permit the query. Example 1 We use the following event vocabulary to crudely model single-trip flights: purchase (ticket purchased), use (ticket used), missedFlight (customer missed the flight), refund (customer is refunded for the ticket), dateChange (the flight is rescheduled). An airfare that allows flight reschedulings will contain sequences in which this happens, but also the ones in which the ticket is used in the originally scheduled date: 1. (purchase) 99K (use) 99K ... 2. (purchase) 99K (dateChange) 99K (use) 99K ... 3. ... A customer looking for a ticket that allows reschedules even after a missed flight, would think of sequences of the form: (purchase) 99K (missedFlight) 99K (dateChange) 99K ... We believe that thinking of contracts as a set of allowed sequences provides a simple way to describe and query their behavior. However, it would be impractical and error-prone to specify airfares by listing explicitly the set of all allowed sequences. The same observation holds for queries, as a customer may only wish to focus on the temporal relationship of two events of interest, not caring how the other events relate to them on the timeline. The customer in Example 1, for instance, is interested in the events missedFlight and dateChange. She does not care about refunds that might happen after the date change or the class upgrade that may happen for the first leg of the flight. Given these observations, we define contracts and queries in a declarative way. Analogously to the real world, contracts are specified as a set of declarative clauses, each of them specifying a required property of the allowed sequences of events. Intuitively, the allowed sequences of events of a contract will be the set of all sequences that simultaneously satisfy all contract clauses. With the same intuition we will use declarative clauses also to specify properties required by the customer at query time. Example 2 Using the vocabulary of Example 1 we show here three different ticket offers, which differ in their policies regarding refunds and changeability. Ticket A

1. No refunds are allowed after date changes

2. Unlimited date changes Ticket B

1. Refunds always allowed

2. Date changes only before the scheduled departure Ticket C

1. No refunds are allowed

2. Only one date change is allowed 3. Date changes only before the scheduled departure 2

We also state some clauses that are valid for all contracts in our domain and that model the intuitive meaning of the events in the vocabulary. C1 The ticket is purchased once. C2 The ticket has to be purchased before any of the following events: it is used, the flight missed or rescheduled, or the customer is refunded. C3 If the flight is missed, the ticket is unusable unless rescheduled C4 If a refund is issued, then no other event can happen C5 If the ticket is used, then no other event can happen The above clauses identify the sequences of events allowed by each ticket. The idea is that a customer might ask: Q1 Is there a possible sequence of events that after a missed flight contains either a refund or a date change? Intuitively, the system should return Ticket A (because it would allow a reschedule) and Ticket B (that would allow a refund). It should not return Ticket C because it does not allow any refund nor rescheduling after a missed flight. The reader should not be tempted to fall back on the usual categorization scheme (e.g. ’no reschedule after misses’, ’reschedules allowed’), which is always possible for simple examples like this one. Roundtrip tickets, multiple legs flights and class upgrades would all contribute to the proliferation of categories, which would contain unmanageable entries such as ’reschedules cannot happen on different legs of the same flight’ or ’if the first flight is used, a refund can be issued only if the second flight is not missed’. We now summarize our problem setting: - We focus on the behavioral aspect of querying contracts, so we assume a traditional DBMS takes care of the features modeled as relational attributes. - A contract describes what is allowed and what is not, so it represents a set of allowed sequences of events. - Contracts are specified with declarative clauses (i.e. properties of sequences of events). - A contract permits a query (also specified as a declarative clause) if it contains a sequence satisfying the query, - Our system stores the temporal specification of all contracts and has to provide a scalable and high performance way to retrieve all contracts that permit an online temporal query. Solving this problem involved the following contributions: • We model an intuitive notion of permission with respect to service contracts and queries, which captures temporal behavior aspects of services that are not handled currently by state-of-the-art systems. Our notion gracefully handles the fact that contracts might not specify the behavior of all events mentioned in the query, avoiding both false positives and the non-realistic requirements of fully-detailed contract specifications. • Aiming to represent and reason about temporal aspects, we picked a standard and well established formalism, namely Linear Temporal Logic. Checking permission does not reduce to any of the many decision problems studied for LTL, so we developed a novel algorithm. Note that we do not expect final users to utilize LTL directly. Analogously to SQL in database-powered e-commerce systems, LTL will be used in our system only by application developers.

3

• We develop a novel indexing scheme that allows our system to quickly generate a set of candidate contracts. This eliminates the need to execute the complex permission algorithm on every single contract specification in the temporal repository and is extremely effective for highly selective complex queries. • We develop a novel way to automatically simplify temporal specifications of contracts based on the events cited in a query. This allows our system to execute the permission algorithm on smaller contract specifications and provides the best results for simple queries that mention few events. • Finally, we provide an experimental evaluation of our techniques to prove the feasibility of our approach. Exploiting both our optimization techniques, we show an optimized average query evaluation time of 6sec over databases of 3000 contracts, improving by more than an order of magnitude over the unoptimized average time of 98sec. The size of the tested databases is adequate for most applications as a complete implementation would use relational attributes (e.g. travel date and destination) to pre-select contracts from a potentially much larger database. The rest of the paper is organized as follows. Section 2 introduces the temporal search problem along with our novel algorithm and a high level unoptimized system architecture, Section 3 describes our indexing scheme, Section 4 describes our automatic simplification technique, Section 5 presents our experimental results, Section 6 discusses some related work and Section 7 concludes the paper.

2

Querying contracts

In §1 we argued that a compact vocabulary of events is a much better choice than complex categories to query temporal behavior of service contracts. In particular we introduced the concept of sequence of events as a natural way to describe temporal behavior. Refining the intuition, we want to add the notion of snapshot. At every moment in the temporal sequence, many events could be happening, a snapshot captures all the information regarding the events in a particular moment (i.e. a truth assignment for every event in the vocabulary). A temporal sequence is then just a list of snapshots. Intuitively, we are able to define declaratively properties of these temporal sequences using clauses, so we can define a contract as a set of clauses that identify the set of allowed temporal sequences of snapshots, and a query as a clause that specifies the required property of a temporal sequence.

2.1

Designing the Semantics

Usually, customers are interested in knowing if a certain sequence is possible once they have subscribed to a contract. More precisely, they are interested in a particular property of that sequence (e.g. the fact that a ticket is partially refundable), which they specify using the query clause. A natural semantics would be that a contract is returned as a result of a query Q, if at least one of its allowed sequences satisfies Q. This semantics, however, has a subtle problem. Example 3 Let us consider Ticket A of Example 2, but with a vocabulary that now contains also the event class upgrade. The airfare policy is still that no refund is allowed after a date change and that date changes are unlimited. Let us assume that a customer issues the following query: Q2 An airfare that allows a class upgrade after a date change. If we think of a sequence of events in which the customer obtains a date change and then obtains a class upgrade, we can easily see that it satisfies all clauses of Ticket A (since they impose no restriction of class upgrades, which is not mentioned). It follows that under this semantics Ticket A will be returned as part of the result. In Example 3, we showed how, using the previous semantics, a contract, which did not say anything about class upgrades, would be returned as part of the result of a query regarding class upgrades. It

4

is clear that the contract is not fully specified. We do not want, however, to force full specification on the part of contract publishers, as it would raise the publishing cost considerably. It would also prevent a simple introduction of new events in the common vocabulary, forcing a revision of all contract specifications in the system. On the other hand, we contend that returning underspecified contracts as part of the answer (as in Example 3) would not serve customers well for two reasons. First, customers would be required to read all returned contracts in order to know which contracts explicitly permit the query and which ones were underspecified. Second, in order to gain visibility for their contracts, publishers would be incentivized to underspecify them, so that their contracts would be selected by more queries. This would exacerbate the first problem and make our system useless. For these reasons we refine the semantics in order to exclude contracts that do not explicitly allow some events. Our semantics takes into account only the cited events of each contract, and assumes that the contract makes no commitment on events not explicitly cited in its clauses. In Example 3, Ticket A would not be returned because it never cites events regarding class upgrades. This discussion motivates the following semantics for permission, stated formally in Appendix A and informally as: Definition 1 A contract permits a query if and only if there exists a sequence that is allowed by the contract, consists only of the events mentioned in the contract and satisfies the query. Note that the restriction to the events cited in the specification of the contract does not rule out the permission of all queries citing events not mentioned in the contract. Considering Ticket B of Example 2 (i.e. all refunds, date changes only before scheduled departure), it is easy to see that it should be returned if the customer is interested in the following contracts: Q3 After a date change, airfare allows a class upgrade or a refund. This is because, even though Ticket B does not specify a class upgrade policy, it explicitly allows refunds after date changes.

2.2

Declarative language

Many domains deal with temporal behavior of systems, most notably model checking and verification ([23]). In all these domains, temporal logic has become the accepted standard formalism for declaration of properties over behavioral sequences. Many techniques based on this kind of logic are successfully applied to industrial level implementations of tools [12], [3], [15]. We use the simplest form of temporal logic (Linear Temporal Logic, LTL [19]) as the formalism to capture declarative clauses, informally introduced in §1. We report here the LTL specification of Ticket C from Example 2. Ticket C

1. G(¬ref und)

2. G(dateChange → X(¬FdateChange)) 3. G(missedF light → ¬FdateChange) Clause 1 is read ‘globally not refund ’, globally means that a certain formula has to be true in all the snapshots of a sequence. Clause 2 reads ‘globally, if a date change occurs, it implies that in the next instant (X) it is not the case that eventually another date change occurs’, eventually (F) means that a certain formula will be true in some future instant. Clause 3 reads ‘globally, if a flight is missed, it is never the case that eventually a date chage occurs’. It is easy to see that these formulas are very close to the informal description of Example 2. We also want to stress the fact that the usage of LTL in our scenario is similar to the use of SQL in database powered e-commerce applications: LTL will be used by the application developers. User-friendly GUIs for LTL have already been studied [4] and will save most final users from ever having to use the formal LTL language. They are not, however, the focus of the present paper. For the formal treatment of the use of LTL and the complete LTL specifications for Example 2, refer to Appendix A.

5

init dateChange

purchase state_1

dateChange

init

missedFlight

missedFlight missedFlight d a t e C h a n g e use || refund

true

state_2

state_3

use

state_1

true

refund

refund state_4

state_2

(a)

true

(b)

Figure 1: BAs for: (a) Ticket A of Example 2 (no refund after a date change AND unlimited date changes, (b) a query asking for a refund after a missed flight. Double circled states are final. For readability: In (a) transitions show only the positive literals, e.g. ref und = ref und ∧ ¬purchase ∧ ¬missedF light ∧ ¬dateChange ∧ ¬use.

2.3

Data model

In order to reason on temporal clauses expressed in LTL, we make use of an important connection [24] between LTL and a particular kind of finite state automata, called B¨ uchi automata (BA). Exploiting the result in [24], there are many tools (e.g. [10]) which build a BA accepting all and only the sequences that satisfy a given set of declarative LTL clauses. As we will see, this reduces our problem to searching a database of BAs representing contracts, while checking an individual contract reduces to exploring paths in its BA. B¨ uchi automata (BA) are similar to standard finite state automata, with the exception that they recognize infinite strings. To some readers the choice of using infinite sequences might seem unnecessarily complicated. For technical reasons, however, it is the standard formalization when reasoning about temporal properties. Moreover, everything we said until now is unaffected by this decision. For the convenience of readers unfamiliar with LTL, we summarize the motivation for the use of infinite sequences in Appendix A. We can now state the BA acceptance condition: a string is accepted if the recognizing automaton traverses a final state infinitely many times. We provide now an intuitive example of BAs in our context (the formal definitions can be found in Appendix B.1). Example 4 Let us consider the B¨ uchi automaton in figure 1b. If we use the automaton to recognize a sequence of snapshots, it is easy to check that it will accept it if and only if this sequence contains a snapshot in which missedF light is true and in any snapshot after that ref und is true. Indeed, recall that the accepting condition is that the automaton traverses state 2 infinitely many times. Since this state has a self loop with true as label, it means that once that state is reached, it will always be part of a cycle, and thus traversed infinitely many times. Note that the labels of our BAs are conjunctions of literals (a literal is an atomic condition on an event, e.g. purchase, ¬use) that, during recognition, have to be satisfied by the current snapshot. Moreover, it should be intuitive from the BA accepting condition and Example 4 that, in order to accept a sequence, a recognizing BA can traverse its states in a path that it is formed by a finite prefix leading to a final state k (called knot) and by a cycle leading back to k. These kind of paths are called lasso paths and are always finite. The intuition is that, despite the finiteness of lasso paths, their cycle may be repeated infinitely many times in order to satisfy the BA accepting condition. It should not be surprising, then, that we can state that a sequence ρ satisfies a temporal property ϕ if and only if any BA for ϕ traverses a lasso path while recognizing ρ [24]. We refer again to §B.1 for a formal treatment of BA in our context.

6

2.4

Algorithm

Recall Definition 1 in §2.1. Assuming that both the contract and the query are represented by the BAs of all their allowed sequences, we provide below the intuition that the lasso path concept can be trasfered to a permission checking algorithm. Example 5 Let us consider the contract BA in figure 1a and the query in figure 1b. It is easy to see that the sequences that are accepted by both automata satisfy a lasso path of the form {init, 1, ..., 1, 3, ..., 3, 4, 4, ... } for the contract automaton, and of the form {init, .., init, 1, ..., 1, 2, 2, ... } for the query automaton. Intuitively, we see that we can build a sequence that satisfies a lasso path for both the contract automaton and the query automaton (i.e. we just need to synchronize the traversal of the transitions 1 → 3 in the contract BA with the init → 1 in the query BA and 3 → 4 with 1 → 2). Note that in this process we tried to pair the labels on the mentioned transitions (evidently not all labels are ’compatible’). The intuition in Example 5 rests on the concept of pairing a lasso path on the contract BA with another lasso path on the query BA. We can provide a more precise intuition by introducing the concept of simultaneous lasso path. Definition 2 A simultaneous lasso path is a sequence of pairs hsi , qi ii≥0 so that: 1. hsi ii≥0 is a lasso path in the contract BA (with the cycle repeated infinitely many times) 2. hqi ii≥0 is a lasso path in the query BA (with the cycle repeated infinitely many times) 3. in every i, the ith transition of the query lasso path contains only events from the contract specification and does not conflict with the events in the ith transition of the contract path. In §B.1 we formally treat this intuition resulting in the following: Theorem 1 Let C be a contract and q a query. There exists a simultaneous-lassopath in the BAs representing C and q if and only if C permits q. Note that simultaneous lasso paths as in Definition 2 are infinite. However, it is easy to see that we can finitely represent them with the prefixes and cycles for both the contract and query lasso paths. The finiteness of the representation of simultanoues lasso paths and Theorem 1 imply that the permission problem is decidable. 2.4.1

Implementation

Our verification algorithm looks for a simultaneous lasso path in a way inspired by the nested depth first search technique used in model checking [23], which though it solves a different problem also relies on finding lasso paths. Recall that a lasso path is formed by a prefix and a cycle we now define similar concepts for simultaneous lasso paths. We call knot any pair hsi , qi i s.t. it appears infinitely many times in the simultaneous lasso path and qi is a final state for the query BA. We call prefix w.r.t. a knot k, the portion of the simultaneous lasso path from the beginning to the first appearance of k. In order to define the cycles for simultaneous lasso paths, we need to force the existence of a final state in the contract BA. More formally, let i be the instant of the first appearance of a knot k and let j > i be the instant of the first appearance of a pair hsi , qi i where si is a final state for the contract BA, we call cycle w.r.t. a knot k, the portion of the simultaneous lasso path between i and the first subsequent appearance of k after j. In order to verify the existence of a prefix and a cycle, we start considering the initial pair, then we generate all possible prefixes using a recursive depth first strategy: for every pair we inspect both input BAs (contract and query), build all possible successor pairs (i.e. those satisfying condition 3 of Definition 9) and recurse on each of those. Since we do not know in advance the length of the prefix, at every step, if the considered pair s = hsi , qi i is a potential knot (i.e. qi is a final state for the query BA), we verify if there exists a cycle containing s. We call seeds all potential knots in which we start a search for a cycle. In order to verify the existence of a cycle containing a seed s, we start a nested search in s which looks for a path 7

Algorithm 1: Verifying that C(ϕ) permits property ψ Input: A = {QA , iA , δA , FA } is the service BA, B = {QB , iB , δB , FB } is the query BA, (w.l.o.g. they have a single initial state) ψ, an LTL query property Output: returns true if S(ϕ) permits ψ 1 begin 2 visited = ∅ 3 found = f alse 4 visit(hiA , iB i) 5 return found 6 end 7 procedure visit(hss , sq i) 8 begin 9 if hss , sq i ∈visited then return 10 visited ⇐ visited ∪{hss , sq i} 11 if sq ∈ FB then 12 visited2 ⇐ ∅ 13 cycle search(hss , sq i, hss , sq i, f alse) 14 if found then return 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

foreach hss , λA , s0s i ∈ δA do foreach hsq , λB , s0q i ∈ δB s.t. compatible(λA , λB ) do visit(hs0s , s0q i) if found then return end procedure cycle search(hss , sq i, start, foundFinal ) begin foundFinal2 ⇐ foundFinal ∨(ss ∈ FA ) if hss , sq i ∈ visited2 then return visited2 ⇐ visited2 ∪{hss , sq i} foreach hss , λA , s0s i ∈ δA do foreach hsq , λB , s0q i ∈ δB s.t. compatible(λA , λB ) do if hs0s , s0q i = start then found ⇐ foundFinal2 cycle search(hs0s , s0q i, start, foundFinal2 ) if found then return visited2 ⇐ visited2 \{hss , sq i} end

that leads back to s but that also contain a pair hsi , qi i in which si is a contract final state. The most straightforward way to look for such cycles is to perform a backtracking depth first search that keeps track of the paths that contain a pair with a contract final state and returns true iff one of these paths lead back to s. Algorithm 1, implements the strategy we just outlined. Lines 1 through 6 initialize the sets of visited pairs for the depth first visit, and the global variable that will contain true iff a simultaneous lasso path is found. It then invokes the procedure visit which recursively generates all prefixes. Lines 9 and 10 verify that we are visiting every pair once. Lines 11 through 14 test if the current pair is a seed (qi is a final state). If this is the case, the data structure visited2 is initialized and the procedure cycle search is invoked. If it returns successfully, the procedure terminates. Lines 15-18 generate all possible successor pairs by inspecting the transitions in the contract BA (line 15) and in the query BA (line 16). A recursive call to visit is then issued for each such pair. The predicate compatible at line

8

16 (and also later at line 32) implements condition 3 of Definition 9. The procedure cycle search accesses the visited2 variable which stores the path it is currently exploring. The variable f oundF inal2 is true if the current path contains a pair with a contract final state. Lines 23 and 24, avoid creating paths that contain cycles.Lines 25 and 26, generate all possible successor pairs. Line 27 and 28, update the global variable f ound if a cycle is present which contains a pair with a contract final state. If this is not the case, line 29 and 30 recursively continue to extend the current path.Line 31 backtracks to the previous explored path by removing the current pair from the visited2 variable. We can now state formally: Theorem 2 Algorithm 1 returns true iff C(ϕ) permits ψ. Note that the procedure cycle search can exploit a simple memoization scheme which stores for every visited pair a boolean variable, which is true if that pair can eventually lead to the original seed. In this way, we can code the whole procedure as a depth first visit, never visiting any pair more than once. 2.4.2

Seeds optimization

Algorithm 1 starts a nested search for a cycle in every reachable pair hsi , qi i s.t. qi is a query final state. As pointed out in Section 2.4.1, we know that in order to build a simultaneous lasso path we need a cycle, which has to contain a pair with a contract final state. It should be easy to see that if the contract state si of the seeding pair is not part, in the contract BA, of a cycle containing a final state, it is impossible to build a simultaneous lasso path cycle that contains the seed hsi , qi i. This is trivially true because, by Definition 9, hsi ii≥0 has to form a lasso path for the contract BA. In some cases we can avoid many cycle searches by knowing in advance which contract states are actually part of a cycle containing a final state. We can thus modify our notion of seed, by considering seeds all and only the pairs hsi , qi i so that qi is a query final state and si is contained in a cycle that contains a final state in the contract BA. Our system precomputes all contract states with this property at registration time and Algorithm 1 is easily modified to exploit this information by modifying line 11, which will check if the current pair satisfies the updated definition of seed. 2.4.3

Complexity

As we saw in §2.4.1, the search for a simultaneous lasso path can be coded with a simple backtracking search on the two BAs, which results in a LOGSPACE complexity in the size of the BAs. The size of BAs, however, is known to be worst case exponential in the size of the original clauses (i.e. contract specifications and queries) [24]. This results in a PSPACE upper bound for the permission decision problem. With a reduction from LTL satisfiability [21] we can prove that the problem is PSPACE-hard, which gives us the following complexity result: Theorem 3 Deciding permission is PSPACE-complete. This result is not completely unexpected because it is the same complexity class of the model checking problem. The algorithm used for model checking checks the non-emptiness of the intersection of the sets of runs. In our case, we need to check the existence of a whole projection class in the intersection, however we are able to prove that it can be done using the same basic technical idea, originated in the model checking context, of looking for a lasso path [23].

2.5

High level architecture

The algorithm in §2.4.1 suggests a straightforward implementation. At registration time, a contract LTL specification is translated to an equivalent BA representation that is the one stored. In our prototype we use an existing tool [10] to perform this stage. At query time, the query in LTL form is converted to an equivalent BA in the same way and then checked (with the algorithm described in §2.4) against all the contract BAs in the database. Given the high complexity of the permission problem, however, such an approach is impractical. We expect our contract database to be fairly static and that each contract will be queried multiple times. So all our techniques share the common idea of precomputing some auxiliary information during 9

init

purchase

init

state_1

purchase

dateChange use OR missedFlight

state_2

d a t e C h a n g e useFirst

state_2

state_3

use OR missedFlight

useFirst

state_4

askRefund useSecond refund

state_5

state_3

(a)

state_6

(b)

init dateChange state_1

*

init

miss

state_1

*

* flightCanceled c h a n g e A p p r o v e d

dateChange state_2

*

use state_3

state_2

*

*

state_3

*

requestChange

requestChange

(c)

changeApproved

state_4

*

(d)

Figure 2: B¨ uchi automata for (a) a ticket that allows no refunds and date changes only before the scheduled departure (Ticket C of Example 2), (b) a round-trip ticket allowing a single change before the first flight and a refund for the second one, (c) a query asking for two date changes, (d) a query asking for tickets that can be changed indefinitely even after a flight is canceled or it is missed and already rescheduled once. Double circled states are final. Note: BA (a) and (b) follow the same convention as Figure 1a. the registration step, in order to achieve faster online query processing. In this paper, we describe two optimization techniques.

3

Prefiltering optimization

The technique that we present in this section reduces the number of executions of the verification algorithm by pruning the number of contracts to test. At registration time, an index data structure is updated with information about the contract. At query time, the index is used to identify a set of contracts which is guaranteed to contain all contracts permitting the query. The permission algorithm can then run only on these contracts (candidates). At a high level, the technique extracts a necessary condition for permission from the query BA (called pruning condition), it then uses the index data structure to quickly extract the set of contracts satisfying that condition (i.e. candidates).

3.1

Pruning Conditions

Example 6 Let us consider the query BA in Figure 2c. Since permission is checked looking for a simultaneous lasso path, it is clear that, in order to permit this query, a contract BA must have some transitions that are ‘compatible’ (as in Definition 2, point 3.) with both formulas: dateChange and use. Otherwise, no simultaneous lasso path can be built. While this condition is necessary, it is clearly not sufficient: the contract in Figure 2a has such transitions but does not permit the stated query. We now generalize the idea of Example 6. From Theorem 1 we know that we have to build a simultaneous lasso path. This means that, by definition, a lasso path has to exist in the query BA. We also know that all transitions in this query lasso path have to be ‘compatible’ to their simultaneous

10

transitions in the contract BA. It follows that in the contract BA, at the very least, there has to exist one compatible label for every label in the query lasso path. Using this intuition, it is clear that if we find all lasso paths in the query BA, we can create a set of candidates by gathering all contracts whose BAs contain a compatible label for every label of a query lasso path. The problem is that the number of query lasso paths is exponential in the size of the query automaton. However, we never explicitly compute all lasso paths, as we can extract a more compact representation directly from the query BA. In order to explain this, we will assume to have a data structure that allows us to obtain S(λ), the set of contracts in the repository that contain a label compatible with a query BA label λ. We will describe the implementation of this data structure in §3.2. Example 7 Let us consider the query BA in Figure 2d. It only contains a single final state state 2. In order to build a lasso path knotted in state 2, we need to have both a prefix and a cycle. By visual inspection, there are only two possible prefixes that are simple paths (i.e. do not contain cycles): one with label f lightCanceled, the other with miss and changeApproved. We do not consider the self-loops in init and state 1 (and any other cycle) because their labels are not strictly necessary to build a prefix, so we cannot exclude any contract for not having them. Knowing all the possible prefixes, it follows that all the contracts that contain labels compatible with these prefixes are in the following set: S(f lightCancelled) ∪ (S(miss) ∩ S(changeApproved)) The idea is that this set will be smaller than the whole database. We can also prune other contracts by considering the cycles of the query lasso paths knotted in state 2. Since the transition from state 2 to state 3 is labeled with true, it provides no way to prune any service. It follows that in order to go from state 2 to state 4 we can only force the presence of a transition compatible with requestChange. In order to reach back state 2 we need changeApproved. Adding this additional condition to the previous one (i.e. intersecting the two sets), we know that we need to run the permission algorithm only for contracts in the following set: (S(f lightCancelled) ∪ (S(miss) ∩ S(changeApproved))) ∩ (S(requestChange) ∩ S(changeApproved)) In Example 7 we generated a set of candidates from a condition derived from the analysis of lasso paths knotted in state 2. We call conditions of this kind lasso pruning conditions. Intuitively, a lasso pruning condition for a knot k selects all contracts that permit the query with a simultaneous lasso path whose query lasso is knotted in k. Formally, we call pruning condition for a query BA A, a formula containing union and intersection operations and in which the elements are only S(λ), where λ is a disjunction-free propositional formula using variables in the query LTL formula. Definition 3 Given a CDB and a query q whose BA is A, a lasso pruning condition for f ∈ FA is a pruning condition evaluating to a set containing every contract s ∈ CDB s.t. there exists a simultaneous lasso path hsi , qi ii≥0 for s and q knotted in f . Many heuristics can be used to build a lasso pruning condition (e.g. considering only the transitions entering the knot). In Example 7, the lasso pruning condition is formed by a first part which considers the possible prefixes of the query lasso paths, and a second one which considers the possible cycles. It should be clear that this is a valid lasso pruning condition and that it is always possible to build such conditions (i.e. trivially, enumerating all lasso paths). Our specific implementation, similar to the one in Example 7, is detailed later. In Example 7, state 2 is the only final state, so its lasso pruning condition selects all contracts in the answer. In general, a query BA contains more than one final state.Since any simultaneous lasso path is enough to guarantee permission, we have to take the union of the sets retrieved by the lasso pruning conditions of every final state. Theorem 4 Given a CDB and a query q whose BA is A, let lf be a lasso pruning condition for f ∈ FA . The union of lf for all f ∈ FA is a superset of the result of q on CDB.

11

Algorithm 2: Implemented algorithm to compute pruning conditions for query BAs Input: A = {QA , iA , δA , FA } is the query BA Output: returns a pruning condition for A 1 begin 2 condition ⇐ true 3 C ⇐ connected components(A) 4 foreach c ∈ C do 5 foreach final state t ∈ c do 6 current path ⇐ ∅ 7 condition ⇐ condition ∨(( cycle condition(t, c) ) ∧ ( compute path from init(t) )) 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

return condition end function cycle condition(t, c): DNF formula begin cycle ⇐ true foreach hi, λ, ti ∈ c do cycle ⇐ cycle ∨λ return cycle end function compute path from init(s: state): DNF formula begin paths ⇐ true if s = iA ∨ s ∈ current path then return paths current path ⇐ current path ∪{s} foreach hi, λ, si ∈ δA do ψ ⇐ compute path W from init (i) foreach χi with i χ = ψ do paths ⇐ paths ∨(χi ∧ λ) current path ⇐ current path \{s} return paths end

The algorithm that we use to compute the lasso pruning conditions computes an approximated version of the ones presented in Example 7. Given a final state f , it does not consider the possible cycles, but only the possible incoming transitions of f from inside its strongly connected component (in Example 7, it would be only changeApproved ). This is an approximation because those incoming transitions might not all be part of a cycle, moreover, we are not considering the rest of the cycles. However, it is much faster and in experimental results it has nearly the same number of false positives as the complete pruning conditions. A reference implementation of this pruning condition algorithm is shown in Algorithm 2. Lines 3-5 simply cycle through all the final states after having identified the connected components of the query BA. As already explained pruning conditions are disjunctions of lasso pruning conditions knotted on final states. Lines 7 generates this disjunction after having created the lasso pruning condition for the current final state t. Lasso pruning condition are formed by a condition derived from the possible paths from the initial state to t in conjunction with the condition on the cycle containing t. We generate these two conditions with the two functions cycle condition and compute path from init. It is immediate to see that the cycle condition computed in lines 12-15 simply takes the disjunctions of the incoming edges of the final state in its connected component. Note that we can consider only the connected component because we know that any cycle has to be contained in a connected component. The function compute path from init generates all backward paths from state s to the intial state, and generates a condition that is the disjunction of the conjunctions of all labels in every such paths. For clarity we show an exponential algorithm. In our implementation we exploit a memoization scheme that stores for

12

each state s the pruning condition generated from the possible paths from the init to s, which gives a linear time algorithm (w.r.t. the size of the BA). Moreover, we remove redundant conditions during the computation in order to minimize the number of conditions we have to iterate in lines 24-25.

3.2

Index data structure

In §3.1, we assumed a way to easily compute S(λ), i.e. the set of contracts in the repository that contain a label compatible (point 3 of Def.2) with a query BA label λ. We will now introduce the index data structure that we use in order to perform this operation. At a high level, we have to find a way to identify labels that are compatible with λ and then gather all contracts containing them. Assuming that a contract label γ involves all events in λ, it is easy to see that we can simply check if the set of literals in γ contains the ones in λ. However, by definition, we might have to consider not only events in γ but also the ones cited in its contract c. This is the case, for instance, if γ is the boolean formula ref und, in a contract that cites both ref und and dateChange. We have then that both labels, ref und ∧ dateChange and ref und ∧ ¬dateChange are compatible with γ. In order to treat compatibility checks as containment checks, we create a set for γ (called the expansion, E(γ)) that contains all literals in γ plus all literals (both positive and negative) for any remaining event cited in its contract. Example 8 Let us consider a transition label t = p ∧ c from a contract that cites the events p, c and m. Given the above definition we have that its expansion E(p ∧ c) = {p, c, m, ¬m}. We have that a query transition q = p ∧ m is compatible with t, while q 0 = p ∧ ¬c and q 00 = c ∧ r are not. Using expansions to check compatibility, our prefiltering index data structure is conceptually a map of sets of contracts, indexed by expansion. The idea is to store for each expansion the set of contract BAs in which there exists a label that generates that expansion. Having this data structure, it is immediate that, given an input formula λ, we can compute S(λ) by taking the union of the contract sets associated to the expansions compatible with λ. Example 9 We consider the two contracts in Figure 1a (Ticket A) and 2a (Ticket C), called A and C. The following prefiltering index contains all the expansions of the labels of both contracts along with the list of contracts in which a particular expansion appear. For readability: p = purchase, c = dateChange, m = missedFlight, u = use, r = refund. {p,¬c,¬m,¬u,¬r} {¬p,c,¬m,¬u,¬r} {¬p,¬c,m,¬u,¬r}

{A,C} {A,C} {A,C}

{¬p,¬c,¬m,u,¬r} {¬p,¬c,¬m,¬u,r}

{A,C} {A }

Note that in this example all the labels in the contract automata contained already all the events, but this is not always the case. If we consider the query in Figure 1b, it is easy to see that a lasso pruning condition will be S(m)∩S(r). Trivially, S(m) evaluates to {A, C} and S(r) to {A}. The whole condition selects the candidate {A}. Using the prefiltering technique, we avoid the execution of the permission checking algorithm on contract C. In order to implement and scale this index we face two problems. The first is that we want to identify all compatible expansions without a full index scan of the index. We solve this problem using a simple hierarchical data structure that enables access to any expansion set in a number of steps bounded by its size. An example of this structure is shown in Figure 3. The expansions are arranged in a graph so that if the literals in expansion a contain literals in expansion b an edge hb, ai is added to the graph. This structure has two important properties. The first is that it can be incrementally maintained (adding and removing expansions) with local modification to the graph structure. The second is that any expansion can be accessed in a number of steps that is linear to the size of the expansion. We achieve this second result by maintaining a direct list of all linked nodes. Note that this list is bounded by the number of possible literals in the system, which is usually contained. The second, and more important, scalability problem is due to the fact that the number of possible expansions is exponential in the size of the vocabulary. This means that the index grows exponentially in 13

{}

{!m,c}

{!m}

{c}

{m}

{!c}

{!m,m}

{m,c}

{!m,!c}

{!c,c}

{!m,m,c}

{!m,!c,c}

{!m,m,!c}

{m,!c}

{m,!c,c}

Figure 3: Hierarchical structure of a prefilter index with a vocabulary of two variables (m and c) and bounded to expansion sets of size 3. The nodes are labeled with the expansion sets and contain the sets of contracts with at least transition that generate it. Note that the structure does not need to contain all possible expansions like in this example. size, leading to intractability. In order to avoid this exponential blow up, we create a data structure that, while maintaining the soundness and completeness of the technique, returns for every label λ a superset S 0 (λ) of S(λ). This modification does not affect the soundness and completeness of the technique because pruning conditions needs to evaluate to a set that just contains a certain set of contracts (§3.1). Since conditions use only union and intersection operators, they are monotonic. This means that the fact that the S 0 () returned by the data structure are supersets of the required S() results in a set of candidates that is a superset of the one returned if we were using the correct S(). Since all candidates will be checked with our permission algorithm and we are returning a superset of the original set of candidates, soundness and completeness are preserved. This data structure avoids the exponential blow up by limiting the size of the expansions in the index structure to a predefined number k. The set of contracts associated to any expansion e of size k will then contain all contracts whose BA representations contains a label whose expansion contains e. In this way, when retrieving S 0 (λ) with |λ| > k, we can just pick any set of contracts (called expansion sets) associated with an expansion that is contained in the set of literals of λ. In order to refine the set of candidates we retrieve, we can take the intersection of many of these expansion sets. In our prototype, we implemented this refinement without any performance impact, since usually query labels are small and do not have a high number of k-subsets.

4

Bisimulation optimization

In this section we describe an optimization technique that is based on the observation that the query usually pertains to a much smaller set of events than the full contract specification. At registration time, we precompute a set of simplified versions of every contract BA. Intuitively, these simplified versions can be thought as ‘projections’ of the full contract specification on a subset of events (i.e. albeit smaller, they faithfully represent the behavior of the contract with respect only to the considered events). At query time, our system retrieves the appropriate precomputed ‘projection’ for each candidate contract that has to be checked for permission.

4.1

Simplified service automaton

We will show now how we are able to use a simplified version in lieu of the full contract BA.First we define a notion of equivalence with respect to a query q: two contracts BA A and B are equivalent if they both either permit or do not permit q. From Theorem 1, it follows that this is the case if and only if we can build simultaneous lasso paths with the query BA for both A and B.

14

init

!d

state_2

!d state_3

init

!d

state_4 !d

!d

!d

state_2

!d

!d

!d state_5

!d

!d

state_3

!d !d

!d state_6

!d

state_4

!d

!d

!d

(a)

state_6

!d

(b)

Figure 4: For readability: d = dateChange. (a) projection of BA in Figure 2b on ¬ dateChange. (b) simplified version of (a). Remember that in order to build a simultaneous lasso path we need two lasso paths (one for the contract BA, one for the query BA) so that in every instant their transitions have ‘compatible’ labels (point 3 of Definition 2). It follows that the only information that we really need in order to find this simultaneous lasso path is if a contract label is ‘compatible’ with a query label. For instance, given the definition of compatibility, if a query BA has transitions that contain only the labels true and dateChange, the only information we need to derive compatibility (besides knowing all events in the contract) is the presence of ¬dateChange in the contract transitions. It should be intuitive now why it is useful to introduce the concept of projection. Given a contract BA A and a set of literals L, we call projection of A on L (πL (A)) the BA built from A by keeping in every label only literals in L, e.g. Figure 4a represents the projection BA on ¬dateChange of BA in Figure 2b. This results in the following theorem, formalized in §C.1: Theorem 5 Given a BA A and a set of literals L, the projection of A on E is equivalent to A for all query BAs that refer to literals that are either negations of literals in L or pertaining to events not in cited in A. The use of projections alone does not improve the performance of the permission checking algorithm as the size of the projection BA is the same as the original one. Projections are, however, simpler BAs than their original counterparts in the sense that many transitions that had different labels have now the same label. This provides the opportunity to use a standard state reduction technique [9] that produces a BA that is equivalent to the original one (i.e. it accepts the same set of strings). This classical polynomial-time technique is based on collapsing bisimilar states and is the same used for the minimization of standard finite state automata [13]. In the case of BAs it does not output the minimum BA (BA minimization is PSPACE-hard [22]), it does provide, however, significant reductions in the number of states of our projections. This allows us to equivalently check our query on a contract BA that is a projection of our original one but that has significatly fewer states. Example 10 In Figure 4a we show a BA derived from a projection, where we have two states (state 4 and state 5) that can be collapsed. Intuitively, once the recognizing BA reaches either of those states, it will accept the exact same set of sequences, i.e. an infinite sequence of snapshots containing ¬dateChange. Its simplification is shown in 4b.

4.2

Using simplified BAs

The previous section details how we can equivalently use simplified projections to check permission. As we already anticipated at the beginning of the section, for every contract we precompute the projections at registration time, so that when our system has to check permission of a query, it can use the precomputed simplified projection. Clearly, the most appropriate projection is the smallest one that is still equivalent to the original BA for the current query.

15

Given the result in Theorem 5, it is easy to see that in order to have the smallest projection for every possible query, we need to have one projection for every subset of literals cited in the contract. The problem that immediately arises is the fact that the number of possible projections is, in the worst case, exponential in the size of the set of events of the contract. We have, however, two important observations that enable us to pursue the precomputation route. The first observation is that we do not necessarily need to precompute all projections, as we can use any projection that contains all required literals for the query, albeit with a possible perfomance penalty. The second observation is that, in practice, not all subsets of literals generate distinct simplified projection BAs, e.g. in our datasets the number of distinct simplified projection BAs was ∼5% of the number of subsets. Intuitively, removing a single event from a projection does not usually lead to a simplification of the observed behavior of the others. It follows that it is often the case that the simplifications are triggered only when whole sets of ‘independent’ events are excluded from the projection. Both these observations provide us great freedom in designing a system that balances runtime performance with precomputation time and storing needs. Storing needs are quite contained, since projections can be represented as partitions of collapsed states of the original BA. To address precomputation time, we designed a novel algorithm that efficiently computes all simplifications in parallel. We could not use the standard bisimulation algorithm (Paige and Tarjan [18]), as we would have had to run it for every subset of literals. Our algorithm avoids testing subsets that are not going to result in simplifications and reuses partial computations between different subsets. We describe this algorithm in §4.3. In our experiments we found it feasible to precompute all projections. However, in case contract complexity grows so much that full precomputation cannot be pursued, other possible approaches include the precomputation of subsets up to a certain size (queries are usually smaller than contracts and contain less literals) or the use of heuristics based on historical data and/or expected workloads to determine which simplification to precompute.

4.3

Computing all simplified BAs in parallel

First of all, notice that the bisimulation is an equivalence relation, so its computation can be thought as the partitioning of the set of nodes into subsets that contains only bisimilar nodes. Now we can state more formally that we want to compute for every service automaton a correspondence P between sets of literals and partitions. For every pair hL, ρi ∈ P , we have that ρ is a partition of the nodes in the service automaton that represents the bisimulation relation of the relevant service automaton w.r.t. the literals in L. In order to describe the main idea we briefly introduce the basic concepts behind the Paige and Tarjan algorithm. Considering a bisimulation relation that ignores the distinction between final and non-final states (i.e. point 1 of Definition 11), the basic algorithm proceeds by successively refining an initial partition that contains the set of all states. We say that a partition ρ is a refinement of ρ0 iff for every set B ∈ ρ there exists B 0 ∈ ρ0 s.t. B ∈ B 0 . In the algorithm, the refining is based on the concept of stability of a partition. Informally, a partition ρ is stable w.r.t. a set B ∈ ρ, if for all sets X ∈ ρ (and for all labels a), either all states in X lead to B (i.e. there exists a transition with label a to a node in B) or none of them. An operator can be designed that iteratively splits all sets X ∈ ρ w.r.t. a B ∈ rho in order to force the stability property. The paper [8] proves that applying this operator for all B ∈ rho defines a compound operator Φ whose fixpoint yields the coarsest partition of bisimilar states (i.e. the coarsest refinement of the initial partition that contains sets of bisimilar nodes). This means that a simplified version of the algorithm iteratively applies this splitting operator, generating a sequence of partitions that terminates with the coarsest one. Our case is a little different since we have to consider final states, so we use an initial partition containing a set with all final states and a set with all remaining ones (analogous to Hopcroft [13]). However, this does not change the properties of the algorithm. We exploit the following property. Theorem 6 Let L be a set of literals and ρL the coarsest bisimulation partition for it, we have that if L0 ⊇ L, then ρL0 is a refinement of ρL Intuitively, adding literals to the considered set will produce a more refined partition, until we consider all literals in the automaton which will produce the partition containing only singleton sets (assuming 16

the starting automaton does not contain redundant states). This property can be exploited by computing first the refinements of smaller subsets literals, since we know that all containing sets of literals will produce partitions that are refinements of those previously computed. In this way, we avoid repeating the same refinement steps multiple times. Another optimization comes from the fact that during the splitting process we generate and consider only those subsets of literals that could result in a split. This is done in the splitting operator by analyzing the labels of the transitions from the nodes in X to those in B (i.e. if a literal does not appear in any transition, it is useless to test sets of literals containing it). Usually, we do not have to test all subsets of literals for all sets.

5 5.1

Experimental evaluation Prototype Architecture

Our prototype is written in Java and consists of four independent modules. The first is the data generator and produces databases of LTL formulae as described in §5.2. It also converts the formulae to BAs using the freely available library LTL2BA [10]. Its output is a database text file. The second module generates the prefiltering index from a database text file. The third precomputes the simplified BAs from a database text file. The fourth module is the runtime module. It uses the prefiltering index and the simplified BAs in order to evaluate queries. It takes as input a query workload dataset text file and outputs statistics regarding their evaluation. We executed all code on an AMD Phenom 2.2 GHz with 4Gb of RAM and we used Sun 32bit JVM 1.6.0.10. The code does not utilize multithreading in order to isolate the technique performance from system level effects.

5.2

Data generation

To avoid a completely random generation of LTL formulae, we adopted a sophisticated generation method based on real-life usage patterns of temporal properties ([7]). The work in [7] surveys over 500 reallife specifications of temporal properties to be formally verified on finite state systems (theoretically equivalent to our contracts) and extracts recurrently appearing patterns that cover over 92% of the surveyed cases, along with their appearing frequencies. While the specifications of these systems come from different application domains (e.g. communication protocols, GUIs, distributed object systems, operating systems, databases), we contend that the sort of properties that form these specifications are very similar to the ones found in service contracts, as they are effectively describing the allowed temporal sequences of interactions between multiple actors. In [7], patterns are classified along two dimensions: required behavior and scope (i.e. temporal interval in which the behavior has to show itself). The possible scopes include global (the whole timeline), before (up to certain event), after (after a given event), between (any part of the timeline between specified event p and event q). The possible behaviors are: Absence An event does not occur in the scope. Existence A given event must occur within a scope. A variation can force the occurrence number to k. Universality A given event occurs throughout the scope. Precedence An event p must always be preceded by an event q. A variation can consider sets of events instead of p and q. Response An event p must always be followed by an event q. A variation can consider sets of events instead of p and q. As an example, we show in Table 1 (from [7]) the LTL expressions for the cited scopes. Contracts and queries are generated as conjunctions of clauses expressed as LTL properties. Our generator randomly generates LTL properties using the distribution reported in [7]; at each generation it substitutes the variable placeholders (e.g. p, q) with variables from the common vocabulary. Given a parameter n, it generates a contract or query specification formed by the conjunction of n properties thus generated. 17

Scope Global Before r After q Between q and r Global Before r After q Between q and r Global Before r After q Between q and r Global Before r After q Between q and r Global Before r After q Between q and r

Table 1: LTL patterns, from [7]. Description Absence G(¬p) p is never true Fr → (¬pUr) p is never true before r G(q → G(¬p)) p is never true after q G((q ∧ ¬r ∧ Fr) → (¬pUr) p is never true between q and r Existence F (p) p is eventually true ¬rW(p ∧ ¬r) p is true some time before r G(¬q) ∨ F(q ∧ Fp) p is true some time after q G(q ∧ ¬r → (¬rW(p ∧ ¬r))) p is true some time between q and r Universality Gp p is always true Fr → (pUr) p is true in every instant before r G((q ∧ ¬r ∧ Fr) → (pUq)) p is true in every instant after q G((q ∧ ¬r ∧ Fr) → (pUr)) p is true in any instant between q and r Precedence Fp → (¬pU(s ∨ G(¬p))) s precedes p at any time Fr → (¬pU(s ∨ r)) if p happens before r, s precedes p G(¬q) ∨ F(q ∧ (¬pU(s ∨ G(¬p)))) if p happens after r, s precedes p and follows q G((q ∧ ¬r ∧ Fr) → (¬pU(s ∨ r))) s predeces p, both events between q and r Response G(p → Fs) if p is true, s will follow Fr → (p → (¬rU(s ∧ ¬r)))Ur if p is true before r, s will follow p and preceed r G(q → G(p → Fs)) if p is true after q, s will follow p G((q ∧ ¬r ∧ Fr) → (p → (¬U(s ∧ s follows p, between q and r ¬r)))Ur) LTL pattern

Example 11 We report here one sample contract from our database of simple contracts. As already said, these contracts are generated by taking the conjunction of randomly generated temporal properties: P1 (G((p14 ∧ ¬p1 ∧ Fp1) → (p17Up1)))∧ P2 (G(¬p2))∧ P3 (¬p7U(p16 ∨ G¬p7))∧ P4 (G(p19 ∧ ¬p12 → ((p6 → (¬p12U(p1 ∧ ¬p12)))U(p12 ∨ G(p6 → (¬p12U(p1 ∧ ¬p12)))))))∧ P5 (G(p20 → Fp17)) Every property is generated from a pattern described in [7], from a vocabulary of 20 events: P1 This property forces the existence of p17 between p14 and p1. In our airline scenario it could force the eligibility of a partial refund only between the first leg use and the second one. P2 This property forces the absence of p2. In our scenario, it could be that this contract does not allow class upgrades. P3 This property forces the existence of p16 before p7. In our scenario P4 This property forces the response of p1 to p6 (i.e. if p1 happens p6 is bound to eventually happen), after p19 and until p12. In our scenario, a contract could postulate that a customer will be refunded if she asks for it but only if it happens after a flight is cancelled and before she is scheduled on another flight. 18

Dataset

size

Simple services Medium services Complex services Simple queries Medium queries Complex queries

3000 1000 1000 100 100 100

Table 2: Datasets statistics #LTL #states #states patterns avg stddev 5 30.94 34.24 6 41.82 43.23 7 50.85 47.5 1 2.31 1.41 2 5.44 4.81 3 9.6 11.11

#trans. avg 622.48 964.69 1291.63 5.2 23.86 92.84

#trans. stddev 1245.77 1628.66 1904.82 5.4 33.18 203.42

P5 This property forces the response of p17 to p20. In our scenario, this could model the fact that when a ticket is purchased it will eventually expire. Given the description of some random generated properties, we want to make two observations. First, the properties are often related between each other as some variables appear in multiple statements (e.g. p17 in P1 and P5). This creates complex interactions that model the fact that a contract is not a series of independent clauses. Second, the temporal patterns represents our contract fairly well. Indeed, if we look at the specification of the tickets in Example 12 in §A (i.e. the LTL formalization of the specification in Example 2) we can see that the properties that we use fall in the patterns that we use in our random generator. For instance, Ticket A property 1, is a simple absence pattern with an ‘after’ scope (i.e. ‘a date change cannot happen after a missed flight’). Note that some properties (e.g. C4 and C5) might be a conjunction of pattern properties. In the particular case of C4 and C5, they are the conjunction of absence patterns with ‘after’ scope. Although we believe that a real contract specification database would have been the best test case for our techniques, given these observations we contend that our random generation method produces fairly realistic test data and allows us to test our system on database sizes hard to obtain in practice. Table 2 shows the statistics of our generated datasets (i.e. contract databases and query workloads). Along with the name of the dataset we report its size (i.e. number of contracts or queries) and the number of LTL properties that form each contract or query. Since we found that the actual complexity of the LTL formulae is better characterized by the statistics of their associated BAs, we also report the average number of states of the BAs in the specification and its standard deviation along with the average number of transitions with its standard deviation.

5.3

Experiments

We measure the running times of query evaluation for the unoptimized system (§2.5) and for the system using both the prefiltering and the bisimulation techniques. The query evaluation time for the unoptimized system is the sum of the conversion time of the query from LTL to BA plus the evaluation on the contract database stored as set of contract BAs. The query evaluation time for the optimized system include the query conversion time plus the time used to evaluate the query using our precomputed data (i.e. prefilter index, simplified BAs) in addition to the contract BAs database. A first batch of measurements was aimed to prove scalability w.r.t. the number of contracts in the database. We evaluated all queries (all complexity levels) on databases of sizes 100 to 3000 of simple contracts (subsets of the Simple Contract database in Table 2). For every contract database size, we recorded the average query evaluation time for the unoptimized and optimized systems. To some readers, these databases might seem too small for practical use. Recall, however, that a complete implementation would use relational attributes (e.g. travel date and destination) to select the contracts that have to be checked for permission from a potentially much larger database. A second batch of measurements was aimed at scalability w.r.t. complexities of both queries and contract specification. We evaluated individually every query database against contract databases of size 1000 (simple, medium, complex). Again, for every combination of queries and contract database complexities we recorded the query evaluation time for the unoptimized and optimized systems.

19

80 speedup w.r.t. no optimization

1e+07

avg speedup (with stdev) avg scan time avg optimized time

1e+06

70

100000

60

10000

50

1000

40

100

30

10

20

1

10

0.1

0

running time (ms)

90

0.01 100

500 1000 2000 number of services in database

3000

Figure 5: Average speedup and running times (unoptimized and optimized) w.r.t. database sizes (small contracts, average of all query complexities)

5.4

Results

First batch - Scaling w.r.t. database size. In Figure 5 the two lines at the top of the graph represent the average running time in the case of the unoptimized approach, called scan time, and the one of the optimized system, called optimized time. The times scale is on the right y axis and it is in milliseconds and logscale. The bar chart on the same picture shows the average speedup (along with standard deviation bars) achieved by the optimized system compared to the unoptimized one (the speedup scale is on the left y axis). Notice that the running times for both evaluations scale nearly linearly w.r.t. the size of the database. The unoptimized time ranges from 2sec for a small 100 contracts database to 100sec in the case of a 3000 contracts database. These times would limit the usage of our approach in interactive systems. Our optimization techniques, however, achieve an average speedup of at least 20 (scaling nicely for larger databases), reducing the average running time in the case of the 3000 contracts database to few seconds instead of minutes. Note that the speedups increase together with the size of the database, a common effect of indexing schemes. Finally, we rarely have speedups less than 10. Second batch - Scaling w.r.t. complexity. Figure 6 reports the results for the scalabitility w.r.t. query and contract complexity. The bars represent the average speedups (along with standard deviation bars) of the optimized system compared to the unoptimized one w.r.t. complexities of both contracts in the database and queries. We explain the reduced speedup of complex queries noting that, using more variables, they cannot take advantage of the most simplified contract BAs. The bisimulation technique is also responsible for the increasing speedup w.r.t. complexities of the contracts. Usually complex contracts cite more variables and with the bisimulation technique the algorithm can ignore the behavior of all variables that are not used in the query. This keeps the complexity down to the minimum required to evaluate the query. Index building and size. For our optimized evaluation, each contract database required the precomputation of two data structures. The computation of the prefiltering index (§3.2) lasted less than 25 mins for our biggest database of 3000 contracts. The average insertion time was around 500ms. The size is also minimal as our biggest prefilter index (3000 contracts) was ∼10Mb. The computation of the simplified BAs is more expensive as we compute all possible projections. The workload is inherently parallel as each contract is simplified independently. For our 3000 contract database the average insertion time of a contract is 42secs, and the total computation of our biggest datasets took 11 hours on our workstation using three cores. We want to note that our technique does not require this massive precomputation to be effective, as motivated in §4.2. The size of the simplified

20

speedup w.r.t. no optimization

100

Simple Queries Medium Queries Complex Queries

80

60

40

20

0 Simple Services

Medium Services

Complex Services

Figure 6: Average speedup w.r.t. contracts and query complexities (database size = 1000 contracts) BAs data is in average around 80% of the database original size and in absolute terms quite contained: our biggest database of 3000 contracts had a combined size (database plus simplified BAs) of 112Mb.

6

Related work

To the best of our knowledge, this is the first work that deals with the discovery of contracts based on their temporal behavior. Previous works in electronic commerce and enterprise computing have dealt with the problem of managing contracts. The field of e-contracting [1] deals with the automation of contracting practices and it is mainly tailored for the B2B case. The direction that is most relevant to the present work is the verification and monitoring of real world contracts. As explained in [16], contracts are usually represented using rules written in some form of temporal and/or deontic logic (i.e. stating permissions, obligations and prohibitions), [14], [17]. Both [14],[17] perform verification on contracts specified with deontic constraints. These works focus on the internal consistency of contracts, leading to the need of more expressive specification language. Queries on a database of contracts are never considered and we are not aware of any contract indexing scheme. Moreover, in [14] the higher order logic used to represent contracts is internally translated to B¨ uchi Automata, opening the way to the application of our optimization techniques also for such higher order formalisms. Web Service search exhibits many similarities with our problem. Web Services are complex pieces of software that expose their functionality on the web. Many works have addressed this problem. Some of them do not consider the temporal behavior, focusing on search in their annotated natural language description ([6]), hence are not applicable to our context. In [11], a method based on graph matching is proposed in order to implement behavioral similarity search of web services. This system returns approximated results, expecting a domain-knowledgeable user to read the actual specification of the returned contracts. Our system, on the other hand, implements an exact query evaluation algorithm, which avoid this. In [20], web services are modeled as directed graphs with nodes representing either actions or messages. An exact evaluation algorithm is proposed to query for graph patterns in service specifications. We contend that such a system would not be suited for our context in which both contracts and queries are inherently declarative specifications and not directly related to a procedural implementation. The problem of querying behavioral objects is present also in the area of business process management. In [2], business processes are represented as directed graphs encoding the BPEL specification. An exact query evaluation algorithm is presented that allows to query the structure of the specification graphs 21

using graph queries. The framework is extended in [5], in order to consider some behavioral semantics. The premise of this line of work differs from ours in that business processes are handled as procedural implementations (workflows), while in our work we manage contracts in the form of declarative clauses. Asking for a full workflow registration by the contract provider would significantly hinder registration into the contract database, as contracts specify only some key properties of the workflow. Moreover, a workflow query will likely miss all contracts whose workflows do not match its structure yet are semantically equivalent to one that does. Finally, many techniques we used are inspired by the work done in model checking [23]. Our permission problem is a novel variation on the standard verification of satisfiability of LTL formulae.

7

Conclusions

We present a broker that enables providers to register service contracts and consumers to query them, in both cases based on the contracts’ temporal behavior. Our novel semantics for permission of a query by a contract takes into account the fact that contracts may not mention some of the events of interest to the query. Permission does not reduce to any standard temporal decision problem. We establish the complexity of the permission problem (PSPACE-complete), and design an algorithm for checking it. We show that the theoretical worst case is not an impediment to scalable implementation, presenting two distinct and complementary techniques for querying large collections of service contracts. We evaluate experimentally our implementation, showing that it scales well with both the number and the complexity of the contracts.

References [1] S. Angelov and P. Grefen. The business case for b2b e-contracting. In Proc. of the 6th Int. Conf. on Electronic Commerce, 2004. [2] C. Beeri, A. Eyal, S. Kamenkovich, and T. Milo. Querying business processes. In VLDB ’06, pages 343–354. VLDB Endowment, 2006. [3] J. Bengtsson, K. Larsen, F. Larsson, P. Pettersson, and W. Yi. Uppaal - a tool suite for automatic verification of real-time systems. In Hybrid Systems III. 1996. [4] M. Brambilla, A. Deutsch, L. Sui, and V. Vianu. The role of visual tools in a web application design and verification framework : A visual notation for ltl formulae. Int. Conf. on Web Engineering, 2005. [5] D. Deutch and T. Milo. Querying structural and behavioral properties of business processes. In Database Programming Languages. 2007. [6] X. Dong, A. Halevy, J. Madhavan, E. Nemes, and J. Zhang. Similarity search for web services. In VLDB, 2004. [7] M. B. Dwyer, G. S. Avrunin, and J. C. Corbett. Property specification patterns for finite-state verification. In Workshop on Formal Methods in Software Practice, 1998. [8] J. C. Fernandez. An implementation of an efficient algorithm for bisimulation equivalence. Sci. Comput. Program., 1990. [9] K. Fisler and M. Y. Vardi. Bisimulation minimization in an automata-theoretic verification framework. In Formal Methods in Computer-Aided Design, 1998. [10] P. Gastin and D. Oddoux. Fast ltl to b¨ uchi automata translation. In CAV, 2001. [11] D. Grigori, J. C. Corrales, and M. Bouzeghoub. Behavioral matchmaking for service retrieval. In Int. Conf. on Web Services, ’06. [12] G. J. Holzmann. The model checker spin. IEEE Trans. Soft. Eng., 1997.

22

[13] J. E. Hopcroft. An n log n algorithm for minimizing states in a finite automaton. Technical report, Stanford, CA, USA, 1971. [14] M. Kyas, C. Prisacariu, and G. Schneider. Run-time monitoring of electronic contracts. In ATVA ’08: Automated Technology for Verification and Analysis, 2008. [15] M. R. Lowry. Software construction and analysis tools for future space missions. In European Conf. on Theory and Practice of Software, 2002. [16] O. Marjanovic and Z. Milosevic. Towards formal modeling of e-contracts. In Enterprise Distributed Object Computing, 2001. [17] Z. Milosevic and R. G. Dromey. On expressing and monitoring behaviour in contracts. In EDOC, ’02. [18] R. Paige and R. E. Tarjan. Three partition refinement algorithms. SIAM J. Comput., 1987. [19] A. Pnueli. The temporal logic of programs. In FOCS, 1977. [20] Z. Shen and J. Su. Web service discovery based on behavior signatures. In Int. Conf. on Services Computing, 2005. [21] A. P. Sistla and E. M. Clarke. The complexity of propositional linear temporal logics. J. ACM, 1985. [22] L. J. Stockmeyer and A. R. Meyer. Word problems requiring exponential time. In STOC ’73: ACM Symp. on Theory of Computing. [23] M. Vardi and P. Wolper. An automata-theoretic approach to automatic program verification. In LICS, 1986. [24] P. Wolper, M. Y. Vardi, and P. A. Sistla. Reasoning about infinite computation paths. In FOCS, 1983.

A

Formalization

We consider a common vocabulary of propositional variables V, intuitively associated to the modeled events. LTL is essentially a propositional logic which, in addition to the usual boolean operators, uses some temporal operators with the following intuitive meaning: • Xp, p is true on the next instant • Fp, eventually p is true • pUq, q is eventually true and p is true until q is true • Gp, globally (i.e. in every instant) p is true • pBq, p is true before q is true • pWq, weak until : p is always true or p is true until q is true Note that all temporal operators can be derived from X and U (e.g. Fϕ ≡ true U ϕ, Gϕ ≡ ¬F(¬ϕ), ϕBψ ≡ ¬(¬ϕ U ψ)). Example 12 Let us see how the clauses of the contracts in Example 2 can be expressed in LTL. We assume a vocabulary V containing the variables { purchase, use, missedFlight, refund, dateChange } that are true in the instant in which the event happens.

23

C0 In this case, we force the fact that only one event happens in each instant, G(purchase → ¬use ∧ ¬missedF light ∧ ¬ref und∧ ¬dateChange), G(use → ¬purchase ∧ ¬missedF light ∧ ¬ref und∧ ¬dateChange), ..., C1 G(purchase → X(¬Fpurchase)) C2 G(purchase B (use ∨ missedF light ∨ ref und ∨ dateChange)) C3 G((missedF light → ¬Fuse) W dateChange) C4 G(ref und → ¬F(use ∨ missedF light ∨ ref und ∨ dateChange)) C5 G(use → ¬F(use ∨ missedF light ∨ ref und ∨ dateChange)) Ticket A

1. G(dateChange → ¬Fref und)

2. no clause necessary Ticket B

1. no clause necessary

2. G(missedF light → ¬FdateChange) Ticket C

1. G(¬ref und)

2. G(dateChange → X(¬FdateChange)) 3. G(missedF light → ¬FdateChange) Note that the specifications of the contracts will be the conjunction of all common clauses along with the specific ones (e.g. for Ticket A: C0,C1,C2,C3,C4,C5,TicketA). Also, note that some natural language clauses (e.g. ’Unlimited date changes’) did not require any additional LTL clause, as they are implicit in the LTL semantics. This is because they did not add any additional constraints to what was already specified in the common clauses (e.g. for date changes: C0,C2,C3,C4,C5). The formal counterpart of our intuitive notion of temporal sequence is the run. A run ρ is a function ρ(t) : N → T (V), where N are the natural numbers, T (V) is the set of truth assignments for the variables in V, and t a particular instant. The ’tail’ of the run starting at instant i is written as ρ|i (formally: ρ|i = ρ(t + i)). The semantics of LTL is given inductively for all boolean operators and for the universal temporal operators (X, U). A run ρ satisfies an LTL formula ϕ (i.e. ρ |= ϕ), iff, inductively: • ρ |= p, with p ∈ V, iff p is true in ρ(0) • ρ |= Xϕ, iff ρ|1 |= ϕ • ρ |= φUψ, iff ∃k ≥ 0 s.t. ρ|k |= ψ and ∀0 ≤ i < k(ρ|i |= φ) • closure w.r.t. boolean operators (∧, ¬) Remember from the discussion after Example 3 that our intuitive semantics did not take into account the behavior of events that were not explicitly cited in the contract. In particular, we wanted to avoid cases where the behavior of a variable not cited in the contract specification results in the contract permission of a query, that is not intended to be supported by the contract. In order to formalize this intuition we introduce the following concepts.

24

Definition 4 The projection of a run ρ w.r.t a set of variables V (a V -projection), is a sequence σ that assigns to the variables in V the same truth values assigned by ρ, and does not assign any truth value to other variables. A run ρ is compatible with a V -projection σ, if σ assigns in every instant the same truth values of ρ to all variables in V . The set of all runs compatible with a V -projection is called projection class. Definition 5 Given an LTL formula ϕ, let V be the set of its variables. We call P (ϕ) ( set of allowed projections of a contract specified by ϕ) the set of V -projections of all runs satisfying ϕ. Definition 6 Given an LTL formula ϕ, let V be the set of its variables. The contract defined by ϕ (referred to as C(ϕ)) permits an LTL query ψ iff there exists a V -projection σ ∈ P (ϕ) s.t. all runs compatible with σ satisfy ψ. Intuitively, forcing the property to be true for all runs in a projection class will avoid the fact that a particular behavior of some variables not mentioned in the contract will trigger a false positive permission of a query. For instance, let us consider that a query q is satisfied by a run ρ of a contract specification because of the behavior of variable v. Let us also assume that without the specific behavior of the variable v, q would not be permitted as it is the case in Example 3 for the variable ‘class upgrade’. We note that in the projection class of ρ there will be a run ρ0 that does not exhibit the same behavior for v, as v is not part of the vocabulary of the contract. It follows that q will not be satisfied by all runs in a projection class and thus it will not be permitted by the contract. We can now formally define the result of the contract search problem. Definition 7 Given a contract database CDB (i.e. a set of LTL specifications for service contracts) and an LTL formula φ, the result R of the query φ on CDB is the set of all and only the contracts that permit φ. Note on infinite runs.To some readers, the decision to choose an infinite list of truth assignments to model sequences might seem an unnecessary complication. We want, however, to make two important observations. The first is that using infinite runs does not reduce the expressive power of the formalism, as any finite sequence can be encoded simply appending an infinite list of dummy snapshots. The second is that we need infinite runs to identify a single witness run for every temporal property. Let us assume a standard contract requirement: being able to change the date of a flight an unlimited number of times. Any finite run would not be able to satisfy this property, preventing us from verifying properties with the exhibition of a single satisfying run.

B

Algorithm

It is easily seen that Definition 6 can be rephrased to: Definition 8 A contract C(ϕ) with vocabulary V permits a query ψ iff the intersection of the set of runs satisfying ϕ with the set of runs satisfying ψ, contains a projection class w.r.t. V . This alternative definition is more suited for the design of an algorithm as we could think of computing the intersection of the two sets, and then checking the presence of a projection class in the intersection set. The first part of this algorithm is very similar to the non-empty intersection problem, extensively studied in other domains (e.g. model checking [23]). To solve this problem the theoretical tool of choice are B¨ uchi automata (BA), which have a deeply exploited connection with LTL [24]. Using BAs, we are able to implement the stated algorithm with a single step: scanning the intersection of the two sets of runs and, at the same time, checking the existence of a complete projection class.

B.1

B¨ uchi automata

Formally, a B¨ uchi automaton is a tuple {Q, I, δ, F }, where Q is the set of states, I the initial states and F the final ones. The transition relation is δ ⊆ Q × Σ × Q, where Σ is the set of disjunction-free propositional formula over V (e.g. Figure 1). 25

In order to define the semantics we introduce the concept of path as a sequence of states σ = σ1 , σ2 , ..., linked by transitions in δ (i.e. ∀i(∃λ(hσi , λ, σi+1 i ∈ δ))). A path is called lasso path if it is formed by a finite prefix leading to a final state k (called knot) and by a cycle leading back to k. The cycle is then repeated infinitely many times. Intuitively, a run ρ satisfies a path σ if at any instant i the truth assignment ρ(i) satisfies a formula on the transition between σi and σi+1 . Now we can state the accepting condition: a BA A accepts a run ρ, if ρ satisfies a lasso path of A. We call BA(ϕ) any BA that accepts all and only the runs satisfying an LTL formula ϕ. We assume transition labels of BA(ϕ) contain only variables in ϕ. We consider now the problem of verifying that a contract permits a certain property. They are both specified as LTL formulae, but our approach will handle them in form of BAs. As in §2.4, we formalize this idea by introducing the concept of simultaneous lasso path. Definition 9 Given a contract defined by ϕ (referred to as C(ϕ)), with a set of variables V , and a query ψ. Let A(ϕ) = {QA , IA , δA , FA } be a BA for C(ϕ) and B(ψ) = {QB , IB , δB , FB } a BA for ψ, we define a simultaneous lasso path an infinite sequence of pairs hsi , qi ii≥0 s.t.: 1. hsi ii≥0 form a lasso path for A 2. hqi ii≥0 form a lasso path for B 3. for every i, there exist θi a formula s.t. hsi , θi , si+1 i ∈ δA and τi s.t. hqi , τi , qi+1 i ∈ δB , and we have that θi ∧ τi is satisfiable (i.e. they are not conflicting), and τi contains only variables in V (θi and τi are compatible). The intuition in §2.4 is then formalized as follows: Lemma 1 Let σ be a lasso path of a BA A(ϕ), P (ϕ) be the set of projection classes of the contract specified by ϕ, and ρ be a run in any p ∈ P (ϕ). We have: ρ |= σ → ∀ρ0 ∈ p(ρ0 |= σ) We assume w.l.o.g. that all the formulae in the transition of BA(ϕ) contain only the variables that appear in ϕ. Let us assume by contradiction that there exists a run ρ0 ∈ s and ρ0 2 σ. So there must exists an instant i where it does not exist a hσi , λ, σi+1 i ∈ δ ∧ρ0 (i) |= λ. However, we know that a λ0 exists for that particular instant i s.t. ρ |= λ0 . Since ρ0 is in s, we know that ∀i(∀v ∈ ϕ(ρ(i)(v) = ρ0 (i)(v))). We also assumed that λ0 does not contain any variable not in ϕ. It follows that ρ0 |= λ0 . Contradiction. Theorem 7 Let P (ϕ) the set of projection classes of a contract with variables in V and ψ a query. There exists σ = hsi , qi i a simultaneous lasso path for S(ϕ) and φ iff ∃s ∈ S(ϕ)(∀ρ ∈ s(ρ |= ψ)). ⇒) For every i, we have by definition that there exists a formula θi (hsi , θi , si+1 i ∈ δA ), a τi (hqi , τi , qi+1 i ∈ δB ) and a set of truth assignements Ti s.t. ∀ti ∈ Ti (ti |= θi ∧ ti |= τi ). Let ρ be a run ρ(i) = ti ∈ Ti , picking for each instant any ti ∈ Ti . Let σ be the laso path formed by si . It follows that ρ |= σ. From the result in Lemma 1, we know that ∀ρ0 ∈ s(ρ0 |= σ)). Since they all share the same assignments on the variables in V (by definition of service), it follows (by definition of service-query-lasopath) that ∀ρ0 ∈ s(∀i(ρ0 (i) |= τi )). Let π be the laso path formed by qi , the previous result means that ∀ρ0 ∈ s(ρ0 |= π), which (being π a laso path for B(ψ)) imply that ∀ρ0 ∈ s(ρ0 |= ψ), which is what we wanted to prove. ⇐) We assume that ∃s ∈ S(ϕ)(∀ρ ∈ s(ρ |= ψ)). From Lemma 1, we know that there exists a laso path σ in A(ϕ) s.t. ∀ρ ∈ s(ρ |= σ). Let ρ be a run in s, since ρ |= ψ, we know that there must be a laso path π in B(ψ) s.t. ρ |= π. Let us consider now the service-query-lasopath χ formed by the states of σ and π. We have to prove that χ is a valid service-query-lasopath by proving the third condition of the definition. For every instant i, let Ti = {ρ(i)|ρ ∈ s}. Since ϕ cannot distinguish on variables not in V , it follows that Ti contains all truth assignments that assign the same truth values to the variables in V . Again, we assume w.l.o.g. that all the formulae in the transition of BA(ϕ) contain only the variables that appear in ϕ. Since all ρ ∈ s satisfy ψ, we have that ∀ρ ∈ s(ρ |= π). We also know 26

that there exists a formula θi on a transition from σi to σi+1 in A and a formula τi between πi and πi+1 in B, so that ∀ρ ∈ s(ρ(i) |= θi ∧ ρ(i) |= τi ). It follows, by construction of Ti , that for every i, ∀ti ∈ Ti (ti |= θi ∧ ti |= τi ). This proves that χ is a valid service-query-lasopath. This theorem implies that the permission problem is decidable. This follows from the fact that we can check permission by looking for simultaneous-lassopaths. These are finite objects (completely described by the two prefixes and cycles of the lassopaths for the contract and query BAs) and any pair of BAs has a finite number of distinct simultaneous-lassopaths. It is immediately clear that this identifies a na¨ıve way to look for simultaneous-lassopaths, that consists in exhaustive exploration.

C

Bisimulation Optimization

C.1

Simplified service automaton

We say that a BA A is equivalent to a BA B w.r.t. a query BA Q iff there exists a simultaneous lasso path for A and Q iff there exists one for B and Q. We consider a query BA Q, and let LQ be the set of literals that appear in labels of Q. LnQ is the set that contains the negation of every literal in LQ . Definition 10 We call the relevant BA Ar of a service BA A w.r.t. a query BA Q, a BA that has the same states as A but that for every transition hs, χ, s0 i ∈ δA we have hs, χ0 , s0 i ∈ δAr s.t. χ0 is a conjunction of all the literals in χ that are in LnQ . We have: Theorem 8 Given a contract BA A and a query BA B, there exists a simultaneous lasso path for A and B iff there exists one for its relevant BA Ar and B. ⇒) Let hsi , qi ii≥0 be a simultaneous lasso path for A and B. We argue that it is a simultaneous lasso path also for Ar and B. Clearly hsi ii≥0 and hqi ii≥0 are lasso paths for Ar and B, respectively. Now, let us assume by contradiction that there exists θi s.t. hsi , θi , si+1 i ∈ δA compatible with τi r comatible with τi . By s.t. hqi , τi , qi+1 i ∈ δB but there are no transitions λi s.t. hsi , λi , si+1 i ∈ δA 0 0 r construction, there exists λi s.t. hsi , λi , si+1 i ∈ δA and that contains only the negated query literals of θi . It follows that λ0i is not conflicting with τi (otherwise θi would also be conflicting). We also know that τi does not contain any literal not in the contract vocabulary (otherwise it would not be compatible with θi ). So λ0i is compatible with τi . Contradiction. ⇐) Let hsi , qi ii≥0 be a simultaneous lasso path for Ar and B. We argue that it is a simultaneous lasso path also for A and B. Clearly hsi ii≥0 and hqi ii≥0 are lasso paths for Ar and B, respectively. r Now, let us assume by contradiction that there exists λi s.t. hsi , λi , si+1 i ∈ δA compatible with τi s.t. hqi , τi , qi+1 i ∈ δB but there are no transitions θi s.t. hsi , θi , si+1 i ∈ δA comatible with τi . By construction, there exists θi0 s.t. hsi , θi0 , si+1 i ∈ δA and that contains all and only the negated query literals of λi . It follows that θi0 is not conflicting with τi (otherwise λi would also be conflicting). We also know that τi does not contain any literal not in the contract vocabulary (otherwise it would not be compatible with θi ). So θi0 is compatible with τi . Contradiction. We make use of the bisimulation concept to reduce the number of states. Definition 11 Let a and b be two nodes of a BA A. We say that a ∼ b (read bisimulates) iff: (1) if a ∈ FA then b ∈ FA and vice versa, and (2) for every edge ha, λ, a0 i ∈ δA we have hb, λ, b0 i ∈ δA with a0 = b0 or a0 ∼ b0 , and vice versa. As with the minimization of traditional finite state automata, we create equivalence classes of BA states (called B(A)) w.r.t. bisimulation, and we use them as the states of a new BA. Definition 12 Given a BA A, we call As (its simplification) a BA s.t. (1) QAs = B(A), (2) IAs is any s ∈ B(A) that contains a state in IA , (3) FAs is any s ∈ B(A) that contains only states in FA , (4) let C(a) be the bisimulation equivalence class for the state a, for every ha, λ, a0 i ∈ δA we have hC(a), λ, C(a0 )i ∈ δAs .

27

The simplification of a BA is equivalent to the original one in the following sense: Theorem 9 For every path σ = σ1 , σ2 , ... for a BA A there exists a path σ s = σ1s , σ2s , ... for As and vice versa, where for all i, σis is the bisimulation class of σi . ⇒) Let us assume by contradiction that there exists a λ s.t. hσi , λ, σi+1 i ∈ δA , but there exists no λ0 s i ∈ δAs . This is a contradition with point 4 of Definition 12. s.t. hσis , λ0 , σi+1 ⇐) By induction on the steps of the path on As . Since σ1s is an initial state for As , by definition it contains σ1 the initial state for A. For the induction step we assume to have built a path σ1 , σ2 , ..., σi for A from the first states of s the path σ1s , σ2s , ..., σis for As . Also, since there exists a transition hσis , λ, σi+1 i ∈ δAs , we know, by s 0 s 0 0 0 . Since σi0 ∈ σi+1 construction that there must be a transition hσi , λ, σi+1 i ∈ δA s.t. σi ∈ σi and σi+1 s 00 and σi are in the same bisimulation class σi we know that there exists a transition hσi , λ, σi+1 i ∈ δA s 00 00 s.t. σi+1 ∈ σi + 1 . This means that we can build a path of length i + 1, σ1 , σ2 , ..., σi , σi+1 in As . Using Theorems 8 and 9, we prove now the main result that allows our bisimulation optimization technique. Theorem 10 Given a service BA A and a query BA B, there exists a simultaneous lasso path for A and B iff there exists one for the simplification of its relevant BA (Ar )s and B. ⇒) Let hsi , qi ii≥0 be a simultaneous lasso path for A and B. We argue that there exists a simultaneous lasso path also for (Ar )s and B. We already know from Theorem 8 that there exists a simultaneous lasso path hs0i , qi ii≥0 for Ar and r B. From Theorem 9 we know that we can build a path hs0s i ii≥0 in (A )s from the contract path 0 r hsi ii≥0 in A . It is easy to see that this is also a lasso path. By construction (definition of the simplified BA), we have the same transition labels on hs0s i ii≥0 as we have on hs0i ii≥0 . It follows that the compatibility requirement with the labels on the query r lasso path hqi ii≥0 is satisfied. So hs0s i , qi ii≥0 is a simultaneous lasso path for (A )s and B. ⇐) Let hssi , qi ii≥0 be a simultaneous lasso path for (Ar )s and B. We argue that there exists a simultaneous lasso path also for A and B. From Theorem 9 we know that we can build a path hsi ii≥0 in Ar from the contract path hssi ii≥0 in (Ar )s . It is easy to see that this is also a lasso path. Using the same induction as in Theorem 9, it is easy to see that we have the same transition labels on hsi ii≥0 as we have on hssi ii≥0 . It follows that the compatibility requirement with the labels on the query lasso path hqi ii≥0 is satisfied. So hsi , qi ii≥0 is a simultaneous lasso path for Ar and B. From Theorem 8 we know that hsi , qi ii≥0 is also a simultaneous lasso path for A and B. Note how Theorem 9 formalizes Theorem 5.

28