Data Feed architecture

M

MarkusJNZ

Hi, we have some datafeeds which pull info from external sources.
Unfortunately, we have to use screen scraping as there are no XML
feeds. The data feeds are located in a variety of different
applications located on different servers. I have to design a new
architecture, I have a fair idea of how I would do it but if anyone has
any pointers to a good existing architecure design or *things not to
do*, please post.

TIA
Markus
===================
googlenews2006markusj
 
N

Nick Malik [Microsoft]

adapters, agents, and messageware.

I've done this a couple of times so far. I'll need to know more about the
technologies you are working with to help more, though.

What your environment look like? Do you have Biztalk or an ESB running yet?
What time requirements do you have for the data?

--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.
 
M

MarkusJNZ

Hi Nick, we do not have BizTalk and I'm not too sure what you mean by
ESB sorry.

Basically we have a number of distributed applications on a variety of
platforms (Classic ASP, .NET 1.1/2.0 and a Python Script). These
applications are scheduled via a scheduling program to go away and
"screen scrape" information at a specified time.

All information is then logged into a centralized database so the data
can be used at a later date.

Database wise we are using MSSQL 2005

TIA
Markus
 
N

Nick Malik [Microsoft]

Hi Markus,

From an architectural perspective, you have applications that draw data
using screen scraping. They interpret that data and store it in a database.
Part of what I need to know: how up to date does the data need to be?

Example:
Contoso Marine Supply is a catalog provider of small parts and fittings for
boaters. They have a Mainframe application, written in CICS, that is used
to enter catalog orders that arrive via a mail processing center.

At any time, the company employees can see the list of invoices that need to
be sent to the customer via a CICS screen on an IBM 3270 terminal.

If the system that prints and sends the invoices is on the Windows platform,
then it makes sense that the data is pulled periodically (perhaps nightly?)
and if a new invoice is found, then the necessary data is stored for
printing. We could also say that we print invoices twice a week.

In this scenario, the data needs to get to the Windows application twice a
week. We pull the data more often, which adds a level of *reliability*
(because if the mainframe or the windows server app are not running on
Tuesday at midnight, you can still pull the data on Wednesday for Thursday's
print run... this serves the reliable delivery of data).

A different scenario may be if the Windows server application is a Partner
Relationship Management system. In that case, the PRM system needs to know
about the orders as soon as they are entered, because a salesman may be
about to call on a particular supplier, and they need accurate and
up-to-date information about the orders that are coming through for their
parts. In this case, the time requirements would be pretty much 'as soon as
humanly possible' (I like the term "near real time").

So I'm asking about the time requirements. You've got some of the
picture... you have apps that pull data. Cool. What data do they pull and
why do they pull it? That's pretty important info if I'm going to be
helpful.

ESB = Enterprise Services Bus.

Please tell me what type of app you are screen scraping (CICS, UNIX, AS/400,
what?).


--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.
 
M

MarkusJNZ

Hi Nick, thanks for your help
Please see below
Nick said:
Hi Markus,

From an architectural perspective, you have applications that draw data
using screen scraping. They interpret that data and store it in a database.
Part of what I need to know: how up to date does the data need to be?

The import is done on a daily basis, so information only needs to be
updated once a day from the existing data sources. Reports etc are
viewed against this information all day long from many different
sources (Web pages, applications etc)
Example:
Contoso Marine Supply is a catalog provider of small parts and fittings for
boaters. They have a Mainframe application, written in CICS, that is used
to enter catalog orders that arrive via a mail processing center.

At any time, the company employees can see the list of invoices that need to
be sent to the customer via a CICS screen on an IBM 3270 terminal.

If the system that prints and sends the invoices is on the Windows platform,
then it makes sense that the data is pulled periodically (perhaps nightly?)
and if a new invoice is found, then the necessary data is stored for
printing. We could also say that we print invoices twice a week.

In this scenario, the data needs to get to the Windows application twice a
week. We pull the data more often, which adds a level of *reliability*
(because if the mainframe or the windows server app are not running on
Tuesday at midnight, you can still pull the data on Wednesday for Thursday's
print run... this serves the reliable delivery of data).

A different scenario may be if the Windows server application is a Partner
Relationship Management system. In that case, the PRM system needs to know
about the orders as soon as they are entered, because a salesman may be
about to call on a particular supplier, and they need accurate and
up-to-date information about the orders that are coming through for their
parts. In this case, the time requirements would be pretty much 'as soon as
humanly possible' (I like the term "near real time").

So I'm asking about the time requirements. You've got some of the
picture... you have apps that pull data. Cool. What data do they pull and
why do they pull it? That's pretty important info if I'm going to be
helpful.

ESB = Enterprise Services Bus.

Please tell me what type of app you are screen scraping (CICS, UNIX, AS/400,
what?).

It's just a external website. We just parse the HTML, retrieve the
information we need and update the database.
 
N

Nick Malik [Microsoft]

Hi Nick, thanks for your help
Please see below

The import is done on a daily basis, so information only needs to be
updated once a day from the existing data sources. Reports etc are
viewed against this information all day long from many different
sources (Web pages, applications etc)


It's just a external website. We just parse the HTML, retrieve the
information we need and update the database.

My prior responses were overkill.

For your architecture, I would suggest that you create an app with two basic
abilities:
1. the ability to specify as many target data pages as you want in an XML
file. That way, if you want to expand the list of pages you want to pull
data from, or if the information provider decides to break the information
up onto multiple pages, you can adapt quickly.

2. the ability to define what data you want from your target page, and how
to find it on the target page, using an XML description. That way, when
the target page changes in formatting or coding, you don't have to change
your C# code to allow you to get your data again.


I would suggest that you run your app as a service that runs nightly. I
notice that you posted your question to the ASP.Net newsgroup, so it is
possible that you are familiar only with creating web apps. Writing a
service is different, but not terribly difficult. Suggestion: Create a
command line utility that will do the work of pulling the data. Then either
write a service to call your command line utility, or simply schedule your
command line utility with the scheduling service in Windows. That makes it
easier to write and debug your code. Keep in mind that your app needs to
run without calling a user interface of any kind. No input from console, no
output to console (except debugging messages).

Using a service will make it much easier to reliably get the data you want,
and you can change the frequency by which you pull data by simply changing
the scheduler or your service code.

Hope this helps.

--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.
--
 
M

MarkusJNZ

Thanks for your help Nick
Regards
Markus said:
My prior responses were overkill.

For your architecture, I would suggest that you create an app with two basic
abilities:
1. the ability to specify as many target data pages as you want in an XML
file. That way, if you want to expand the list of pages you want to pull
data from, or if the information provider decides to break the information
up onto multiple pages, you can adapt quickly.

2. the ability to define what data you want from your target page, and how
to find it on the target page, using an XML description. That way, when
the target page changes in formatting or coding, you don't have to change
your C# code to allow you to get your data again.


I would suggest that you run your app as a service that runs nightly. I
notice that you posted your question to the ASP.Net newsgroup, so it is
possible that you are familiar only with creating web apps. Writing a
service is different, but not terribly difficult. Suggestion: Create a
command line utility that will do the work of pulling the data. Then either
write a service to call your command line utility, or simply schedule your
command line utility with the scheduling service in Windows. That makes it
easier to write and debug your code. Keep in mind that your app needs to
run without calling a user interface of any kind. No input from console, no
output to console (except debugging messages).

Using a service will make it much easier to reliably get the data you want,
and you can change the frequency by which you pull data by simply changing
the scheduler or your service code.

Hope this helps.

--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.
--
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,995
Messages
2,570,228
Members
46,818
Latest member
SapanaCarpetStudio

Latest Threads

Top