EzDevInfo.com

storage interview questions

Top storage frequently asked interview questions

What are the merits of SAS vs SATA drives? [closed]

What are the merits of buying SAS drives over SATA drives, or vice versa?

ZFS vs XFS

We're considering building a ~16TB storage server. At the moment, we're considering both ZFS and XFS as filesystem. What are the advantages, disadvantages? What do we have to look for? Is there a third, better option?

Source: (StackOverflow)

Are SSD drives as reliable as mechanical drives (2013)?

SSD drives have been around for several years now. But the issue of reliability still comes up.

I guess this is a follow up from this question posted 4 years ago, and last updated in 2011. It's now 2013, has much changed? I guess I'm looking for some real evidence, more than just a gut feel. Maybe you're using them in your DC. What's been your experience?

Reliability of ssd drives

Source: (StackOverflow)

What are the different widely used RAID levels and when should I consider them?

This is a Canonical Question about RAID levels.

What are:

the RAID levels typically used (including the RAID-Z family)?
deployments are they commonly found in?
benefits and pitfalls of each?

Source: (StackOverflow)

Why is Enterprise Storage so expensive?

This is a Canonical Question about the Cost of Enterprise Storage.
See also the following question:

What's the best way to explain storage issues to developers and other users

Regarding general questions like:

Why do I have to pay 50 bucks a month per extra gigabyte of storage?
Our file server is always running out of space, why doesn't our sysadmin just throw an extra 1TB drive in there?
Why is SAN equipment so expensive?

The Answers here will attempt to provide a better understanding of how enterprise-level storage works and what influences the price. If you can expand on the Question or provide insight as to the Answer, please post.

Source: (StackOverflow)

What are the advantages of tape drives?

IBM still develop and sell tape drives today. The capacity of them seems to be on a par with today's hard drives, but the search time and transfer rate are both significantly lower than that of hard drives.

So when is tape drives preferable to hard drives (or SSDs) today?

Source: (StackOverflow)

Fibre channel long distance woes

I need a fresh pair of eyes.

We're using a 15km fibre optic line across which fibrechannel and 10GbE is multiplexed (passive optical CWDM). For FC we have long distance lasers suitable up to 40km (Skylane SFCxx0404F0D). The multiplexer is limited by the SFPs which can do max. 4Gb fibrechannel. The FC switch is a Brocade 5000 series. The respective wavelengths are 1550,1570,1590 and 1610nm for FC and 1530nm for 10GbE.

The problem is the 4GbFC fabrics are almost never clean. Sometimes they are for a while even with a lot of traffic on them. Then they may suddenly start producing errors (RX CRC, RX encoding, RX disparity, ...) even with only marginal traffic on them. I am attaching some error and traffic graphs. Errors are currently in the order of 50-100 errors per 5 minutes when with 1Gb/s traffic.

Optics

Here is the power output of one port summarized (collected using sfpshow on different switches)

SITE-A      units=uW (microwatt)    SITE-B
**********************************************
FAB1
SW1   TX 1234.3       RX   49.1       SW3   1550nm (ko)
      RX   95.2       TX 1175.6
FAB2
SW2   TX 1422.0       RX  104.6       SW4   1610nm (ok)
      RX   54.3       TX 1468.4

What I find curious at this point is the asymmetry in the power levels. While SW2 transmits with 1422uW which SW4 receives with 104uW, SW2 only receives the SW4 signal with similar original power only with 54uW.

Vice versa for SW1-3.

Anyway the SFPs have RX sensitivity down to -18dBm (ca. 20uW) so in any case it should be fine... But nothing is.

Some SFPs have been diagnosed as malfunctioning by the manufacturer (the 1550nm ones shown above with "ko"). The 1610nm ones apparently are ok, they have been tested using a traffic generator. The leased line has also been tested more than once. All is within tolerances. I'm awaiting the replacements but for some reason I don't believe it will make things better as the apparently good ones don't produce ZERO errors either.

Earlier there was active equipment involved (some kind of 4GFC retimer) before putting the signal on the line. No idea why. That equipment was eliminated because of the problems so we now only have:

the long distance laser in the switch,
(new) 10m LC-SC monomode cable to the mux (for each fabric),
the leased line,
the same thing but reversed on the other side of the link.

FC switches

Here is a port config from the Brocade portcfgshow (it's like that on both sides, obviously)

Area Number:              0
Speed Level:              4G
Fill Word(On Active)      0(Idle-Idle)
Fill Word(Current)        0(Idle-Idle)
AL_PA Offset 13:          OFF
Trunk Port                ON
Long Distance             LS
VC Link Init              OFF
Desired Distance          32 Km
Reserved Buffers          70
Locked L_Port             OFF
Locked G_Port             OFF
Disabled E_Port           OFF
Locked E_Port             OFF
ISL R_RDY Mode            OFF
RSCN Suppressed           OFF
Persistent Disable        OFF
LOS TOV enable            OFF
NPIV capability           ON
QOS E_Port                OFF
Port Auto Disable:        OFF
Rate Limit                OFF
EX Port                   OFF
Mirror Port               OFF
Credit Recovery           ON
F_Port Buffers            OFF
Fault Delay:              0(R_A_TOV)
NPIV PP Limit:            126
CSCTL mode:               OFF

Forcing the links to 2GbFC produces no errors, but we bought 4GbFC and we want 4GbFC.

error and traffic graphs

I don't know where to look anymore. Any ideas what to try next or how to proceed?

If we can't make 4GbFC work reliably I wonder what the people working with 8 or 16 do... I don't assume that "a few errors here and there" are acceptable.

Oh and BTW we are in contact with everyone of the manufacturers (FC switch, MUX, SFPs, ...) Except for the SFPs to be changed (some have been changed before) nobody has a clue. Brocade SAN Health says the fabric is ok. MUX, well, it's passive, it's only a prism, nature at it's best.

Any shots in the dark?

APPENDIX: Answers to your questions

@Chopper3: This is the second generation of Brocades exhibiting the problem. Before we had 5000s, now we have 5100s. In the beginning when we still had the active MUX we rented a longdistance laser once to put it into the switch directly in order to make tests for a day, during that day of course it was clean. But as I said, sometimes it's clean just like that. And sometimes it's not. Alternative switches would mean to rebuild the entire SAN with those only to test. Alternative SFPs, well they're hard to come by just like that.

@longneck: The line is rented. It's a dark fibre (9um monomode) so there's noone else on it. Sure there are splices. I can't go and look but I have to trust they have been done correctly. As I said the line has been checked and rechecked (using an optical time-domain reflectometer). Obviously you don't have all this equipment yourself because it's way too expensive.

@mdpc: What would be the "wrong" type of cable according to you? Up to the switch everything is monomode, yes. The connectors are the correct ones too. Yeah I know there are the green ones where the fibre is cut off at a certain angle etc. But we have the correct ones for all that I know.

Progress Report #1

We have had two fabrics (=2x2 switches) with Brocade 5100s with FabricOS 6.4.1 and two fabrics (another 2x4 switches) on FabricOS 7.0.2.

On the longdistance ISLs (one in each fabric) it turned out that with FOS 6.4.1 setting it to long distance issues warnings about the VC Init setting and consequently the fill word. But those are only warnings. FOS 7.0.2 requires you to do modifications to VCI and the fillword for long distance links.

Setting FOS 6.4.1 to the LS (long-distance static distance) setting with wrong VCI and fillword setting made the whole fabric inoperational (stuck in an SCN loop, use fabriclog -s to see, you don't see it anywhere else, no port error counters or anything increasing).

Currently I'm giving the one fabric with the IMHO more correct settings a beating and it seems to do fine, whereas the other one without much traffic still has errors here and there.

progress1

In short:

We have eliminated the active part of the MUX (the FC retimer).
We are putting the long distance SFPs into the end equipment themselves.
Just to be sure we bought new monomode cables to connect the end equipment to the remaining passive part of the MUX.
We are now trying out several long distance configs.

It's almost black magic. Everything that happens is mostly empirical, noone seems to have a clue what are the exact reasons to do something. ("We have tried this, and it didn't work, then we tried that and it worked, so we stuck with that." But noone really seems to know why.)

I'll keep you updated.

Progress Report #2

We got the new lasers for one of the fabrics on warranty. It's ultra clean even on 4GbFC.

They're transmitting with roughly 2mW (3dBm) whereas the others are only at 1.5mW (1.5dBm) although that should really be enough.

The other fabric (where the lasers are apparently ok) still produces one or two CRCs infrequently.

Using sfpshow the SFP producing the actual RX errors shows

Status/Ctrl: 0x82
Alarm flags[0,1] = 0x5, 0x40
Warn Flags[0,1] = 0x5, 0x40

Now I'll have to find out what that means. Not sure if it was there before.

Well I'll first clear my head with a week of vacation. 8-)

Source: (StackOverflow)

How should I set up my Fibre Channel (FC) network?

How do FC switches work, and how should I configure mine?

Source: (StackOverflow)

Server vendor that allows 3rd party disks

As noted here, Dell is no longer allowing 3rd party disks to be used with their latest servers. As in, they don't work period. Which means that if you buy one of these boxes and want to upgrade the storage later, you have buy disks from Dell at significant premiums.

Dell has just given me a very strong reason to take my server business elsewhere. My company buys (instead of leasing) our servers, and typically uses them for 5 years. I need to be able to upgrade/repurpose storage periodically, and do not want to be locked in to whatever Dell might have in stock, at inflated prices to boot. As you will see in the comments of the above link, it seems HP is doing the same thing.

I am looking for a server vendor that offers 3-5 year warranty with same day/next day onsite service, and allows me to use 3rd party disks.

Suggestions?

Source: (StackOverflow)

How to format as FAT32 from Windows 7/Vista

What is the best way to format a USB drive with FAT32 (for Mac compatibility) from within Windows 7/Vista?

I ask because the Disk Management only lets you pick exFAT (because the disk is over 32GB I believe).

Doing it from the command line with diskpart doesn't seem to work either.

Source: (StackOverflow)

BBWC: in theory a good idea but has one ever saved your data?

I'm familiar with what a BBWC (Battery-backed write cache) is intended to do - and previously used them in my servers even with good UPS. There are obvously failures it does not provide protection for. I'm curious to understand whether it actually offers any real benefit in practice.

(NB I'm specifically looking for responses from people who have BBWC and had crashes/failures and whether the BBWC helped recovery or not)

Update

After the feedback here, I'm increasingly skeptical as whether a BBWC adds any value.

To have any confidence about data integrity, the filesystem MUST know when data has been committed to non-volatile storage (not necessarily the disk - a point I'll come back to). It's worth noting that a lot of disks lie about when data has been committed to the disk (http://brad.livejournal.com/2116715.html). While it seems reasonable to assume that disabling the on-disk cache might make the disks more honest, there's still no guarantee that this is the case either.

Due to the typcally large buffers in a BBWC, a barrier can require significantly more data to be commited to disk therefore causing delays on writes: the general advice is to disable barriers when using a non-volatile write back cache (and to disable on-disk caching). However this would appear to undermine the integrity of the write operation - just because more data is maintained in non-volatile storage does not mean that it will be more consistent. Indeed, arguably without demarcation between logical transactions there seems to be less opportunity to ensure consistency than otherwise.

If the BBWC were to acknowledge barriers at the point the data enters it's non-volatile storage (rather than being committed to disk) then it would appear to satisfy the data integrity requirement without a performance penalty - implying that barriers should still be enabled. However since these devices generally exhibit behaviour consistent with flushing the data to the physical device (significantly slower with barriers) and the widespread advice to disable barriers, they cannot therefore be behaving in this way. WHY NOT?

If the I/O in the OS is modelled as a series of streams then there is some scope to minimise the blocking effect of a write barrier when write caching is managed by the OS - since at this level only the logical transaction (a single stream) needs to be committed. On the other hand, a BBWC with no knowledge of which bits of data make up the transaction would have to commit its entire cache to disk. Whether the kernel/filesystems actually implement this in practice would require a lot more effort than I'm wiling to invest at the moment.

A combination of disks telling fibs about what has been committed and sudden loss of power undoubtedly leads to corruption - and with a Journalling or log structured filesystem which don't do a full fsck after an outage its unlikely that the corruption will be detected let alone an attempt made to repair it.

In terms of the modes of failure, in my experience most sudden power outages occur because of loss of mains power (easily mitigated with a UPS and managed shutdown). People pulling the wrong cable out of rack implies poor datacentre hygene (labelling and cable management). There are some types of sudden power loss event which are not prevented by a UPS - failure in the PSU or VRM a BBWC with barriers would provide data integrity in the event of a failure here, however how common are such events? Very rare judging by the lack of responses here.

Certainly moving the fault tolerance higher in the stack is significantly more expensive the a BBWC - however implementing a server as a cluster has lots of other benefits for performance and availability.

An alternative way to mitigate the impact of sudden power loss would be to implement a SAN - AoE makes this a practical proposition (I don't really see the point in iSCSI) but again there's a higher cost.

Source: (StackOverflow)

Distributed Storage Filesystem - Which one/Is there a ready to use product?

With Hadoop and CouchDB all over in Blogs and related news what's a distributed-fault-tolerant storage (engine) that actually works.

CouchDB doesn't actually have any distribution features built-in, to my knowledge the glue to automagically distribute entries or even whole databases is simply missing.
Hadoop seems to be very widely used - at least it gets good press, but still has a single point of failure: The NameNode. Plus, it's only mountable via FUSE, I understand the HDFS isn't actually the main goal of Hadoop
GlusterFS does have a shared nothing concept but lately I read several posts that lead me to the opinion it's not quite as stable
Lustre also has a single point of failure as it uses a dedicated metadata server
Ceph seems to be the player of choice but the homepage states it is still in it's alpha stages.

So the question is which distributed filesystem has the following feature set (no particular order):

POSIX-compatible
easy addition/removal of nodes
shared-nothing concept
runs on cheap hardware (AMD Geode or VIA Eden class processors)
authentication/authorization built-in
a network file system (I'd like to be able to mount it simultaneously on different hosts)

Nice to have:

locally accessible files: I can take a node down mount the partition with a standard local file system (ext3/xfs/whatever...) and still access the files

I'm not looking for hosted applications, rather something that will allow me to take say 10GB of each of our hardware boxes and have that storage available in our network, easily mountable on a multitude of hosts.

Source: (StackOverflow)

ZFS Data Loss Scenarios

I'm looking toward building a largish ZFS Pool (150TB+), and I'd like to hear people experiences about data loss scenarios due to failed hardware, in particular, distinguishing between instances where just some data is lost vs. the whole filesystem (of if there even is such a distinction in ZFS).

For example: let's say a vdev is lost due to a failure like an external drive enclosure losing power, or a controller card failing. From what I've read the pool should go into a faulted mode, but if the vdev is returned the pool should recover? or not? or if the vdev is partially damaged, does one lose the whole pool, some files, etc.?

What happens if a ZIL device fails? Or just one of several ZILs?

Truly any and all anecdotes or hypothetical scenarios backed by deep technical knowledge are appreciated!

Thanks!

Update:

We're doing this on the cheap since we are a small business (9 people or so) but we generate a fair amount of imaging data.

The data is mostly smallish files, by my count about 500k files per TB.

The data is important but not uber-critical. We are planning to use the ZFS pool to mirror 48TB "live" data array (in use for 3 years or so), and use the the rest of the storage for 'archived' data.

The pool will be shared using NFS.

The rack is supposedly on a building backup generator line, and we have two APC UPSes capable of powering the rack at full load for 5 mins or so.

Source: (StackOverflow)

What kind of storage do people actually use for VMware ESX servers?

VMware and many network evangelists try to tell you that sophisticated (=expensive) fiber SANs are the "only" storage option for VMware ESX and ESXi servers. Well, yes, of course. Using a SAN is fast, reliable and makes vMotion possible. Great. But: Can all ESX/ESXi users really afford SANs?

My theory is that less than 20% of all VMware ESX installations on this planet actually use fiber or iSCS SANs. Most of these installation will be in larger companies who can afford this. I would predict that most VMware installations use "attached storage" (vmdks are stored on disks inside the server). Most of them run in SMEs and there are so many of them!

We run two ESX 3.5 servers with attached storage and two ESX 4 servers with an iSCS san. And the "real live difference" between both is barely notable :-)

Do you know of any official statistics for this question? What do you use as your storage medium?

Source: (StackOverflow)

Do I need to RAID Fusion-io cards?

Can I run reliably with a single Fusion-io card installed in a server, or do I need to deploy two cards in a software RAID setup?

Fusion-io isn't very clear (almost misleading) on the topic when reviewing their marketing materials Given the cost of the cards, I'm curious how other engineers deploy them in real-world scenarios.

I plan to use the HP-branded Fusion-io ioDrive2 1.2TB card for a proprietary standalone database solution running on Linux. This is a single server setup with no real high-availability option. There is asynchronous replication with a 10-minute RPO that mirrors transaction logs to a second physical server.

Traditionally, I would specify a high-end HP ProLiant server with the top CPU stepping for this application. I need to go to SSD, and I'm able to acquire Fusion-io at a lower price than enterprise SAS SSD for the required capacity.

Do I need to run two ioDrive2 cards and join them with software RAID (md or ZFS), or is that unnecessary?
Should I be concerned about Fusion-io failure any more than I'd be concerned about a RAID controller failure or a motherboard failure?
System administrators like RAID. Does this require a different mindset, given the different interface and on-card wear-leveling/error-correction available in this form-factor?
What IS the failure rate of these devices?

Edit: I just read a Fusion-io reliability whitepaper from Dell, and the takeaway seems to be "Fusion-io cards have lots of internal redundancies... Don't worry about RAID!!".

Source: (StackOverflow)